Consume using KafkaIO in batch processing mode using Google DataflowWhy is processing a sorted array faster than processing an unsorted array?Notifying Google PubSub when Dataflow job is completeElement value based writing to Google Cloud Storage using DataflowCloud Dataflow batch taking hours to join two PCollections on a common keyKafkaIO checkpoint - how to commit offsets to KafkaGoogle cloud dataflow - batch insert in bigqueryGoogle Dataflow - SchedulingDataflow Batch or Streaming insert to BigQuery clarificationKafka to Google Cloud Platform Dataflow ingestionAvro vs Parquet in Google Cloud storage using Google Dataflow transformation

Behavior of the zero and negative/sign flags on classic instruction sets

Can I activate an iPhone without an Apple ID?

Alternatives to using writing paper for writing practice

Is this a Lost Mine of Phandelver Plot Hole?

Replacing URI when using dynamic hosts in Nginx reverse proxy

As a DM, how to avoid unconscious metagaming when dealing with a high AC character?

How did John Lennon tune his guitar

Why does Hellboy file down his horns?

Postgresql numeric and decimal is automatically rounding off

Why is dry soil hydrophobic? Bad gardener paradox

How are "soeben" and "eben" different from one another?

Is a public company able to check out who owns its shares in very detailed format?

HackerRank: Electronics Shop

Why doesn't Anakin's lightsaber explode when it's chopped in half on Geonosis?

What is this old "lemon-squeezer" shaped pan

Why is the collector feedback bias popular in electret-mic preamp circuits?

Will it hurt my career to work as a graphic designer in a startup for beauty and skin care?

Is `curl something | sudo bash -` a reasonably safe installation method?

Basic example of a formal affine scheme, functorial point of view

How to draw a gif with expanding circles that reveal lines connecting a non-centered point to the expanding circle using Tikz

Why does the trade federation become so alarmed upon learning the ambassadors are Jedi Knights?

Can I capture stereo IQ signals from WebSDR?

Are there any double stars that I can actually see orbit each other?

What exactly is the Tension force?



Consume using KafkaIO in batch processing mode using Google Dataflow


Why is processing a sorted array faster than processing an unsorted array?Notifying Google PubSub when Dataflow job is completeElement value based writing to Google Cloud Storage using DataflowCloud Dataflow batch taking hours to join two PCollections on a common keyKafkaIO checkpoint - how to commit offsets to KafkaGoogle cloud dataflow - batch insert in bigqueryGoogle Dataflow - SchedulingDataflow Batch or Streaming insert to BigQuery clarificationKafka to Google Cloud Platform Dataflow ingestionAvro vs Parquet in Google Cloud storage using Google Dataflow transformation






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








1















Google Dataflow job uses Apache Beam's KafkaIO library with AvroIO and Windowed Writes writing output to ".avro" files in Google Cloud Storage bucket. However, it defaults to Streaming as the processing job type on production data.



Is it possible to consume data from Kafka topic using KafkaIO in Dataflow using Batch processing. This dataflow job does not require near real time processing (streaming). Is there a way to also insert the incoming records into BigQuery table without streaming inserts costs enabling the batch type processing.



Batch processing with less frequent runs could work resulting in less memory, vCPUs and compute costs.



As per:https://beam.apache.org/releases/javadoc/2.5.0/org/apache/beam/sdk/io/kafka/KafkaIO.html



KafkaIO source returns unbounded collection of Kafka records as PCollection>.



Does it mean Kafka being unbounded source cannot be run in batch mode?



Testing .withMaxNumRecords(1000000) condition runs the job in batch mode. However, to run the job in live incoming data, I need to remove this condition.



I have tried using windowing and setting streaming mode options flag to false without success as in below code.




// not streaming mode
options.setStreaming(false);

...
PCollection<String> collection = p.apply(KafkaIO.<String, String>read()
.withBootstrapServers("IPADDRESS:9092")
.withTopic(topic)
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializer(StringDeserializer.class)
.updateConsumerProperties(props)
.withConsumerFactoryFn(new ConsumerFactory())
// .withMaxNumRecords(1000000)
.withoutMetadata()
).apply(Values.<String>create())
.apply(Window.into(FixedWindows.of(Duration.standardDays(1))));


...
//convert to Avro GenericRecord

.apply("AvroToGCS", AvroIO.writeGenericRecords(AVRO_SCHEMA)
.withWindowedWrites()
.withNumShards(1)
.to("gs://BUCKET/FOLDER/")
.withSuffix(".avro"));



The code resulted in Streaming job type with 4 vCPUs and 1 worker for 9 minutes processing 1.8mn records. After this, I had to stop job (drain) to prevent costs.



Enforcing the Batch processing in Dataflow on incoming data, is it possible to collect batch of records writing it as avro files and continue doing so until offset has caught up to latest.



Any examples or sample code greatly appreciated.










share|improve this question






























    1















    Google Dataflow job uses Apache Beam's KafkaIO library with AvroIO and Windowed Writes writing output to ".avro" files in Google Cloud Storage bucket. However, it defaults to Streaming as the processing job type on production data.



    Is it possible to consume data from Kafka topic using KafkaIO in Dataflow using Batch processing. This dataflow job does not require near real time processing (streaming). Is there a way to also insert the incoming records into BigQuery table without streaming inserts costs enabling the batch type processing.



    Batch processing with less frequent runs could work resulting in less memory, vCPUs and compute costs.



    As per:https://beam.apache.org/releases/javadoc/2.5.0/org/apache/beam/sdk/io/kafka/KafkaIO.html



    KafkaIO source returns unbounded collection of Kafka records as PCollection>.



    Does it mean Kafka being unbounded source cannot be run in batch mode?



    Testing .withMaxNumRecords(1000000) condition runs the job in batch mode. However, to run the job in live incoming data, I need to remove this condition.



    I have tried using windowing and setting streaming mode options flag to false without success as in below code.




    // not streaming mode
    options.setStreaming(false);

    ...
    PCollection<String> collection = p.apply(KafkaIO.<String, String>read()
    .withBootstrapServers("IPADDRESS:9092")
    .withTopic(topic)
    .withKeyDeserializer(StringDeserializer.class)
    .withValueDeserializer(StringDeserializer.class)
    .updateConsumerProperties(props)
    .withConsumerFactoryFn(new ConsumerFactory())
    // .withMaxNumRecords(1000000)
    .withoutMetadata()
    ).apply(Values.<String>create())
    .apply(Window.into(FixedWindows.of(Duration.standardDays(1))));


    ...
    //convert to Avro GenericRecord

    .apply("AvroToGCS", AvroIO.writeGenericRecords(AVRO_SCHEMA)
    .withWindowedWrites()
    .withNumShards(1)
    .to("gs://BUCKET/FOLDER/")
    .withSuffix(".avro"));



    The code resulted in Streaming job type with 4 vCPUs and 1 worker for 9 minutes processing 1.8mn records. After this, I had to stop job (drain) to prevent costs.



    Enforcing the Batch processing in Dataflow on incoming data, is it possible to collect batch of records writing it as avro files and continue doing so until offset has caught up to latest.



    Any examples or sample code greatly appreciated.










    share|improve this question


























      1












      1








      1








      Google Dataflow job uses Apache Beam's KafkaIO library with AvroIO and Windowed Writes writing output to ".avro" files in Google Cloud Storage bucket. However, it defaults to Streaming as the processing job type on production data.



      Is it possible to consume data from Kafka topic using KafkaIO in Dataflow using Batch processing. This dataflow job does not require near real time processing (streaming). Is there a way to also insert the incoming records into BigQuery table without streaming inserts costs enabling the batch type processing.



      Batch processing with less frequent runs could work resulting in less memory, vCPUs and compute costs.



      As per:https://beam.apache.org/releases/javadoc/2.5.0/org/apache/beam/sdk/io/kafka/KafkaIO.html



      KafkaIO source returns unbounded collection of Kafka records as PCollection>.



      Does it mean Kafka being unbounded source cannot be run in batch mode?



      Testing .withMaxNumRecords(1000000) condition runs the job in batch mode. However, to run the job in live incoming data, I need to remove this condition.



      I have tried using windowing and setting streaming mode options flag to false without success as in below code.




      // not streaming mode
      options.setStreaming(false);

      ...
      PCollection<String> collection = p.apply(KafkaIO.<String, String>read()
      .withBootstrapServers("IPADDRESS:9092")
      .withTopic(topic)
      .withKeyDeserializer(StringDeserializer.class)
      .withValueDeserializer(StringDeserializer.class)
      .updateConsumerProperties(props)
      .withConsumerFactoryFn(new ConsumerFactory())
      // .withMaxNumRecords(1000000)
      .withoutMetadata()
      ).apply(Values.<String>create())
      .apply(Window.into(FixedWindows.of(Duration.standardDays(1))));


      ...
      //convert to Avro GenericRecord

      .apply("AvroToGCS", AvroIO.writeGenericRecords(AVRO_SCHEMA)
      .withWindowedWrites()
      .withNumShards(1)
      .to("gs://BUCKET/FOLDER/")
      .withSuffix(".avro"));



      The code resulted in Streaming job type with 4 vCPUs and 1 worker for 9 minutes processing 1.8mn records. After this, I had to stop job (drain) to prevent costs.



      Enforcing the Batch processing in Dataflow on incoming data, is it possible to collect batch of records writing it as avro files and continue doing so until offset has caught up to latest.



      Any examples or sample code greatly appreciated.










      share|improve this question
















      Google Dataflow job uses Apache Beam's KafkaIO library with AvroIO and Windowed Writes writing output to ".avro" files in Google Cloud Storage bucket. However, it defaults to Streaming as the processing job type on production data.



      Is it possible to consume data from Kafka topic using KafkaIO in Dataflow using Batch processing. This dataflow job does not require near real time processing (streaming). Is there a way to also insert the incoming records into BigQuery table without streaming inserts costs enabling the batch type processing.



      Batch processing with less frequent runs could work resulting in less memory, vCPUs and compute costs.



      As per:https://beam.apache.org/releases/javadoc/2.5.0/org/apache/beam/sdk/io/kafka/KafkaIO.html



      KafkaIO source returns unbounded collection of Kafka records as PCollection>.



      Does it mean Kafka being unbounded source cannot be run in batch mode?



      Testing .withMaxNumRecords(1000000) condition runs the job in batch mode. However, to run the job in live incoming data, I need to remove this condition.



      I have tried using windowing and setting streaming mode options flag to false without success as in below code.




      // not streaming mode
      options.setStreaming(false);

      ...
      PCollection<String> collection = p.apply(KafkaIO.<String, String>read()
      .withBootstrapServers("IPADDRESS:9092")
      .withTopic(topic)
      .withKeyDeserializer(StringDeserializer.class)
      .withValueDeserializer(StringDeserializer.class)
      .updateConsumerProperties(props)
      .withConsumerFactoryFn(new ConsumerFactory())
      // .withMaxNumRecords(1000000)
      .withoutMetadata()
      ).apply(Values.<String>create())
      .apply(Window.into(FixedWindows.of(Duration.standardDays(1))));


      ...
      //convert to Avro GenericRecord

      .apply("AvroToGCS", AvroIO.writeGenericRecords(AVRO_SCHEMA)
      .withWindowedWrites()
      .withNumShards(1)
      .to("gs://BUCKET/FOLDER/")
      .withSuffix(".avro"));



      The code resulted in Streaming job type with 4 vCPUs and 1 worker for 9 minutes processing 1.8mn records. After this, I had to stop job (drain) to prevent costs.



      Enforcing the Batch processing in Dataflow on incoming data, is it possible to collect batch of records writing it as avro files and continue doing so until offset has caught up to latest.



      Any examples or sample code greatly appreciated.







      java google-cloud-platform apache-kafka google-cloud-dataflow apache-beam






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Mar 26 at 6:44







      Amogh Antarkar

















      asked Mar 26 at 6:30









      Amogh AntarkarAmogh Antarkar

      347 bronze badges




      347 bronze badges






















          1 Answer
          1






          active

          oldest

          votes


















          1














          Unbounded sources cannot be run in batch mode. This is by design, as batch pipelines expect a finite amount of data to be read and to terminate when done processing it.



          However, you could convert the unbounded sources into a bounded source by constraining how many records it reads, which you have done. Note: There is no guarantee which records will be read.



          Streaming pipelines are meant to be always up, so that they are available to read live data. Batch pipelines are meant to read backlogs of stored data.



          A batch pipeline will not respond well to reading live data, it will read whatever data is there when you launch the pipeline then terminate.






          share|improve this answer























          • Since I can commit the kafka offset back using commitOffsetsInFinalize(), I am thinking it can be used with withMaxNumRecords() expecting N billion records everyday. Offset would move ahead as per the run schedule frequencies.

            – Amogh Antarkar
            Apr 1 at 4:58










          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55351031%2fconsume-using-kafkaio-in-batch-processing-mode-using-google-dataflow%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          Unbounded sources cannot be run in batch mode. This is by design, as batch pipelines expect a finite amount of data to be read and to terminate when done processing it.



          However, you could convert the unbounded sources into a bounded source by constraining how many records it reads, which you have done. Note: There is no guarantee which records will be read.



          Streaming pipelines are meant to be always up, so that they are available to read live data. Batch pipelines are meant to read backlogs of stored data.



          A batch pipeline will not respond well to reading live data, it will read whatever data is there when you launch the pipeline then terminate.






          share|improve this answer























          • Since I can commit the kafka offset back using commitOffsetsInFinalize(), I am thinking it can be used with withMaxNumRecords() expecting N billion records everyday. Offset would move ahead as per the run schedule frequencies.

            – Amogh Antarkar
            Apr 1 at 4:58















          1














          Unbounded sources cannot be run in batch mode. This is by design, as batch pipelines expect a finite amount of data to be read and to terminate when done processing it.



          However, you could convert the unbounded sources into a bounded source by constraining how many records it reads, which you have done. Note: There is no guarantee which records will be read.



          Streaming pipelines are meant to be always up, so that they are available to read live data. Batch pipelines are meant to read backlogs of stored data.



          A batch pipeline will not respond well to reading live data, it will read whatever data is there when you launch the pipeline then terminate.






          share|improve this answer























          • Since I can commit the kafka offset back using commitOffsetsInFinalize(), I am thinking it can be used with withMaxNumRecords() expecting N billion records everyday. Offset would move ahead as per the run schedule frequencies.

            – Amogh Antarkar
            Apr 1 at 4:58













          1












          1








          1







          Unbounded sources cannot be run in batch mode. This is by design, as batch pipelines expect a finite amount of data to be read and to terminate when done processing it.



          However, you could convert the unbounded sources into a bounded source by constraining how many records it reads, which you have done. Note: There is no guarantee which records will be read.



          Streaming pipelines are meant to be always up, so that they are available to read live data. Batch pipelines are meant to read backlogs of stored data.



          A batch pipeline will not respond well to reading live data, it will read whatever data is there when you launch the pipeline then terminate.






          share|improve this answer













          Unbounded sources cannot be run in batch mode. This is by design, as batch pipelines expect a finite amount of data to be read and to terminate when done processing it.



          However, you could convert the unbounded sources into a bounded source by constraining how many records it reads, which you have done. Note: There is no guarantee which records will be read.



          Streaming pipelines are meant to be always up, so that they are available to read live data. Batch pipelines are meant to read backlogs of stored data.



          A batch pipeline will not respond well to reading live data, it will read whatever data is there when you launch the pipeline then terminate.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Mar 29 at 22:19









          Alex AmatoAlex Amato

          7034 silver badges10 bronze badges




          7034 silver badges10 bronze badges












          • Since I can commit the kafka offset back using commitOffsetsInFinalize(), I am thinking it can be used with withMaxNumRecords() expecting N billion records everyday. Offset would move ahead as per the run schedule frequencies.

            – Amogh Antarkar
            Apr 1 at 4:58

















          • Since I can commit the kafka offset back using commitOffsetsInFinalize(), I am thinking it can be used with withMaxNumRecords() expecting N billion records everyday. Offset would move ahead as per the run schedule frequencies.

            – Amogh Antarkar
            Apr 1 at 4:58
















          Since I can commit the kafka offset back using commitOffsetsInFinalize(), I am thinking it can be used with withMaxNumRecords() expecting N billion records everyday. Offset would move ahead as per the run schedule frequencies.

          – Amogh Antarkar
          Apr 1 at 4:58





          Since I can commit the kafka offset back using commitOffsetsInFinalize(), I am thinking it can be used with withMaxNumRecords() expecting N billion records everyday. Offset would move ahead as per the run schedule frequencies.

          – Amogh Antarkar
          Apr 1 at 4:58








          Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.







          Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.



















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55351031%2fconsume-using-kafkaio-in-batch-processing-mode-using-google-dataflow%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

          Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

          Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript