Consume using KafkaIO in batch processing mode using Google DataflowWhy is processing a sorted array faster than processing an unsorted array?Notifying Google PubSub when Dataflow job is completeElement value based writing to Google Cloud Storage using DataflowCloud Dataflow batch taking hours to join two PCollections on a common keyKafkaIO checkpoint - how to commit offsets to KafkaGoogle cloud dataflow - batch insert in bigqueryGoogle Dataflow - SchedulingDataflow Batch or Streaming insert to BigQuery clarificationKafka to Google Cloud Platform Dataflow ingestionAvro vs Parquet in Google Cloud storage using Google Dataflow transformation

Behavior of the zero and negative/sign flags on classic instruction sets

Can I activate an iPhone without an Apple ID?

Alternatives to using writing paper for writing practice

Is this a Lost Mine of Phandelver Plot Hole?

Replacing URI when using dynamic hosts in Nginx reverse proxy

As a DM, how to avoid unconscious metagaming when dealing with a high AC character?

How did John Lennon tune his guitar

Why does Hellboy file down his horns?

Postgresql numeric and decimal is automatically rounding off

Why is dry soil hydrophobic? Bad gardener paradox

How are "soeben" and "eben" different from one another?

Is a public company able to check out who owns its shares in very detailed format?

HackerRank: Electronics Shop

Why doesn't Anakin's lightsaber explode when it's chopped in half on Geonosis?

What is this old "lemon-squeezer" shaped pan

Why is the collector feedback bias popular in electret-mic preamp circuits?

Will it hurt my career to work as a graphic designer in a startup for beauty and skin care?

Is `curl something | sudo bash -` a reasonably safe installation method?

Basic example of a formal affine scheme, functorial point of view

How to draw a gif with expanding circles that reveal lines connecting a non-centered point to the expanding circle using Tikz

Why does the trade federation become so alarmed upon learning the ambassadors are Jedi Knights?

Can I capture stereo IQ signals from WebSDR?

Are there any double stars that I can actually see orbit each other?

What exactly is the Tension force?

Consume using KafkaIO in batch processing mode using Google Dataflow

Why is processing a sorted array faster than processing an unsorted array?Notifying Google PubSub when Dataflow job is completeElement value based writing to Google Cloud Storage using DataflowCloud Dataflow batch taking hours to join two PCollections on a common keyKafkaIO checkpoint - how to commit offsets to KafkaGoogle cloud dataflow - batch insert in bigqueryGoogle Dataflow - SchedulingDataflow Batch or Streaming insert to BigQuery clarificationKafka to Google Cloud Platform Dataflow ingestionAvro vs Parquet in Google Cloud storage using Google Dataflow transformation

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

Google Dataflow job uses Apache Beam's KafkaIO library with AvroIO and Windowed Writes writing output to ".avro" files in Google Cloud Storage bucket. However, it defaults to Streaming as the processing job type on production data.

Is it possible to consume data from Kafka topic using KafkaIO in Dataflow using Batch processing. This dataflow job does not require near real time processing (streaming). Is there a way to also insert the incoming records into BigQuery table without streaming inserts costs enabling the batch type processing.

Batch processing with less frequent runs could work resulting in less memory, vCPUs and compute costs.

As per:https://beam.apache.org/releases/javadoc/2.5.0/org/apache/beam/sdk/io/kafka/KafkaIO.html

KafkaIO source returns unbounded collection of Kafka records as PCollection>.

Does it mean Kafka being unbounded source cannot be run in batch mode?

Testing .withMaxNumRecords(1000000) condition runs the job in batch mode. However, to run the job in live incoming data, I need to remove this condition.

I have tried using windowing and setting streaming mode options flag to false without success as in below code.


// not streaming mode
options.setStreaming(false);

...
PCollection<String> collection = p.apply(KafkaIO.<String, String>read()
 .withBootstrapServers("IPADDRESS:9092")
 .withTopic(topic) 
 .withKeyDeserializer(StringDeserializer.class)
 .withValueDeserializer(StringDeserializer.class) 
 .updateConsumerProperties(props)
 .withConsumerFactoryFn(new ConsumerFactory()) 
// .withMaxNumRecords(1000000)
 .withoutMetadata() 
 ).apply(Values.<String>create())
 .apply(Window.into(FixedWindows.of(Duration.standardDays(1))));


...
//convert to Avro GenericRecord

.apply("AvroToGCS", AvroIO.writeGenericRecords(AVRO_SCHEMA)
.withWindowedWrites()
.withNumShards(1)
.to("gs://BUCKET/FOLDER/")
.withSuffix(".avro"));

The code resulted in Streaming job type with 4 vCPUs and 1 worker for 9 minutes processing 1.8mn records. After this, I had to stop job (drain) to prevent costs.

Enforcing the Batch processing in Dataflow on incoming data, is it possible to collect batch of records writing it as avro files and continue doing so until offset has caught up to latest.

Any examples or sample code greatly appreciated.

edited Mar 26 at 6:44

asked Mar 26 at 6:30

Amogh Antarkar

347 bronze badges

add a comment |

Batch processing with less frequent runs could work resulting in less memory, vCPUs and compute costs.

As per:https://beam.apache.org/releases/javadoc/2.5.0/org/apache/beam/sdk/io/kafka/KafkaIO.html

KafkaIO source returns unbounded collection of Kafka records as PCollection>.

Does it mean Kafka being unbounded source cannot be run in batch mode?

Testing .withMaxNumRecords(1000000) condition runs the job in batch mode. However, to run the job in live incoming data, I need to remove this condition.

I have tried using windowing and setting streaming mode options flag to false without success as in below code.


// not streaming mode
options.setStreaming(false);

...
PCollection<String> collection = p.apply(KafkaIO.<String, String>read()
 .withBootstrapServers("IPADDRESS:9092")
 .withTopic(topic) 
 .withKeyDeserializer(StringDeserializer.class)
 .withValueDeserializer(StringDeserializer.class) 
 .updateConsumerProperties(props)
 .withConsumerFactoryFn(new ConsumerFactory()) 
// .withMaxNumRecords(1000000)
 .withoutMetadata() 
 ).apply(Values.<String>create())
 .apply(Window.into(FixedWindows.of(Duration.standardDays(1))));


...
//convert to Avro GenericRecord

.apply("AvroToGCS", AvroIO.writeGenericRecords(AVRO_SCHEMA)
.withWindowedWrites()
.withNumShards(1)
.to("gs://BUCKET/FOLDER/")
.withSuffix(".avro"));

The code resulted in Streaming job type with 4 vCPUs and 1 worker for 9 minutes processing 1.8mn records. After this, I had to stop job (drain) to prevent costs.

Enforcing the Batch processing in Dataflow on incoming data, is it possible to collect batch of records writing it as avro files and continue doing so until offset has caught up to latest.

Any examples or sample code greatly appreciated.

edited Mar 26 at 6:44

asked Mar 26 at 6:30

Amogh Antarkar

347 bronze badges

add a comment |

Batch processing with less frequent runs could work resulting in less memory, vCPUs and compute costs.

As per:https://beam.apache.org/releases/javadoc/2.5.0/org/apache/beam/sdk/io/kafka/KafkaIO.html

KafkaIO source returns unbounded collection of Kafka records as PCollection>.

Does it mean Kafka being unbounded source cannot be run in batch mode?

Testing .withMaxNumRecords(1000000) condition runs the job in batch mode. However, to run the job in live incoming data, I need to remove this condition.

I have tried using windowing and setting streaming mode options flag to false without success as in below code.


// not streaming mode
options.setStreaming(false);

...
PCollection<String> collection = p.apply(KafkaIO.<String, String>read()
 .withBootstrapServers("IPADDRESS:9092")
 .withTopic(topic) 
 .withKeyDeserializer(StringDeserializer.class)
 .withValueDeserializer(StringDeserializer.class) 
 .updateConsumerProperties(props)
 .withConsumerFactoryFn(new ConsumerFactory()) 
// .withMaxNumRecords(1000000)
 .withoutMetadata() 
 ).apply(Values.<String>create())
 .apply(Window.into(FixedWindows.of(Duration.standardDays(1))));


...
//convert to Avro GenericRecord

.apply("AvroToGCS", AvroIO.writeGenericRecords(AVRO_SCHEMA)
.withWindowedWrites()
.withNumShards(1)
.to("gs://BUCKET/FOLDER/")
.withSuffix(".avro"));

The code resulted in Streaming job type with 4 vCPUs and 1 worker for 9 minutes processing 1.8mn records. After this, I had to stop job (drain) to prevent costs.

Enforcing the Batch processing in Dataflow on incoming data, is it possible to collect batch of records writing it as avro files and continue doing so until offset has caught up to latest.

Any examples or sample code greatly appreciated.

edited Mar 26 at 6:44

asked Mar 26 at 6:30

Amogh Antarkar

347 bronze badges

Batch processing with less frequent runs could work resulting in less memory, vCPUs and compute costs.

As per:https://beam.apache.org/releases/javadoc/2.5.0/org/apache/beam/sdk/io/kafka/KafkaIO.html

KafkaIO source returns unbounded collection of Kafka records as PCollection>.

Does it mean Kafka being unbounded source cannot be run in batch mode?

Testing .withMaxNumRecords(1000000) condition runs the job in batch mode. However, to run the job in live incoming data, I need to remove this condition.

I have tried using windowing and setting streaming mode options flag to false without success as in below code.


// not streaming mode
options.setStreaming(false);

...
PCollection<String> collection = p.apply(KafkaIO.<String, String>read()
 .withBootstrapServers("IPADDRESS:9092")
 .withTopic(topic) 
 .withKeyDeserializer(StringDeserializer.class)
 .withValueDeserializer(StringDeserializer.class) 
 .updateConsumerProperties(props)
 .withConsumerFactoryFn(new ConsumerFactory()) 
// .withMaxNumRecords(1000000)
 .withoutMetadata() 
 ).apply(Values.<String>create())
 .apply(Window.into(FixedWindows.of(Duration.standardDays(1))));


...
//convert to Avro GenericRecord

.apply("AvroToGCS", AvroIO.writeGenericRecords(AVRO_SCHEMA)
.withWindowedWrites()
.withNumShards(1)
.to("gs://BUCKET/FOLDER/")
.withSuffix(".avro"));

The code resulted in Streaming job type with 4 vCPUs and 1 worker for 9 minutes processing 1.8mn records. After this, I had to stop job (drain) to prevent costs.

Enforcing the Batch processing in Dataflow on incoming data, is it possible to collect batch of records writing it as avro files and continue doing so until offset has caught up to latest.

Any examples or sample code greatly appreciated.

java google-cloud-platform apache-kafka google-cloud-dataflow apache-beam

edited Mar 26 at 6:44

asked Mar 26 at 6:30

Amogh Antarkar

347 bronze badges

edited Mar 26 at 6:44

asked Mar 26 at 6:30

Amogh Antarkar

347 bronze badges

edited Mar 26 at 6:44

asked Mar 26 at 6:30

Amogh Antarkar

347 bronze badges

asked Mar 26 at 6:30

Amogh Antarkar

347 bronze badges

asked Mar 26 at 6:30

Amogh Antarkar

347 bronze badges

add a comment |

1 Answer
1

active

oldest

votes

Unbounded sources cannot be run in batch mode. This is by design, as batch pipelines expect a finite amount of data to be read and to terminate when done processing it.

However, you could convert the unbounded sources into a bounded source by constraining how many records it reads, which you have done. Note: There is no guarantee which records will be read.

Streaming pipelines are meant to be always up, so that they are available to read live data. Batch pipelines are meant to read backlogs of stored data.

A batch pipeline will not respond well to reading live data, it will read whatever data is there when you launch the pipeline then terminate.

answered Mar 29 at 22:19

Alex Amato

7034 silver badges10 bronze badges

Since I can commit the kafka offset back using commitOffsetsInFinalize(), I am thinking it can be used with withMaxNumRecords() expecting N billion records everyday. Offset would move ahead as per the run schedule frequencies.

– Amogh Antarkar
Apr 1 at 4:58

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55351031%2fconsume-using-kafkaio-in-batch-processing-mode-using-google-dataflow%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Unbounded sources cannot be run in batch mode. This is by design, as batch pipelines expect a finite amount of data to be read and to terminate when done processing it.

However, you could convert the unbounded sources into a bounded source by constraining how many records it reads, which you have done. Note: There is no guarantee which records will be read.

Streaming pipelines are meant to be always up, so that they are available to read live data. Batch pipelines are meant to read backlogs of stored data.

A batch pipeline will not respond well to reading live data, it will read whatever data is there when you launch the pipeline then terminate.

answered Mar 29 at 22:19

Alex Amato

7034 silver badges10 bronze badges

Since I can commit the kafka offset back using commitOffsetsInFinalize(), I am thinking it can be used with withMaxNumRecords() expecting N billion records everyday. Offset would move ahead as per the run schedule frequencies.

– Amogh Antarkar
Apr 1 at 4:58

add a comment |

Unbounded sources cannot be run in batch mode. This is by design, as batch pipelines expect a finite amount of data to be read and to terminate when done processing it.

However, you could convert the unbounded sources into a bounded source by constraining how many records it reads, which you have done. Note: There is no guarantee which records will be read.

Streaming pipelines are meant to be always up, so that they are available to read live data. Batch pipelines are meant to read backlogs of stored data.

A batch pipeline will not respond well to reading live data, it will read whatever data is there when you launch the pipeline then terminate.

answered Mar 29 at 22:19

Alex Amato

7034 silver badges10 bronze badges

Since I can commit the kafka offset back using commitOffsetsInFinalize(), I am thinking it can be used with withMaxNumRecords() expecting N billion records everyday. Offset would move ahead as per the run schedule frequencies.

– Amogh Antarkar
Apr 1 at 4:58

add a comment |

Unbounded sources cannot be run in batch mode. This is by design, as batch pipelines expect a finite amount of data to be read and to terminate when done processing it.

However, you could convert the unbounded sources into a bounded source by constraining how many records it reads, which you have done. Note: There is no guarantee which records will be read.

Streaming pipelines are meant to be always up, so that they are available to read live data. Batch pipelines are meant to read backlogs of stored data.

A batch pipeline will not respond well to reading live data, it will read whatever data is there when you launch the pipeline then terminate.

answered Mar 29 at 22:19

Alex Amato

7034 silver badges10 bronze badges

Unbounded sources cannot be run in batch mode. This is by design, as batch pipelines expect a finite amount of data to be read and to terminate when done processing it.

However, you could convert the unbounded sources into a bounded source by constraining how many records it reads, which you have done. Note: There is no guarantee which records will be read.

Streaming pipelines are meant to be always up, so that they are available to read live data. Batch pipelines are meant to read backlogs of stored data.

A batch pipeline will not respond well to reading live data, it will read whatever data is there when you launch the pipeline then terminate.

answered Mar 29 at 22:19

Alex Amato

7034 silver badges10 bronze badges

answered Mar 29 at 22:19

Alex Amato

7034 silver badges10 bronze badges

answered Mar 29 at 22:19

Alex Amato

7034 silver badges10 bronze badges

answered Mar 29 at 22:19

Alex Amato

7034 silver badges10 bronze badges

Since I can commit the kafka offset back using commitOffsetsInFinalize(), I am thinking it can be used with withMaxNumRecords() expecting N billion records everyday. Offset would move ahead as per the run schedule frequencies.

– Amogh Antarkar
Apr 1 at 4:58

add a comment |

Since I can commit the kafka offset back using commitOffsetsInFinalize(), I am thinking it can be used with withMaxNumRecords() expecting N billion records everyday. Offset would move ahead as per the run schedule frequencies.

– Amogh Antarkar
Apr 1 at 4:58

Since I can commit the kafka offset back using commitOffsetsInFinalize(), I am thinking it can be used with withMaxNumRecords() expecting N billion records everyday. Offset would move ahead as per the run schedule frequencies.

– Amogh Antarkar
Apr 1 at 4:58

add a comment |

Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

vV,xedu9L,e FvbGs62YCMteu,8Avdz1KynDBeLGtxD3U4v9IztOwlSBR1F1IJq2

搜尋此網誌

Styjun

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

1 Answer
1

1 Answer
1

1 Answer
1