How to break long rdd lineage in order to avoid stackoverflowSpark - repartition() vs coalesce()How to read multiple text files into a single RDD?How to convert rdd object to dataframe in sparkStackoverflow due to long RDD Lineageiterative code with long lineage RDD causes stackoverflow error in Apache SparkSpark - Avro Reads Schema but DataFrame Emptypyspark saveAsSequenceFile with pyspark.ml.linalg.VectorsHow to start Spark Thrift Server on Datastax Enterprise (fails with java.lang.NoSuchMethodError: …LogDivertAppender.setWriter)?Stackoverflow error due to long lineage in loop for (on DataFrame)Breaking lineage of an RDD without relying on HDFS
Is this password scheme legit?
Cooking Scrambled Eggs
How do I remap "å" to type "å"?
Why did my folder names end up like this, and how can I fix this using a script?
Semantic difference between regular and irregular 'backen'
Retroactively modifying humans for Earth?
Bevel not matching the curve object source
Is the negative potential of 書く used in this sentence and what is its meaning?
How to read Microware OS-9 RBF Filesystem under Windows / Linux?
Why does Windows store Wi-Fi passwords in a reversible format?
What's special ammo in Destiny 2?
How to prevent a hosting company from accessing a VM's encryption keys?
How can I download a file from a host I can only SSH to through another host?
Breaker Mapping Questions
What is the loud noise of a helicopter when the rotors are not yet moving?
Number of academics in various EU countries
Expressing an implication as ILP where each implication term comprises a chain of boolean ORs
Can I get a PhD for developing an educational software?
Expanding powers of expressions of the form ax+b
Do you pay one or two mana to bounce a transformed Delver of Secrets with Repeal?
Talk interpreter
Prevent use of CNAME record for untrusted domain
Why is strlen so complex in C?
Simulation in Python
How to break long rdd lineage in order to avoid stackoverflow
Spark - repartition() vs coalesce()How to read multiple text files into a single RDD?How to convert rdd object to dataframe in sparkStackoverflow due to long RDD Lineageiterative code with long lineage RDD causes stackoverflow error in Apache SparkSpark - Avro Reads Schema but DataFrame Emptypyspark saveAsSequenceFile with pyspark.ml.linalg.VectorsHow to start Spark Thrift Server on Datastax Enterprise (fails with java.lang.NoSuchMethodError: …LogDivertAppender.setWriter)?Stackoverflow error due to long lineage in loop for (on DataFrame)Breaking lineage of an RDD without relying on HDFS
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
I'm trying to consolidate a large number of small avro files(in hdfs) to parquet file. Looks like if there's a ton of avro files within that directory, I get a ERROR yarn.ApplicationMaster: User class threw exception: java.lang.StackOverflowError
Error:
19/03/26 15:14:14 INFO avro.AvroRelation: Listing hdfs://rc-hddd701.dev.local:8020/ur/source/Avro/urlog-avro-dm.dmlog/2019/03/21/20190321235700-dm-appd703-9abf19b8-2f6f-4341-87d7-74c0175e980d.avro on driver
19/03/26 15:14:14 INFO avro.AvroRelation: Listing hdfs://rc-hddd701.dev.local:8020/ur/source/Avro/urlog-avro-dm.dmlog/2019/03/21/20190321235800-DM-APPTSTD701-6af176ba-68f8-4420-b1b0-2f2be6abf003.avro on driver
19/03/26 15:14:14 INFO avro.AvroRelation: Listing hdfs://rc-hddd701.dev.local:8020/ur/source/Avro/urlog-avro-dm.dmlog/2019/03/21/20190321235800-dm-appd701-70b0ff1c-1664-4ce7-8321-149e12961627.avro on driver
19/03/26 15:14:14 INFO avro.AvroRelation: Listing hdfs://rc-hddd701.dev.local:8020/ur/source/Avro/urlog-avro-dm.dmlog/2019/03/21/20190321235800-dm-appd702-3dcbe094-14c9-4a4f-b326-57256df78b50.avro on driver
19/03/26 15:14:14 INFO avro.AvroRelation: Listing hdfs://rc-hddd701.dev.local:8020/ur/source/Avro/urlog-avro-dm.dmlog/2019/03/21/20190321235800-dm-appd703-a8a3ef8b-4dc0-41c1-a69a-2ef432fee0af.avro on driver
19/03/26 15:14:56 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.StackOverflowError
java.lang.StackOverflowError
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
The code I'm using
val df_array = filePaths.map(path => sqlContext.read.format("com.databricks.spark.avro").load(path.toString))
val df_mid = df_array.reduce((df1, df2) => df1.unionAll(df2))
val df = df_mid
.withColumn("dt", date_format(df_mid.col("timeStamp"), "yyyy-MM-dd"))
.filter("dt != 'null'")
df
.repartition(df.col("dt")) //repartition vs coalese: https://stackoverflow.com/questions/31610971/spark-repartition-vs-coalesce
.write.partitionBy("dt")
.mode(SaveMode.Append)
.option("compression","snappy")
.parquet(avroConsolidator.parquetFilePathSpark.toString)
Where filePaths is Array[Path].
This code works if there less number of paths I try to process.
After a bit of googling around, i found out checkpointing the dataframe might be an option to mitigate the issue, but I'm not sure how to achieve that.
Spark version:2.1
apache-spark apache-spark-sql hdfs spark-checkpoint
add a comment |
I'm trying to consolidate a large number of small avro files(in hdfs) to parquet file. Looks like if there's a ton of avro files within that directory, I get a ERROR yarn.ApplicationMaster: User class threw exception: java.lang.StackOverflowError
Error:
19/03/26 15:14:14 INFO avro.AvroRelation: Listing hdfs://rc-hddd701.dev.local:8020/ur/source/Avro/urlog-avro-dm.dmlog/2019/03/21/20190321235700-dm-appd703-9abf19b8-2f6f-4341-87d7-74c0175e980d.avro on driver
19/03/26 15:14:14 INFO avro.AvroRelation: Listing hdfs://rc-hddd701.dev.local:8020/ur/source/Avro/urlog-avro-dm.dmlog/2019/03/21/20190321235800-DM-APPTSTD701-6af176ba-68f8-4420-b1b0-2f2be6abf003.avro on driver
19/03/26 15:14:14 INFO avro.AvroRelation: Listing hdfs://rc-hddd701.dev.local:8020/ur/source/Avro/urlog-avro-dm.dmlog/2019/03/21/20190321235800-dm-appd701-70b0ff1c-1664-4ce7-8321-149e12961627.avro on driver
19/03/26 15:14:14 INFO avro.AvroRelation: Listing hdfs://rc-hddd701.dev.local:8020/ur/source/Avro/urlog-avro-dm.dmlog/2019/03/21/20190321235800-dm-appd702-3dcbe094-14c9-4a4f-b326-57256df78b50.avro on driver
19/03/26 15:14:14 INFO avro.AvroRelation: Listing hdfs://rc-hddd701.dev.local:8020/ur/source/Avro/urlog-avro-dm.dmlog/2019/03/21/20190321235800-dm-appd703-a8a3ef8b-4dc0-41c1-a69a-2ef432fee0af.avro on driver
19/03/26 15:14:56 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.StackOverflowError
java.lang.StackOverflowError
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
The code I'm using
val df_array = filePaths.map(path => sqlContext.read.format("com.databricks.spark.avro").load(path.toString))
val df_mid = df_array.reduce((df1, df2) => df1.unionAll(df2))
val df = df_mid
.withColumn("dt", date_format(df_mid.col("timeStamp"), "yyyy-MM-dd"))
.filter("dt != 'null'")
df
.repartition(df.col("dt")) //repartition vs coalese: https://stackoverflow.com/questions/31610971/spark-repartition-vs-coalesce
.write.partitionBy("dt")
.mode(SaveMode.Append)
.option("compression","snappy")
.parquet(avroConsolidator.parquetFilePathSpark.toString)
Where filePaths is Array[Path].
This code works if there less number of paths I try to process.
After a bit of googling around, i found out checkpointing the dataframe might be an option to mitigate the issue, but I'm not sure how to achieve that.
Spark version:2.1
apache-spark apache-spark-sql hdfs spark-checkpoint
add a comment |
I'm trying to consolidate a large number of small avro files(in hdfs) to parquet file. Looks like if there's a ton of avro files within that directory, I get a ERROR yarn.ApplicationMaster: User class threw exception: java.lang.StackOverflowError
Error:
19/03/26 15:14:14 INFO avro.AvroRelation: Listing hdfs://rc-hddd701.dev.local:8020/ur/source/Avro/urlog-avro-dm.dmlog/2019/03/21/20190321235700-dm-appd703-9abf19b8-2f6f-4341-87d7-74c0175e980d.avro on driver
19/03/26 15:14:14 INFO avro.AvroRelation: Listing hdfs://rc-hddd701.dev.local:8020/ur/source/Avro/urlog-avro-dm.dmlog/2019/03/21/20190321235800-DM-APPTSTD701-6af176ba-68f8-4420-b1b0-2f2be6abf003.avro on driver
19/03/26 15:14:14 INFO avro.AvroRelation: Listing hdfs://rc-hddd701.dev.local:8020/ur/source/Avro/urlog-avro-dm.dmlog/2019/03/21/20190321235800-dm-appd701-70b0ff1c-1664-4ce7-8321-149e12961627.avro on driver
19/03/26 15:14:14 INFO avro.AvroRelation: Listing hdfs://rc-hddd701.dev.local:8020/ur/source/Avro/urlog-avro-dm.dmlog/2019/03/21/20190321235800-dm-appd702-3dcbe094-14c9-4a4f-b326-57256df78b50.avro on driver
19/03/26 15:14:14 INFO avro.AvroRelation: Listing hdfs://rc-hddd701.dev.local:8020/ur/source/Avro/urlog-avro-dm.dmlog/2019/03/21/20190321235800-dm-appd703-a8a3ef8b-4dc0-41c1-a69a-2ef432fee0af.avro on driver
19/03/26 15:14:56 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.StackOverflowError
java.lang.StackOverflowError
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
The code I'm using
val df_array = filePaths.map(path => sqlContext.read.format("com.databricks.spark.avro").load(path.toString))
val df_mid = df_array.reduce((df1, df2) => df1.unionAll(df2))
val df = df_mid
.withColumn("dt", date_format(df_mid.col("timeStamp"), "yyyy-MM-dd"))
.filter("dt != 'null'")
df
.repartition(df.col("dt")) //repartition vs coalese: https://stackoverflow.com/questions/31610971/spark-repartition-vs-coalesce
.write.partitionBy("dt")
.mode(SaveMode.Append)
.option("compression","snappy")
.parquet(avroConsolidator.parquetFilePathSpark.toString)
Where filePaths is Array[Path].
This code works if there less number of paths I try to process.
After a bit of googling around, i found out checkpointing the dataframe might be an option to mitigate the issue, but I'm not sure how to achieve that.
Spark version:2.1
apache-spark apache-spark-sql hdfs spark-checkpoint
I'm trying to consolidate a large number of small avro files(in hdfs) to parquet file. Looks like if there's a ton of avro files within that directory, I get a ERROR yarn.ApplicationMaster: User class threw exception: java.lang.StackOverflowError
Error:
19/03/26 15:14:14 INFO avro.AvroRelation: Listing hdfs://rc-hddd701.dev.local:8020/ur/source/Avro/urlog-avro-dm.dmlog/2019/03/21/20190321235700-dm-appd703-9abf19b8-2f6f-4341-87d7-74c0175e980d.avro on driver
19/03/26 15:14:14 INFO avro.AvroRelation: Listing hdfs://rc-hddd701.dev.local:8020/ur/source/Avro/urlog-avro-dm.dmlog/2019/03/21/20190321235800-DM-APPTSTD701-6af176ba-68f8-4420-b1b0-2f2be6abf003.avro on driver
19/03/26 15:14:14 INFO avro.AvroRelation: Listing hdfs://rc-hddd701.dev.local:8020/ur/source/Avro/urlog-avro-dm.dmlog/2019/03/21/20190321235800-dm-appd701-70b0ff1c-1664-4ce7-8321-149e12961627.avro on driver
19/03/26 15:14:14 INFO avro.AvroRelation: Listing hdfs://rc-hddd701.dev.local:8020/ur/source/Avro/urlog-avro-dm.dmlog/2019/03/21/20190321235800-dm-appd702-3dcbe094-14c9-4a4f-b326-57256df78b50.avro on driver
19/03/26 15:14:14 INFO avro.AvroRelation: Listing hdfs://rc-hddd701.dev.local:8020/ur/source/Avro/urlog-avro-dm.dmlog/2019/03/21/20190321235800-dm-appd703-a8a3ef8b-4dc0-41c1-a69a-2ef432fee0af.avro on driver
19/03/26 15:14:56 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.StackOverflowError
java.lang.StackOverflowError
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
The code I'm using
val df_array = filePaths.map(path => sqlContext.read.format("com.databricks.spark.avro").load(path.toString))
val df_mid = df_array.reduce((df1, df2) => df1.unionAll(df2))
val df = df_mid
.withColumn("dt", date_format(df_mid.col("timeStamp"), "yyyy-MM-dd"))
.filter("dt != 'null'")
df
.repartition(df.col("dt")) //repartition vs coalese: https://stackoverflow.com/questions/31610971/spark-repartition-vs-coalesce
.write.partitionBy("dt")
.mode(SaveMode.Append)
.option("compression","snappy")
.parquet(avroConsolidator.parquetFilePathSpark.toString)
Where filePaths is Array[Path].
This code works if there less number of paths I try to process.
After a bit of googling around, i found out checkpointing the dataframe might be an option to mitigate the issue, but I'm not sure how to achieve that.
Spark version:2.1
apache-spark apache-spark-sql hdfs spark-checkpoint
apache-spark apache-spark-sql hdfs spark-checkpoint
edited Apr 19 at 17:41
Neel_sama
asked Mar 27 at 19:59
Neel_samaNeel_sama
103 bronze badges
103 bronze badges
add a comment |
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55385521%2fhow-to-break-long-rdd-lineage-in-order-to-avoid-stackoverflow%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Is this question similar to what you get asked at work? Learn more about asking and sharing private information with your coworkers using Stack Overflow for Teams.
Is this question similar to what you get asked at work? Learn more about asking and sharing private information with your coworkers using Stack Overflow for Teams.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55385521%2fhow-to-break-long-rdd-lineage-in-order-to-avoid-stackoverflow%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown