How to combine and sort different dataframes into one?Spark: Merge RDDsHow to create a new column in a Spark DataFrame based on a second DataFrame (Java)?Scala Spark - Map function referencing another dataframeHow to join two dataframes where key to be used for joining has different datatype in both dataframesSpark not writing to HiveReplace words in Data frame using List of words in another Data frame in Spark ScalaHow to count number of rows in a spark dataframe based on a value (primary key) from another dataframe?Join two dataframes in pyspark by one columnHow to merge edits from one dataframe into another dataframe in Spark?comparing two dataframes get number repetitions

If the Moon were impacted by a suitably sized meteor, how long would it take to impact the Earth?

What are these hats and the function of those wearing them? worn by the Russian imperial army at Borodino

"Will flex for food". What does this phrase mean?

Why do MS SQL Server SEQUENCEs not have an ORDER parameter like Oracle?

Security measures that could plausibly last 150+ years?

Reasons for using monsters as bioweapons

Backpacking with incontinence

Is Norway in the Single Market?

What is my clock telling me to do?

What Marvel character has this 'W' symbol?

How do I respond appropriately to an overseas company that obtained a visa for me without hiring me?

Were there any unmanned expeditions to the moon that returned to Earth prior to Apollo?

Novel - Accidental exploration ship, broadcasts a TV show to let people know what they find

What to expect in a jazz audition

Is this popular optical illusion made of a grey-scale image with coloured lines?

How does Asimov's second law deal with contradictory orders from different people?

How to let cacti grow even if no player is near?

Change data format in QGIS field calculator using format_date

How to innovate in OR

May a hotel provide accommodation for fewer people than booked?

Why would an invisible personal shield be necessary?

Should 2FA be enabled on service accounts?

"Fewer errors means better products" or "Fewer errors mean better products"?

Can the additional attack from a Samurai's Rapid Strike have advantage?

How to combine and sort different dataframes into one?

Spark: Merge RDDsHow to create a new column in a Spark DataFrame based on a second DataFrame (Java)?Scala Spark - Map function referencing another dataframeHow to join two dataframes where key to be used for joining has different datatype in both dataframesSpark not writing to HiveReplace words in Data frame using List of words in another Data frame in Spark ScalaHow to count number of rows in a spark dataframe based on a value (primary key) from another dataframe?Join two dataframes in pyspark by one columnHow to merge edits from one dataframe into another dataframe in Spark?comparing two dataframes get number repetitions

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

Given two dataframes, which may have completely different schemas, except for a index column (timestamp in this case), such as df1 and df2 below:

df1:

timestamp | length | width
 1 | 10 | 20
 3 | 5 | 3

df2:

timestamp | name | length
 0 | "sample" | 3
 2 | "test" | 6

How can I combine these two dataframes into one that would look something like this:

df3:

timestamp | df1 | df2
 | length | width | name | length 
 0 | null | null | "sample" | 3
 1 | 10 | 20 | null | null
 2 | null | null | "test" | 6
 3 | 5 | 3 | null | null

I am extremely new to spark, so this might not actually make a lot of sense. But the problem I am trying to solve is: I need to combine these dataframes so that later I can convert each row to a given object. However, they have to be ordered by timestamp, so when I write these objects out, they are in the correct order.

So for example, given the df3 above, I would be able to generate the following list of objects:

objs = [
 ObjectType1(timestamp=0, name="sample", length=3),
 ObjectType2(timestamp=1, length=10, width=20),
 ObjectType1(timestamp=2, name="test", length=6),
 ObjectType2(timestamp=3, length=5, width=3)
]

Perhaps combining the dataframes does not make sense, but how could I sort the dataframes individually and somehow grab the Rows from each one of them ordered by timestamp globally?

P.S.: Note that I repeated length in both dataframes. That was done on purpose to illustrate that they may have columns of same name and type, but represent completely different data, so merging schema is not a possibility.

asked Mar 26 at 23:05

Vini

84 bronze badges

add a comment |

Given two dataframes, which may have completely different schemas, except for a index column (timestamp in this case), such as df1 and df2 below:

df1:

timestamp | length | width
 1 | 10 | 20
 3 | 5 | 3

df2:

timestamp | name | length
 0 | "sample" | 3
 2 | "test" | 6

How can I combine these two dataframes into one that would look something like this:

df3:

timestamp | df1 | df2
 | length | width | name | length 
 0 | null | null | "sample" | 3
 1 | 10 | 20 | null | null
 2 | null | null | "test" | 6
 3 | 5 | 3 | null | null

So for example, given the df3 above, I would be able to generate the following list of objects:

objs = [
 ObjectType1(timestamp=0, name="sample", length=3),
 ObjectType2(timestamp=1, length=10, width=20),
 ObjectType1(timestamp=2, name="test", length=6),
 ObjectType2(timestamp=3, length=5, width=3)
]

Perhaps combining the dataframes does not make sense, but how could I sort the dataframes individually and somehow grab the Rows from each one of them ordered by timestamp globally?

asked Mar 26 at 23:05

Vini

84 bronze badges

add a comment |

Given two dataframes, which may have completely different schemas, except for a index column (timestamp in this case), such as df1 and df2 below:

df1:

timestamp | length | width
 1 | 10 | 20
 3 | 5 | 3

df2:

timestamp | name | length
 0 | "sample" | 3
 2 | "test" | 6

How can I combine these two dataframes into one that would look something like this:

df3:

timestamp | df1 | df2
 | length | width | name | length 
 0 | null | null | "sample" | 3
 1 | 10 | 20 | null | null
 2 | null | null | "test" | 6
 3 | 5 | 3 | null | null

So for example, given the df3 above, I would be able to generate the following list of objects:

objs = [
 ObjectType1(timestamp=0, name="sample", length=3),
 ObjectType2(timestamp=1, length=10, width=20),
 ObjectType1(timestamp=2, name="test", length=6),
 ObjectType2(timestamp=3, length=5, width=3)
]

Perhaps combining the dataframes does not make sense, but how could I sort the dataframes individually and somehow grab the Rows from each one of them ordered by timestamp globally?

asked Mar 26 at 23:05

Vini

84 bronze badges

Given two dataframes, which may have completely different schemas, except for a index column (timestamp in this case), such as df1 and df2 below:

df1:

timestamp | length | width
 1 | 10 | 20
 3 | 5 | 3

df2:

timestamp | name | length
 0 | "sample" | 3
 2 | "test" | 6

How can I combine these two dataframes into one that would look something like this:

df3:

timestamp | df1 | df2
 | length | width | name | length 
 0 | null | null | "sample" | 3
 1 | 10 | 20 | null | null
 2 | null | null | "test" | 6
 3 | 5 | 3 | null | null

So for example, given the df3 above, I would be able to generate the following list of objects:

objs = [
 ObjectType1(timestamp=0, name="sample", length=3),
 ObjectType2(timestamp=1, length=10, width=20),
 ObjectType1(timestamp=2, name="test", length=6),
 ObjectType2(timestamp=3, length=5, width=3)
]

Perhaps combining the dataframes does not make sense, but how could I sort the dataframes individually and somehow grab the Rows from each one of them ordered by timestamp globally?

apache-spark pyspark apache-spark-sql

asked Mar 26 at 23:05

Vini

84 bronze badges

asked Mar 26 at 23:05

Vini

84 bronze badges

asked Mar 26 at 23:05

Vini

84 bronze badges

asked Mar 26 at 23:05

Vini

84 bronze badges

asked Mar 26 at 23:05

Vini

84 bronze badges

add a comment |

1 Answer
1

active

oldest

votes

what you need is a full outer join, possibly renaming one of the columns, something like df1.join(df2.withColumnRenamed("length","length2"), Seq("timestamp"),"full_outer")

See this example, built from yours (just less typing)

// data shaped as your example
case class t1(ts:Int, width:Int,l:Int)
case class t2(ts:Int, width:Int,l:Int)
// create data frames
val df1 = Seq(t1(1,10,20),t1(3,5,3)).toDF
val df2 = Seq(t2(0,"sample",3),t2(2,"test",6)).toDF
df1.join(df2.withColumnRenamed("l","l2"),Seq("ts"),"full_outer").sort("ts").show
+---+-----+----+------+----+ 
| ts|width| l| name| l2|
+---+-----+----+------+----+
| 0| null|null|sample| 3|
| 1| 10| 20| null|null|
| 2| null|null| test| 6|
| 3| 5| 3| null|null|
+---+-----+----+------+----+

answered Mar 26 at 23:55

Roberto Congiu

3,6321 gold badge15 silver badges27 bronze badges

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55367446%2fhow-to-combine-and-sort-different-dataframes-into-one%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

what you need is a full outer join, possibly renaming one of the columns, something like df1.join(df2.withColumnRenamed("length","length2"), Seq("timestamp"),"full_outer")

See this example, built from yours (just less typing)

// data shaped as your example
case class t1(ts:Int, width:Int,l:Int)
case class t2(ts:Int, width:Int,l:Int)
// create data frames
val df1 = Seq(t1(1,10,20),t1(3,5,3)).toDF
val df2 = Seq(t2(0,"sample",3),t2(2,"test",6)).toDF
df1.join(df2.withColumnRenamed("l","l2"),Seq("ts"),"full_outer").sort("ts").show
+---+-----+----+------+----+ 
| ts|width| l| name| l2|
+---+-----+----+------+----+
| 0| null|null|sample| 3|
| 1| 10| 20| null|null|
| 2| null|null| test| 6|
| 3| 5| 3| null|null|
+---+-----+----+------+----+

answered Mar 26 at 23:55

Roberto Congiu

3,6321 gold badge15 silver badges27 bronze badges

add a comment |

what you need is a full outer join, possibly renaming one of the columns, something like df1.join(df2.withColumnRenamed("length","length2"), Seq("timestamp"),"full_outer")

See this example, built from yours (just less typing)

// data shaped as your example
case class t1(ts:Int, width:Int,l:Int)
case class t2(ts:Int, width:Int,l:Int)
// create data frames
val df1 = Seq(t1(1,10,20),t1(3,5,3)).toDF
val df2 = Seq(t2(0,"sample",3),t2(2,"test",6)).toDF
df1.join(df2.withColumnRenamed("l","l2"),Seq("ts"),"full_outer").sort("ts").show
+---+-----+----+------+----+ 
| ts|width| l| name| l2|
+---+-----+----+------+----+
| 0| null|null|sample| 3|
| 1| 10| 20| null|null|
| 2| null|null| test| 6|
| 3| 5| 3| null|null|
+---+-----+----+------+----+

answered Mar 26 at 23:55

Roberto Congiu

3,6321 gold badge15 silver badges27 bronze badges

add a comment |

what you need is a full outer join, possibly renaming one of the columns, something like df1.join(df2.withColumnRenamed("length","length2"), Seq("timestamp"),"full_outer")

See this example, built from yours (just less typing)

// data shaped as your example
case class t1(ts:Int, width:Int,l:Int)
case class t2(ts:Int, width:Int,l:Int)
// create data frames
val df1 = Seq(t1(1,10,20),t1(3,5,3)).toDF
val df2 = Seq(t2(0,"sample",3),t2(2,"test",6)).toDF
df1.join(df2.withColumnRenamed("l","l2"),Seq("ts"),"full_outer").sort("ts").show
+---+-----+----+------+----+ 
| ts|width| l| name| l2|
+---+-----+----+------+----+
| 0| null|null|sample| 3|
| 1| 10| 20| null|null|
| 2| null|null| test| 6|
| 3| 5| 3| null|null|
+---+-----+----+------+----+

answered Mar 26 at 23:55

Roberto Congiu

3,6321 gold badge15 silver badges27 bronze badges

what you need is a full outer join, possibly renaming one of the columns, something like df1.join(df2.withColumnRenamed("length","length2"), Seq("timestamp"),"full_outer")

See this example, built from yours (just less typing)

// data shaped as your example
case class t1(ts:Int, width:Int,l:Int)
case class t2(ts:Int, width:Int,l:Int)
// create data frames
val df1 = Seq(t1(1,10,20),t1(3,5,3)).toDF
val df2 = Seq(t2(0,"sample",3),t2(2,"test",6)).toDF
df1.join(df2.withColumnRenamed("l","l2"),Seq("ts"),"full_outer").sort("ts").show
+---+-----+----+------+----+ 
| ts|width| l| name| l2|
+---+-----+----+------+----+
| 0| null|null|sample| 3|
| 1| 10| 20| null|null|
| 2| null|null| test| 6|
| 3| 5| 3| null|null|
+---+-----+----+------+----+

answered Mar 26 at 23:55

Roberto Congiu

3,6321 gold badge15 silver badges27 bronze badges

answered Mar 26 at 23:55

Roberto Congiu

3,6321 gold badge15 silver badges27 bronze badges

answered Mar 26 at 23:55

Roberto Congiu

3,6321 gold badge15 silver badges27 bronze badges

answered Mar 26 at 23:55

Roberto Congiu

3,6321 gold badge15 silver badges27 bronze badges

add a comment |

Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴

1 Answer
1

1 Answer
1

1 Answer
1