How to combine and sort different dataframes into one?Spark: Merge RDDsHow to create a new column in a Spark DataFrame based on a second DataFrame (Java)?Scala Spark - Map function referencing another dataframeHow to join two dataframes where key to be used for joining has different datatype in both dataframesSpark not writing to HiveReplace words in Data frame using List of words in another Data frame in Spark ScalaHow to count number of rows in a spark dataframe based on a value (primary key) from another dataframe?Join two dataframes in pyspark by one columnHow to merge edits from one dataframe into another dataframe in Spark?comparing two dataframes get number repetitions

If the Moon were impacted by a suitably sized meteor, how long would it take to impact the Earth?

What are these hats and the function of those wearing them? worn by the Russian imperial army at Borodino

"Will flex for food". What does this phrase mean?

Why do MS SQL Server SEQUENCEs not have an ORDER parameter like Oracle?

Security measures that could plausibly last 150+ years?

Reasons for using monsters as bioweapons

Backpacking with incontinence

Is Norway in the Single Market?

What is my clock telling me to do?

What Marvel character has this 'W' symbol?

How do I respond appropriately to an overseas company that obtained a visa for me without hiring me?

Were there any unmanned expeditions to the moon that returned to Earth prior to Apollo?

Novel - Accidental exploration ship, broadcasts a TV show to let people know what they find

What to expect in a jazz audition

Is this popular optical illusion made of a grey-scale image with coloured lines?

How does Asimov's second law deal with contradictory orders from different people?

How to let cacti grow even if no player is near?

Change data format in QGIS field calculator using format_date

How to innovate in OR

May a hotel provide accommodation for fewer people than booked?

Why would an invisible personal shield be necessary?

Should 2FA be enabled on service accounts?

"Fewer errors means better products" or "Fewer errors mean better products"?

Can the additional attack from a Samurai's Rapid Strike have advantage?



How to combine and sort different dataframes into one?


Spark: Merge RDDsHow to create a new column in a Spark DataFrame based on a second DataFrame (Java)?Scala Spark - Map function referencing another dataframeHow to join two dataframes where key to be used for joining has different datatype in both dataframesSpark not writing to HiveReplace words in Data frame using List of words in another Data frame in Spark ScalaHow to count number of rows in a spark dataframe based on a value (primary key) from another dataframe?Join two dataframes in pyspark by one columnHow to merge edits from one dataframe into another dataframe in Spark?comparing two dataframes get number repetitions






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








0















Given two dataframes, which may have completely different schemas, except for a index column (timestamp in this case), such as df1 and df2 below:



df1:



timestamp | length | width
1 | 10 | 20
3 | 5 | 3


df2:



timestamp | name | length
0 | "sample" | 3
2 | "test" | 6


How can I combine these two dataframes into one that would look something like this:



df3:



timestamp | df1 | df2
| length | width | name | length
0 | null | null | "sample" | 3
1 | 10 | 20 | null | null
2 | null | null | "test" | 6
3 | 5 | 3 | null | null


I am extremely new to spark, so this might not actually make a lot of sense. But the problem I am trying to solve is: I need to combine these dataframes so that later I can convert each row to a given object. However, they have to be ordered by timestamp, so when I write these objects out, they are in the correct order.



So for example, given the df3 above, I would be able to generate the following list of objects:



objs = [
ObjectType1(timestamp=0, name="sample", length=3),
ObjectType2(timestamp=1, length=10, width=20),
ObjectType1(timestamp=2, name="test", length=6),
ObjectType2(timestamp=3, length=5, width=3)
]


Perhaps combining the dataframes does not make sense, but how could I sort the dataframes individually and somehow grab the Rows from each one of them ordered by timestamp globally?



P.S.: Note that I repeated length in both dataframes. That was done on purpose to illustrate that they may have columns of same name and type, but represent completely different data, so merging schema is not a possibility.










share|improve this question






























    0















    Given two dataframes, which may have completely different schemas, except for a index column (timestamp in this case), such as df1 and df2 below:



    df1:



    timestamp | length | width
    1 | 10 | 20
    3 | 5 | 3


    df2:



    timestamp | name | length
    0 | "sample" | 3
    2 | "test" | 6


    How can I combine these two dataframes into one that would look something like this:



    df3:



    timestamp | df1 | df2
    | length | width | name | length
    0 | null | null | "sample" | 3
    1 | 10 | 20 | null | null
    2 | null | null | "test" | 6
    3 | 5 | 3 | null | null


    I am extremely new to spark, so this might not actually make a lot of sense. But the problem I am trying to solve is: I need to combine these dataframes so that later I can convert each row to a given object. However, they have to be ordered by timestamp, so when I write these objects out, they are in the correct order.



    So for example, given the df3 above, I would be able to generate the following list of objects:



    objs = [
    ObjectType1(timestamp=0, name="sample", length=3),
    ObjectType2(timestamp=1, length=10, width=20),
    ObjectType1(timestamp=2, name="test", length=6),
    ObjectType2(timestamp=3, length=5, width=3)
    ]


    Perhaps combining the dataframes does not make sense, but how could I sort the dataframes individually and somehow grab the Rows from each one of them ordered by timestamp globally?



    P.S.: Note that I repeated length in both dataframes. That was done on purpose to illustrate that they may have columns of same name and type, but represent completely different data, so merging schema is not a possibility.










    share|improve this question


























      0












      0








      0








      Given two dataframes, which may have completely different schemas, except for a index column (timestamp in this case), such as df1 and df2 below:



      df1:



      timestamp | length | width
      1 | 10 | 20
      3 | 5 | 3


      df2:



      timestamp | name | length
      0 | "sample" | 3
      2 | "test" | 6


      How can I combine these two dataframes into one that would look something like this:



      df3:



      timestamp | df1 | df2
      | length | width | name | length
      0 | null | null | "sample" | 3
      1 | 10 | 20 | null | null
      2 | null | null | "test" | 6
      3 | 5 | 3 | null | null


      I am extremely new to spark, so this might not actually make a lot of sense. But the problem I am trying to solve is: I need to combine these dataframes so that later I can convert each row to a given object. However, they have to be ordered by timestamp, so when I write these objects out, they are in the correct order.



      So for example, given the df3 above, I would be able to generate the following list of objects:



      objs = [
      ObjectType1(timestamp=0, name="sample", length=3),
      ObjectType2(timestamp=1, length=10, width=20),
      ObjectType1(timestamp=2, name="test", length=6),
      ObjectType2(timestamp=3, length=5, width=3)
      ]


      Perhaps combining the dataframes does not make sense, but how could I sort the dataframes individually and somehow grab the Rows from each one of them ordered by timestamp globally?



      P.S.: Note that I repeated length in both dataframes. That was done on purpose to illustrate that they may have columns of same name and type, but represent completely different data, so merging schema is not a possibility.










      share|improve this question














      Given two dataframes, which may have completely different schemas, except for a index column (timestamp in this case), such as df1 and df2 below:



      df1:



      timestamp | length | width
      1 | 10 | 20
      3 | 5 | 3


      df2:



      timestamp | name | length
      0 | "sample" | 3
      2 | "test" | 6


      How can I combine these two dataframes into one that would look something like this:



      df3:



      timestamp | df1 | df2
      | length | width | name | length
      0 | null | null | "sample" | 3
      1 | 10 | 20 | null | null
      2 | null | null | "test" | 6
      3 | 5 | 3 | null | null


      I am extremely new to spark, so this might not actually make a lot of sense. But the problem I am trying to solve is: I need to combine these dataframes so that later I can convert each row to a given object. However, they have to be ordered by timestamp, so when I write these objects out, they are in the correct order.



      So for example, given the df3 above, I would be able to generate the following list of objects:



      objs = [
      ObjectType1(timestamp=0, name="sample", length=3),
      ObjectType2(timestamp=1, length=10, width=20),
      ObjectType1(timestamp=2, name="test", length=6),
      ObjectType2(timestamp=3, length=5, width=3)
      ]


      Perhaps combining the dataframes does not make sense, but how could I sort the dataframes individually and somehow grab the Rows from each one of them ordered by timestamp globally?



      P.S.: Note that I repeated length in both dataframes. That was done on purpose to illustrate that they may have columns of same name and type, but represent completely different data, so merging schema is not a possibility.







      apache-spark pyspark apache-spark-sql






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 26 at 23:05









      ViniVini

      84 bronze badges




      84 bronze badges

























          1 Answer
          1






          active

          oldest

          votes


















          0














          what you need is a full outer join, possibly renaming one of the columns, something like df1.join(df2.withColumnRenamed("length","length2"), Seq("timestamp"),"full_outer")



          See this example, built from yours (just less typing)



          // data shaped as your example
          case class t1(ts:Int, width:Int,l:Int)
          case class t2(ts:Int, width:Int,l:Int)
          // create data frames
          val df1 = Seq(t1(1,10,20),t1(3,5,3)).toDF
          val df2 = Seq(t2(0,"sample",3),t2(2,"test",6)).toDF
          df1.join(df2.withColumnRenamed("l","l2"),Seq("ts"),"full_outer").sort("ts").show
          +---+-----+----+------+----+
          | ts|width| l| name| l2|
          +---+-----+----+------+----+
          | 0| null|null|sample| 3|
          | 1| 10| 20| null|null|
          | 2| null|null| test| 6|
          | 3| 5| 3| null|null|
          +---+-----+----+------+----+





          share|improve this answer
























            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55367446%2fhow-to-combine-and-sort-different-dataframes-into-one%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0














            what you need is a full outer join, possibly renaming one of the columns, something like df1.join(df2.withColumnRenamed("length","length2"), Seq("timestamp"),"full_outer")



            See this example, built from yours (just less typing)



            // data shaped as your example
            case class t1(ts:Int, width:Int,l:Int)
            case class t2(ts:Int, width:Int,l:Int)
            // create data frames
            val df1 = Seq(t1(1,10,20),t1(3,5,3)).toDF
            val df2 = Seq(t2(0,"sample",3),t2(2,"test",6)).toDF
            df1.join(df2.withColumnRenamed("l","l2"),Seq("ts"),"full_outer").sort("ts").show
            +---+-----+----+------+----+
            | ts|width| l| name| l2|
            +---+-----+----+------+----+
            | 0| null|null|sample| 3|
            | 1| 10| 20| null|null|
            | 2| null|null| test| 6|
            | 3| 5| 3| null|null|
            +---+-----+----+------+----+





            share|improve this answer





























              0














              what you need is a full outer join, possibly renaming one of the columns, something like df1.join(df2.withColumnRenamed("length","length2"), Seq("timestamp"),"full_outer")



              See this example, built from yours (just less typing)



              // data shaped as your example
              case class t1(ts:Int, width:Int,l:Int)
              case class t2(ts:Int, width:Int,l:Int)
              // create data frames
              val df1 = Seq(t1(1,10,20),t1(3,5,3)).toDF
              val df2 = Seq(t2(0,"sample",3),t2(2,"test",6)).toDF
              df1.join(df2.withColumnRenamed("l","l2"),Seq("ts"),"full_outer").sort("ts").show
              +---+-----+----+------+----+
              | ts|width| l| name| l2|
              +---+-----+----+------+----+
              | 0| null|null|sample| 3|
              | 1| 10| 20| null|null|
              | 2| null|null| test| 6|
              | 3| 5| 3| null|null|
              +---+-----+----+------+----+





              share|improve this answer



























                0












                0








                0







                what you need is a full outer join, possibly renaming one of the columns, something like df1.join(df2.withColumnRenamed("length","length2"), Seq("timestamp"),"full_outer")



                See this example, built from yours (just less typing)



                // data shaped as your example
                case class t1(ts:Int, width:Int,l:Int)
                case class t2(ts:Int, width:Int,l:Int)
                // create data frames
                val df1 = Seq(t1(1,10,20),t1(3,5,3)).toDF
                val df2 = Seq(t2(0,"sample",3),t2(2,"test",6)).toDF
                df1.join(df2.withColumnRenamed("l","l2"),Seq("ts"),"full_outer").sort("ts").show
                +---+-----+----+------+----+
                | ts|width| l| name| l2|
                +---+-----+----+------+----+
                | 0| null|null|sample| 3|
                | 1| 10| 20| null|null|
                | 2| null|null| test| 6|
                | 3| 5| 3| null|null|
                +---+-----+----+------+----+





                share|improve this answer













                what you need is a full outer join, possibly renaming one of the columns, something like df1.join(df2.withColumnRenamed("length","length2"), Seq("timestamp"),"full_outer")



                See this example, built from yours (just less typing)



                // data shaped as your example
                case class t1(ts:Int, width:Int,l:Int)
                case class t2(ts:Int, width:Int,l:Int)
                // create data frames
                val df1 = Seq(t1(1,10,20),t1(3,5,3)).toDF
                val df2 = Seq(t2(0,"sample",3),t2(2,"test",6)).toDF
                df1.join(df2.withColumnRenamed("l","l2"),Seq("ts"),"full_outer").sort("ts").show
                +---+-----+----+------+----+
                | ts|width| l| name| l2|
                +---+-----+----+------+----+
                | 0| null|null|sample| 3|
                | 1| 10| 20| null|null|
                | 2| null|null| test| 6|
                | 3| 5| 3| null|null|
                +---+-----+----+------+----+






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Mar 26 at 23:55









                Roberto CongiuRoberto Congiu

                3,6321 gold badge15 silver badges27 bronze badges




                3,6321 gold badge15 silver badges27 bronze badges





















                    Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.







                    Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.



















                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55367446%2fhow-to-combine-and-sort-different-dataframes-into-one%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    SQL error code 1064 with creating Laravel foreign keysForeign key constraints: When to use ON UPDATE and ON DELETEDropping column with foreign key Laravel error: General error: 1025 Error on renameLaravel SQL Can't create tableLaravel Migration foreign key errorLaravel php artisan migrate:refresh giving a syntax errorSQLSTATE[42S01]: Base table or view already exists or Base table or view already exists: 1050 Tableerror in migrating laravel file to xampp serverSyntax error or access violation: 1064:syntax to use near 'unsigned not null, modelName varchar(191) not null, title varchar(191) not nLaravel cannot create new table field in mysqlLaravel 5.7:Last migration creates table but is not registered in the migration table

                    용인 삼성생명 블루밍스 목차 통계 역대 감독 선수단 응원단 경기장 같이 보기 외부 링크 둘러보기 메뉴samsungblueminx.comeh선수 명단용인 삼성생명 블루밍스용인 삼성생명 블루밍스ehsamsungblueminx.comeheheheh

                    155 수학 과학 기타 둘러보기 메뉴eh추가해eh문서를 완성해