Derive multiple columns from a single column in a Spark DataFrameHow to update multiple columns of DataFrame in Spark Scalaspark case class udf output as dataframeQuerying Spark SQL DataFrame with complex typesApache Spark — Assign the result of UDF to multiple dataframe columnsHow to get keys and values from MapType column in SparkSQL DataFramepyspark split a column to multiple columns without pandasSpark: Applying UDF to Dataframe Generating new Columns based on Values in DFSpark UDF returning more than one itemhow to add new column to data frame in spark by deriving edit distance data frame columns (String)Trying to turn a blob into multiple columns in SparkHow to sort a dataframe by multiple column(s)Selecting multiple columns in a pandas dataframeAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrame by column nameHow to drop rows of Pandas DataFrame whose value in certain columns is NaNSelect rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headersHow to change column types in Spark SQL's DataFrame?Transforming Spark Dataframe ColumnHow to add multiple row and multiple column from single row in pyspark?

Meaning of "half-crown enclosure"

Is it normal to "extract a paper" from a master thesis?

Anatomically correct Guivre

Shell builtin `printf` line limit?

Keeping the dodos out of the field

Sony VAIO Duo 13 Wifi not working on Ubuntu 16.04

Must every right-inverse of a linear transformation be a linear transformation?

Why do testers need root cause analysis?

(For training purposes) Are there any openings with rook pawns that are more effective than others (and if so, what are they)?

What happens when redirecting with 3>&1 1>/dev/null?

What defines a person who is circumcised "of the heart"?

Real Analysis: Proof of the equivalent definitions of the derivative.

Existence of a model of ZFC in which the natural numbers are really the natural numbers

Why is this python script running in background consuming 100 % CPU?

Why "strap-on" boosters, and how do other people say it?

Passport queue length in UK in relation to arrival method

Why is a weak base more able to deprotonate a strong acid than a weak acid?

size of pointers and architecture

Was murdering a slave illegal in American slavery, and if so, what punishments were given for it?

How to create razor wire

What is this dime sized black bug with white on the segments near Loveland Colorodao?

Split into three!

Can a UK national work as a paid shop assistant in the USA?

nginx conf: http2 module not working in Chrome in ubuntu 18.04



Derive multiple columns from a single column in a Spark DataFrame


How to update multiple columns of DataFrame in Spark Scalaspark case class udf output as dataframeQuerying Spark SQL DataFrame with complex typesApache Spark — Assign the result of UDF to multiple dataframe columnsHow to get keys and values from MapType column in SparkSQL DataFramepyspark split a column to multiple columns without pandasSpark: Applying UDF to Dataframe Generating new Columns based on Values in DFSpark UDF returning more than one itemhow to add new column to data frame in spark by deriving edit distance data frame columns (String)Trying to turn a blob into multiple columns in SparkHow to sort a dataframe by multiple column(s)Selecting multiple columns in a pandas dataframeAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrame by column nameHow to drop rows of Pandas DataFrame whose value in certain columns is NaNSelect rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headersHow to change column types in Spark SQL's DataFrame?Transforming Spark Dataframe ColumnHow to add multiple row and multiple column from single row in pyspark?






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








46















I have a DF with a huge parseable metadata as a single string column in a Dataframe, lets call it DFA, with ColmnA.



I would like to break this column, ColmnA into multiple columns thru a function, ClassXYZ = Func1(ColmnA). This function returns a class ClassXYZ, with multiple variables, and each of these variables now has to be mapped to new Column, such a ColmnA1, ColmnA2 etc.



How would I do such a transformation from 1 Dataframe to another with these additional columns by calling this Func1 just once, and not have to repeat-it to create all the columns.



Its easy to solve if I were to call this huge function every time to add a new column, but that what I wish to avoid.



Kindly please advise with a working or pseudo code.



Thanks



Sanjay










share|improve this question






























    46















    I have a DF with a huge parseable metadata as a single string column in a Dataframe, lets call it DFA, with ColmnA.



    I would like to break this column, ColmnA into multiple columns thru a function, ClassXYZ = Func1(ColmnA). This function returns a class ClassXYZ, with multiple variables, and each of these variables now has to be mapped to new Column, such a ColmnA1, ColmnA2 etc.



    How would I do such a transformation from 1 Dataframe to another with these additional columns by calling this Func1 just once, and not have to repeat-it to create all the columns.



    Its easy to solve if I were to call this huge function every time to add a new column, but that what I wish to avoid.



    Kindly please advise with a working or pseudo code.



    Thanks



    Sanjay










    share|improve this question


























      46












      46








      46


      31






      I have a DF with a huge parseable metadata as a single string column in a Dataframe, lets call it DFA, with ColmnA.



      I would like to break this column, ColmnA into multiple columns thru a function, ClassXYZ = Func1(ColmnA). This function returns a class ClassXYZ, with multiple variables, and each of these variables now has to be mapped to new Column, such a ColmnA1, ColmnA2 etc.



      How would I do such a transformation from 1 Dataframe to another with these additional columns by calling this Func1 just once, and not have to repeat-it to create all the columns.



      Its easy to solve if I were to call this huge function every time to add a new column, but that what I wish to avoid.



      Kindly please advise with a working or pseudo code.



      Thanks



      Sanjay










      share|improve this question
















      I have a DF with a huge parseable metadata as a single string column in a Dataframe, lets call it DFA, with ColmnA.



      I would like to break this column, ColmnA into multiple columns thru a function, ClassXYZ = Func1(ColmnA). This function returns a class ClassXYZ, with multiple variables, and each of these variables now has to be mapped to new Column, such a ColmnA1, ColmnA2 etc.



      How would I do such a transformation from 1 Dataframe to another with these additional columns by calling this Func1 just once, and not have to repeat-it to create all the columns.



      Its easy to solve if I were to call this huge function every time to add a new column, but that what I wish to avoid.



      Kindly please advise with a working or pseudo code.



      Thanks



      Sanjay







      scala apache-spark dataframe apache-spark-sql user-defined-functions






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jan 6 at 14:07









      Community

      11




      11










      asked Aug 25 '15 at 5:33









      sshroffsshroff

      4012812




      4012812






















          5 Answers
          5






          active

          oldest

          votes


















          63














          Generally speaking what you want is not directly possible. UDF can return only a single column at the time. There are two different ways you can overcome this limitation:




          1. Return a column of complex type. The most general solution is a StructType but you can consider ArrayType or MapType as well.



            import org.apache.spark.sql.functions.udf

            val df = Seq(
            (1L, 3.0, "a"), (2L, -1.0, "b"), (3L, 0.0, "c")
            ).toDF("x", "y", "z")

            case class Foobar(foo: Double, bar: Double)

            val foobarUdf = udf((x: Long, y: Double, z: String) =>
            Foobar(x * y, z.head.toInt * y))

            val df1 = df.withColumn("foobar", foobarUdf($"x", $"y", $"z"))
            df1.show
            // +---+----+---+------------+
            // | x| y| z| foobar|
            // +---+----+---+------------+
            // | 1| 3.0| a| [3.0,291.0]|
            // | 2|-1.0| b|[-2.0,-98.0]|
            // | 3| 0.0| c| [0.0,0.0]|
            // +---+----+---+------------+

            df1.printSchema
            // root
            // |-- x: long (nullable = false)
            // |-- y: double (nullable = false)
            // |-- z: string (nullable = true)
            // |-- foobar: struct (nullable = true)
            // | |-- foo: double (nullable = false)
            // | |-- bar: double (nullable = false)


            This can be easily flattened later but usually there is no need for that.




          2. Switch to RDD, reshape and rebuild DF:



            import org.apache.spark.sql.types._
            import org.apache.spark.sql.Row

            def foobarFunc(x: Long, y: Double, z: String): Seq[Any] =
            Seq(x * y, z.head.toInt * y)

            val schema = StructType(df.schema.fields ++
            Array(StructField("foo", DoubleType), StructField("bar", DoubleType)))

            val rows = df.rdd.map(r => Row.fromSeq(
            r.toSeq ++
            foobarFunc(r.getAs[Long]("x"), r.getAs[Double]("y"), r.getAs[String]("z"))))

            val df2 = sqlContext.createDataFrame(rows, schema)

            df2.show
            // +---+----+---+----+-----+
            // | x| y| z| foo| bar|
            // +---+----+---+----+-----+
            // | 1| 3.0| a| 3.0|291.0|
            // | 2|-1.0| b|-2.0|-98.0|
            // | 3| 0.0| c| 0.0| 0.0|
            // +---+----+---+----+-----+






          share|improve this answer




















          • 2





            When you say "usually there's no for [flattening a column]", why is that? Or does spark allow most things that you do with top-level columns to also be done with hierarchical data (like df1.foobar.foo)?

            – max
            Jun 20 '16 at 17:03







          • 2





            @max Because simple structs can be used in pretty much any context when one would normally use flat structure (with simple dot syntax fooobar.foo). It doesn't apply to collection types though. You can also check stackoverflow.com/a/33850490/1560062

            – zero323
            Jun 20 '16 at 17:33



















          16














          Assume that after your function there will be a sequence of elements, giving an example as below:



          val df = sc.parallelize(List(("Mike,1986,Toronto", 30), ("Andre,1980,Ottawa", 36), ("jill,1989,London", 27))).toDF("infoComb", "age")
          df.show
          +------------------+---+
          | infoComb|age|
          +------------------+---+
          |Mike,1986,Toronto| 30|
          | Andre,1980,Ottawa| 36|
          | jill,1989,London| 27|
          +------------------+---+


          now what you can do with this infoComb is that you can start split the string and get more columns with:



          df.select(expr("(split(infoComb, ','))[0]").cast("string").as("name"), expr("(split(infoComb, ','))[1]").cast("integer").as("yearOfBorn"), expr("(split(infoComb, ','))[2]").cast("string").as("city"), $"age").show
          +-----+----------+-------+---+
          | name|yearOfBorn| city|age|
          +-----+----------+-------+---+
          |Mike| 1986|Toronto| 30|
          |Andre| 1980| Ottawa| 36|
          | jill| 1989| London| 27|
          +-----+----------+-------+---+


          Hope this helps.






          share|improve this answer























          • Couldn't you just say df.select('infoComb.*', 'age') The .* on a column name selects each field in the struct as a new column.

            – Malcolm McRoberts
            Sep 28 '18 at 17:53


















          5














          If your resulting columns will be of the same length as the original one, you can create brand new columns with withColumn function and by applying an udf. After this you can drop your original column, eg:



           val newDf = myDf.withColumn("newCol1", myFun(myDf("originalColumn")))
          .withColumn("newCol2", myFun2(myDf("originalColumn"))
          .drop(myDf("originalColumn"))


          where myFun is an udf defined like this:



           def myFun= udf(
          (originalColumnContent : String) =>
          // do something with your original column content and return a new one

          )





          share|improve this answer

























          • Hi Niemand, I appreciate your reply...but it does not solve the question... in you code, you are calling the function "myDF" several times, whereas I would like that function to be called once, generate a class having multiple fields , and each field variable be returned as a new columns

            – sshroff
            Aug 25 '15 at 20:09












          • Well I', afraid that I presented the only one possible way for know, I don't think that any other way exists, but hopefully I am wrong ;). Also not that I did not call myFun several times - you can call other functions like myFun2, myFun3 etc. to create columns that you need.

            – Niemand
            Aug 25 '15 at 20:15


















          2














          I opted to create a function to flatten one column and then just call it simultaneously with the udf.



          First define this:



          implicit class DfOperations(df: DataFrame) 

          def flattenColumn(col: String) =
          def addColumns(df: DataFrame, cols: Array[String]): DataFrame =
          if (cols.isEmpty) df
          else addColumns(
          df.withColumn(col + "_" + cols.head, df(col + "." + cols.head)),
          cols.tail
          )


          val field = df.select(col).schema.fields(0)
          val newCols = field.dataType.asInstanceOf[StructType].fields.map(x => x.name)

          addColumns(df, newCols).drop(col)


          def withColumnMany(colName: String, col: Column) =
          df.withColumn(colName, col).flattenColumn(colName)





          Then usage is very simple:



          case class MyClass(a: Int, b: Int)

          val df = sc.parallelize(Seq(
          (0),
          (1)
          )).toDF("x")

          val f = udf((x: Int) => MyClass(x*2,x*3))

          df.withColumnMany("test", f($"x")).show()

          // +---+------+------+
          // | x|test_a|test_b|
          // +---+------+------+
          // | 0| 0| 0|
          // | 1| 2| 3|
          // +---+------+------+





          share|improve this answer























          • You don't have to do the whole withColumnMany thing. Just use select("select.*") to flatten it.

            – Assaf Mendelson
            Feb 28 '17 at 8:17


















          -2














          This can be easily achieved by using pivot function



          df4.groupBy("year").pivot("course").sum("earnings").collect() 





          share|improve this answer

























          • Pang ,Thank-you for formatting it

            – Abhishek Kgsk
            Jan 20 '17 at 7:08











          • I don't see "year", "course" or "earnings" in any of the answers or o.p... what data frame are you talking about in this very terse answer (not)?

            – Kai
            May 26 '17 at 15:27












          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f32196207%2fderive-multiple-columns-from-a-single-column-in-a-spark-dataframe%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          5 Answers
          5






          active

          oldest

          votes








          5 Answers
          5






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          63














          Generally speaking what you want is not directly possible. UDF can return only a single column at the time. There are two different ways you can overcome this limitation:




          1. Return a column of complex type. The most general solution is a StructType but you can consider ArrayType or MapType as well.



            import org.apache.spark.sql.functions.udf

            val df = Seq(
            (1L, 3.0, "a"), (2L, -1.0, "b"), (3L, 0.0, "c")
            ).toDF("x", "y", "z")

            case class Foobar(foo: Double, bar: Double)

            val foobarUdf = udf((x: Long, y: Double, z: String) =>
            Foobar(x * y, z.head.toInt * y))

            val df1 = df.withColumn("foobar", foobarUdf($"x", $"y", $"z"))
            df1.show
            // +---+----+---+------------+
            // | x| y| z| foobar|
            // +---+----+---+------------+
            // | 1| 3.0| a| [3.0,291.0]|
            // | 2|-1.0| b|[-2.0,-98.0]|
            // | 3| 0.0| c| [0.0,0.0]|
            // +---+----+---+------------+

            df1.printSchema
            // root
            // |-- x: long (nullable = false)
            // |-- y: double (nullable = false)
            // |-- z: string (nullable = true)
            // |-- foobar: struct (nullable = true)
            // | |-- foo: double (nullable = false)
            // | |-- bar: double (nullable = false)


            This can be easily flattened later but usually there is no need for that.




          2. Switch to RDD, reshape and rebuild DF:



            import org.apache.spark.sql.types._
            import org.apache.spark.sql.Row

            def foobarFunc(x: Long, y: Double, z: String): Seq[Any] =
            Seq(x * y, z.head.toInt * y)

            val schema = StructType(df.schema.fields ++
            Array(StructField("foo", DoubleType), StructField("bar", DoubleType)))

            val rows = df.rdd.map(r => Row.fromSeq(
            r.toSeq ++
            foobarFunc(r.getAs[Long]("x"), r.getAs[Double]("y"), r.getAs[String]("z"))))

            val df2 = sqlContext.createDataFrame(rows, schema)

            df2.show
            // +---+----+---+----+-----+
            // | x| y| z| foo| bar|
            // +---+----+---+----+-----+
            // | 1| 3.0| a| 3.0|291.0|
            // | 2|-1.0| b|-2.0|-98.0|
            // | 3| 0.0| c| 0.0| 0.0|
            // +---+----+---+----+-----+






          share|improve this answer




















          • 2





            When you say "usually there's no for [flattening a column]", why is that? Or does spark allow most things that you do with top-level columns to also be done with hierarchical data (like df1.foobar.foo)?

            – max
            Jun 20 '16 at 17:03







          • 2





            @max Because simple structs can be used in pretty much any context when one would normally use flat structure (with simple dot syntax fooobar.foo). It doesn't apply to collection types though. You can also check stackoverflow.com/a/33850490/1560062

            – zero323
            Jun 20 '16 at 17:33
















          63














          Generally speaking what you want is not directly possible. UDF can return only a single column at the time. There are two different ways you can overcome this limitation:




          1. Return a column of complex type. The most general solution is a StructType but you can consider ArrayType or MapType as well.



            import org.apache.spark.sql.functions.udf

            val df = Seq(
            (1L, 3.0, "a"), (2L, -1.0, "b"), (3L, 0.0, "c")
            ).toDF("x", "y", "z")

            case class Foobar(foo: Double, bar: Double)

            val foobarUdf = udf((x: Long, y: Double, z: String) =>
            Foobar(x * y, z.head.toInt * y))

            val df1 = df.withColumn("foobar", foobarUdf($"x", $"y", $"z"))
            df1.show
            // +---+----+---+------------+
            // | x| y| z| foobar|
            // +---+----+---+------------+
            // | 1| 3.0| a| [3.0,291.0]|
            // | 2|-1.0| b|[-2.0,-98.0]|
            // | 3| 0.0| c| [0.0,0.0]|
            // +---+----+---+------------+

            df1.printSchema
            // root
            // |-- x: long (nullable = false)
            // |-- y: double (nullable = false)
            // |-- z: string (nullable = true)
            // |-- foobar: struct (nullable = true)
            // | |-- foo: double (nullable = false)
            // | |-- bar: double (nullable = false)


            This can be easily flattened later but usually there is no need for that.




          2. Switch to RDD, reshape and rebuild DF:



            import org.apache.spark.sql.types._
            import org.apache.spark.sql.Row

            def foobarFunc(x: Long, y: Double, z: String): Seq[Any] =
            Seq(x * y, z.head.toInt * y)

            val schema = StructType(df.schema.fields ++
            Array(StructField("foo", DoubleType), StructField("bar", DoubleType)))

            val rows = df.rdd.map(r => Row.fromSeq(
            r.toSeq ++
            foobarFunc(r.getAs[Long]("x"), r.getAs[Double]("y"), r.getAs[String]("z"))))

            val df2 = sqlContext.createDataFrame(rows, schema)

            df2.show
            // +---+----+---+----+-----+
            // | x| y| z| foo| bar|
            // +---+----+---+----+-----+
            // | 1| 3.0| a| 3.0|291.0|
            // | 2|-1.0| b|-2.0|-98.0|
            // | 3| 0.0| c| 0.0| 0.0|
            // +---+----+---+----+-----+






          share|improve this answer




















          • 2





            When you say "usually there's no for [flattening a column]", why is that? Or does spark allow most things that you do with top-level columns to also be done with hierarchical data (like df1.foobar.foo)?

            – max
            Jun 20 '16 at 17:03







          • 2





            @max Because simple structs can be used in pretty much any context when one would normally use flat structure (with simple dot syntax fooobar.foo). It doesn't apply to collection types though. You can also check stackoverflow.com/a/33850490/1560062

            – zero323
            Jun 20 '16 at 17:33














          63












          63








          63







          Generally speaking what you want is not directly possible. UDF can return only a single column at the time. There are two different ways you can overcome this limitation:




          1. Return a column of complex type. The most general solution is a StructType but you can consider ArrayType or MapType as well.



            import org.apache.spark.sql.functions.udf

            val df = Seq(
            (1L, 3.0, "a"), (2L, -1.0, "b"), (3L, 0.0, "c")
            ).toDF("x", "y", "z")

            case class Foobar(foo: Double, bar: Double)

            val foobarUdf = udf((x: Long, y: Double, z: String) =>
            Foobar(x * y, z.head.toInt * y))

            val df1 = df.withColumn("foobar", foobarUdf($"x", $"y", $"z"))
            df1.show
            // +---+----+---+------------+
            // | x| y| z| foobar|
            // +---+----+---+------------+
            // | 1| 3.0| a| [3.0,291.0]|
            // | 2|-1.0| b|[-2.0,-98.0]|
            // | 3| 0.0| c| [0.0,0.0]|
            // +---+----+---+------------+

            df1.printSchema
            // root
            // |-- x: long (nullable = false)
            // |-- y: double (nullable = false)
            // |-- z: string (nullable = true)
            // |-- foobar: struct (nullable = true)
            // | |-- foo: double (nullable = false)
            // | |-- bar: double (nullable = false)


            This can be easily flattened later but usually there is no need for that.




          2. Switch to RDD, reshape and rebuild DF:



            import org.apache.spark.sql.types._
            import org.apache.spark.sql.Row

            def foobarFunc(x: Long, y: Double, z: String): Seq[Any] =
            Seq(x * y, z.head.toInt * y)

            val schema = StructType(df.schema.fields ++
            Array(StructField("foo", DoubleType), StructField("bar", DoubleType)))

            val rows = df.rdd.map(r => Row.fromSeq(
            r.toSeq ++
            foobarFunc(r.getAs[Long]("x"), r.getAs[Double]("y"), r.getAs[String]("z"))))

            val df2 = sqlContext.createDataFrame(rows, schema)

            df2.show
            // +---+----+---+----+-----+
            // | x| y| z| foo| bar|
            // +---+----+---+----+-----+
            // | 1| 3.0| a| 3.0|291.0|
            // | 2|-1.0| b|-2.0|-98.0|
            // | 3| 0.0| c| 0.0| 0.0|
            // +---+----+---+----+-----+






          share|improve this answer















          Generally speaking what you want is not directly possible. UDF can return only a single column at the time. There are two different ways you can overcome this limitation:




          1. Return a column of complex type. The most general solution is a StructType but you can consider ArrayType or MapType as well.



            import org.apache.spark.sql.functions.udf

            val df = Seq(
            (1L, 3.0, "a"), (2L, -1.0, "b"), (3L, 0.0, "c")
            ).toDF("x", "y", "z")

            case class Foobar(foo: Double, bar: Double)

            val foobarUdf = udf((x: Long, y: Double, z: String) =>
            Foobar(x * y, z.head.toInt * y))

            val df1 = df.withColumn("foobar", foobarUdf($"x", $"y", $"z"))
            df1.show
            // +---+----+---+------------+
            // | x| y| z| foobar|
            // +---+----+---+------------+
            // | 1| 3.0| a| [3.0,291.0]|
            // | 2|-1.0| b|[-2.0,-98.0]|
            // | 3| 0.0| c| [0.0,0.0]|
            // +---+----+---+------------+

            df1.printSchema
            // root
            // |-- x: long (nullable = false)
            // |-- y: double (nullable = false)
            // |-- z: string (nullable = true)
            // |-- foobar: struct (nullable = true)
            // | |-- foo: double (nullable = false)
            // | |-- bar: double (nullable = false)


            This can be easily flattened later but usually there is no need for that.




          2. Switch to RDD, reshape and rebuild DF:



            import org.apache.spark.sql.types._
            import org.apache.spark.sql.Row

            def foobarFunc(x: Long, y: Double, z: String): Seq[Any] =
            Seq(x * y, z.head.toInt * y)

            val schema = StructType(df.schema.fields ++
            Array(StructField("foo", DoubleType), StructField("bar", DoubleType)))

            val rows = df.rdd.map(r => Row.fromSeq(
            r.toSeq ++
            foobarFunc(r.getAs[Long]("x"), r.getAs[Double]("y"), r.getAs[String]("z"))))

            val df2 = sqlContext.createDataFrame(rows, schema)

            df2.show
            // +---+----+---+----+-----+
            // | x| y| z| foo| bar|
            // +---+----+---+----+-----+
            // | 1| 3.0| a| 3.0|291.0|
            // | 2|-1.0| b|-2.0|-98.0|
            // | 3| 0.0| c| 0.0| 0.0|
            // +---+----+---+----+-----+







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Mar 8 '17 at 17:45

























          answered Oct 26 '15 at 12:23









          zero323zero323

          178k42528596




          178k42528596







          • 2





            When you say "usually there's no for [flattening a column]", why is that? Or does spark allow most things that you do with top-level columns to also be done with hierarchical data (like df1.foobar.foo)?

            – max
            Jun 20 '16 at 17:03







          • 2





            @max Because simple structs can be used in pretty much any context when one would normally use flat structure (with simple dot syntax fooobar.foo). It doesn't apply to collection types though. You can also check stackoverflow.com/a/33850490/1560062

            – zero323
            Jun 20 '16 at 17:33













          • 2





            When you say "usually there's no for [flattening a column]", why is that? Or does spark allow most things that you do with top-level columns to also be done with hierarchical data (like df1.foobar.foo)?

            – max
            Jun 20 '16 at 17:03







          • 2





            @max Because simple structs can be used in pretty much any context when one would normally use flat structure (with simple dot syntax fooobar.foo). It doesn't apply to collection types though. You can also check stackoverflow.com/a/33850490/1560062

            – zero323
            Jun 20 '16 at 17:33








          2




          2





          When you say "usually there's no for [flattening a column]", why is that? Or does spark allow most things that you do with top-level columns to also be done with hierarchical data (like df1.foobar.foo)?

          – max
          Jun 20 '16 at 17:03






          When you say "usually there's no for [flattening a column]", why is that? Or does spark allow most things that you do with top-level columns to also be done with hierarchical data (like df1.foobar.foo)?

          – max
          Jun 20 '16 at 17:03





          2




          2





          @max Because simple structs can be used in pretty much any context when one would normally use flat structure (with simple dot syntax fooobar.foo). It doesn't apply to collection types though. You can also check stackoverflow.com/a/33850490/1560062

          – zero323
          Jun 20 '16 at 17:33






          @max Because simple structs can be used in pretty much any context when one would normally use flat structure (with simple dot syntax fooobar.foo). It doesn't apply to collection types though. You can also check stackoverflow.com/a/33850490/1560062

          – zero323
          Jun 20 '16 at 17:33














          16














          Assume that after your function there will be a sequence of elements, giving an example as below:



          val df = sc.parallelize(List(("Mike,1986,Toronto", 30), ("Andre,1980,Ottawa", 36), ("jill,1989,London", 27))).toDF("infoComb", "age")
          df.show
          +------------------+---+
          | infoComb|age|
          +------------------+---+
          |Mike,1986,Toronto| 30|
          | Andre,1980,Ottawa| 36|
          | jill,1989,London| 27|
          +------------------+---+


          now what you can do with this infoComb is that you can start split the string and get more columns with:



          df.select(expr("(split(infoComb, ','))[0]").cast("string").as("name"), expr("(split(infoComb, ','))[1]").cast("integer").as("yearOfBorn"), expr("(split(infoComb, ','))[2]").cast("string").as("city"), $"age").show
          +-----+----------+-------+---+
          | name|yearOfBorn| city|age|
          +-----+----------+-------+---+
          |Mike| 1986|Toronto| 30|
          |Andre| 1980| Ottawa| 36|
          | jill| 1989| London| 27|
          +-----+----------+-------+---+


          Hope this helps.






          share|improve this answer























          • Couldn't you just say df.select('infoComb.*', 'age') The .* on a column name selects each field in the struct as a new column.

            – Malcolm McRoberts
            Sep 28 '18 at 17:53















          16














          Assume that after your function there will be a sequence of elements, giving an example as below:



          val df = sc.parallelize(List(("Mike,1986,Toronto", 30), ("Andre,1980,Ottawa", 36), ("jill,1989,London", 27))).toDF("infoComb", "age")
          df.show
          +------------------+---+
          | infoComb|age|
          +------------------+---+
          |Mike,1986,Toronto| 30|
          | Andre,1980,Ottawa| 36|
          | jill,1989,London| 27|
          +------------------+---+


          now what you can do with this infoComb is that you can start split the string and get more columns with:



          df.select(expr("(split(infoComb, ','))[0]").cast("string").as("name"), expr("(split(infoComb, ','))[1]").cast("integer").as("yearOfBorn"), expr("(split(infoComb, ','))[2]").cast("string").as("city"), $"age").show
          +-----+----------+-------+---+
          | name|yearOfBorn| city|age|
          +-----+----------+-------+---+
          |Mike| 1986|Toronto| 30|
          |Andre| 1980| Ottawa| 36|
          | jill| 1989| London| 27|
          +-----+----------+-------+---+


          Hope this helps.






          share|improve this answer























          • Couldn't you just say df.select('infoComb.*', 'age') The .* on a column name selects each field in the struct as a new column.

            – Malcolm McRoberts
            Sep 28 '18 at 17:53













          16












          16








          16







          Assume that after your function there will be a sequence of elements, giving an example as below:



          val df = sc.parallelize(List(("Mike,1986,Toronto", 30), ("Andre,1980,Ottawa", 36), ("jill,1989,London", 27))).toDF("infoComb", "age")
          df.show
          +------------------+---+
          | infoComb|age|
          +------------------+---+
          |Mike,1986,Toronto| 30|
          | Andre,1980,Ottawa| 36|
          | jill,1989,London| 27|
          +------------------+---+


          now what you can do with this infoComb is that you can start split the string and get more columns with:



          df.select(expr("(split(infoComb, ','))[0]").cast("string").as("name"), expr("(split(infoComb, ','))[1]").cast("integer").as("yearOfBorn"), expr("(split(infoComb, ','))[2]").cast("string").as("city"), $"age").show
          +-----+----------+-------+---+
          | name|yearOfBorn| city|age|
          +-----+----------+-------+---+
          |Mike| 1986|Toronto| 30|
          |Andre| 1980| Ottawa| 36|
          | jill| 1989| London| 27|
          +-----+----------+-------+---+


          Hope this helps.






          share|improve this answer













          Assume that after your function there will be a sequence of elements, giving an example as below:



          val df = sc.parallelize(List(("Mike,1986,Toronto", 30), ("Andre,1980,Ottawa", 36), ("jill,1989,London", 27))).toDF("infoComb", "age")
          df.show
          +------------------+---+
          | infoComb|age|
          +------------------+---+
          |Mike,1986,Toronto| 30|
          | Andre,1980,Ottawa| 36|
          | jill,1989,London| 27|
          +------------------+---+


          now what you can do with this infoComb is that you can start split the string and get more columns with:



          df.select(expr("(split(infoComb, ','))[0]").cast("string").as("name"), expr("(split(infoComb, ','))[1]").cast("integer").as("yearOfBorn"), expr("(split(infoComb, ','))[2]").cast("string").as("city"), $"age").show
          +-----+----------+-------+---+
          | name|yearOfBorn| city|age|
          +-----+----------+-------+---+
          |Mike| 1986|Toronto| 30|
          |Andre| 1980| Ottawa| 36|
          | jill| 1989| London| 27|
          +-----+----------+-------+---+


          Hope this helps.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Jul 13 '16 at 0:46









          EdwinGuoEdwinGuo

          81021323




          81021323












          • Couldn't you just say df.select('infoComb.*', 'age') The .* on a column name selects each field in the struct as a new column.

            – Malcolm McRoberts
            Sep 28 '18 at 17:53

















          • Couldn't you just say df.select('infoComb.*', 'age') The .* on a column name selects each field in the struct as a new column.

            – Malcolm McRoberts
            Sep 28 '18 at 17:53
















          Couldn't you just say df.select('infoComb.*', 'age') The .* on a column name selects each field in the struct as a new column.

          – Malcolm McRoberts
          Sep 28 '18 at 17:53





          Couldn't you just say df.select('infoComb.*', 'age') The .* on a column name selects each field in the struct as a new column.

          – Malcolm McRoberts
          Sep 28 '18 at 17:53











          5














          If your resulting columns will be of the same length as the original one, you can create brand new columns with withColumn function and by applying an udf. After this you can drop your original column, eg:



           val newDf = myDf.withColumn("newCol1", myFun(myDf("originalColumn")))
          .withColumn("newCol2", myFun2(myDf("originalColumn"))
          .drop(myDf("originalColumn"))


          where myFun is an udf defined like this:



           def myFun= udf(
          (originalColumnContent : String) =>
          // do something with your original column content and return a new one

          )





          share|improve this answer

























          • Hi Niemand, I appreciate your reply...but it does not solve the question... in you code, you are calling the function "myDF" several times, whereas I would like that function to be called once, generate a class having multiple fields , and each field variable be returned as a new columns

            – sshroff
            Aug 25 '15 at 20:09












          • Well I', afraid that I presented the only one possible way for know, I don't think that any other way exists, but hopefully I am wrong ;). Also not that I did not call myFun several times - you can call other functions like myFun2, myFun3 etc. to create columns that you need.

            – Niemand
            Aug 25 '15 at 20:15















          5














          If your resulting columns will be of the same length as the original one, you can create brand new columns with withColumn function and by applying an udf. After this you can drop your original column, eg:



           val newDf = myDf.withColumn("newCol1", myFun(myDf("originalColumn")))
          .withColumn("newCol2", myFun2(myDf("originalColumn"))
          .drop(myDf("originalColumn"))


          where myFun is an udf defined like this:



           def myFun= udf(
          (originalColumnContent : String) =>
          // do something with your original column content and return a new one

          )





          share|improve this answer

























          • Hi Niemand, I appreciate your reply...but it does not solve the question... in you code, you are calling the function "myDF" several times, whereas I would like that function to be called once, generate a class having multiple fields , and each field variable be returned as a new columns

            – sshroff
            Aug 25 '15 at 20:09












          • Well I', afraid that I presented the only one possible way for know, I don't think that any other way exists, but hopefully I am wrong ;). Also not that I did not call myFun several times - you can call other functions like myFun2, myFun3 etc. to create columns that you need.

            – Niemand
            Aug 25 '15 at 20:15













          5












          5








          5







          If your resulting columns will be of the same length as the original one, you can create brand new columns with withColumn function and by applying an udf. After this you can drop your original column, eg:



           val newDf = myDf.withColumn("newCol1", myFun(myDf("originalColumn")))
          .withColumn("newCol2", myFun2(myDf("originalColumn"))
          .drop(myDf("originalColumn"))


          where myFun is an udf defined like this:



           def myFun= udf(
          (originalColumnContent : String) =>
          // do something with your original column content and return a new one

          )





          share|improve this answer















          If your resulting columns will be of the same length as the original one, you can create brand new columns with withColumn function and by applying an udf. After this you can drop your original column, eg:



           val newDf = myDf.withColumn("newCol1", myFun(myDf("originalColumn")))
          .withColumn("newCol2", myFun2(myDf("originalColumn"))
          .drop(myDf("originalColumn"))


          where myFun is an udf defined like this:



           def myFun= udf(
          (originalColumnContent : String) =>
          // do something with your original column content and return a new one

          )






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Aug 25 '15 at 8:08

























          answered Aug 25 '15 at 7:59









          NiemandNiemand

          5,21643163




          5,21643163












          • Hi Niemand, I appreciate your reply...but it does not solve the question... in you code, you are calling the function "myDF" several times, whereas I would like that function to be called once, generate a class having multiple fields , and each field variable be returned as a new columns

            – sshroff
            Aug 25 '15 at 20:09












          • Well I', afraid that I presented the only one possible way for know, I don't think that any other way exists, but hopefully I am wrong ;). Also not that I did not call myFun several times - you can call other functions like myFun2, myFun3 etc. to create columns that you need.

            – Niemand
            Aug 25 '15 at 20:15

















          • Hi Niemand, I appreciate your reply...but it does not solve the question... in you code, you are calling the function "myDF" several times, whereas I would like that function to be called once, generate a class having multiple fields , and each field variable be returned as a new columns

            – sshroff
            Aug 25 '15 at 20:09












          • Well I', afraid that I presented the only one possible way for know, I don't think that any other way exists, but hopefully I am wrong ;). Also not that I did not call myFun several times - you can call other functions like myFun2, myFun3 etc. to create columns that you need.

            – Niemand
            Aug 25 '15 at 20:15
















          Hi Niemand, I appreciate your reply...but it does not solve the question... in you code, you are calling the function "myDF" several times, whereas I would like that function to be called once, generate a class having multiple fields , and each field variable be returned as a new columns

          – sshroff
          Aug 25 '15 at 20:09






          Hi Niemand, I appreciate your reply...but it does not solve the question... in you code, you are calling the function "myDF" several times, whereas I would like that function to be called once, generate a class having multiple fields , and each field variable be returned as a new columns

          – sshroff
          Aug 25 '15 at 20:09














          Well I', afraid that I presented the only one possible way for know, I don't think that any other way exists, but hopefully I am wrong ;). Also not that I did not call myFun several times - you can call other functions like myFun2, myFun3 etc. to create columns that you need.

          – Niemand
          Aug 25 '15 at 20:15





          Well I', afraid that I presented the only one possible way for know, I don't think that any other way exists, but hopefully I am wrong ;). Also not that I did not call myFun several times - you can call other functions like myFun2, myFun3 etc. to create columns that you need.

          – Niemand
          Aug 25 '15 at 20:15











          2














          I opted to create a function to flatten one column and then just call it simultaneously with the udf.



          First define this:



          implicit class DfOperations(df: DataFrame) 

          def flattenColumn(col: String) =
          def addColumns(df: DataFrame, cols: Array[String]): DataFrame =
          if (cols.isEmpty) df
          else addColumns(
          df.withColumn(col + "_" + cols.head, df(col + "." + cols.head)),
          cols.tail
          )


          val field = df.select(col).schema.fields(0)
          val newCols = field.dataType.asInstanceOf[StructType].fields.map(x => x.name)

          addColumns(df, newCols).drop(col)


          def withColumnMany(colName: String, col: Column) =
          df.withColumn(colName, col).flattenColumn(colName)





          Then usage is very simple:



          case class MyClass(a: Int, b: Int)

          val df = sc.parallelize(Seq(
          (0),
          (1)
          )).toDF("x")

          val f = udf((x: Int) => MyClass(x*2,x*3))

          df.withColumnMany("test", f($"x")).show()

          // +---+------+------+
          // | x|test_a|test_b|
          // +---+------+------+
          // | 0| 0| 0|
          // | 1| 2| 3|
          // +---+------+------+





          share|improve this answer























          • You don't have to do the whole withColumnMany thing. Just use select("select.*") to flatten it.

            – Assaf Mendelson
            Feb 28 '17 at 8:17















          2














          I opted to create a function to flatten one column and then just call it simultaneously with the udf.



          First define this:



          implicit class DfOperations(df: DataFrame) 

          def flattenColumn(col: String) =
          def addColumns(df: DataFrame, cols: Array[String]): DataFrame =
          if (cols.isEmpty) df
          else addColumns(
          df.withColumn(col + "_" + cols.head, df(col + "." + cols.head)),
          cols.tail
          )


          val field = df.select(col).schema.fields(0)
          val newCols = field.dataType.asInstanceOf[StructType].fields.map(x => x.name)

          addColumns(df, newCols).drop(col)


          def withColumnMany(colName: String, col: Column) =
          df.withColumn(colName, col).flattenColumn(colName)





          Then usage is very simple:



          case class MyClass(a: Int, b: Int)

          val df = sc.parallelize(Seq(
          (0),
          (1)
          )).toDF("x")

          val f = udf((x: Int) => MyClass(x*2,x*3))

          df.withColumnMany("test", f($"x")).show()

          // +---+------+------+
          // | x|test_a|test_b|
          // +---+------+------+
          // | 0| 0| 0|
          // | 1| 2| 3|
          // +---+------+------+





          share|improve this answer























          • You don't have to do the whole withColumnMany thing. Just use select("select.*") to flatten it.

            – Assaf Mendelson
            Feb 28 '17 at 8:17













          2












          2








          2







          I opted to create a function to flatten one column and then just call it simultaneously with the udf.



          First define this:



          implicit class DfOperations(df: DataFrame) 

          def flattenColumn(col: String) =
          def addColumns(df: DataFrame, cols: Array[String]): DataFrame =
          if (cols.isEmpty) df
          else addColumns(
          df.withColumn(col + "_" + cols.head, df(col + "." + cols.head)),
          cols.tail
          )


          val field = df.select(col).schema.fields(0)
          val newCols = field.dataType.asInstanceOf[StructType].fields.map(x => x.name)

          addColumns(df, newCols).drop(col)


          def withColumnMany(colName: String, col: Column) =
          df.withColumn(colName, col).flattenColumn(colName)





          Then usage is very simple:



          case class MyClass(a: Int, b: Int)

          val df = sc.parallelize(Seq(
          (0),
          (1)
          )).toDF("x")

          val f = udf((x: Int) => MyClass(x*2,x*3))

          df.withColumnMany("test", f($"x")).show()

          // +---+------+------+
          // | x|test_a|test_b|
          // +---+------+------+
          // | 0| 0| 0|
          // | 1| 2| 3|
          // +---+------+------+





          share|improve this answer













          I opted to create a function to flatten one column and then just call it simultaneously with the udf.



          First define this:



          implicit class DfOperations(df: DataFrame) 

          def flattenColumn(col: String) =
          def addColumns(df: DataFrame, cols: Array[String]): DataFrame =
          if (cols.isEmpty) df
          else addColumns(
          df.withColumn(col + "_" + cols.head, df(col + "." + cols.head)),
          cols.tail
          )


          val field = df.select(col).schema.fields(0)
          val newCols = field.dataType.asInstanceOf[StructType].fields.map(x => x.name)

          addColumns(df, newCols).drop(col)


          def withColumnMany(colName: String, col: Column) =
          df.withColumn(colName, col).flattenColumn(colName)





          Then usage is very simple:



          case class MyClass(a: Int, b: Int)

          val df = sc.parallelize(Seq(
          (0),
          (1)
          )).toDF("x")

          val f = udf((x: Int) => MyClass(x*2,x*3))

          df.withColumnMany("test", f($"x")).show()

          // +---+------+------+
          // | x|test_a|test_b|
          // +---+------+------+
          // | 0| 0| 0|
          // | 1| 2| 3|
          // +---+------+------+






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Jun 7 '16 at 13:49









          PekkaPekka

          1,40811126




          1,40811126












          • You don't have to do the whole withColumnMany thing. Just use select("select.*") to flatten it.

            – Assaf Mendelson
            Feb 28 '17 at 8:17

















          • You don't have to do the whole withColumnMany thing. Just use select("select.*") to flatten it.

            – Assaf Mendelson
            Feb 28 '17 at 8:17
















          You don't have to do the whole withColumnMany thing. Just use select("select.*") to flatten it.

          – Assaf Mendelson
          Feb 28 '17 at 8:17





          You don't have to do the whole withColumnMany thing. Just use select("select.*") to flatten it.

          – Assaf Mendelson
          Feb 28 '17 at 8:17











          -2














          This can be easily achieved by using pivot function



          df4.groupBy("year").pivot("course").sum("earnings").collect() 





          share|improve this answer

























          • Pang ,Thank-you for formatting it

            – Abhishek Kgsk
            Jan 20 '17 at 7:08











          • I don't see "year", "course" or "earnings" in any of the answers or o.p... what data frame are you talking about in this very terse answer (not)?

            – Kai
            May 26 '17 at 15:27
















          -2














          This can be easily achieved by using pivot function



          df4.groupBy("year").pivot("course").sum("earnings").collect() 





          share|improve this answer

























          • Pang ,Thank-you for formatting it

            – Abhishek Kgsk
            Jan 20 '17 at 7:08











          • I don't see "year", "course" or "earnings" in any of the answers or o.p... what data frame are you talking about in this very terse answer (not)?

            – Kai
            May 26 '17 at 15:27














          -2












          -2








          -2







          This can be easily achieved by using pivot function



          df4.groupBy("year").pivot("course").sum("earnings").collect() 





          share|improve this answer















          This can be easily achieved by using pivot function



          df4.groupBy("year").pivot("course").sum("earnings").collect() 






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Mar 27 '17 at 8:04









          David Arenburg

          79.7k1297165




          79.7k1297165










          answered Jan 19 '17 at 6:02









          Abhishek KgskAbhishek Kgsk

          228




          228












          • Pang ,Thank-you for formatting it

            – Abhishek Kgsk
            Jan 20 '17 at 7:08











          • I don't see "year", "course" or "earnings" in any of the answers or o.p... what data frame are you talking about in this very terse answer (not)?

            – Kai
            May 26 '17 at 15:27


















          • Pang ,Thank-you for formatting it

            – Abhishek Kgsk
            Jan 20 '17 at 7:08











          • I don't see "year", "course" or "earnings" in any of the answers or o.p... what data frame are you talking about in this very terse answer (not)?

            – Kai
            May 26 '17 at 15:27

















          Pang ,Thank-you for formatting it

          – Abhishek Kgsk
          Jan 20 '17 at 7:08





          Pang ,Thank-you for formatting it

          – Abhishek Kgsk
          Jan 20 '17 at 7:08













          I don't see "year", "course" or "earnings" in any of the answers or o.p... what data frame are you talking about in this very terse answer (not)?

          – Kai
          May 26 '17 at 15:27






          I don't see "year", "course" or "earnings" in any of the answers or o.p... what data frame are you talking about in this very terse answer (not)?

          – Kai
          May 26 '17 at 15:27


















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f32196207%2fderive-multiple-columns-from-a-single-column-in-a-spark-dataframe%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

          SQL error code 1064 with creating Laravel foreign keysForeign key constraints: When to use ON UPDATE and ON DELETEDropping column with foreign key Laravel error: General error: 1025 Error on renameLaravel SQL Can't create tableLaravel Migration foreign key errorLaravel php artisan migrate:refresh giving a syntax errorSQLSTATE[42S01]: Base table or view already exists or Base table or view already exists: 1050 Tableerror in migrating laravel file to xampp serverSyntax error or access violation: 1064:syntax to use near 'unsigned not null, modelName varchar(191) not null, title varchar(191) not nLaravel cannot create new table field in mysqlLaravel 5.7:Last migration creates table but is not registered in the migration table

          은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현