New to Pyspark - importing a CSV and creating a parquet file with array columnsHow to import other Python files?Importing files from different folderAdding new column to existing DataFrame in Python pandasPandas writing dataframe to CSV fileNullable field is changed upon writing a Spark DataframeScala - How to avoid java.lang.IllegalArgumentException when Row.get(i) would retrieve a nullHow to extract XML string from parquet columnHow do I apply schema with nullable = false to json readingIgnore missing values when writing to parquet in pysparkpyspark load csv file into dataframe using a schema
Why is my log file so massive? 22gb. I am running log backups
What does "enim et" mean?
Is it wise to hold on to stock that has plummeted and then stabilized?
What does 'script /dev/null' do?
Landlord wants to switch my lease to a "Land contract" to "get back at the city"
What is GPS' 19 year rollover and does it present a cybersecurity issue?
Why do we use polarized capacitors?
Is there a name of the flying bionic bird?
How did the USSR manage to innovate in an environment characterized by government censorship and high bureaucracy?
Are objects structures and/or vice versa?
What happens when a metallic dragon and a chromatic dragon mate?
How to deal with fear of taking dependencies
Extreme, but not acceptable situation and I can't start the work tomorrow morning
Is domain driven design an anti-SQL pattern?
Can I find out the caloric content of bread by dehydrating it?
Why is the design of haulage companies so “special”?
Re-submission of rejected manuscript without informing co-authors
Is there any use for defining additional entity types in a SOQL FROM clause?
Does the average primeness of natural numbers tend to zero?
aging parents with no investments
What do you call something that goes against the spirit of the law, but is legal when interpreting the law to the letter?
Calculate Levenshtein distance between two strings in Python
If a centaur druid Wild Shapes into a Giant Elk, do their Charge features stack?
Does a dangling wire really electrocute me if I'm standing in water?
New to Pyspark - importing a CSV and creating a parquet file with array columns
How to import other Python files?Importing files from different folderAdding new column to existing DataFrame in Python pandasPandas writing dataframe to CSV fileNullable field is changed upon writing a Spark DataframeScala - How to avoid java.lang.IllegalArgumentException when Row.get(i) would retrieve a nullHow to extract XML string from parquet columnHow do I apply schema with nullable = false to json readingIgnore missing values when writing to parquet in pysparkpyspark load csv file into dataframe using a schema
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I am new to Pyspark and I've been pulling my hair out trying to accomplish something I believe is fairly simple. I am trying to do an ETL process where a csv file is converted to a parquet file. The CSV file has a few simple columns, but one column is a delimited array of integers that I want to expand/unzip into a parquet file. This parquet file is actually used by a .net core micro service which uses a Parquet Reader to do calculations downstream. To keep this question simple, the structure of the column is:
"geomap" 5:3:7|4:2:1|8:2:78 -> this represents an array of 3 items, it is split at the "|" and then a tuple is build of the values (5,3,7), (4,2,1), (8,2,78)
I have tried various processes and schemas and I can't get this correct. Via UDF I am creating either a list of lists or a list of tuple, but I can't get the schema correct or unzip explode the data into the parquet write operation. I either get nulls, an error or other problems. Do I need to approach this differently? Relevant code is below. I am just showing the problem column for simplicity since I have the rest working. This is my first Pyspark attempt, so apologies for missing something obvious:
def convert_geo(geo):
return [tuple(x.split(':')) for x in geo.split('|')]
compression_type = 'snappy'
schema = ArrayType(StructType([
StructField("c1", IntegerType(), False),
StructField("c2", IntegerType(), False),
StructField("c3", IntegerType(), False)
]))
spark_convert_geo = udf(lambda z: convert_geo(z),schema)
source_path = '...path to csv'
destination_path = 'path for generated parquet file'
df = spark.read.option('delimiter',',').option('header','true').csv(source_path).withColumn("geomap",spark_convert_geo(col('geomap')).alias("geomap"))
df.write.mode("overwrite").format('parquet').option('compression', compression_type).save(destination_path)
EDIT: Per request adding the printSchema() output, I'm not sure what's wrong in here either. I still can't seem to get the string split values to show up or render properly. This contains all the columns. I do see the c1 and c2 and c3 struct names...
root |-- lrsegid: integer (nullable = true) |-- loadsourceid: integer (nullable = true) |-- agencyid: integer (nullable = true) |-- acres: float (nullable = true) |-- sourcemap: array (nullable = true) | |-- element: integer (containsNull = true) |-- geomap: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- c1: integer (nullable = false) | | |-- c2: integer (nullable = false) | | |-- c3: integer (nullable = false)
python apache-spark dataframe pyspark parquet
add a comment |
I am new to Pyspark and I've been pulling my hair out trying to accomplish something I believe is fairly simple. I am trying to do an ETL process where a csv file is converted to a parquet file. The CSV file has a few simple columns, but one column is a delimited array of integers that I want to expand/unzip into a parquet file. This parquet file is actually used by a .net core micro service which uses a Parquet Reader to do calculations downstream. To keep this question simple, the structure of the column is:
"geomap" 5:3:7|4:2:1|8:2:78 -> this represents an array of 3 items, it is split at the "|" and then a tuple is build of the values (5,3,7), (4,2,1), (8,2,78)
I have tried various processes and schemas and I can't get this correct. Via UDF I am creating either a list of lists or a list of tuple, but I can't get the schema correct or unzip explode the data into the parquet write operation. I either get nulls, an error or other problems. Do I need to approach this differently? Relevant code is below. I am just showing the problem column for simplicity since I have the rest working. This is my first Pyspark attempt, so apologies for missing something obvious:
def convert_geo(geo):
return [tuple(x.split(':')) for x in geo.split('|')]
compression_type = 'snappy'
schema = ArrayType(StructType([
StructField("c1", IntegerType(), False),
StructField("c2", IntegerType(), False),
StructField("c3", IntegerType(), False)
]))
spark_convert_geo = udf(lambda z: convert_geo(z),schema)
source_path = '...path to csv'
destination_path = 'path for generated parquet file'
df = spark.read.option('delimiter',',').option('header','true').csv(source_path).withColumn("geomap",spark_convert_geo(col('geomap')).alias("geomap"))
df.write.mode("overwrite").format('parquet').option('compression', compression_type).save(destination_path)
EDIT: Per request adding the printSchema() output, I'm not sure what's wrong in here either. I still can't seem to get the string split values to show up or render properly. This contains all the columns. I do see the c1 and c2 and c3 struct names...
root |-- lrsegid: integer (nullable = true) |-- loadsourceid: integer (nullable = true) |-- agencyid: integer (nullable = true) |-- acres: float (nullable = true) |-- sourcemap: array (nullable = true) | |-- element: integer (containsNull = true) |-- geomap: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- c1: integer (nullable = false) | | |-- c2: integer (nullable = false) | | |-- c3: integer (nullable = false)
python apache-spark dataframe pyspark parquet
Can you post the output of df.printSchema
– sramalingam24
Mar 22 at 3:27
Sure, I have edited the post with the output of printSchema(). It contains all the other columns I left out for simplicity purposes.
– MGK
Mar 22 at 4:10
add a comment |
I am new to Pyspark and I've been pulling my hair out trying to accomplish something I believe is fairly simple. I am trying to do an ETL process where a csv file is converted to a parquet file. The CSV file has a few simple columns, but one column is a delimited array of integers that I want to expand/unzip into a parquet file. This parquet file is actually used by a .net core micro service which uses a Parquet Reader to do calculations downstream. To keep this question simple, the structure of the column is:
"geomap" 5:3:7|4:2:1|8:2:78 -> this represents an array of 3 items, it is split at the "|" and then a tuple is build of the values (5,3,7), (4,2,1), (8,2,78)
I have tried various processes and schemas and I can't get this correct. Via UDF I am creating either a list of lists or a list of tuple, but I can't get the schema correct or unzip explode the data into the parquet write operation. I either get nulls, an error or other problems. Do I need to approach this differently? Relevant code is below. I am just showing the problem column for simplicity since I have the rest working. This is my first Pyspark attempt, so apologies for missing something obvious:
def convert_geo(geo):
return [tuple(x.split(':')) for x in geo.split('|')]
compression_type = 'snappy'
schema = ArrayType(StructType([
StructField("c1", IntegerType(), False),
StructField("c2", IntegerType(), False),
StructField("c3", IntegerType(), False)
]))
spark_convert_geo = udf(lambda z: convert_geo(z),schema)
source_path = '...path to csv'
destination_path = 'path for generated parquet file'
df = spark.read.option('delimiter',',').option('header','true').csv(source_path).withColumn("geomap",spark_convert_geo(col('geomap')).alias("geomap"))
df.write.mode("overwrite").format('parquet').option('compression', compression_type).save(destination_path)
EDIT: Per request adding the printSchema() output, I'm not sure what's wrong in here either. I still can't seem to get the string split values to show up or render properly. This contains all the columns. I do see the c1 and c2 and c3 struct names...
root |-- lrsegid: integer (nullable = true) |-- loadsourceid: integer (nullable = true) |-- agencyid: integer (nullable = true) |-- acres: float (nullable = true) |-- sourcemap: array (nullable = true) | |-- element: integer (containsNull = true) |-- geomap: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- c1: integer (nullable = false) | | |-- c2: integer (nullable = false) | | |-- c3: integer (nullable = false)
python apache-spark dataframe pyspark parquet
I am new to Pyspark and I've been pulling my hair out trying to accomplish something I believe is fairly simple. I am trying to do an ETL process where a csv file is converted to a parquet file. The CSV file has a few simple columns, but one column is a delimited array of integers that I want to expand/unzip into a parquet file. This parquet file is actually used by a .net core micro service which uses a Parquet Reader to do calculations downstream. To keep this question simple, the structure of the column is:
"geomap" 5:3:7|4:2:1|8:2:78 -> this represents an array of 3 items, it is split at the "|" and then a tuple is build of the values (5,3,7), (4,2,1), (8,2,78)
I have tried various processes and schemas and I can't get this correct. Via UDF I am creating either a list of lists or a list of tuple, but I can't get the schema correct or unzip explode the data into the parquet write operation. I either get nulls, an error or other problems. Do I need to approach this differently? Relevant code is below. I am just showing the problem column for simplicity since I have the rest working. This is my first Pyspark attempt, so apologies for missing something obvious:
def convert_geo(geo):
return [tuple(x.split(':')) for x in geo.split('|')]
compression_type = 'snappy'
schema = ArrayType(StructType([
StructField("c1", IntegerType(), False),
StructField("c2", IntegerType(), False),
StructField("c3", IntegerType(), False)
]))
spark_convert_geo = udf(lambda z: convert_geo(z),schema)
source_path = '...path to csv'
destination_path = 'path for generated parquet file'
df = spark.read.option('delimiter',',').option('header','true').csv(source_path).withColumn("geomap",spark_convert_geo(col('geomap')).alias("geomap"))
df.write.mode("overwrite").format('parquet').option('compression', compression_type).save(destination_path)
EDIT: Per request adding the printSchema() output, I'm not sure what's wrong in here either. I still can't seem to get the string split values to show up or render properly. This contains all the columns. I do see the c1 and c2 and c3 struct names...
root |-- lrsegid: integer (nullable = true) |-- loadsourceid: integer (nullable = true) |-- agencyid: integer (nullable = true) |-- acres: float (nullable = true) |-- sourcemap: array (nullable = true) | |-- element: integer (containsNull = true) |-- geomap: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- c1: integer (nullable = false) | | |-- c2: integer (nullable = false) | | |-- c3: integer (nullable = false)
python apache-spark dataframe pyspark parquet
python apache-spark dataframe pyspark parquet
edited Mar 22 at 4:12
MGK
asked Mar 22 at 1:55
MGKMGK
65
65
Can you post the output of df.printSchema
– sramalingam24
Mar 22 at 3:27
Sure, I have edited the post with the output of printSchema(). It contains all the other columns I left out for simplicity purposes.
– MGK
Mar 22 at 4:10
add a comment |
Can you post the output of df.printSchema
– sramalingam24
Mar 22 at 3:27
Sure, I have edited the post with the output of printSchema(). It contains all the other columns I left out for simplicity purposes.
– MGK
Mar 22 at 4:10
Can you post the output of df.printSchema
– sramalingam24
Mar 22 at 3:27
Can you post the output of df.printSchema
– sramalingam24
Mar 22 at 3:27
Sure, I have edited the post with the output of printSchema(). It contains all the other columns I left out for simplicity purposes.
– MGK
Mar 22 at 4:10
Sure, I have edited the post with the output of printSchema(). It contains all the other columns I left out for simplicity purposes.
– MGK
Mar 22 at 4:10
add a comment |
1 Answer
1
active
oldest
votes
The problem is that the convert_geo
function returns a list of tuples with character elements rather than ints as specified in the schema. If you modify as follows it will work:
def convert_geo(geo):
return [tuple([int(y) for y in x.split(':')]) for x in geo.split('|')]
I could have sworn I tried making the schema all String type() and it still did not work. Let me check again. Also, what if the third item in the Tuple actually needs to be a double? How do I edit the UDF to make the 3rd item a different value type?
– MGK
Mar 22 at 14:42
It worked for me with the above tweak. you could replace the list comprehension with a for loop and some conditional logic if you want different dtypes for the struct elements
– ags29
Mar 22 at 14:47
You sir, are correct. Marking as answered. I was playing with a bunch of different schemas and structures, and I must have never tried matching the value types with the proper schema definition. I don't do a lot in Python and I wasn't sure if list comprehension had a way to mix the value types on creation. A for loop would probably be a little slower I assume? I guess it depends on on the implementation of the list comprehension internally. But, yes, my parquet file now has 3 structures, same length, same repetition levels and with the proper data.
– MGK
Mar 22 at 14:59
Thanks for accepting the answer. Thinking about it, you could probably avoid the for loop by having a list of type-casting functions the same length as the tuple (then zip with the colon split list and use a list comprehension as before)
– ags29
Mar 22 at 15:07
Hmmm, perhaps. Do you happpen to have an example of that? If not, no worries I can play with the idea in a bit. Thanks again for your help. I'm a .net developer and somewhat rusty in my python. I'm sure I can figure it out.
– MGK
Mar 22 at 15:09
|
show 1 more comment
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55291790%2fnew-to-pyspark-importing-a-csv-and-creating-a-parquet-file-with-array-columns%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The problem is that the convert_geo
function returns a list of tuples with character elements rather than ints as specified in the schema. If you modify as follows it will work:
def convert_geo(geo):
return [tuple([int(y) for y in x.split(':')]) for x in geo.split('|')]
I could have sworn I tried making the schema all String type() and it still did not work. Let me check again. Also, what if the third item in the Tuple actually needs to be a double? How do I edit the UDF to make the 3rd item a different value type?
– MGK
Mar 22 at 14:42
It worked for me with the above tweak. you could replace the list comprehension with a for loop and some conditional logic if you want different dtypes for the struct elements
– ags29
Mar 22 at 14:47
You sir, are correct. Marking as answered. I was playing with a bunch of different schemas and structures, and I must have never tried matching the value types with the proper schema definition. I don't do a lot in Python and I wasn't sure if list comprehension had a way to mix the value types on creation. A for loop would probably be a little slower I assume? I guess it depends on on the implementation of the list comprehension internally. But, yes, my parquet file now has 3 structures, same length, same repetition levels and with the proper data.
– MGK
Mar 22 at 14:59
Thanks for accepting the answer. Thinking about it, you could probably avoid the for loop by having a list of type-casting functions the same length as the tuple (then zip with the colon split list and use a list comprehension as before)
– ags29
Mar 22 at 15:07
Hmmm, perhaps. Do you happpen to have an example of that? If not, no worries I can play with the idea in a bit. Thanks again for your help. I'm a .net developer and somewhat rusty in my python. I'm sure I can figure it out.
– MGK
Mar 22 at 15:09
|
show 1 more comment
The problem is that the convert_geo
function returns a list of tuples with character elements rather than ints as specified in the schema. If you modify as follows it will work:
def convert_geo(geo):
return [tuple([int(y) for y in x.split(':')]) for x in geo.split('|')]
I could have sworn I tried making the schema all String type() and it still did not work. Let me check again. Also, what if the third item in the Tuple actually needs to be a double? How do I edit the UDF to make the 3rd item a different value type?
– MGK
Mar 22 at 14:42
It worked for me with the above tweak. you could replace the list comprehension with a for loop and some conditional logic if you want different dtypes for the struct elements
– ags29
Mar 22 at 14:47
You sir, are correct. Marking as answered. I was playing with a bunch of different schemas and structures, and I must have never tried matching the value types with the proper schema definition. I don't do a lot in Python and I wasn't sure if list comprehension had a way to mix the value types on creation. A for loop would probably be a little slower I assume? I guess it depends on on the implementation of the list comprehension internally. But, yes, my parquet file now has 3 structures, same length, same repetition levels and with the proper data.
– MGK
Mar 22 at 14:59
Thanks for accepting the answer. Thinking about it, you could probably avoid the for loop by having a list of type-casting functions the same length as the tuple (then zip with the colon split list and use a list comprehension as before)
– ags29
Mar 22 at 15:07
Hmmm, perhaps. Do you happpen to have an example of that? If not, no worries I can play with the idea in a bit. Thanks again for your help. I'm a .net developer and somewhat rusty in my python. I'm sure I can figure it out.
– MGK
Mar 22 at 15:09
|
show 1 more comment
The problem is that the convert_geo
function returns a list of tuples with character elements rather than ints as specified in the schema. If you modify as follows it will work:
def convert_geo(geo):
return [tuple([int(y) for y in x.split(':')]) for x in geo.split('|')]
The problem is that the convert_geo
function returns a list of tuples with character elements rather than ints as specified in the schema. If you modify as follows it will work:
def convert_geo(geo):
return [tuple([int(y) for y in x.split(':')]) for x in geo.split('|')]
answered Mar 22 at 11:09
ags29ags29
1,01427
1,01427
I could have sworn I tried making the schema all String type() and it still did not work. Let me check again. Also, what if the third item in the Tuple actually needs to be a double? How do I edit the UDF to make the 3rd item a different value type?
– MGK
Mar 22 at 14:42
It worked for me with the above tweak. you could replace the list comprehension with a for loop and some conditional logic if you want different dtypes for the struct elements
– ags29
Mar 22 at 14:47
You sir, are correct. Marking as answered. I was playing with a bunch of different schemas and structures, and I must have never tried matching the value types with the proper schema definition. I don't do a lot in Python and I wasn't sure if list comprehension had a way to mix the value types on creation. A for loop would probably be a little slower I assume? I guess it depends on on the implementation of the list comprehension internally. But, yes, my parquet file now has 3 structures, same length, same repetition levels and with the proper data.
– MGK
Mar 22 at 14:59
Thanks for accepting the answer. Thinking about it, you could probably avoid the for loop by having a list of type-casting functions the same length as the tuple (then zip with the colon split list and use a list comprehension as before)
– ags29
Mar 22 at 15:07
Hmmm, perhaps. Do you happpen to have an example of that? If not, no worries I can play with the idea in a bit. Thanks again for your help. I'm a .net developer and somewhat rusty in my python. I'm sure I can figure it out.
– MGK
Mar 22 at 15:09
|
show 1 more comment
I could have sworn I tried making the schema all String type() and it still did not work. Let me check again. Also, what if the third item in the Tuple actually needs to be a double? How do I edit the UDF to make the 3rd item a different value type?
– MGK
Mar 22 at 14:42
It worked for me with the above tweak. you could replace the list comprehension with a for loop and some conditional logic if you want different dtypes for the struct elements
– ags29
Mar 22 at 14:47
You sir, are correct. Marking as answered. I was playing with a bunch of different schemas and structures, and I must have never tried matching the value types with the proper schema definition. I don't do a lot in Python and I wasn't sure if list comprehension had a way to mix the value types on creation. A for loop would probably be a little slower I assume? I guess it depends on on the implementation of the list comprehension internally. But, yes, my parquet file now has 3 structures, same length, same repetition levels and with the proper data.
– MGK
Mar 22 at 14:59
Thanks for accepting the answer. Thinking about it, you could probably avoid the for loop by having a list of type-casting functions the same length as the tuple (then zip with the colon split list and use a list comprehension as before)
– ags29
Mar 22 at 15:07
Hmmm, perhaps. Do you happpen to have an example of that? If not, no worries I can play with the idea in a bit. Thanks again for your help. I'm a .net developer and somewhat rusty in my python. I'm sure I can figure it out.
– MGK
Mar 22 at 15:09
I could have sworn I tried making the schema all String type() and it still did not work. Let me check again. Also, what if the third item in the Tuple actually needs to be a double? How do I edit the UDF to make the 3rd item a different value type?
– MGK
Mar 22 at 14:42
I could have sworn I tried making the schema all String type() and it still did not work. Let me check again. Also, what if the third item in the Tuple actually needs to be a double? How do I edit the UDF to make the 3rd item a different value type?
– MGK
Mar 22 at 14:42
It worked for me with the above tweak. you could replace the list comprehension with a for loop and some conditional logic if you want different dtypes for the struct elements
– ags29
Mar 22 at 14:47
It worked for me with the above tweak. you could replace the list comprehension with a for loop and some conditional logic if you want different dtypes for the struct elements
– ags29
Mar 22 at 14:47
You sir, are correct. Marking as answered. I was playing with a bunch of different schemas and structures, and I must have never tried matching the value types with the proper schema definition. I don't do a lot in Python and I wasn't sure if list comprehension had a way to mix the value types on creation. A for loop would probably be a little slower I assume? I guess it depends on on the implementation of the list comprehension internally. But, yes, my parquet file now has 3 structures, same length, same repetition levels and with the proper data.
– MGK
Mar 22 at 14:59
You sir, are correct. Marking as answered. I was playing with a bunch of different schemas and structures, and I must have never tried matching the value types with the proper schema definition. I don't do a lot in Python and I wasn't sure if list comprehension had a way to mix the value types on creation. A for loop would probably be a little slower I assume? I guess it depends on on the implementation of the list comprehension internally. But, yes, my parquet file now has 3 structures, same length, same repetition levels and with the proper data.
– MGK
Mar 22 at 14:59
Thanks for accepting the answer. Thinking about it, you could probably avoid the for loop by having a list of type-casting functions the same length as the tuple (then zip with the colon split list and use a list comprehension as before)
– ags29
Mar 22 at 15:07
Thanks for accepting the answer. Thinking about it, you could probably avoid the for loop by having a list of type-casting functions the same length as the tuple (then zip with the colon split list and use a list comprehension as before)
– ags29
Mar 22 at 15:07
Hmmm, perhaps. Do you happpen to have an example of that? If not, no worries I can play with the idea in a bit. Thanks again for your help. I'm a .net developer and somewhat rusty in my python. I'm sure I can figure it out.
– MGK
Mar 22 at 15:09
Hmmm, perhaps. Do you happpen to have an example of that? If not, no worries I can play with the idea in a bit. Thanks again for your help. I'm a .net developer and somewhat rusty in my python. I'm sure I can figure it out.
– MGK
Mar 22 at 15:09
|
show 1 more comment
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55291790%2fnew-to-pyspark-importing-a-csv-and-creating-a-parquet-file-with-array-columns%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Can you post the output of df.printSchema
– sramalingam24
Mar 22 at 3:27
Sure, I have edited the post with the output of printSchema(). It contains all the other columns I left out for simplicity purposes.
– MGK
Mar 22 at 4:10