New to Pyspark - importing a CSV and creating a parquet file with array columnsHow to import other Python files?Importing files from different folderAdding new column to existing DataFrame in Python pandasPandas writing dataframe to CSV fileNullable field is changed upon writing a Spark DataframeScala - How to avoid java.lang.IllegalArgumentException when Row.get(i) would retrieve a nullHow to extract XML string from parquet columnHow do I apply schema with nullable = false to json readingIgnore missing values when writing to parquet in pysparkpyspark load csv file into dataframe using a schema

Why is my log file so massive? 22gb. I am running log backups

What does "enim et" mean?

Is it wise to hold on to stock that has plummeted and then stabilized?

What does 'script /dev/null' do?

Landlord wants to switch my lease to a "Land contract" to "get back at the city"

What is GPS' 19 year rollover and does it present a cybersecurity issue?

Why do we use polarized capacitors?

Is there a name of the flying bionic bird?

How did the USSR manage to innovate in an environment characterized by government censorship and high bureaucracy?

Are objects structures and/or vice versa?

What happens when a metallic dragon and a chromatic dragon mate?

How to deal with fear of taking dependencies

Extreme, but not acceptable situation and I can't start the work tomorrow morning

Is domain driven design an anti-SQL pattern?

Can I find out the caloric content of bread by dehydrating it?

Why is the design of haulage companies so “special”?

Re-submission of rejected manuscript without informing co-authors

Is there any use for defining additional entity types in a SOQL FROM clause?

Does the average primeness of natural numbers tend to zero?

aging parents with no investments

What do you call something that goes against the spirit of the law, but is legal when interpreting the law to the letter?

Calculate Levenshtein distance between two strings in Python

If a centaur druid Wild Shapes into a Giant Elk, do their Charge features stack?

Does a dangling wire really electrocute me if I'm standing in water?

New to Pyspark - importing a CSV and creating a parquet file with array columns

How to import other Python files?Importing files from different folderAdding new column to existing DataFrame in Python pandasPandas writing dataframe to CSV fileNullable field is changed upon writing a Spark DataframeScala - How to avoid java.lang.IllegalArgumentException when Row.get(i) would retrieve a nullHow to extract XML string from parquet columnHow do I apply schema with nullable = false to json readingIgnore missing values when writing to parquet in pysparkpyspark load csv file into dataframe using a schema

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;

I am new to Pyspark and I've been pulling my hair out trying to accomplish something I believe is fairly simple. I am trying to do an ETL process where a csv file is converted to a parquet file. The CSV file has a few simple columns, but one column is a delimited array of integers that I want to expand/unzip into a parquet file. This parquet file is actually used by a .net core micro service which uses a Parquet Reader to do calculations downstream. To keep this question simple, the structure of the column is:

"geomap" 5:3:7|4:2:1|8:2:78 -> this represents an array of 3 items, it is split at the "|" and then a tuple is build of the values (5,3,7), (4,2,1), (8,2,78)

I have tried various processes and schemas and I can't get this correct. Via UDF I am creating either a list of lists or a list of tuple, but I can't get the schema correct or unzip explode the data into the parquet write operation. I either get nulls, an error or other problems. Do I need to approach this differently? Relevant code is below. I am just showing the problem column for simplicity since I have the rest working. This is my first Pyspark attempt, so apologies for missing something obvious:

def convert_geo(geo):
 return [tuple(x.split(':')) for x in geo.split('|')]

compression_type = 'snappy'

schema = ArrayType(StructType([
 StructField("c1", IntegerType(), False),
 StructField("c2", IntegerType(), False),
 StructField("c3", IntegerType(), False)
]))

spark_convert_geo = udf(lambda z: convert_geo(z),schema)

source_path = '...path to csv'
destination_path = 'path for generated parquet file'

df = spark.read.option('delimiter',',').option('header','true').csv(source_path).withColumn("geomap",spark_convert_geo(col('geomap')).alias("geomap"))
df.write.mode("overwrite").format('parquet').option('compression', compression_type).save(destination_path)

EDIT: Per request adding the printSchema() output, I'm not sure what's wrong in here either. I still can't seem to get the string split values to show up or render properly. This contains all the columns. I do see the c1 and c2 and c3 struct names...

root |-- lrsegid: integer (nullable = true) |-- loadsourceid: integer (nullable = true) |-- agencyid: integer (nullable = true) |-- acres: float (nullable = true) |-- sourcemap: array (nullable = true) | |-- element: integer (containsNull = true) |-- geomap: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- c1: integer (nullable = false) | | |-- c2: integer (nullable = false) | | |-- c3: integer (nullable = false)

edited Mar 22 at 4:12

asked Mar 22 at 1:55

MGK

Can you post the output of df.printSchema

– sramalingam24
Mar 22 at 3:27

Sure, I have edited the post with the output of printSchema(). It contains all the other columns I left out for simplicity purposes.

– MGK
Mar 22 at 4:10

add a comment |

"geomap" 5:3:7|4:2:1|8:2:78 -> this represents an array of 3 items, it is split at the "|" and then a tuple is build of the values (5,3,7), (4,2,1), (8,2,78)

def convert_geo(geo):
 return [tuple(x.split(':')) for x in geo.split('|')]

compression_type = 'snappy'

schema = ArrayType(StructType([
 StructField("c1", IntegerType(), False),
 StructField("c2", IntegerType(), False),
 StructField("c3", IntegerType(), False)
]))

spark_convert_geo = udf(lambda z: convert_geo(z),schema)

source_path = '...path to csv'
destination_path = 'path for generated parquet file'

df = spark.read.option('delimiter',',').option('header','true').csv(source_path).withColumn("geomap",spark_convert_geo(col('geomap')).alias("geomap"))
df.write.mode("overwrite").format('parquet').option('compression', compression_type).save(destination_path)

root |-- lrsegid: integer (nullable = true) |-- loadsourceid: integer (nullable = true) |-- agencyid: integer (nullable = true) |-- acres: float (nullable = true) |-- sourcemap: array (nullable = true) | |-- element: integer (containsNull = true) |-- geomap: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- c1: integer (nullable = false) | | |-- c2: integer (nullable = false) | | |-- c3: integer (nullable = false)

edited Mar 22 at 4:12

asked Mar 22 at 1:55

MGK

Can you post the output of df.printSchema

– sramalingam24
Mar 22 at 3:27

Sure, I have edited the post with the output of printSchema(). It contains all the other columns I left out for simplicity purposes.

– MGK
Mar 22 at 4:10

add a comment |

"geomap" 5:3:7|4:2:1|8:2:78 -> this represents an array of 3 items, it is split at the "|" and then a tuple is build of the values (5,3,7), (4,2,1), (8,2,78)

def convert_geo(geo):
 return [tuple(x.split(':')) for x in geo.split('|')]

compression_type = 'snappy'

schema = ArrayType(StructType([
 StructField("c1", IntegerType(), False),
 StructField("c2", IntegerType(), False),
 StructField("c3", IntegerType(), False)
]))

spark_convert_geo = udf(lambda z: convert_geo(z),schema)

source_path = '...path to csv'
destination_path = 'path for generated parquet file'

df = spark.read.option('delimiter',',').option('header','true').csv(source_path).withColumn("geomap",spark_convert_geo(col('geomap')).alias("geomap"))
df.write.mode("overwrite").format('parquet').option('compression', compression_type).save(destination_path)

root |-- lrsegid: integer (nullable = true) |-- loadsourceid: integer (nullable = true) |-- agencyid: integer (nullable = true) |-- acres: float (nullable = true) |-- sourcemap: array (nullable = true) | |-- element: integer (containsNull = true) |-- geomap: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- c1: integer (nullable = false) | | |-- c2: integer (nullable = false) | | |-- c3: integer (nullable = false)

edited Mar 22 at 4:12

asked Mar 22 at 1:55

MGK

"geomap" 5:3:7|4:2:1|8:2:78 -> this represents an array of 3 items, it is split at the "|" and then a tuple is build of the values (5,3,7), (4,2,1), (8,2,78)

def convert_geo(geo):
 return [tuple(x.split(':')) for x in geo.split('|')]

compression_type = 'snappy'

schema = ArrayType(StructType([
 StructField("c1", IntegerType(), False),
 StructField("c2", IntegerType(), False),
 StructField("c3", IntegerType(), False)
]))

spark_convert_geo = udf(lambda z: convert_geo(z),schema)

source_path = '...path to csv'
destination_path = 'path for generated parquet file'

df = spark.read.option('delimiter',',').option('header','true').csv(source_path).withColumn("geomap",spark_convert_geo(col('geomap')).alias("geomap"))
df.write.mode("overwrite").format('parquet').option('compression', compression_type).save(destination_path)

root |-- lrsegid: integer (nullable = true) |-- loadsourceid: integer (nullable = true) |-- agencyid: integer (nullable = true) |-- acres: float (nullable = true) |-- sourcemap: array (nullable = true) | |-- element: integer (containsNull = true) |-- geomap: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- c1: integer (nullable = false) | | |-- c2: integer (nullable = false) | | |-- c3: integer (nullable = false)

python apache-spark dataframe pyspark parquet

edited Mar 22 at 4:12

asked Mar 22 at 1:55

MGK

edited Mar 22 at 4:12

asked Mar 22 at 1:55

MGK

edited Mar 22 at 4:12

asked Mar 22 at 1:55

MGK

asked Mar 22 at 1:55

MGK

asked Mar 22 at 1:55

MGK

Can you post the output of df.printSchema

– sramalingam24
Mar 22 at 3:27

Sure, I have edited the post with the output of printSchema(). It contains all the other columns I left out for simplicity purposes.

– MGK
Mar 22 at 4:10

add a comment |

Can you post the output of df.printSchema

– sramalingam24
Mar 22 at 3:27

Sure, I have edited the post with the output of printSchema(). It contains all the other columns I left out for simplicity purposes.

– MGK
Mar 22 at 4:10

Can you post the output of df.printSchema

– sramalingam24
Mar 22 at 3:27

Sure, I have edited the post with the output of printSchema(). It contains all the other columns I left out for simplicity purposes.

– MGK
Mar 22 at 4:10

add a comment |

1 Answer
1

active

oldest

votes

The problem is that the convert_geo function returns a list of tuples with character elements rather than ints as specified in the schema. If you modify as follows it will work:

def convert_geo(geo):
 return [tuple([int(y) for y in x.split(':')]) for x in geo.split('|')]

answered Mar 22 at 11:09

ags29

1,01427

I could have sworn I tried making the schema all String type() and it still did not work. Let me check again. Also, what if the third item in the Tuple actually needs to be a double? How do I edit the UDF to make the 3rd item a different value type?

– MGK
Mar 22 at 14:42

It worked for me with the above tweak. you could replace the list comprehension with a for loop and some conditional logic if you want different dtypes for the struct elements

– ags29
Mar 22 at 14:47

You sir, are correct. Marking as answered. I was playing with a bunch of different schemas and structures, and I must have never tried matching the value types with the proper schema definition. I don't do a lot in Python and I wasn't sure if list comprehension had a way to mix the value types on creation. A for loop would probably be a little slower I assume? I guess it depends on on the implementation of the list comprehension internally. But, yes, my parquet file now has 3 structures, same length, same repetition levels and with the proper data.

– MGK
Mar 22 at 14:59

Thanks for accepting the answer. Thinking about it, you could probably avoid the for loop by having a list of type-casting functions the same length as the tuple (then zip with the colon split list and use a list comprehension as before)

– ags29
Mar 22 at 15:07

Hmmm, perhaps. Do you happpen to have an example of that? If not, no worries I can play with the idea in a bit. Thanks again for your help. I'm a .net developer and somewhat rusty in my python. I'm sure I can figure it out.

– MGK
Mar 22 at 15:09

|
show 1 more comment

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55291790%2fnew-to-pyspark-importing-a-csv-and-creating-a-parquet-file-with-array-columns%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

The problem is that the convert_geo function returns a list of tuples with character elements rather than ints as specified in the schema. If you modify as follows it will work:

def convert_geo(geo):
 return [tuple([int(y) for y in x.split(':')]) for x in geo.split('|')]

answered Mar 22 at 11:09

ags29

1,01427

I could have sworn I tried making the schema all String type() and it still did not work. Let me check again. Also, what if the third item in the Tuple actually needs to be a double? How do I edit the UDF to make the 3rd item a different value type?

– MGK
Mar 22 at 14:42

It worked for me with the above tweak. you could replace the list comprehension with a for loop and some conditional logic if you want different dtypes for the struct elements

– ags29
Mar 22 at 14:47

You sir, are correct. Marking as answered. I was playing with a bunch of different schemas and structures, and I must have never tried matching the value types with the proper schema definition. I don't do a lot in Python and I wasn't sure if list comprehension had a way to mix the value types on creation. A for loop would probably be a little slower I assume? I guess it depends on on the implementation of the list comprehension internally. But, yes, my parquet file now has 3 structures, same length, same repetition levels and with the proper data.

– MGK
Mar 22 at 14:59

Thanks for accepting the answer. Thinking about it, you could probably avoid the for loop by having a list of type-casting functions the same length as the tuple (then zip with the colon split list and use a list comprehension as before)

– ags29
Mar 22 at 15:07

Hmmm, perhaps. Do you happpen to have an example of that? If not, no worries I can play with the idea in a bit. Thanks again for your help. I'm a .net developer and somewhat rusty in my python. I'm sure I can figure it out.

– MGK
Mar 22 at 15:09

|
show 1 more comment

The problem is that the convert_geo function returns a list of tuples with character elements rather than ints as specified in the schema. If you modify as follows it will work:

def convert_geo(geo):
 return [tuple([int(y) for y in x.split(':')]) for x in geo.split('|')]

answered Mar 22 at 11:09

ags29

1,01427

I could have sworn I tried making the schema all String type() and it still did not work. Let me check again. Also, what if the third item in the Tuple actually needs to be a double? How do I edit the UDF to make the 3rd item a different value type?

– MGK
Mar 22 at 14:42

It worked for me with the above tweak. you could replace the list comprehension with a for loop and some conditional logic if you want different dtypes for the struct elements

– ags29
Mar 22 at 14:47

You sir, are correct. Marking as answered. I was playing with a bunch of different schemas and structures, and I must have never tried matching the value types with the proper schema definition. I don't do a lot in Python and I wasn't sure if list comprehension had a way to mix the value types on creation. A for loop would probably be a little slower I assume? I guess it depends on on the implementation of the list comprehension internally. But, yes, my parquet file now has 3 structures, same length, same repetition levels and with the proper data.

– MGK
Mar 22 at 14:59

Thanks for accepting the answer. Thinking about it, you could probably avoid the for loop by having a list of type-casting functions the same length as the tuple (then zip with the colon split list and use a list comprehension as before)

– ags29
Mar 22 at 15:07

Hmmm, perhaps. Do you happpen to have an example of that? If not, no worries I can play with the idea in a bit. Thanks again for your help. I'm a .net developer and somewhat rusty in my python. I'm sure I can figure it out.

– MGK
Mar 22 at 15:09

|
show 1 more comment

The problem is that the convert_geo function returns a list of tuples with character elements rather than ints as specified in the schema. If you modify as follows it will work:

def convert_geo(geo):
 return [tuple([int(y) for y in x.split(':')]) for x in geo.split('|')]

answered Mar 22 at 11:09

ags29

1,01427

The problem is that the convert_geo function returns a list of tuples with character elements rather than ints as specified in the schema. If you modify as follows it will work:

def convert_geo(geo):
 return [tuple([int(y) for y in x.split(':')]) for x in geo.split('|')]

answered Mar 22 at 11:09

ags29

1,01427

answered Mar 22 at 11:09

ags29

1,01427

answered Mar 22 at 11:09

ags29

1,01427

answered Mar 22 at 11:09

ags29

1,01427

I could have sworn I tried making the schema all String type() and it still did not work. Let me check again. Also, what if the third item in the Tuple actually needs to be a double? How do I edit the UDF to make the 3rd item a different value type?

– MGK
Mar 22 at 14:42

It worked for me with the above tweak. you could replace the list comprehension with a for loop and some conditional logic if you want different dtypes for the struct elements

– ags29
Mar 22 at 14:47

You sir, are correct. Marking as answered. I was playing with a bunch of different schemas and structures, and I must have never tried matching the value types with the proper schema definition. I don't do a lot in Python and I wasn't sure if list comprehension had a way to mix the value types on creation. A for loop would probably be a little slower I assume? I guess it depends on on the implementation of the list comprehension internally. But, yes, my parquet file now has 3 structures, same length, same repetition levels and with the proper data.

– MGK
Mar 22 at 14:59

Thanks for accepting the answer. Thinking about it, you could probably avoid the for loop by having a list of type-casting functions the same length as the tuple (then zip with the colon split list and use a list comprehension as before)

– ags29
Mar 22 at 15:07

Hmmm, perhaps. Do you happpen to have an example of that? If not, no worries I can play with the idea in a bit. Thanks again for your help. I'm a .net developer and somewhat rusty in my python. I'm sure I can figure it out.

– MGK
Mar 22 at 15:09

|
show 1 more comment

I could have sworn I tried making the schema all String type() and it still did not work. Let me check again. Also, what if the third item in the Tuple actually needs to be a double? How do I edit the UDF to make the 3rd item a different value type?

– MGK
Mar 22 at 14:42

It worked for me with the above tweak. you could replace the list comprehension with a for loop and some conditional logic if you want different dtypes for the struct elements

– ags29
Mar 22 at 14:47

You sir, are correct. Marking as answered. I was playing with a bunch of different schemas and structures, and I must have never tried matching the value types with the proper schema definition. I don't do a lot in Python and I wasn't sure if list comprehension had a way to mix the value types on creation. A for loop would probably be a little slower I assume? I guess it depends on on the implementation of the list comprehension internally. But, yes, my parquet file now has 3 structures, same length, same repetition levels and with the proper data.

– MGK
Mar 22 at 14:59

Thanks for accepting the answer. Thinking about it, you could probably avoid the for loop by having a list of type-casting functions the same length as the tuple (then zip with the colon split list and use a list comprehension as before)

– ags29
Mar 22 at 15:07

Hmmm, perhaps. Do you happpen to have an example of that? If not, no worries I can play with the idea in a bit. Thanks again for your help. I'm a .net developer and somewhat rusty in my python. I'm sure I can figure it out.

– MGK
Mar 22 at 15:09

I could have sworn I tried making the schema all String type() and it still did not work. Let me check again. Also, what if the third item in the Tuple actually needs to be a double? How do I edit the UDF to make the 3rd item a different value type?

– MGK
Mar 22 at 14:42

It worked for me with the above tweak. you could replace the list comprehension with a for loop and some conditional logic if you want different dtypes for the struct elements

– ags29
Mar 22 at 14:47

You sir, are correct. Marking as answered. I was playing with a bunch of different schemas and structures, and I must have never tried matching the value types with the proper schema definition. I don't do a lot in Python and I wasn't sure if list comprehension had a way to mix the value types on creation. A for loop would probably be a little slower I assume? I guess it depends on on the implementation of the list comprehension internally. But, yes, my parquet file now has 3 structures, same length, same repetition levels and with the proper data.

– MGK
Mar 22 at 14:59

Thanks for accepting the answer. Thinking about it, you could probably avoid the for loop by having a list of type-casting functions the same length as the tuple (then zip with the colon split list and use a list comprehension as before)

– ags29
Mar 22 at 15:07

Hmmm, perhaps. Do you happpen to have an example of that? If not, no worries I can play with the idea in a bit. Thanks again for your help. I'm a .net developer and somewhat rusty in my python. I'm sure I can figure it out.

– MGK
Mar 22 at 15:09

|
show 1 more comment

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

1 Answer
1

1 Answer
1

1 Answer
1