How to read large zip files in pysparkHow to open/stream .zip files through Spark?How to merge two dictionaries in a single expression?How do I check if a list is empty?How do I check whether a file exists without exceptions?How do I copy a file in Python?How can I safely create a nested directory?How do I sort a dictionary by value?How to make a chain of function decorators?How do you read from stdin?How do I list all files of a directory?How to read a file line-by-line into a list?

Simulate a 1D Game-of-Life-ish Model

Manager manipulates my leaves, what's in it for him?

Make Interviewee Comfortable in Potentially Intimate Environment

Do household ovens ventilate heat to the outdoors?

Norwegian refuses EU delay (4.7 hours) compensation because it turned out there was nothing wrong with the aircraft

Are lay articles good enough to be the main source of information for PhD research?

Which museums have artworks of all four ninja turtles' namesakes?

Is it true that, "just ten trading days represent 63 per cent of the returns of the past 50 years"?

How should I avoid someone patenting technology in my paper/poster?

How does one calculate the distribution of the Matt Colville way of rolling stats?

Minimize taxes now that I earn more

Debussy as term for bathroom?

Is it really necessary to have 4 hours meeting in Sprint planning?

Cheap antenna for new HF HAM

Automate tasks with Lambdas in java

What informations can we obtain with these voltage and current measurements of a little electronic device?

Temporarily moving a SQL Server 2016 database to SQL Server 2017 and then moving back. Is it possible?

Calibrated Esteps Causes Extruder Skipping

Was there a trial by combat between a man and a dog in medieval France?

Nanomachines exist that enable Axolotl-levels of regeneration - So how can crippling injuries exist as well?

Applications of mathematics in clinical setting

How to manage expenditure when billing cycles and paycheck cycles are not aligned?

Is there an in-universe reason Harry says this or is this simply a Rowling mistake?

How can I prevent soul energy from dissipating?

How to read large zip files in pyspark

How to open/stream .zip files through Spark?How to merge two dictionaries in a single expression?How do I check if a list is empty?How do I check whether a file exists without exceptions?How do I copy a file in Python?How can I safely create a nested directory?How do I sort a dictionary by value?How to make a chain of function decorators?How do you read from stdin?How do I list all files of a directory?How to read a file line-by-line into a list?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

I do have n number of .zip files on s3, which I want to process and extract some data out of them. zip files contains a single json file. In Spark we can read .gz files, but I didn't find any way to read data within .zip files. Can someone please help me out how can I process large zip files over spark using python. I came across some options like newAPIHadoopFile, but didn't get any luck with them, nor found way to implement them in pyspark. Please note the zip files are >1G, some are of 20G as well.

Below is code, which I used:

import zipfile
import io
file_name = "s3 file path for zip file"

def zip_extract(x):
 in_memory_data = io.BytesIO(x[1])
 file_obj = zipfile.ZipFile(in_memory_data, "r")
 files = [i for i in file_obj.namelist()]
 return dict(zip(files, [file_obj.open(file).read() for file in files]))


zips = sc.binaryFiles(file_name)
files_data = zips.map(zip_extract)

But it's failing because of below reason. The instance which I'm using is r42x.large.

Exit code: 52
Stack trace: ExitCodeException exitCode=52: 
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0

edited Jul 8 at 14:02

thebluephantom

4,7825 gold badges15 silver badges37 bronze badges

asked Mar 28 at 14:29

Sandie

1211 gold badge2 silver badges13 bronze badges

Possible duplicate of How to open/stream .zip files through Spark?

– Jim Todd
Mar 28 at 17:06

Had a look already, that is not working.

– Sandie
Mar 28 at 17:22

Briefly add your code that you tried, and also the error you get. That would be great.

– Jim Todd
Mar 28 at 17:23

add a comment
|

Below is code, which I used:

import zipfile
import io
file_name = "s3 file path for zip file"

def zip_extract(x):
 in_memory_data = io.BytesIO(x[1])
 file_obj = zipfile.ZipFile(in_memory_data, "r")
 files = [i for i in file_obj.namelist()]
 return dict(zip(files, [file_obj.open(file).read() for file in files]))


zips = sc.binaryFiles(file_name)
files_data = zips.map(zip_extract)

But it's failing because of below reason. The instance which I'm using is r42x.large.

Exit code: 52
Stack trace: ExitCodeException exitCode=52: 
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0

edited Jul 8 at 14:02

thebluephantom

4,7825 gold badges15 silver badges37 bronze badges

asked Mar 28 at 14:29

Sandie

1211 gold badge2 silver badges13 bronze badges

Possible duplicate of How to open/stream .zip files through Spark?

– Jim Todd
Mar 28 at 17:06

Had a look already, that is not working.

– Sandie
Mar 28 at 17:22

Briefly add your code that you tried, and also the error you get. That would be great.

– Jim Todd
Mar 28 at 17:23

add a comment
|

Below is code, which I used:

import zipfile
import io
file_name = "s3 file path for zip file"

def zip_extract(x):
 in_memory_data = io.BytesIO(x[1])
 file_obj = zipfile.ZipFile(in_memory_data, "r")
 files = [i for i in file_obj.namelist()]
 return dict(zip(files, [file_obj.open(file).read() for file in files]))


zips = sc.binaryFiles(file_name)
files_data = zips.map(zip_extract)

But it's failing because of below reason. The instance which I'm using is r42x.large.

Exit code: 52
Stack trace: ExitCodeException exitCode=52: 
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0

edited Jul 8 at 14:02

thebluephantom

4,7825 gold badges15 silver badges37 bronze badges

asked Mar 28 at 14:29

Sandie

1211 gold badge2 silver badges13 bronze badges

Below is code, which I used:

import zipfile
import io
file_name = "s3 file path for zip file"

def zip_extract(x):
 in_memory_data = io.BytesIO(x[1])
 file_obj = zipfile.ZipFile(in_memory_data, "r")
 files = [i for i in file_obj.namelist()]
 return dict(zip(files, [file_obj.open(file).read() for file in files]))


zips = sc.binaryFiles(file_name)
files_data = zips.map(zip_extract)

But it's failing because of below reason. The instance which I'm using is r42x.large.

Exit code: 52
Stack trace: ExitCodeException exitCode=52: 
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0

python pyspark amazon-emr

edited Jul 8 at 14:02

thebluephantom

4,7825 gold badges15 silver badges37 bronze badges

asked Mar 28 at 14:29

Sandie

1211 gold badge2 silver badges13 bronze badges

edited Jul 8 at 14:02

thebluephantom

4,7825 gold badges15 silver badges37 bronze badges

asked Mar 28 at 14:29

Sandie

1211 gold badge2 silver badges13 bronze badges

edited Jul 8 at 14:02

thebluephantom

4,7825 gold badges15 silver badges37 bronze badges

edited Jul 8 at 14:02

thebluephantom

4,7825 gold badges15 silver badges37 bronze badges

edited Jul 8 at 14:02

thebluephantom

4,7825 gold badges15 silver badges37 bronze badges

asked Mar 28 at 14:29

Sandie

1211 gold badge2 silver badges13 bronze badges

asked Mar 28 at 14:29

Sandie

1211 gold badge2 silver badges13 bronze badges

asked Mar 28 at 14:29

Sandie

1211 gold badge2 silver badges13 bronze badges

Possible duplicate of How to open/stream .zip files through Spark?

– Jim Todd
Mar 28 at 17:06

Had a look already, that is not working.

– Sandie
Mar 28 at 17:22

Briefly add your code that you tried, and also the error you get. That would be great.

– Jim Todd
Mar 28 at 17:23

add a comment
|

Possible duplicate of How to open/stream .zip files through Spark?

– Jim Todd
Mar 28 at 17:06

Had a look already, that is not working.

– Sandie
Mar 28 at 17:22

Briefly add your code that you tried, and also the error you get. That would be great.

– Jim Todd
Mar 28 at 17:23

Possible duplicate of How to open/stream .zip files through Spark?

– Jim Todd
Mar 28 at 17:06

Had a look already, that is not working.

– Sandie
Mar 28 at 17:22

Briefly add your code that you tried, and also the error you get. That would be great.

– Jim Todd
Mar 28 at 17:23

add a comment
|

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55400119%2fhow-to-read-large-zip-files-in-pyspark%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

Is this question similar to what you get asked at work? Learn more about asking and sharing private information with your coworkers using Stack Overflow for Teams.

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

0

Your Answer

Post as a guest

0

0

Post as a guest

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴

0

Your Answer

Sign up or log in

Post as a guest

Post as a guest

0

0

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴