How to read large zip files in pysparkHow to open/stream .zip files through Spark?How to merge two dictionaries in a single expression?How do I check if a list is empty?How do I check whether a file exists without exceptions?How do I copy a file in Python?How can I safely create a nested directory?How do I sort a dictionary by value?How to make a chain of function decorators?How do you read from stdin?How do I list all files of a directory?How to read a file line-by-line into a list?
Simulate a 1D Game-of-Life-ish Model
Manager manipulates my leaves, what's in it for him?
Make Interviewee Comfortable in Potentially Intimate Environment
Do household ovens ventilate heat to the outdoors?
Norwegian refuses EU delay (4.7 hours) compensation because it turned out there was nothing wrong with the aircraft
Are lay articles good enough to be the main source of information for PhD research?
Which museums have artworks of all four ninja turtles' namesakes?
Is it true that, "just ten trading days represent 63 per cent of the returns of the past 50 years"?
How should I avoid someone patenting technology in my paper/poster?
How does one calculate the distribution of the Matt Colville way of rolling stats?
Minimize taxes now that I earn more
Debussy as term for bathroom?
Is it really necessary to have 4 hours meeting in Sprint planning?
Cheap antenna for new HF HAM
Automate tasks with Lambdas in java
What informations can we obtain with these voltage and current measurements of a little electronic device?
Temporarily moving a SQL Server 2016 database to SQL Server 2017 and then moving back. Is it possible?
Calibrated Esteps Causes Extruder Skipping
Was there a trial by combat between a man and a dog in medieval France?
Nanomachines exist that enable Axolotl-levels of regeneration - So how can crippling injuries exist as well?
Applications of mathematics in clinical setting
How to manage expenditure when billing cycles and paycheck cycles are not aligned?
Is there an in-universe reason Harry says this or is this simply a Rowling mistake?
How can I prevent soul energy from dissipating?
How to read large zip files in pyspark
How to open/stream .zip files through Spark?How to merge two dictionaries in a single expression?How do I check if a list is empty?How do I check whether a file exists without exceptions?How do I copy a file in Python?How can I safely create a nested directory?How do I sort a dictionary by value?How to make a chain of function decorators?How do you read from stdin?How do I list all files of a directory?How to read a file line-by-line into a list?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
I do have n number of .zip files on s3, which I want to process and extract some data out of them. zip files contains a single json file. In Spark we can read .gz files, but I didn't find any way to read data within .zip files. Can someone please help me out how can I process large zip files over spark using python. I came across some options like newAPIHadoopFile, but didn't get any luck with them, nor found way to implement them in pyspark. Please note the zip files are >1G, some are of 20G as well.
Below is code, which I used:
import zipfile
import io
file_name = "s3 file path for zip file"
def zip_extract(x):
in_memory_data = io.BytesIO(x[1])
file_obj = zipfile.ZipFile(in_memory_data, "r")
files = [i for i in file_obj.namelist()]
return dict(zip(files, [file_obj.open(file).read() for file in files]))
zips = sc.binaryFiles(file_name)
files_data = zips.map(zip_extract)
But it's failing because of below reason. The instance which I'm using is r42x.large.
Exit code: 52
Stack trace: ExitCodeException exitCode=52:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0
python pyspark amazon-emr
add a comment
|
I do have n number of .zip files on s3, which I want to process and extract some data out of them. zip files contains a single json file. In Spark we can read .gz files, but I didn't find any way to read data within .zip files. Can someone please help me out how can I process large zip files over spark using python. I came across some options like newAPIHadoopFile, but didn't get any luck with them, nor found way to implement them in pyspark. Please note the zip files are >1G, some are of 20G as well.
Below is code, which I used:
import zipfile
import io
file_name = "s3 file path for zip file"
def zip_extract(x):
in_memory_data = io.BytesIO(x[1])
file_obj = zipfile.ZipFile(in_memory_data, "r")
files = [i for i in file_obj.namelist()]
return dict(zip(files, [file_obj.open(file).read() for file in files]))
zips = sc.binaryFiles(file_name)
files_data = zips.map(zip_extract)
But it's failing because of below reason. The instance which I'm using is r42x.large.
Exit code: 52
Stack trace: ExitCodeException exitCode=52:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0
python pyspark amazon-emr
Possible duplicate of How to open/stream .zip files through Spark?
– Jim Todd
Mar 28 at 17:06
Had a look already, that is not working.
– Sandie
Mar 28 at 17:22
Briefly add your code that you tried, and also the error you get. That would be great.
– Jim Todd
Mar 28 at 17:23
add a comment
|
I do have n number of .zip files on s3, which I want to process and extract some data out of them. zip files contains a single json file. In Spark we can read .gz files, but I didn't find any way to read data within .zip files. Can someone please help me out how can I process large zip files over spark using python. I came across some options like newAPIHadoopFile, but didn't get any luck with them, nor found way to implement them in pyspark. Please note the zip files are >1G, some are of 20G as well.
Below is code, which I used:
import zipfile
import io
file_name = "s3 file path for zip file"
def zip_extract(x):
in_memory_data = io.BytesIO(x[1])
file_obj = zipfile.ZipFile(in_memory_data, "r")
files = [i for i in file_obj.namelist()]
return dict(zip(files, [file_obj.open(file).read() for file in files]))
zips = sc.binaryFiles(file_name)
files_data = zips.map(zip_extract)
But it's failing because of below reason. The instance which I'm using is r42x.large.
Exit code: 52
Stack trace: ExitCodeException exitCode=52:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0
python pyspark amazon-emr
I do have n number of .zip files on s3, which I want to process and extract some data out of them. zip files contains a single json file. In Spark we can read .gz files, but I didn't find any way to read data within .zip files. Can someone please help me out how can I process large zip files over spark using python. I came across some options like newAPIHadoopFile, but didn't get any luck with them, nor found way to implement them in pyspark. Please note the zip files are >1G, some are of 20G as well.
Below is code, which I used:
import zipfile
import io
file_name = "s3 file path for zip file"
def zip_extract(x):
in_memory_data = io.BytesIO(x[1])
file_obj = zipfile.ZipFile(in_memory_data, "r")
files = [i for i in file_obj.namelist()]
return dict(zip(files, [file_obj.open(file).read() for file in files]))
zips = sc.binaryFiles(file_name)
files_data = zips.map(zip_extract)
But it's failing because of below reason. The instance which I'm using is r42x.large.
Exit code: 52
Stack trace: ExitCodeException exitCode=52:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0
python pyspark amazon-emr
python pyspark amazon-emr
edited Jul 8 at 14:02
thebluephantom
4,7825 gold badges15 silver badges37 bronze badges
4,7825 gold badges15 silver badges37 bronze badges
asked Mar 28 at 14:29
SandieSandie
1211 gold badge2 silver badges13 bronze badges
1211 gold badge2 silver badges13 bronze badges
Possible duplicate of How to open/stream .zip files through Spark?
– Jim Todd
Mar 28 at 17:06
Had a look already, that is not working.
– Sandie
Mar 28 at 17:22
Briefly add your code that you tried, and also the error you get. That would be great.
– Jim Todd
Mar 28 at 17:23
add a comment
|
Possible duplicate of How to open/stream .zip files through Spark?
– Jim Todd
Mar 28 at 17:06
Had a look already, that is not working.
– Sandie
Mar 28 at 17:22
Briefly add your code that you tried, and also the error you get. That would be great.
– Jim Todd
Mar 28 at 17:23
Possible duplicate of How to open/stream .zip files through Spark?
– Jim Todd
Mar 28 at 17:06
Possible duplicate of How to open/stream .zip files through Spark?
– Jim Todd
Mar 28 at 17:06
Had a look already, that is not working.
– Sandie
Mar 28 at 17:22
Had a look already, that is not working.
– Sandie
Mar 28 at 17:22
Briefly add your code that you tried, and also the error you get. That would be great.
– Jim Todd
Mar 28 at 17:23
Briefly add your code that you tried, and also the error you get. That would be great.
– Jim Todd
Mar 28 at 17:23
add a comment
|
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55400119%2fhow-to-read-large-zip-files-in-pyspark%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Is this question similar to what you get asked at work? Learn more about asking and sharing private information with your coworkers using Stack Overflow for Teams.
Is this question similar to what you get asked at work? Learn more about asking and sharing private information with your coworkers using Stack Overflow for Teams.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55400119%2fhow-to-read-large-zip-files-in-pyspark%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Possible duplicate of How to open/stream .zip files through Spark?
– Jim Todd
Mar 28 at 17:06
Had a look already, that is not working.
– Sandie
Mar 28 at 17:22
Briefly add your code that you tried, and also the error you get. That would be great.
– Jim Todd
Mar 28 at 17:23