How to read large zip files in pysparkHow to open/stream .zip files through Spark?How to merge two dictionaries in a single expression?How do I check if a list is empty?How do I check whether a file exists without exceptions?How do I copy a file in Python?How can I safely create a nested directory?How do I sort a dictionary by value?How to make a chain of function decorators?How do you read from stdin?How do I list all files of a directory?How to read a file line-by-line into a list?

Simulate a 1D Game-of-Life-ish Model

Manager manipulates my leaves, what's in it for him?

Make Interviewee Comfortable in Potentially Intimate Environment

Do household ovens ventilate heat to the outdoors?

Norwegian refuses EU delay (4.7 hours) compensation because it turned out there was nothing wrong with the aircraft

Are lay articles good enough to be the main source of information for PhD research?

Which museums have artworks of all four ninja turtles' namesakes?

Is it true that, "just ten trading days represent 63 per cent of the returns of the past 50 years"?

How should I avoid someone patenting technology in my paper/poster?

How does one calculate the distribution of the Matt Colville way of rolling stats?

Minimize taxes now that I earn more

Debussy as term for bathroom?

Is it really necessary to have 4 hours meeting in Sprint planning?

Cheap antenna for new HF HAM

Automate tasks with Lambdas in java

What informations can we obtain with these voltage and current measurements of a little electronic device?

Temporarily moving a SQL Server 2016 database to SQL Server 2017 and then moving back. Is it possible?

Calibrated Esteps Causes Extruder Skipping

Was there a trial by combat between a man and a dog in medieval France?

Nanomachines exist that enable Axolotl-levels of regeneration - So how can crippling injuries exist as well?

Applications of mathematics in clinical setting

How to manage expenditure when billing cycles and paycheck cycles are not aligned?

Is there an in-universe reason Harry says this or is this simply a Rowling mistake?

How can I prevent soul energy from dissipating?



How to read large zip files in pyspark


How to open/stream .zip files through Spark?How to merge two dictionaries in a single expression?How do I check if a list is empty?How do I check whether a file exists without exceptions?How do I copy a file in Python?How can I safely create a nested directory?How do I sort a dictionary by value?How to make a chain of function decorators?How do you read from stdin?How do I list all files of a directory?How to read a file line-by-line into a list?






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








1















I do have n number of .zip files on s3, which I want to process and extract some data out of them. zip files contains a single json file. In Spark we can read .gz files, but I didn't find any way to read data within .zip files. Can someone please help me out how can I process large zip files over spark using python. I came across some options like newAPIHadoopFile, but didn't get any luck with them, nor found way to implement them in pyspark. Please note the zip files are >1G, some are of 20G as well.



Below is code, which I used:



import zipfile
import io
file_name = "s3 file path for zip file"

def zip_extract(x):
in_memory_data = io.BytesIO(x[1])
file_obj = zipfile.ZipFile(in_memory_data, "r")
files = [i for i in file_obj.namelist()]
return dict(zip(files, [file_obj.open(file).read() for file in files]))


zips = sc.binaryFiles(file_name)
files_data = zips.map(zip_extract)


But it's failing because of below reason. The instance which I'm using is r42x.large.



Exit code: 52
Stack trace: ExitCodeException exitCode=52:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0









share|improve this question


























  • Possible duplicate of How to open/stream .zip files through Spark?

    – Jim Todd
    Mar 28 at 17:06











  • Had a look already, that is not working.

    – Sandie
    Mar 28 at 17:22











  • Briefly add your code that you tried, and also the error you get. That would be great.

    – Jim Todd
    Mar 28 at 17:23

















1















I do have n number of .zip files on s3, which I want to process and extract some data out of them. zip files contains a single json file. In Spark we can read .gz files, but I didn't find any way to read data within .zip files. Can someone please help me out how can I process large zip files over spark using python. I came across some options like newAPIHadoopFile, but didn't get any luck with them, nor found way to implement them in pyspark. Please note the zip files are >1G, some are of 20G as well.



Below is code, which I used:



import zipfile
import io
file_name = "s3 file path for zip file"

def zip_extract(x):
in_memory_data = io.BytesIO(x[1])
file_obj = zipfile.ZipFile(in_memory_data, "r")
files = [i for i in file_obj.namelist()]
return dict(zip(files, [file_obj.open(file).read() for file in files]))


zips = sc.binaryFiles(file_name)
files_data = zips.map(zip_extract)


But it's failing because of below reason. The instance which I'm using is r42x.large.



Exit code: 52
Stack trace: ExitCodeException exitCode=52:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0









share|improve this question


























  • Possible duplicate of How to open/stream .zip files through Spark?

    – Jim Todd
    Mar 28 at 17:06











  • Had a look already, that is not working.

    – Sandie
    Mar 28 at 17:22











  • Briefly add your code that you tried, and also the error you get. That would be great.

    – Jim Todd
    Mar 28 at 17:23













1












1








1








I do have n number of .zip files on s3, which I want to process and extract some data out of them. zip files contains a single json file. In Spark we can read .gz files, but I didn't find any way to read data within .zip files. Can someone please help me out how can I process large zip files over spark using python. I came across some options like newAPIHadoopFile, but didn't get any luck with them, nor found way to implement them in pyspark. Please note the zip files are >1G, some are of 20G as well.



Below is code, which I used:



import zipfile
import io
file_name = "s3 file path for zip file"

def zip_extract(x):
in_memory_data = io.BytesIO(x[1])
file_obj = zipfile.ZipFile(in_memory_data, "r")
files = [i for i in file_obj.namelist()]
return dict(zip(files, [file_obj.open(file).read() for file in files]))


zips = sc.binaryFiles(file_name)
files_data = zips.map(zip_extract)


But it's failing because of below reason. The instance which I'm using is r42x.large.



Exit code: 52
Stack trace: ExitCodeException exitCode=52:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0









share|improve this question
















I do have n number of .zip files on s3, which I want to process and extract some data out of them. zip files contains a single json file. In Spark we can read .gz files, but I didn't find any way to read data within .zip files. Can someone please help me out how can I process large zip files over spark using python. I came across some options like newAPIHadoopFile, but didn't get any luck with them, nor found way to implement them in pyspark. Please note the zip files are >1G, some are of 20G as well.



Below is code, which I used:



import zipfile
import io
file_name = "s3 file path for zip file"

def zip_extract(x):
in_memory_data = io.BytesIO(x[1])
file_obj = zipfile.ZipFile(in_memory_data, "r")
files = [i for i in file_obj.namelist()]
return dict(zip(files, [file_obj.open(file).read() for file in files]))


zips = sc.binaryFiles(file_name)
files_data = zips.map(zip_extract)


But it's failing because of below reason. The instance which I'm using is r42x.large.



Exit code: 52
Stack trace: ExitCodeException exitCode=52:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0






python pyspark amazon-emr






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jul 8 at 14:02









thebluephantom

4,7825 gold badges15 silver badges37 bronze badges




4,7825 gold badges15 silver badges37 bronze badges










asked Mar 28 at 14:29









SandieSandie

1211 gold badge2 silver badges13 bronze badges




1211 gold badge2 silver badges13 bronze badges















  • Possible duplicate of How to open/stream .zip files through Spark?

    – Jim Todd
    Mar 28 at 17:06











  • Had a look already, that is not working.

    – Sandie
    Mar 28 at 17:22











  • Briefly add your code that you tried, and also the error you get. That would be great.

    – Jim Todd
    Mar 28 at 17:23

















  • Possible duplicate of How to open/stream .zip files through Spark?

    – Jim Todd
    Mar 28 at 17:06











  • Had a look already, that is not working.

    – Sandie
    Mar 28 at 17:22











  • Briefly add your code that you tried, and also the error you get. That would be great.

    – Jim Todd
    Mar 28 at 17:23
















Possible duplicate of How to open/stream .zip files through Spark?

– Jim Todd
Mar 28 at 17:06





Possible duplicate of How to open/stream .zip files through Spark?

– Jim Todd
Mar 28 at 17:06













Had a look already, that is not working.

– Sandie
Mar 28 at 17:22





Had a look already, that is not working.

– Sandie
Mar 28 at 17:22













Briefly add your code that you tried, and also the error you get. That would be great.

– Jim Todd
Mar 28 at 17:23





Briefly add your code that you tried, and also the error you get. That would be great.

– Jim Todd
Mar 28 at 17:23












0






active

oldest

votes










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);














draft saved

draft discarded
















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55400119%2fhow-to-read-large-zip-files-in-pyspark%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes




Is this question similar to what you get asked at work? Learn more about asking and sharing private information with your coworkers using Stack Overflow for Teams.







Is this question similar to what you get asked at work? Learn more about asking and sharing private information with your coworkers using Stack Overflow for Teams.




















draft saved

draft discarded















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55400119%2fhow-to-read-large-zip-files-in-pyspark%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

SQL error code 1064 with creating Laravel foreign keysForeign key constraints: When to use ON UPDATE and ON DELETEDropping column with foreign key Laravel error: General error: 1025 Error on renameLaravel SQL Can't create tableLaravel Migration foreign key errorLaravel php artisan migrate:refresh giving a syntax errorSQLSTATE[42S01]: Base table or view already exists or Base table or view already exists: 1050 Tableerror in migrating laravel file to xampp serverSyntax error or access violation: 1064:syntax to use near 'unsigned not null, modelName varchar(191) not null, title varchar(191) not nLaravel cannot create new table field in mysqlLaravel 5.7:Last migration creates table but is not registered in the migration table

용인 삼성생명 블루밍스 목차 통계 역대 감독 선수단 응원단 경기장 같이 보기 외부 링크 둘러보기 메뉴samsungblueminx.comeh선수 명단용인 삼성생명 블루밍스용인 삼성생명 블루밍스ehsamsungblueminx.comeheheheh

155 수학 과학 기타 둘러보기 메뉴eh추가해eh문서를 완성해