Data processing using AWS GlueOptimize data upload on GoogleBigQueryTechniques for working with very large Pandas data frames or csv files in PythonProcessing a large SQL query in Python using Pandas?AWS Glue Job running too slowProblem loading csv into DataFrame in PySparkHow to write user-defined function in AWS-Glue Script?Best way to read in part of a huge table to AWS GLUEAdding data columns in AWS GlueAWS Glue and Python Integration

Is there a maximum distance from a planet that a moon can orbit?

STM Microcontroller burns every time

Can White Castle? #2

Hot coffee brewing solutions for deep woods camping

What do you call a weak person's act of taking on bigger opponents?

Trainee keeps missing deadlines for independent learning

Abel-Jacobi map on symmetric product of genus 4 curve

Do French speakers not use the subjunctive informally?

How to add multiple ip address in destination ip in acl rule

How long would it take to cross the Channel in 1890's?

Why are there so many 'vimrcs'?

What is the legal status of travelling with (unprescribed) methadone in your carry-on?

First-year PhD giving a talk among well-established researchers in the field

What are the penalties for overstaying in USA?

Require advice on power conservation for backpacking trip

How does a blind passenger not die, if driver becomes unconscious

Unusual mail headers, evidence of an attempted attack. Have I been pwned?

Inverse-quotes-quine

Changing the opacity of lines on a plot based on their value

Should I include salary information on my CV?

When is it ok to add filler to a story?

Are all instances of trolls turning to stone ultimately references back to Tolkien?

Should my manager be aware of private LinkedIn approaches I receive? How to politely have this happen?

Why aren't (poly-)cotton tents more popular?



Data processing using AWS Glue


Optimize data upload on GoogleBigQueryTechniques for working with very large Pandas data frames or csv files in PythonProcessing a large SQL query in Python using Pandas?AWS Glue Job running too slowProblem loading csv into DataFrame in PySparkHow to write user-defined function in AWS-Glue Script?Best way to read in part of a huge table to AWS GLUEAdding data columns in AWS GlueAWS Glue and Python Integration






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








0















I am trying use pySpark AWS Glue for data processing/data cleaning. Data is in CSV format and saved in S3. Data has around 7k columns and 7k rows.
Cleaning is a set of rules in another CSV, need to loop through each rule and query the data frame based on the condition, update the data based on action.



I loaded the data in a data frame and for cleaning it take more than 3 hours.



How can i improve the performance? how can i parallelise the cleaning? In normal python, i can divide the data into chunks and apply cleaning rules parallely for chunk.



Please suggest if AWS Glue is suited for this.



Regards
MaX










share|improve this question






















  • Explain your scenario with sample data and rule, it will help others understand your question and give you a better response. If you are using dataframe in AWS glue and partitioning the data, you can process the data in parallel.

    – Abraham
    Mar 26 at 18:34

















0















I am trying use pySpark AWS Glue for data processing/data cleaning. Data is in CSV format and saved in S3. Data has around 7k columns and 7k rows.
Cleaning is a set of rules in another CSV, need to loop through each rule and query the data frame based on the condition, update the data based on action.



I loaded the data in a data frame and for cleaning it take more than 3 hours.



How can i improve the performance? how can i parallelise the cleaning? In normal python, i can divide the data into chunks and apply cleaning rules parallely for chunk.



Please suggest if AWS Glue is suited for this.



Regards
MaX










share|improve this question






















  • Explain your scenario with sample data and rule, it will help others understand your question and give you a better response. If you are using dataframe in AWS glue and partitioning the data, you can process the data in parallel.

    – Abraham
    Mar 26 at 18:34













0












0








0








I am trying use pySpark AWS Glue for data processing/data cleaning. Data is in CSV format and saved in S3. Data has around 7k columns and 7k rows.
Cleaning is a set of rules in another CSV, need to loop through each rule and query the data frame based on the condition, update the data based on action.



I loaded the data in a data frame and for cleaning it take more than 3 hours.



How can i improve the performance? how can i parallelise the cleaning? In normal python, i can divide the data into chunks and apply cleaning rules parallely for chunk.



Please suggest if AWS Glue is suited for this.



Regards
MaX










share|improve this question














I am trying use pySpark AWS Glue for data processing/data cleaning. Data is in CSV format and saved in S3. Data has around 7k columns and 7k rows.
Cleaning is a set of rules in another CSV, need to loop through each rule and query the data frame based on the condition, update the data based on action.



I loaded the data in a data frame and for cleaning it take more than 3 hours.



How can i improve the performance? how can i parallelise the cleaning? In normal python, i can divide the data into chunks and apply cleaning rules parallely for chunk.



Please suggest if AWS Glue is suited for this.



Regards
MaX







bigdata aws-glue






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 25 at 9:55









mAxmAx

105 bronze badges




105 bronze badges












  • Explain your scenario with sample data and rule, it will help others understand your question and give you a better response. If you are using dataframe in AWS glue and partitioning the data, you can process the data in parallel.

    – Abraham
    Mar 26 at 18:34

















  • Explain your scenario with sample data and rule, it will help others understand your question and give you a better response. If you are using dataframe in AWS glue and partitioning the data, you can process the data in parallel.

    – Abraham
    Mar 26 at 18:34
















Explain your scenario with sample data and rule, it will help others understand your question and give you a better response. If you are using dataframe in AWS glue and partitioning the data, you can process the data in parallel.

– Abraham
Mar 26 at 18:34





Explain your scenario with sample data and rule, it will help others understand your question and give you a better response. If you are using dataframe in AWS glue and partitioning the data, you can process the data in parallel.

– Abraham
Mar 26 at 18:34












0






active

oldest

votes














Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55335156%2fdata-processing-using-aws-glue%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55335156%2fdata-processing-using-aws-glue%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

SQL error code 1064 with creating Laravel foreign keysForeign key constraints: When to use ON UPDATE and ON DELETEDropping column with foreign key Laravel error: General error: 1025 Error on renameLaravel SQL Can't create tableLaravel Migration foreign key errorLaravel php artisan migrate:refresh giving a syntax errorSQLSTATE[42S01]: Base table or view already exists or Base table or view already exists: 1050 Tableerror in migrating laravel file to xampp serverSyntax error or access violation: 1064:syntax to use near 'unsigned not null, modelName varchar(191) not null, title varchar(191) not nLaravel cannot create new table field in mysqlLaravel 5.7:Last migration creates table but is not registered in the migration table

용인 삼성생명 블루밍스 목차 통계 역대 감독 선수단 응원단 경기장 같이 보기 외부 링크 둘러보기 메뉴samsungblueminx.comeh선수 명단용인 삼성생명 블루밍스용인 삼성생명 블루밍스ehsamsungblueminx.comeheheheh

155 수학 과학 기타 둘러보기 메뉴eh추가해eh문서를 완성해