Data processing using AWS GlueOptimize data upload on GoogleBigQueryTechniques for working with very large Pandas data frames or csv files in PythonProcessing a large SQL query in Python using Pandas?AWS Glue Job running too slowProblem loading csv into DataFrame in PySparkHow to write user-defined function in AWS-Glue Script?Best way to read in part of a huge table to AWS GLUEAdding data columns in AWS GlueAWS Glue and Python Integration
Is there a maximum distance from a planet that a moon can orbit?
STM Microcontroller burns every time
Can White Castle? #2
Hot coffee brewing solutions for deep woods camping
What do you call a weak person's act of taking on bigger opponents?
Trainee keeps missing deadlines for independent learning
Abel-Jacobi map on symmetric product of genus 4 curve
Do French speakers not use the subjunctive informally?
How to add multiple ip address in destination ip in acl rule
How long would it take to cross the Channel in 1890's?
Why are there so many 'vimrcs'?
What is the legal status of travelling with (unprescribed) methadone in your carry-on?
First-year PhD giving a talk among well-established researchers in the field
What are the penalties for overstaying in USA?
Require advice on power conservation for backpacking trip
How does a blind passenger not die, if driver becomes unconscious
Unusual mail headers, evidence of an attempted attack. Have I been pwned?
Inverse-quotes-quine
Changing the opacity of lines on a plot based on their value
Should I include salary information on my CV?
When is it ok to add filler to a story?
Are all instances of trolls turning to stone ultimately references back to Tolkien?
Should my manager be aware of private LinkedIn approaches I receive? How to politely have this happen?
Why aren't (poly-)cotton tents more popular?
Data processing using AWS Glue
Optimize data upload on GoogleBigQueryTechniques for working with very large Pandas data frames or csv files in PythonProcessing a large SQL query in Python using Pandas?AWS Glue Job running too slowProblem loading csv into DataFrame in PySparkHow to write user-defined function in AWS-Glue Script?Best way to read in part of a huge table to AWS GLUEAdding data columns in AWS GlueAWS Glue and Python Integration
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
I am trying use pySpark AWS Glue for data processing/data cleaning. Data is in CSV format and saved in S3. Data has around 7k columns and 7k rows.
Cleaning is a set of rules in another CSV, need to loop through each rule and query the data frame based on the condition, update the data based on action.
I loaded the data in a data frame and for cleaning it take more than 3 hours.
How can i improve the performance? how can i parallelise the cleaning? In normal python, i can divide the data into chunks and apply cleaning rules parallely for chunk.
Please suggest if AWS Glue is suited for this.
Regards
MaX
bigdata aws-glue
add a comment |
I am trying use pySpark AWS Glue for data processing/data cleaning. Data is in CSV format and saved in S3. Data has around 7k columns and 7k rows.
Cleaning is a set of rules in another CSV, need to loop through each rule and query the data frame based on the condition, update the data based on action.
I loaded the data in a data frame and for cleaning it take more than 3 hours.
How can i improve the performance? how can i parallelise the cleaning? In normal python, i can divide the data into chunks and apply cleaning rules parallely for chunk.
Please suggest if AWS Glue is suited for this.
Regards
MaX
bigdata aws-glue
Explain your scenario with sample data and rule, it will help others understand your question and give you a better response. If you are using dataframe in AWS glue and partitioning the data, you can process the data in parallel.
– Abraham
Mar 26 at 18:34
add a comment |
I am trying use pySpark AWS Glue for data processing/data cleaning. Data is in CSV format and saved in S3. Data has around 7k columns and 7k rows.
Cleaning is a set of rules in another CSV, need to loop through each rule and query the data frame based on the condition, update the data based on action.
I loaded the data in a data frame and for cleaning it take more than 3 hours.
How can i improve the performance? how can i parallelise the cleaning? In normal python, i can divide the data into chunks and apply cleaning rules parallely for chunk.
Please suggest if AWS Glue is suited for this.
Regards
MaX
bigdata aws-glue
I am trying use pySpark AWS Glue for data processing/data cleaning. Data is in CSV format and saved in S3. Data has around 7k columns and 7k rows.
Cleaning is a set of rules in another CSV, need to loop through each rule and query the data frame based on the condition, update the data based on action.
I loaded the data in a data frame and for cleaning it take more than 3 hours.
How can i improve the performance? how can i parallelise the cleaning? In normal python, i can divide the data into chunks and apply cleaning rules parallely for chunk.
Please suggest if AWS Glue is suited for this.
Regards
MaX
bigdata aws-glue
bigdata aws-glue
asked Mar 25 at 9:55
mAxmAx
105 bronze badges
105 bronze badges
Explain your scenario with sample data and rule, it will help others understand your question and give you a better response. If you are using dataframe in AWS glue and partitioning the data, you can process the data in parallel.
– Abraham
Mar 26 at 18:34
add a comment |
Explain your scenario with sample data and rule, it will help others understand your question and give you a better response. If you are using dataframe in AWS glue and partitioning the data, you can process the data in parallel.
– Abraham
Mar 26 at 18:34
Explain your scenario with sample data and rule, it will help others understand your question and give you a better response. If you are using dataframe in AWS glue and partitioning the data, you can process the data in parallel.
– Abraham
Mar 26 at 18:34
Explain your scenario with sample data and rule, it will help others understand your question and give you a better response. If you are using dataframe in AWS glue and partitioning the data, you can process the data in parallel.
– Abraham
Mar 26 at 18:34
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55335156%2fdata-processing-using-aws-glue%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55335156%2fdata-processing-using-aws-glue%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Explain your scenario with sample data and rule, it will help others understand your question and give you a better response. If you are using dataframe in AWS glue and partitioning the data, you can process the data in parallel.
– Abraham
Mar 26 at 18:34