EMR notebook session times out within seconds(using pyspark) on a large dataframe(pyspark)Configure Ipython/Jupyter notebook with Pyspark on AWS EMR v4.0.0Spark Executors hang after Out of Memorycollect() or toPandas() on a large DataFrame in pyspark/EMRSpark: There is insufficient memory for the Java Runtime Environment to continuesort pyspark dataframe within groupsPyspark Joining dataframes with multiple rows in the second dataframeDoes multiprocessing/Pooling benefit Pyspark processing timeAWS EMR Spark app tuning for groupBy with >300,000 groupsRepartitioning large number of json files in s3 into parquet with EMR taking very long time
Interchange `colon` and `:`
Does the Horizon Walker ranger's Planar Warrior feature bypass resistance to non-magical attacks?
Can someone give the intuition behind Mean Absolute Error and the Median?
Hangman Game (YAHG)
Is the mass of paint relevant in rocket design?
Difference between "rip up" and "rip down"
Why is volatility skew/smile for long term options flatter compare to short term options?
Algorithm that generates orthogonal vectors: C++ implementation
Why is 6. Nge2 better, and 7. d5 a necessary push in this game?
Need Improvement on Script Which Continuosly Tests Website
There are 51 natural numbers between 1-100, proof that there are 2 numbers such that the difference between them equals to 5
My Project Manager does not accept carry-over in Scrum, Is that normal?
Is a PWM required for regenerative braking on a DC Motor?
A famous scholar sent me an unpublished draft of hers. Then she died. I think her work should be published. What should I do?
Is it acceptable to say that a reviewer's concern is not going to be addressed because then the paper would be too long?
Would you write key signatures for non-conventional scales?
Why does my browser attempt to download pages from http://clhs.lisp.se instead of viewing them normally?
Why are there two fundamental laws of logic?
Why was LOGO created?
How do pilots align the HUD with their eyeballs?
Subverting the emotional woman and stoic man trope
If a spaceship ran out of fuel somewhere in space between Earth and Mars, does it slowly drift off to Sun?
What secular civic space would pioneers build for small frontier towns?
Clear text passwords in Unix
EMR notebook session times out within seconds(using pyspark) on a large dataframe(pyspark)
Configure Ipython/Jupyter notebook with Pyspark on AWS EMR v4.0.0Spark Executors hang after Out of Memorycollect() or toPandas() on a large DataFrame in pyspark/EMRSpark: There is insufficient memory for the Java Runtime Environment to continuesort pyspark dataframe within groupsPyspark Joining dataframes with multiple rows in the second dataframeDoes multiprocessing/Pooling benefit Pyspark processing timeAWS EMR Spark app tuning for groupBy with >300,000 groupsRepartitioning large number of json files in s3 into parquet with EMR taking very long time
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
I am trying to do some operations on a pyspark dataframe. The dataframe looks something like this:
user domain1 domain2 ........ domain100 conversions
abcd 1 0 ........ 0 1
gcea 0 0 ........ 1 0
. . . ........ . .
. . . ........ . .
. . . ........ . .
The code I use works perfectly fine for me to further operate on the above dataframe if the dataframe is small, for example it works perfectly fine for a dataframe of the following shape:
(148457,41)
But if I increase the size of the dataframe, to for example:
(2184934,324)
I cannot proceed forward because the notebook times out or throws a session timeout error message as soon as i execute any kind of code on the dataframe, even something like a count() operation timesout.
This is how the timeout message looks like:
An error was encountered:
Invalid status code '400' from
https://172.31.12.103:18888/sessions/5/statements/20 with error
payload:
"requirement failed: Session isn't active."
This timeout takes 1 or 2 seconds(does not take long time to timeout).
I am not using collect() or any topandas() operations for it to timeout.
What I'm trying to do to the above dataframe is undersampling the data but I can't seem to make a simple .count() operation to work after the dataframe size is increased.
I have already tried using different types of instances in my emr cluster to make it work. When I use the smaller dataframe a c5.2xlarge type instance is enough, but for the larger dataframe it doesnt work even if I use c5.18xlarge instances. I have 1 master node and 2 slave nodes in my cluster.
This is what I'm trying to do to the dataframe
#Undersampling.
from pyspark.sql.functions import col, when
def resample(base_features,ratio,class_field,base_class):
pos = base_features.filter(col(class_field)==base_class)
neg = base_features.filter(col(class_field)!=base_class)
total_pos = pos.count()
total_neg = neg.count()
fraction=float(total_pos*ratio)/float(total_neg)
sampled = neg.sample(True,fraction)
a = sampled.union(pos)
return a
undersampled_df = resample(df,10,'conversions',1)
How can I solve this issue? Any suggestions on what steps I should take?
python amazon-web-services apache-spark pyspark amazon-emr
add a comment
|
I am trying to do some operations on a pyspark dataframe. The dataframe looks something like this:
user domain1 domain2 ........ domain100 conversions
abcd 1 0 ........ 0 1
gcea 0 0 ........ 1 0
. . . ........ . .
. . . ........ . .
. . . ........ . .
The code I use works perfectly fine for me to further operate on the above dataframe if the dataframe is small, for example it works perfectly fine for a dataframe of the following shape:
(148457,41)
But if I increase the size of the dataframe, to for example:
(2184934,324)
I cannot proceed forward because the notebook times out or throws a session timeout error message as soon as i execute any kind of code on the dataframe, even something like a count() operation timesout.
This is how the timeout message looks like:
An error was encountered:
Invalid status code '400' from
https://172.31.12.103:18888/sessions/5/statements/20 with error
payload:
"requirement failed: Session isn't active."
This timeout takes 1 or 2 seconds(does not take long time to timeout).
I am not using collect() or any topandas() operations for it to timeout.
What I'm trying to do to the above dataframe is undersampling the data but I can't seem to make a simple .count() operation to work after the dataframe size is increased.
I have already tried using different types of instances in my emr cluster to make it work. When I use the smaller dataframe a c5.2xlarge type instance is enough, but for the larger dataframe it doesnt work even if I use c5.18xlarge instances. I have 1 master node and 2 slave nodes in my cluster.
This is what I'm trying to do to the dataframe
#Undersampling.
from pyspark.sql.functions import col, when
def resample(base_features,ratio,class_field,base_class):
pos = base_features.filter(col(class_field)==base_class)
neg = base_features.filter(col(class_field)!=base_class)
total_pos = pos.count()
total_neg = neg.count()
fraction=float(total_pos*ratio)/float(total_neg)
sampled = neg.sample(True,fraction)
a = sampled.union(pos)
return a
undersampled_df = resample(df,10,'conversions',1)
How can I solve this issue? Any suggestions on what steps I should take?
python amazon-web-services apache-spark pyspark amazon-emr
add a comment
|
I am trying to do some operations on a pyspark dataframe. The dataframe looks something like this:
user domain1 domain2 ........ domain100 conversions
abcd 1 0 ........ 0 1
gcea 0 0 ........ 1 0
. . . ........ . .
. . . ........ . .
. . . ........ . .
The code I use works perfectly fine for me to further operate on the above dataframe if the dataframe is small, for example it works perfectly fine for a dataframe of the following shape:
(148457,41)
But if I increase the size of the dataframe, to for example:
(2184934,324)
I cannot proceed forward because the notebook times out or throws a session timeout error message as soon as i execute any kind of code on the dataframe, even something like a count() operation timesout.
This is how the timeout message looks like:
An error was encountered:
Invalid status code '400' from
https://172.31.12.103:18888/sessions/5/statements/20 with error
payload:
"requirement failed: Session isn't active."
This timeout takes 1 or 2 seconds(does not take long time to timeout).
I am not using collect() or any topandas() operations for it to timeout.
What I'm trying to do to the above dataframe is undersampling the data but I can't seem to make a simple .count() operation to work after the dataframe size is increased.
I have already tried using different types of instances in my emr cluster to make it work. When I use the smaller dataframe a c5.2xlarge type instance is enough, but for the larger dataframe it doesnt work even if I use c5.18xlarge instances. I have 1 master node and 2 slave nodes in my cluster.
This is what I'm trying to do to the dataframe
#Undersampling.
from pyspark.sql.functions import col, when
def resample(base_features,ratio,class_field,base_class):
pos = base_features.filter(col(class_field)==base_class)
neg = base_features.filter(col(class_field)!=base_class)
total_pos = pos.count()
total_neg = neg.count()
fraction=float(total_pos*ratio)/float(total_neg)
sampled = neg.sample(True,fraction)
a = sampled.union(pos)
return a
undersampled_df = resample(df,10,'conversions',1)
How can I solve this issue? Any suggestions on what steps I should take?
python amazon-web-services apache-spark pyspark amazon-emr
I am trying to do some operations on a pyspark dataframe. The dataframe looks something like this:
user domain1 domain2 ........ domain100 conversions
abcd 1 0 ........ 0 1
gcea 0 0 ........ 1 0
. . . ........ . .
. . . ........ . .
. . . ........ . .
The code I use works perfectly fine for me to further operate on the above dataframe if the dataframe is small, for example it works perfectly fine for a dataframe of the following shape:
(148457,41)
But if I increase the size of the dataframe, to for example:
(2184934,324)
I cannot proceed forward because the notebook times out or throws a session timeout error message as soon as i execute any kind of code on the dataframe, even something like a count() operation timesout.
This is how the timeout message looks like:
An error was encountered:
Invalid status code '400' from
https://172.31.12.103:18888/sessions/5/statements/20 with error
payload:
"requirement failed: Session isn't active."
This timeout takes 1 or 2 seconds(does not take long time to timeout).
I am not using collect() or any topandas() operations for it to timeout.
What I'm trying to do to the above dataframe is undersampling the data but I can't seem to make a simple .count() operation to work after the dataframe size is increased.
I have already tried using different types of instances in my emr cluster to make it work. When I use the smaller dataframe a c5.2xlarge type instance is enough, but for the larger dataframe it doesnt work even if I use c5.18xlarge instances. I have 1 master node and 2 slave nodes in my cluster.
This is what I'm trying to do to the dataframe
#Undersampling.
from pyspark.sql.functions import col, when
def resample(base_features,ratio,class_field,base_class):
pos = base_features.filter(col(class_field)==base_class)
neg = base_features.filter(col(class_field)!=base_class)
total_pos = pos.count()
total_neg = neg.count()
fraction=float(total_pos*ratio)/float(total_neg)
sampled = neg.sample(True,fraction)
a = sampled.union(pos)
return a
undersampled_df = resample(df,10,'conversions',1)
How can I solve this issue? Any suggestions on what steps I should take?
python amazon-web-services apache-spark pyspark amazon-emr
python amazon-web-services apache-spark pyspark amazon-emr
edited Mar 29 at 20:16
gara of the sand
asked Mar 28 at 18:25
gara of the sandgara of the sand
186 bronze badges
186 bronze badges
add a comment
|
add a comment
|
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55404521%2femr-notebook-session-times-out-within-secondsusing-pyspark-on-a-large-datafram%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55404521%2femr-notebook-session-times-out-within-secondsusing-pyspark-on-a-large-datafram%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown