EMR notebook session times out within seconds(using pyspark) on a large dataframe(pyspark)Configure Ipython/Jupyter notebook with Pyspark on AWS EMR v4.0.0Spark Executors hang after Out of Memorycollect() or toPandas() on a large DataFrame in pyspark/EMRSpark: There is insufficient memory for the Java Runtime Environment to continuesort pyspark dataframe within groupsPyspark Joining dataframes with multiple rows in the second dataframeDoes multiprocessing/Pooling benefit Pyspark processing timeAWS EMR Spark app tuning for groupBy with >300,000 groupsRepartitioning large number of json files in s3 into parquet with EMR taking very long time

Interchange `colon` and `:`

Does the Horizon Walker ranger's Planar Warrior feature bypass resistance to non-magical attacks?

Can someone give the intuition behind Mean Absolute Error and the Median?

Hangman Game (YAHG)

Is the mass of paint relevant in rocket design?

Difference between "rip up" and "rip down"

Why is volatility skew/smile for long term options flatter compare to short term options?

Algorithm that generates orthogonal vectors: C++ implementation

Why is 6. Nge2 better, and 7. d5 a necessary push in this game?

Need Improvement on Script Which Continuosly Tests Website

There are 51 natural numbers between 1-100, proof that there are 2 numbers such that the difference between them equals to 5

My Project Manager does not accept carry-over in Scrum, Is that normal?

Is a PWM required for regenerative braking on a DC Motor?

A famous scholar sent me an unpublished draft of hers. Then she died. I think her work should be published. What should I do?

Is it acceptable to say that a reviewer's concern is not going to be addressed because then the paper would be too long?

Would you write key signatures for non-conventional scales?

Why does my browser attempt to download pages from http://clhs.lisp.se instead of viewing them normally?

Why are there two fundamental laws of logic?

Why was LOGO created?

How do pilots align the HUD with their eyeballs?

Subverting the emotional woman and stoic man trope

If a spaceship ran out of fuel somewhere in space between Earth and Mars, does it slowly drift off to Sun?

What secular civic space would pioneers build for small frontier towns?

Clear text passwords in Unix

EMR notebook session times out within seconds(using pyspark) on a large dataframe(pyspark)

Configure Ipython/Jupyter notebook with Pyspark on AWS EMR v4.0.0Spark Executors hang after Out of Memorycollect() or toPandas() on a large DataFrame in pyspark/EMRSpark: There is insufficient memory for the Java Runtime Environment to continuesort pyspark dataframe within groupsPyspark Joining dataframes with multiple rows in the second dataframeDoes multiprocessing/Pooling benefit Pyspark processing timeAWS EMR Spark app tuning for groupBy with >300,000 groupsRepartitioning large number of json files in s3 into parquet with EMR taking very long time

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

I am trying to do some operations on a pyspark dataframe. The dataframe looks something like this:

 user domain1 domain2 ........ domain100 conversions

 abcd 1 0 ........ 0 1
 gcea 0 0 ........ 1 0
 . . . ........ . .
 . . . ........ . .
 . . . ........ . .

The code I use works perfectly fine for me to further operate on the above dataframe if the dataframe is small, for example it works perfectly fine for a dataframe of the following shape:

 (148457,41)

But if I increase the size of the dataframe, to for example:

 (2184934,324)

I cannot proceed forward because the notebook times out or throws a session timeout error message as soon as i execute any kind of code on the dataframe, even something like a count() operation timesout.
This is how the timeout message looks like:

 An error was encountered:
 Invalid status code '400' from 
 https://172.31.12.103:18888/sessions/5/statements/20 with error 
 payload: 
 "requirement failed: Session isn't active."

This timeout takes 1 or 2 seconds(does not take long time to timeout).

I am not using collect() or any topandas() operations for it to timeout.
What I'm trying to do to the above dataframe is undersampling the data but I can't seem to make a simple .count() operation to work after the dataframe size is increased.

I have already tried using different types of instances in my emr cluster to make it work. When I use the smaller dataframe a c5.2xlarge type instance is enough, but for the larger dataframe it doesnt work even if I use c5.18xlarge instances. I have 1 master node and 2 slave nodes in my cluster.

This is what I'm trying to do to the dataframe

#Undersampling.
from pyspark.sql.functions import col, when
def resample(base_features,ratio,class_field,base_class):
 pos = base_features.filter(col(class_field)==base_class)
 neg = base_features.filter(col(class_field)!=base_class)
 total_pos = pos.count()
 total_neg = neg.count()
 fraction=float(total_pos*ratio)/float(total_neg)
 sampled = neg.sample(True,fraction)
 a = sampled.union(pos)
 return a
undersampled_df = resample(df,10,'conversions',1)

How can I solve this issue? Any suggestions on what steps I should take?

edited Mar 29 at 20:16

asked Mar 28 at 18:25

gara of the sand

186 bronze badges

add a comment
|

I am trying to do some operations on a pyspark dataframe. The dataframe looks something like this:

 user domain1 domain2 ........ domain100 conversions

 abcd 1 0 ........ 0 1
 gcea 0 0 ........ 1 0
 . . . ........ . .
 . . . ........ . .
 . . . ........ . .

The code I use works perfectly fine for me to further operate on the above dataframe if the dataframe is small, for example it works perfectly fine for a dataframe of the following shape:

 (148457,41)

But if I increase the size of the dataframe, to for example:

 (2184934,324)

 An error was encountered:
 Invalid status code '400' from 
 https://172.31.12.103:18888/sessions/5/statements/20 with error 
 payload: 
 "requirement failed: Session isn't active."

This timeout takes 1 or 2 seconds(does not take long time to timeout).

This is what I'm trying to do to the dataframe

#Undersampling.
from pyspark.sql.functions import col, when
def resample(base_features,ratio,class_field,base_class):
 pos = base_features.filter(col(class_field)==base_class)
 neg = base_features.filter(col(class_field)!=base_class)
 total_pos = pos.count()
 total_neg = neg.count()
 fraction=float(total_pos*ratio)/float(total_neg)
 sampled = neg.sample(True,fraction)
 a = sampled.union(pos)
 return a
undersampled_df = resample(df,10,'conversions',1)

How can I solve this issue? Any suggestions on what steps I should take?

edited Mar 29 at 20:16

asked Mar 28 at 18:25

gara of the sand

186 bronze badges

add a comment
|

I am trying to do some operations on a pyspark dataframe. The dataframe looks something like this:

 user domain1 domain2 ........ domain100 conversions

 abcd 1 0 ........ 0 1
 gcea 0 0 ........ 1 0
 . . . ........ . .
 . . . ........ . .
 . . . ........ . .

The code I use works perfectly fine for me to further operate on the above dataframe if the dataframe is small, for example it works perfectly fine for a dataframe of the following shape:

 (148457,41)

But if I increase the size of the dataframe, to for example:

 (2184934,324)

 An error was encountered:
 Invalid status code '400' from 
 https://172.31.12.103:18888/sessions/5/statements/20 with error 
 payload: 
 "requirement failed: Session isn't active."

This timeout takes 1 or 2 seconds(does not take long time to timeout).

This is what I'm trying to do to the dataframe

#Undersampling.
from pyspark.sql.functions import col, when
def resample(base_features,ratio,class_field,base_class):
 pos = base_features.filter(col(class_field)==base_class)
 neg = base_features.filter(col(class_field)!=base_class)
 total_pos = pos.count()
 total_neg = neg.count()
 fraction=float(total_pos*ratio)/float(total_neg)
 sampled = neg.sample(True,fraction)
 a = sampled.union(pos)
 return a
undersampled_df = resample(df,10,'conversions',1)

How can I solve this issue? Any suggestions on what steps I should take?

edited Mar 29 at 20:16

asked Mar 28 at 18:25

gara of the sand

186 bronze badges

I am trying to do some operations on a pyspark dataframe. The dataframe looks something like this:

 user domain1 domain2 ........ domain100 conversions

 abcd 1 0 ........ 0 1
 gcea 0 0 ........ 1 0
 . . . ........ . .
 . . . ........ . .
 . . . ........ . .

The code I use works perfectly fine for me to further operate on the above dataframe if the dataframe is small, for example it works perfectly fine for a dataframe of the following shape:

 (148457,41)

But if I increase the size of the dataframe, to for example:

 (2184934,324)

 An error was encountered:
 Invalid status code '400' from 
 https://172.31.12.103:18888/sessions/5/statements/20 with error 
 payload: 
 "requirement failed: Session isn't active."

This timeout takes 1 or 2 seconds(does not take long time to timeout).

This is what I'm trying to do to the dataframe

#Undersampling.
from pyspark.sql.functions import col, when
def resample(base_features,ratio,class_field,base_class):
 pos = base_features.filter(col(class_field)==base_class)
 neg = base_features.filter(col(class_field)!=base_class)
 total_pos = pos.count()
 total_neg = neg.count()
 fraction=float(total_pos*ratio)/float(total_neg)
 sampled = neg.sample(True,fraction)
 a = sampled.union(pos)
 return a
undersampled_df = resample(df,10,'conversions',1)

How can I solve this issue? Any suggestions on what steps I should take?

python amazon-web-services apache-spark pyspark amazon-emr

edited Mar 29 at 20:16

asked Mar 28 at 18:25

gara of the sand

186 bronze badges

edited Mar 29 at 20:16

asked Mar 28 at 18:25

gara of the sand

186 bronze badges

edited Mar 29 at 20:16

asked Mar 28 at 18:25

gara of the sand

186 bronze badges

asked Mar 28 at 18:25

gara of the sand

186 bronze badges

asked Mar 28 at 18:25

gara of the sand

186 bronze badges

add a comment
|

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55404521%2femr-notebook-session-times-out-within-secondsusing-pyspark-on-a-large-datafram%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

0

Your Answer

Post as a guest

0

0

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

0

Your Answer

Sign up or log in

Post as a guest

Post as a guest

0

0

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현