EMR notebook session times out within seconds(using pyspark) on a large dataframe(pyspark)Configure Ipython/Jupyter notebook with Pyspark on AWS EMR v4.0.0Spark Executors hang after Out of Memorycollect() or toPandas() on a large DataFrame in pyspark/EMRSpark: There is insufficient memory for the Java Runtime Environment to continuesort pyspark dataframe within groupsPyspark Joining dataframes with multiple rows in the second dataframeDoes multiprocessing/Pooling benefit Pyspark processing timeAWS EMR Spark app tuning for groupBy with >300,000 groupsRepartitioning large number of json files in s3 into parquet with EMR taking very long time

Interchange `colon` and `:`

Does the Horizon Walker ranger's Planar Warrior feature bypass resistance to non-magical attacks?

Can someone give the intuition behind Mean Absolute Error and the Median?

Hangman Game (YAHG)

Is the mass of paint relevant in rocket design?

Difference between "rip up" and "rip down"

Why is volatility skew/smile for long term options flatter compare to short term options?

Algorithm that generates orthogonal vectors: C++ implementation

Why is 6. Nge2 better, and 7. d5 a necessary push in this game?

Need Improvement on Script Which Continuosly Tests Website

There are 51 natural numbers between 1-100, proof that there are 2 numbers such that the difference between them equals to 5

My Project Manager does not accept carry-over in Scrum, Is that normal?

Is a PWM required for regenerative braking on a DC Motor?

A famous scholar sent me an unpublished draft of hers. Then she died. I think her work should be published. What should I do?

Is it acceptable to say that a reviewer's concern is not going to be addressed because then the paper would be too long?

Would you write key signatures for non-conventional scales?

Why does my browser attempt to download pages from http://clhs.lisp.se instead of viewing them normally?

Why are there two fundamental laws of logic?

Why was LOGO created?

How do pilots align the HUD with their eyeballs?

Subverting the emotional woman and stoic man trope

If a spaceship ran out of fuel somewhere in space between Earth and Mars, does it slowly drift off to Sun?

What secular civic space would pioneers build for small frontier towns?

Clear text passwords in Unix



EMR notebook session times out within seconds(using pyspark) on a large dataframe(pyspark)


Configure Ipython/Jupyter notebook with Pyspark on AWS EMR v4.0.0Spark Executors hang after Out of Memorycollect() or toPandas() on a large DataFrame in pyspark/EMRSpark: There is insufficient memory for the Java Runtime Environment to continuesort pyspark dataframe within groupsPyspark Joining dataframes with multiple rows in the second dataframeDoes multiprocessing/Pooling benefit Pyspark processing timeAWS EMR Spark app tuning for groupBy with >300,000 groupsRepartitioning large number of json files in s3 into parquet with EMR taking very long time






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








3















I am trying to do some operations on a pyspark dataframe. The dataframe looks something like this:



 user domain1 domain2 ........ domain100 conversions

abcd 1 0 ........ 0 1
gcea 0 0 ........ 1 0
. . . ........ . .
. . . ........ . .
. . . ........ . .


The code I use works perfectly fine for me to further operate on the above dataframe if the dataframe is small, for example it works perfectly fine for a dataframe of the following shape:



 (148457,41)


But if I increase the size of the dataframe, to for example:



 (2184934,324)


I cannot proceed forward because the notebook times out or throws a session timeout error message as soon as i execute any kind of code on the dataframe, even something like a count() operation timesout.
This is how the timeout message looks like:



 An error was encountered:
Invalid status code '400' from
https://172.31.12.103:18888/sessions/5/statements/20 with error
payload:
"requirement failed: Session isn't active."


This timeout takes 1 or 2 seconds(does not take long time to timeout).



I am not using collect() or any topandas() operations for it to timeout.
What I'm trying to do to the above dataframe is undersampling the data but I can't seem to make a simple .count() operation to work after the dataframe size is increased.



I have already tried using different types of instances in my emr cluster to make it work. When I use the smaller dataframe a c5.2xlarge type instance is enough, but for the larger dataframe it doesnt work even if I use c5.18xlarge instances. I have 1 master node and 2 slave nodes in my cluster.



This is what I'm trying to do to the dataframe



#Undersampling.
from pyspark.sql.functions import col, when
def resample(base_features,ratio,class_field,base_class):
pos = base_features.filter(col(class_field)==base_class)
neg = base_features.filter(col(class_field)!=base_class)
total_pos = pos.count()
total_neg = neg.count()
fraction=float(total_pos*ratio)/float(total_neg)
sampled = neg.sample(True,fraction)
a = sampled.union(pos)
return a
undersampled_df = resample(df,10,'conversions',1)


How can I solve this issue? Any suggestions on what steps I should take?










share|improve this question
































    3















    I am trying to do some operations on a pyspark dataframe. The dataframe looks something like this:



     user domain1 domain2 ........ domain100 conversions

    abcd 1 0 ........ 0 1
    gcea 0 0 ........ 1 0
    . . . ........ . .
    . . . ........ . .
    . . . ........ . .


    The code I use works perfectly fine for me to further operate on the above dataframe if the dataframe is small, for example it works perfectly fine for a dataframe of the following shape:



     (148457,41)


    But if I increase the size of the dataframe, to for example:



     (2184934,324)


    I cannot proceed forward because the notebook times out or throws a session timeout error message as soon as i execute any kind of code on the dataframe, even something like a count() operation timesout.
    This is how the timeout message looks like:



     An error was encountered:
    Invalid status code '400' from
    https://172.31.12.103:18888/sessions/5/statements/20 with error
    payload:
    "requirement failed: Session isn't active."


    This timeout takes 1 or 2 seconds(does not take long time to timeout).



    I am not using collect() or any topandas() operations for it to timeout.
    What I'm trying to do to the above dataframe is undersampling the data but I can't seem to make a simple .count() operation to work after the dataframe size is increased.



    I have already tried using different types of instances in my emr cluster to make it work. When I use the smaller dataframe a c5.2xlarge type instance is enough, but for the larger dataframe it doesnt work even if I use c5.18xlarge instances. I have 1 master node and 2 slave nodes in my cluster.



    This is what I'm trying to do to the dataframe



    #Undersampling.
    from pyspark.sql.functions import col, when
    def resample(base_features,ratio,class_field,base_class):
    pos = base_features.filter(col(class_field)==base_class)
    neg = base_features.filter(col(class_field)!=base_class)
    total_pos = pos.count()
    total_neg = neg.count()
    fraction=float(total_pos*ratio)/float(total_neg)
    sampled = neg.sample(True,fraction)
    a = sampled.union(pos)
    return a
    undersampled_df = resample(df,10,'conversions',1)


    How can I solve this issue? Any suggestions on what steps I should take?










    share|improve this question




























      3












      3








      3








      I am trying to do some operations on a pyspark dataframe. The dataframe looks something like this:



       user domain1 domain2 ........ domain100 conversions

      abcd 1 0 ........ 0 1
      gcea 0 0 ........ 1 0
      . . . ........ . .
      . . . ........ . .
      . . . ........ . .


      The code I use works perfectly fine for me to further operate on the above dataframe if the dataframe is small, for example it works perfectly fine for a dataframe of the following shape:



       (148457,41)


      But if I increase the size of the dataframe, to for example:



       (2184934,324)


      I cannot proceed forward because the notebook times out or throws a session timeout error message as soon as i execute any kind of code on the dataframe, even something like a count() operation timesout.
      This is how the timeout message looks like:



       An error was encountered:
      Invalid status code '400' from
      https://172.31.12.103:18888/sessions/5/statements/20 with error
      payload:
      "requirement failed: Session isn't active."


      This timeout takes 1 or 2 seconds(does not take long time to timeout).



      I am not using collect() or any topandas() operations for it to timeout.
      What I'm trying to do to the above dataframe is undersampling the data but I can't seem to make a simple .count() operation to work after the dataframe size is increased.



      I have already tried using different types of instances in my emr cluster to make it work. When I use the smaller dataframe a c5.2xlarge type instance is enough, but for the larger dataframe it doesnt work even if I use c5.18xlarge instances. I have 1 master node and 2 slave nodes in my cluster.



      This is what I'm trying to do to the dataframe



      #Undersampling.
      from pyspark.sql.functions import col, when
      def resample(base_features,ratio,class_field,base_class):
      pos = base_features.filter(col(class_field)==base_class)
      neg = base_features.filter(col(class_field)!=base_class)
      total_pos = pos.count()
      total_neg = neg.count()
      fraction=float(total_pos*ratio)/float(total_neg)
      sampled = neg.sample(True,fraction)
      a = sampled.union(pos)
      return a
      undersampled_df = resample(df,10,'conversions',1)


      How can I solve this issue? Any suggestions on what steps I should take?










      share|improve this question
















      I am trying to do some operations on a pyspark dataframe. The dataframe looks something like this:



       user domain1 domain2 ........ domain100 conversions

      abcd 1 0 ........ 0 1
      gcea 0 0 ........ 1 0
      . . . ........ . .
      . . . ........ . .
      . . . ........ . .


      The code I use works perfectly fine for me to further operate on the above dataframe if the dataframe is small, for example it works perfectly fine for a dataframe of the following shape:



       (148457,41)


      But if I increase the size of the dataframe, to for example:



       (2184934,324)


      I cannot proceed forward because the notebook times out or throws a session timeout error message as soon as i execute any kind of code on the dataframe, even something like a count() operation timesout.
      This is how the timeout message looks like:



       An error was encountered:
      Invalid status code '400' from
      https://172.31.12.103:18888/sessions/5/statements/20 with error
      payload:
      "requirement failed: Session isn't active."


      This timeout takes 1 or 2 seconds(does not take long time to timeout).



      I am not using collect() or any topandas() operations for it to timeout.
      What I'm trying to do to the above dataframe is undersampling the data but I can't seem to make a simple .count() operation to work after the dataframe size is increased.



      I have already tried using different types of instances in my emr cluster to make it work. When I use the smaller dataframe a c5.2xlarge type instance is enough, but for the larger dataframe it doesnt work even if I use c5.18xlarge instances. I have 1 master node and 2 slave nodes in my cluster.



      This is what I'm trying to do to the dataframe



      #Undersampling.
      from pyspark.sql.functions import col, when
      def resample(base_features,ratio,class_field,base_class):
      pos = base_features.filter(col(class_field)==base_class)
      neg = base_features.filter(col(class_field)!=base_class)
      total_pos = pos.count()
      total_neg = neg.count()
      fraction=float(total_pos*ratio)/float(total_neg)
      sampled = neg.sample(True,fraction)
      a = sampled.union(pos)
      return a
      undersampled_df = resample(df,10,'conversions',1)


      How can I solve this issue? Any suggestions on what steps I should take?







      python amazon-web-services apache-spark pyspark amazon-emr






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Mar 29 at 20:16







      gara of the sand

















      asked Mar 28 at 18:25









      gara of the sandgara of the sand

      186 bronze badges




      186 bronze badges

























          0






          active

          oldest

          votes














          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );














          draft saved

          draft discarded
















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55404521%2femr-notebook-session-times-out-within-secondsusing-pyspark-on-a-large-datafram%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes
















          draft saved

          draft discarded















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55404521%2femr-notebook-session-times-out-within-secondsusing-pyspark-on-a-large-datafram%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

          Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

          Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript