EMR notebook session times out within seconds(using pyspark) on a large dataframe(pyspark)Configure Ipython/Jupyter notebook with Pyspark on AWS EMR v4.0.0Spark Executors hang after Out of Memorycollect() or toPandas() on a large DataFrame in pyspark/EMRSpark: There is insufficient memory for the Java Runtime Environment to continuesort pyspark dataframe within groupsPyspark Joining dataframes with multiple rows in the second dataframeDoes multiprocessing/Pooling benefit Pyspark processing timeAWS EMR Spark app tuning for groupBy with >300,000 groupsRepartitioning large number of json files in s3 into parquet with EMR taking very long time

Interchange `colon` and `:`

Does the Horizon Walker ranger's Planar Warrior feature bypass resistance to non-magical attacks?

Can someone give the intuition behind Mean Absolute Error and the Median?

Hangman Game (YAHG)

Is the mass of paint relevant in rocket design?

Difference between "rip up" and "rip down"

Why is volatility skew/smile for long term options flatter compare to short term options?

Algorithm that generates orthogonal vectors: C++ implementation

Why is 6. Nge2 better, and 7. d5 a necessary push in this game?

Need Improvement on Script Which Continuosly Tests Website

There are 51 natural numbers between 1-100, proof that there are 2 numbers such that the difference between them equals to 5

My Project Manager does not accept carry-over in Scrum, Is that normal?

Is a PWM required for regenerative braking on a DC Motor?

A famous scholar sent me an unpublished draft of hers. Then she died. I think her work should be published. What should I do?

Is it acceptable to say that a reviewer's concern is not going to be addressed because then the paper would be too long?

Would you write key signatures for non-conventional scales?

Why does my browser attempt to download pages from http://clhs.lisp.se instead of viewing them normally?

Why are there two fundamental laws of logic?

Why was LOGO created?

How do pilots align the HUD with their eyeballs?

Subverting the emotional woman and stoic man trope

If a spaceship ran out of fuel somewhere in space between Earth and Mars, does it slowly drift off to Sun?

What secular civic space would pioneers build for small frontier towns?

Clear text passwords in Unix



EMR notebook session times out within seconds(using pyspark) on a large dataframe(pyspark)


Configure Ipython/Jupyter notebook with Pyspark on AWS EMR v4.0.0Spark Executors hang after Out of Memorycollect() or toPandas() on a large DataFrame in pyspark/EMRSpark: There is insufficient memory for the Java Runtime Environment to continuesort pyspark dataframe within groupsPyspark Joining dataframes with multiple rows in the second dataframeDoes multiprocessing/Pooling benefit Pyspark processing timeAWS EMR Spark app tuning for groupBy with >300,000 groupsRepartitioning large number of json files in s3 into parquet with EMR taking very long time






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








3















I am trying to do some operations on a pyspark dataframe. The dataframe looks something like this:



 user domain1 domain2 ........ domain100 conversions

abcd 1 0 ........ 0 1
gcea 0 0 ........ 1 0
. . . ........ . .
. . . ........ . .
. . . ........ . .


The code I use works perfectly fine for me to further operate on the above dataframe if the dataframe is small, for example it works perfectly fine for a dataframe of the following shape:



 (148457,41)


But if I increase the size of the dataframe, to for example:



 (2184934,324)


I cannot proceed forward because the notebook times out or throws a session timeout error message as soon as i execute any kind of code on the dataframe, even something like a count() operation timesout.
This is how the timeout message looks like:



 An error was encountered:
Invalid status code '400' from
https://172.31.12.103:18888/sessions/5/statements/20 with error
payload:
"requirement failed: Session isn't active."


This timeout takes 1 or 2 seconds(does not take long time to timeout).



I am not using collect() or any topandas() operations for it to timeout.
What I'm trying to do to the above dataframe is undersampling the data but I can't seem to make a simple .count() operation to work after the dataframe size is increased.



I have already tried using different types of instances in my emr cluster to make it work. When I use the smaller dataframe a c5.2xlarge type instance is enough, but for the larger dataframe it doesnt work even if I use c5.18xlarge instances. I have 1 master node and 2 slave nodes in my cluster.



This is what I'm trying to do to the dataframe



#Undersampling.
from pyspark.sql.functions import col, when
def resample(base_features,ratio,class_field,base_class):
pos = base_features.filter(col(class_field)==base_class)
neg = base_features.filter(col(class_field)!=base_class)
total_pos = pos.count()
total_neg = neg.count()
fraction=float(total_pos*ratio)/float(total_neg)
sampled = neg.sample(True,fraction)
a = sampled.union(pos)
return a
undersampled_df = resample(df,10,'conversions',1)


How can I solve this issue? Any suggestions on what steps I should take?










share|improve this question
































    3















    I am trying to do some operations on a pyspark dataframe. The dataframe looks something like this:



     user domain1 domain2 ........ domain100 conversions

    abcd 1 0 ........ 0 1
    gcea 0 0 ........ 1 0
    . . . ........ . .
    . . . ........ . .
    . . . ........ . .


    The code I use works perfectly fine for me to further operate on the above dataframe if the dataframe is small, for example it works perfectly fine for a dataframe of the following shape:



     (148457,41)


    But if I increase the size of the dataframe, to for example:



     (2184934,324)


    I cannot proceed forward because the notebook times out or throws a session timeout error message as soon as i execute any kind of code on the dataframe, even something like a count() operation timesout.
    This is how the timeout message looks like:



     An error was encountered:
    Invalid status code '400' from
    https://172.31.12.103:18888/sessions/5/statements/20 with error
    payload:
    "requirement failed: Session isn't active."


    This timeout takes 1 or 2 seconds(does not take long time to timeout).



    I am not using collect() or any topandas() operations for it to timeout.
    What I'm trying to do to the above dataframe is undersampling the data but I can't seem to make a simple .count() operation to work after the dataframe size is increased.



    I have already tried using different types of instances in my emr cluster to make it work. When I use the smaller dataframe a c5.2xlarge type instance is enough, but for the larger dataframe it doesnt work even if I use c5.18xlarge instances. I have 1 master node and 2 slave nodes in my cluster.



    This is what I'm trying to do to the dataframe



    #Undersampling.
    from pyspark.sql.functions import col, when
    def resample(base_features,ratio,class_field,base_class):
    pos = base_features.filter(col(class_field)==base_class)
    neg = base_features.filter(col(class_field)!=base_class)
    total_pos = pos.count()
    total_neg = neg.count()
    fraction=float(total_pos*ratio)/float(total_neg)
    sampled = neg.sample(True,fraction)
    a = sampled.union(pos)
    return a
    undersampled_df = resample(df,10,'conversions',1)


    How can I solve this issue? Any suggestions on what steps I should take?










    share|improve this question




























      3












      3








      3








      I am trying to do some operations on a pyspark dataframe. The dataframe looks something like this:



       user domain1 domain2 ........ domain100 conversions

      abcd 1 0 ........ 0 1
      gcea 0 0 ........ 1 0
      . . . ........ . .
      . . . ........ . .
      . . . ........ . .


      The code I use works perfectly fine for me to further operate on the above dataframe if the dataframe is small, for example it works perfectly fine for a dataframe of the following shape:



       (148457,41)


      But if I increase the size of the dataframe, to for example:



       (2184934,324)


      I cannot proceed forward because the notebook times out or throws a session timeout error message as soon as i execute any kind of code on the dataframe, even something like a count() operation timesout.
      This is how the timeout message looks like:



       An error was encountered:
      Invalid status code '400' from
      https://172.31.12.103:18888/sessions/5/statements/20 with error
      payload:
      "requirement failed: Session isn't active."


      This timeout takes 1 or 2 seconds(does not take long time to timeout).



      I am not using collect() or any topandas() operations for it to timeout.
      What I'm trying to do to the above dataframe is undersampling the data but I can't seem to make a simple .count() operation to work after the dataframe size is increased.



      I have already tried using different types of instances in my emr cluster to make it work. When I use the smaller dataframe a c5.2xlarge type instance is enough, but for the larger dataframe it doesnt work even if I use c5.18xlarge instances. I have 1 master node and 2 slave nodes in my cluster.



      This is what I'm trying to do to the dataframe



      #Undersampling.
      from pyspark.sql.functions import col, when
      def resample(base_features,ratio,class_field,base_class):
      pos = base_features.filter(col(class_field)==base_class)
      neg = base_features.filter(col(class_field)!=base_class)
      total_pos = pos.count()
      total_neg = neg.count()
      fraction=float(total_pos*ratio)/float(total_neg)
      sampled = neg.sample(True,fraction)
      a = sampled.union(pos)
      return a
      undersampled_df = resample(df,10,'conversions',1)


      How can I solve this issue? Any suggestions on what steps I should take?










      share|improve this question
















      I am trying to do some operations on a pyspark dataframe. The dataframe looks something like this:



       user domain1 domain2 ........ domain100 conversions

      abcd 1 0 ........ 0 1
      gcea 0 0 ........ 1 0
      . . . ........ . .
      . . . ........ . .
      . . . ........ . .


      The code I use works perfectly fine for me to further operate on the above dataframe if the dataframe is small, for example it works perfectly fine for a dataframe of the following shape:



       (148457,41)


      But if I increase the size of the dataframe, to for example:



       (2184934,324)


      I cannot proceed forward because the notebook times out or throws a session timeout error message as soon as i execute any kind of code on the dataframe, even something like a count() operation timesout.
      This is how the timeout message looks like:



       An error was encountered:
      Invalid status code '400' from
      https://172.31.12.103:18888/sessions/5/statements/20 with error
      payload:
      "requirement failed: Session isn't active."


      This timeout takes 1 or 2 seconds(does not take long time to timeout).



      I am not using collect() or any topandas() operations for it to timeout.
      What I'm trying to do to the above dataframe is undersampling the data but I can't seem to make a simple .count() operation to work after the dataframe size is increased.



      I have already tried using different types of instances in my emr cluster to make it work. When I use the smaller dataframe a c5.2xlarge type instance is enough, but for the larger dataframe it doesnt work even if I use c5.18xlarge instances. I have 1 master node and 2 slave nodes in my cluster.



      This is what I'm trying to do to the dataframe



      #Undersampling.
      from pyspark.sql.functions import col, when
      def resample(base_features,ratio,class_field,base_class):
      pos = base_features.filter(col(class_field)==base_class)
      neg = base_features.filter(col(class_field)!=base_class)
      total_pos = pos.count()
      total_neg = neg.count()
      fraction=float(total_pos*ratio)/float(total_neg)
      sampled = neg.sample(True,fraction)
      a = sampled.union(pos)
      return a
      undersampled_df = resample(df,10,'conversions',1)


      How can I solve this issue? Any suggestions on what steps I should take?







      python amazon-web-services apache-spark pyspark amazon-emr






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Mar 29 at 20:16







      gara of the sand

















      asked Mar 28 at 18:25









      gara of the sandgara of the sand

      186 bronze badges




      186 bronze badges

























          0






          active

          oldest

          votes














          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );














          draft saved

          draft discarded
















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55404521%2femr-notebook-session-times-out-within-secondsusing-pyspark-on-a-large-datafram%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes
















          draft saved

          draft discarded















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55404521%2femr-notebook-session-times-out-within-secondsusing-pyspark-on-a-large-datafram%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

          SQL error code 1064 with creating Laravel foreign keysForeign key constraints: When to use ON UPDATE and ON DELETEDropping column with foreign key Laravel error: General error: 1025 Error on renameLaravel SQL Can't create tableLaravel Migration foreign key errorLaravel php artisan migrate:refresh giving a syntax errorSQLSTATE[42S01]: Base table or view already exists or Base table or view already exists: 1050 Tableerror in migrating laravel file to xampp serverSyntax error or access violation: 1064:syntax to use near 'unsigned not null, modelName varchar(191) not null, title varchar(191) not nLaravel cannot create new table field in mysqlLaravel 5.7:Last migration creates table but is not registered in the migration table

          은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현