How to do multiple Dask computations without re-loading my large CSV Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) Data science time! April 2019 and salary with experience The Ask Question Wizard is Live!How do I check whether a file exists without exceptions?How to print without newline or space?“Large data” work flows using pandasAt what situation I can use Dask instead of Apache Spark?How to concat multiple pandas dataframes into one dask dataframe larger than memory?dask set_index from large unordered csv fileSplit CSV file on multiple delimiters and then detect duplicate rowsHandling large, compressed csv files with DaskReading large CSV files using delayed (DASK)

In musical terms, what properties are varied by the human voice to produce different words / syllables?

3D Masyu - A Die

How to make an animal which can only breed for a certain number of generations?

"Destructive power" carried by a B-52?

Was the pager message from Nick Fury to Captain Marvel unnecessary?

2018 MacBook Pro won't let me install macOS High Sierra 10.13 from USB installer

How to achieve cat-like agility?

Adapting the Chinese Remainder Theorem (CRT) for integers to polynomials

Russian equivalents of おしゃれは足元から (Every good outfit starts with the shoes)

Found this skink in my tomato plant bucket. Is he trapped? Or could he leave if he wanted?

How do I say "this must not happen"?

malloc in main() or malloc in another function: allocating memory for a struct and its members

Twin's vs. Twins'

Does a random sequence of vectors span a Hilbert space?

Can the Haste spell grant both a Beast Master ranger and their animal companion extra attacks?

Noise in Eigenvalues plot

Is it OK to use the testing sample to compare algorithms?

How can I list files in reverse time order by a command and pass them as arguments to another command?

Centre cell vertically in tabularx

Weaponising the Grasp-at-a-Distance spell

Short story about astronauts fertilizing soil with their own bodies

Sally's older brother

My mentor says to set image to Fine instead of RAW — how is this different from JPG?

By what mechanism was the 2017 UK General Election called?



How to do multiple Dask computations without re-loading my large CSV



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)
Data science time! April 2019 and salary with experience
The Ask Question Wizard is Live!How do I check whether a file exists without exceptions?How to print without newline or space?“Large data” work flows using pandasAt what situation I can use Dask instead of Apache Spark?How to concat multiple pandas dataframes into one dask dataframe larger than memory?dask set_index from large unordered csv fileSplit CSV file on multiple delimiters and then detect duplicate rowsHandling large, compressed csv files with DaskReading large CSV files using delayed (DASK)



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








1















I have to process a number of large (ca. 10GB) CSV files. I'm currently using Dask to pre-process the data into some aggregated statistics, which I then further analyze with regular Pandas.



The problem I'm having is that Dask reloads the data for every call to compute(). Some dummy code to illustrate the problem:



import dask.dataframe as dd

ddf = dd.read_csv('very_large_file.csv') # ca. 10GB

# Every line seems to trigger painfully slow re-reading of the CSV file from disk!
groupstats_A = ddf.groupby(['col1', 'col2'])
.mean()
.compute()
groupstats_B = ddf.groupby(['col3'])
.mean()
.compute()
groupstats_C = ddf.groupby(['col1', 'col2', 'col3'])
.mean()
.compute()


Is there a way to optimize this code in such a way that the compute() function does not have to read the large file from disk at every call?










share|improve this question




























    1















    I have to process a number of large (ca. 10GB) CSV files. I'm currently using Dask to pre-process the data into some aggregated statistics, which I then further analyze with regular Pandas.



    The problem I'm having is that Dask reloads the data for every call to compute(). Some dummy code to illustrate the problem:



    import dask.dataframe as dd

    ddf = dd.read_csv('very_large_file.csv') # ca. 10GB

    # Every line seems to trigger painfully slow re-reading of the CSV file from disk!
    groupstats_A = ddf.groupby(['col1', 'col2'])
    .mean()
    .compute()
    groupstats_B = ddf.groupby(['col3'])
    .mean()
    .compute()
    groupstats_C = ddf.groupby(['col1', 'col2', 'col3'])
    .mean()
    .compute()


    Is there a way to optimize this code in such a way that the compute() function does not have to read the large file from disk at every call?










    share|improve this question
























      1












      1








      1








      I have to process a number of large (ca. 10GB) CSV files. I'm currently using Dask to pre-process the data into some aggregated statistics, which I then further analyze with regular Pandas.



      The problem I'm having is that Dask reloads the data for every call to compute(). Some dummy code to illustrate the problem:



      import dask.dataframe as dd

      ddf = dd.read_csv('very_large_file.csv') # ca. 10GB

      # Every line seems to trigger painfully slow re-reading of the CSV file from disk!
      groupstats_A = ddf.groupby(['col1', 'col2'])
      .mean()
      .compute()
      groupstats_B = ddf.groupby(['col3'])
      .mean()
      .compute()
      groupstats_C = ddf.groupby(['col1', 'col2', 'col3'])
      .mean()
      .compute()


      Is there a way to optimize this code in such a way that the compute() function does not have to read the large file from disk at every call?










      share|improve this question














      I have to process a number of large (ca. 10GB) CSV files. I'm currently using Dask to pre-process the data into some aggregated statistics, which I then further analyze with regular Pandas.



      The problem I'm having is that Dask reloads the data for every call to compute(). Some dummy code to illustrate the problem:



      import dask.dataframe as dd

      ddf = dd.read_csv('very_large_file.csv') # ca. 10GB

      # Every line seems to trigger painfully slow re-reading of the CSV file from disk!
      groupstats_A = ddf.groupby(['col1', 'col2'])
      .mean()
      .compute()
      groupstats_B = ddf.groupby(['col3'])
      .mean()
      .compute()
      groupstats_C = ddf.groupby(['col1', 'col2', 'col3'])
      .mean()
      .compute()


      Is there a way to optimize this code in such a way that the compute() function does not have to read the large file from disk at every call?







      python dask






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 22 at 12:50









      GijsGijs

      63




      63






















          1 Answer
          1






          active

          oldest

          votes


















          0














          This is a lot like a duplicate, but I cannot find the original.



          You can pass multiple things to compute as follows, and any possible intermediates will be shared.



          groupstats_A = ddf.groupby(['col1', 'col2']) 
          .mean()
          groupstats_B = ddf.groupby(['col3'])
          .mean()
          groupstats_C = ddf.groupby(['col1', 'col2', 'col3'])
          .mean()
          A, B, C = dask.compute(groupstats_A, groupstats_B, groupstats_C)





          share|improve this answer























          • Thanks! I did look for duplicates as well, but couldn't find it either. This answer is exactly what I was looking for, it makes everything so much faster.

            – Gijs
            Mar 25 at 9:46











          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55300011%2fhow-to-do-multiple-dask-computations-without-re-loading-my-large-csv%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0














          This is a lot like a duplicate, but I cannot find the original.



          You can pass multiple things to compute as follows, and any possible intermediates will be shared.



          groupstats_A = ddf.groupby(['col1', 'col2']) 
          .mean()
          groupstats_B = ddf.groupby(['col3'])
          .mean()
          groupstats_C = ddf.groupby(['col1', 'col2', 'col3'])
          .mean()
          A, B, C = dask.compute(groupstats_A, groupstats_B, groupstats_C)





          share|improve this answer























          • Thanks! I did look for duplicates as well, but couldn't find it either. This answer is exactly what I was looking for, it makes everything so much faster.

            – Gijs
            Mar 25 at 9:46















          0














          This is a lot like a duplicate, but I cannot find the original.



          You can pass multiple things to compute as follows, and any possible intermediates will be shared.



          groupstats_A = ddf.groupby(['col1', 'col2']) 
          .mean()
          groupstats_B = ddf.groupby(['col3'])
          .mean()
          groupstats_C = ddf.groupby(['col1', 'col2', 'col3'])
          .mean()
          A, B, C = dask.compute(groupstats_A, groupstats_B, groupstats_C)





          share|improve this answer























          • Thanks! I did look for duplicates as well, but couldn't find it either. This answer is exactly what I was looking for, it makes everything so much faster.

            – Gijs
            Mar 25 at 9:46













          0












          0








          0







          This is a lot like a duplicate, but I cannot find the original.



          You can pass multiple things to compute as follows, and any possible intermediates will be shared.



          groupstats_A = ddf.groupby(['col1', 'col2']) 
          .mean()
          groupstats_B = ddf.groupby(['col3'])
          .mean()
          groupstats_C = ddf.groupby(['col1', 'col2', 'col3'])
          .mean()
          A, B, C = dask.compute(groupstats_A, groupstats_B, groupstats_C)





          share|improve this answer













          This is a lot like a duplicate, but I cannot find the original.



          You can pass multiple things to compute as follows, and any possible intermediates will be shared.



          groupstats_A = ddf.groupby(['col1', 'col2']) 
          .mean()
          groupstats_B = ddf.groupby(['col3'])
          .mean()
          groupstats_C = ddf.groupby(['col1', 'col2', 'col3'])
          .mean()
          A, B, C = dask.compute(groupstats_A, groupstats_B, groupstats_C)






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Mar 22 at 16:32









          mdurantmdurant

          11.9k11741




          11.9k11741












          • Thanks! I did look for duplicates as well, but couldn't find it either. This answer is exactly what I was looking for, it makes everything so much faster.

            – Gijs
            Mar 25 at 9:46

















          • Thanks! I did look for duplicates as well, but couldn't find it either. This answer is exactly what I was looking for, it makes everything so much faster.

            – Gijs
            Mar 25 at 9:46
















          Thanks! I did look for duplicates as well, but couldn't find it either. This answer is exactly what I was looking for, it makes everything so much faster.

          – Gijs
          Mar 25 at 9:46





          Thanks! I did look for duplicates as well, but couldn't find it either. This answer is exactly what I was looking for, it makes everything so much faster.

          – Gijs
          Mar 25 at 9:46



















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55300011%2fhow-to-do-multiple-dask-computations-without-re-loading-my-large-csv%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

          Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

          Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript