How to do multiple Dask computations without re-loading my large CSV Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) Data science time! April 2019 and salary with experience The Ask Question Wizard is Live!How do I check whether a file exists without exceptions?How to print without newline or space?“Large data” work flows using pandasAt what situation I can use Dask instead of Apache Spark?How to concat multiple pandas dataframes into one dask dataframe larger than memory?dask set_index from large unordered csv fileSplit CSV file on multiple delimiters and then detect duplicate rowsHandling large, compressed csv files with DaskReading large CSV files using delayed (DASK)

In musical terms, what properties are varied by the human voice to produce different words / syllables?

3D Masyu - A Die

How to make an animal which can only breed for a certain number of generations?

"Destructive power" carried by a B-52?

Was the pager message from Nick Fury to Captain Marvel unnecessary?

2018 MacBook Pro won't let me install macOS High Sierra 10.13 from USB installer

How to achieve cat-like agility?

Adapting the Chinese Remainder Theorem (CRT) for integers to polynomials

Russian equivalents of おしゃれは足元から (Every good outfit starts with the shoes)

Found this skink in my tomato plant bucket. Is he trapped? Or could he leave if he wanted?

How do I say "this must not happen"?

malloc in main() or malloc in another function: allocating memory for a struct and its members

Twin's vs. Twins'

Does a random sequence of vectors span a Hilbert space?

Can the Haste spell grant both a Beast Master ranger and their animal companion extra attacks?

Noise in Eigenvalues plot

Is it OK to use the testing sample to compare algorithms?

How can I list files in reverse time order by a command and pass them as arguments to another command?

Centre cell vertically in tabularx

Weaponising the Grasp-at-a-Distance spell

Short story about astronauts fertilizing soil with their own bodies

Sally's older brother

My mentor says to set image to Fine instead of RAW — how is this different from JPG?

By what mechanism was the 2017 UK General Election called?

How to do multiple Dask computations without re-loading my large CSV

Announcing the arrival of Valued Associate #679: Cesar Manara

Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)

Data science time! April 2019 and salary with experience

The Ask Question Wizard is Live!How do I check whether a file exists without exceptions?How to print without newline or space?“Large data” work flows using pandasAt what situation I can use Dask instead of Apache Spark?How to concat multiple pandas dataframes into one dask dataframe larger than memory?dask set_index from large unordered csv fileSplit CSV file on multiple delimiters and then detect duplicate rowsHandling large, compressed csv files with DaskReading large CSV files using delayed (DASK)

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;

I have to process a number of large (ca. 10GB) CSV files. I'm currently using Dask to pre-process the data into some aggregated statistics, which I then further analyze with regular Pandas.

The problem I'm having is that Dask reloads the data for every call to compute(). Some dummy code to illustrate the problem:

import dask.dataframe as dd

ddf = dd.read_csv('very_large_file.csv') # ca. 10GB

# Every line seems to trigger painfully slow re-reading of the CSV file from disk!
groupstats_A = ddf.groupby(['col1', 'col2']) 
 .mean() 
 .compute()
groupstats_B = ddf.groupby(['col3']) 
 .mean() 
 .compute()
groupstats_C = ddf.groupby(['col1', 'col2', 'col3']) 
 .mean() 
 .compute()

Is there a way to optimize this code in such a way that the compute() function does not have to read the large file from disk at every call?

asked Mar 22 at 12:50

Gijs

add a comment |

I have to process a number of large (ca. 10GB) CSV files. I'm currently using Dask to pre-process the data into some aggregated statistics, which I then further analyze with regular Pandas.

The problem I'm having is that Dask reloads the data for every call to compute(). Some dummy code to illustrate the problem:

import dask.dataframe as dd

ddf = dd.read_csv('very_large_file.csv') # ca. 10GB

# Every line seems to trigger painfully slow re-reading of the CSV file from disk!
groupstats_A = ddf.groupby(['col1', 'col2']) 
 .mean() 
 .compute()
groupstats_B = ddf.groupby(['col3']) 
 .mean() 
 .compute()
groupstats_C = ddf.groupby(['col1', 'col2', 'col3']) 
 .mean() 
 .compute()

Is there a way to optimize this code in such a way that the compute() function does not have to read the large file from disk at every call?

asked Mar 22 at 12:50

Gijs

add a comment |

I have to process a number of large (ca. 10GB) CSV files. I'm currently using Dask to pre-process the data into some aggregated statistics, which I then further analyze with regular Pandas.

The problem I'm having is that Dask reloads the data for every call to compute(). Some dummy code to illustrate the problem:

import dask.dataframe as dd

ddf = dd.read_csv('very_large_file.csv') # ca. 10GB

# Every line seems to trigger painfully slow re-reading of the CSV file from disk!
groupstats_A = ddf.groupby(['col1', 'col2']) 
 .mean() 
 .compute()
groupstats_B = ddf.groupby(['col3']) 
 .mean() 
 .compute()
groupstats_C = ddf.groupby(['col1', 'col2', 'col3']) 
 .mean() 
 .compute()

Is there a way to optimize this code in such a way that the compute() function does not have to read the large file from disk at every call?

asked Mar 22 at 12:50

Gijs

I have to process a number of large (ca. 10GB) CSV files. I'm currently using Dask to pre-process the data into some aggregated statistics, which I then further analyze with regular Pandas.

The problem I'm having is that Dask reloads the data for every call to compute(). Some dummy code to illustrate the problem:

import dask.dataframe as dd

ddf = dd.read_csv('very_large_file.csv') # ca. 10GB

# Every line seems to trigger painfully slow re-reading of the CSV file from disk!
groupstats_A = ddf.groupby(['col1', 'col2']) 
 .mean() 
 .compute()
groupstats_B = ddf.groupby(['col3']) 
 .mean() 
 .compute()
groupstats_C = ddf.groupby(['col1', 'col2', 'col3']) 
 .mean() 
 .compute()

Is there a way to optimize this code in such a way that the compute() function does not have to read the large file from disk at every call?

python dask

asked Mar 22 at 12:50

Gijs

asked Mar 22 at 12:50

Gijs

asked Mar 22 at 12:50

Gijs

asked Mar 22 at 12:50

Gijs

asked Mar 22 at 12:50

Gijs

add a comment |

1 Answer
1

active

oldest

votes

This is a lot like a duplicate, but I cannot find the original.

You can pass multiple things to compute as follows, and any possible intermediates will be shared.

groupstats_A = ddf.groupby(['col1', 'col2']) 
 .mean()
groupstats_B = ddf.groupby(['col3']) 
 .mean()
groupstats_C = ddf.groupby(['col1', 'col2', 'col3']) 
 .mean()
A, B, C = dask.compute(groupstats_A, groupstats_B, groupstats_C)

answered Mar 22 at 16:32

mdurant

11.9k11741

Thanks! I did look for duplicates as well, but couldn't find it either. This answer is exactly what I was looking for, it makes everything so much faster.

– Gijs
Mar 25 at 9:46

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55300011%2fhow-to-do-multiple-dask-computations-without-re-loading-my-large-csv%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

This is a lot like a duplicate, but I cannot find the original.

You can pass multiple things to compute as follows, and any possible intermediates will be shared.

groupstats_A = ddf.groupby(['col1', 'col2']) 
 .mean()
groupstats_B = ddf.groupby(['col3']) 
 .mean()
groupstats_C = ddf.groupby(['col1', 'col2', 'col3']) 
 .mean()
A, B, C = dask.compute(groupstats_A, groupstats_B, groupstats_C)

answered Mar 22 at 16:32

mdurant

11.9k11741

Thanks! I did look for duplicates as well, but couldn't find it either. This answer is exactly what I was looking for, it makes everything so much faster.

– Gijs
Mar 25 at 9:46

add a comment |

This is a lot like a duplicate, but I cannot find the original.

You can pass multiple things to compute as follows, and any possible intermediates will be shared.

groupstats_A = ddf.groupby(['col1', 'col2']) 
 .mean()
groupstats_B = ddf.groupby(['col3']) 
 .mean()
groupstats_C = ddf.groupby(['col1', 'col2', 'col3']) 
 .mean()
A, B, C = dask.compute(groupstats_A, groupstats_B, groupstats_C)

answered Mar 22 at 16:32

mdurant

11.9k11741

Thanks! I did look for duplicates as well, but couldn't find it either. This answer is exactly what I was looking for, it makes everything so much faster.

– Gijs
Mar 25 at 9:46

add a comment |

This is a lot like a duplicate, but I cannot find the original.

You can pass multiple things to compute as follows, and any possible intermediates will be shared.

groupstats_A = ddf.groupby(['col1', 'col2']) 
 .mean()
groupstats_B = ddf.groupby(['col3']) 
 .mean()
groupstats_C = ddf.groupby(['col1', 'col2', 'col3']) 
 .mean()
A, B, C = dask.compute(groupstats_A, groupstats_B, groupstats_C)

answered Mar 22 at 16:32

mdurant

11.9k11741

This is a lot like a duplicate, but I cannot find the original.

You can pass multiple things to compute as follows, and any possible intermediates will be shared.

groupstats_A = ddf.groupby(['col1', 'col2']) 
 .mean()
groupstats_B = ddf.groupby(['col3']) 
 .mean()
groupstats_C = ddf.groupby(['col1', 'col2', 'col3']) 
 .mean()
A, B, C = dask.compute(groupstats_A, groupstats_B, groupstats_C)

answered Mar 22 at 16:32

mdurant

11.9k11741

answered Mar 22 at 16:32

mdurant

11.9k11741

answered Mar 22 at 16:32

mdurant

11.9k11741

answered Mar 22 at 16:32

mdurant

11.9k11741

Thanks! I did look for duplicates as well, but couldn't find it either. This answer is exactly what I was looking for, it makes everything so much faster.

– Gijs
Mar 25 at 9:46

add a comment |

Thanks! I did look for duplicates as well, but couldn't find it either. This answer is exactly what I was looking for, it makes everything so much faster.

– Gijs
Mar 25 at 9:46

Thanks! I did look for duplicates as well, but couldn't find it either. This answer is exactly what I was looking for, it makes everything so much faster.

– Gijs
Mar 25 at 9:46

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

1 Answer
1

1 Answer
1

1 Answer
1