How to do multiple Dask computations without re-loading my large CSV Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) Data science time! April 2019 and salary with experience The Ask Question Wizard is Live!How do I check whether a file exists without exceptions?How to print without newline or space?“Large data” work flows using pandasAt what situation I can use Dask instead of Apache Spark?How to concat multiple pandas dataframes into one dask dataframe larger than memory?dask set_index from large unordered csv fileSplit CSV file on multiple delimiters and then detect duplicate rowsHandling large, compressed csv files with DaskReading large CSV files using delayed (DASK)
In musical terms, what properties are varied by the human voice to produce different words / syllables?
3D Masyu - A Die
How to make an animal which can only breed for a certain number of generations?
"Destructive power" carried by a B-52?
Was the pager message from Nick Fury to Captain Marvel unnecessary?
2018 MacBook Pro won't let me install macOS High Sierra 10.13 from USB installer
How to achieve cat-like agility?
Adapting the Chinese Remainder Theorem (CRT) for integers to polynomials
Russian equivalents of おしゃれは足元から (Every good outfit starts with the shoes)
Found this skink in my tomato plant bucket. Is he trapped? Or could he leave if he wanted?
How do I say "this must not happen"?
malloc in main() or malloc in another function: allocating memory for a struct and its members
Twin's vs. Twins'
Does a random sequence of vectors span a Hilbert space?
Can the Haste spell grant both a Beast Master ranger and their animal companion extra attacks?
Noise in Eigenvalues plot
Is it OK to use the testing sample to compare algorithms?
How can I list files in reverse time order by a command and pass them as arguments to another command?
Centre cell vertically in tabularx
Weaponising the Grasp-at-a-Distance spell
Short story about astronauts fertilizing soil with their own bodies
Sally's older brother
My mentor says to set image to Fine instead of RAW — how is this different from JPG?
By what mechanism was the 2017 UK General Election called?
How to do multiple Dask computations without re-loading my large CSV
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)
Data science time! April 2019 and salary with experience
The Ask Question Wizard is Live!How do I check whether a file exists without exceptions?How to print without newline or space?“Large data” work flows using pandasAt what situation I can use Dask instead of Apache Spark?How to concat multiple pandas dataframes into one dask dataframe larger than memory?dask set_index from large unordered csv fileSplit CSV file on multiple delimiters and then detect duplicate rowsHandling large, compressed csv files with DaskReading large CSV files using delayed (DASK)
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I have to process a number of large (ca. 10GB) CSV files. I'm currently using Dask to pre-process the data into some aggregated statistics, which I then further analyze with regular Pandas.
The problem I'm having is that Dask reloads the data for every call to compute()
. Some dummy code to illustrate the problem:
import dask.dataframe as dd
ddf = dd.read_csv('very_large_file.csv') # ca. 10GB
# Every line seems to trigger painfully slow re-reading of the CSV file from disk!
groupstats_A = ddf.groupby(['col1', 'col2'])
.mean()
.compute()
groupstats_B = ddf.groupby(['col3'])
.mean()
.compute()
groupstats_C = ddf.groupby(['col1', 'col2', 'col3'])
.mean()
.compute()
Is there a way to optimize this code in such a way that the compute()
function does not have to read the large file from disk at every call?
python dask
add a comment |
I have to process a number of large (ca. 10GB) CSV files. I'm currently using Dask to pre-process the data into some aggregated statistics, which I then further analyze with regular Pandas.
The problem I'm having is that Dask reloads the data for every call to compute()
. Some dummy code to illustrate the problem:
import dask.dataframe as dd
ddf = dd.read_csv('very_large_file.csv') # ca. 10GB
# Every line seems to trigger painfully slow re-reading of the CSV file from disk!
groupstats_A = ddf.groupby(['col1', 'col2'])
.mean()
.compute()
groupstats_B = ddf.groupby(['col3'])
.mean()
.compute()
groupstats_C = ddf.groupby(['col1', 'col2', 'col3'])
.mean()
.compute()
Is there a way to optimize this code in such a way that the compute()
function does not have to read the large file from disk at every call?
python dask
add a comment |
I have to process a number of large (ca. 10GB) CSV files. I'm currently using Dask to pre-process the data into some aggregated statistics, which I then further analyze with regular Pandas.
The problem I'm having is that Dask reloads the data for every call to compute()
. Some dummy code to illustrate the problem:
import dask.dataframe as dd
ddf = dd.read_csv('very_large_file.csv') # ca. 10GB
# Every line seems to trigger painfully slow re-reading of the CSV file from disk!
groupstats_A = ddf.groupby(['col1', 'col2'])
.mean()
.compute()
groupstats_B = ddf.groupby(['col3'])
.mean()
.compute()
groupstats_C = ddf.groupby(['col1', 'col2', 'col3'])
.mean()
.compute()
Is there a way to optimize this code in such a way that the compute()
function does not have to read the large file from disk at every call?
python dask
I have to process a number of large (ca. 10GB) CSV files. I'm currently using Dask to pre-process the data into some aggregated statistics, which I then further analyze with regular Pandas.
The problem I'm having is that Dask reloads the data for every call to compute()
. Some dummy code to illustrate the problem:
import dask.dataframe as dd
ddf = dd.read_csv('very_large_file.csv') # ca. 10GB
# Every line seems to trigger painfully slow re-reading of the CSV file from disk!
groupstats_A = ddf.groupby(['col1', 'col2'])
.mean()
.compute()
groupstats_B = ddf.groupby(['col3'])
.mean()
.compute()
groupstats_C = ddf.groupby(['col1', 'col2', 'col3'])
.mean()
.compute()
Is there a way to optimize this code in such a way that the compute()
function does not have to read the large file from disk at every call?
python dask
python dask
asked Mar 22 at 12:50
GijsGijs
63
63
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
This is a lot like a duplicate, but I cannot find the original.
You can pass multiple things to compute as follows, and any possible intermediates will be shared.
groupstats_A = ddf.groupby(['col1', 'col2'])
.mean()
groupstats_B = ddf.groupby(['col3'])
.mean()
groupstats_C = ddf.groupby(['col1', 'col2', 'col3'])
.mean()
A, B, C = dask.compute(groupstats_A, groupstats_B, groupstats_C)
Thanks! I did look for duplicates as well, but couldn't find it either. This answer is exactly what I was looking for, it makes everything so much faster.
– Gijs
Mar 25 at 9:46
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55300011%2fhow-to-do-multiple-dask-computations-without-re-loading-my-large-csv%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
This is a lot like a duplicate, but I cannot find the original.
You can pass multiple things to compute as follows, and any possible intermediates will be shared.
groupstats_A = ddf.groupby(['col1', 'col2'])
.mean()
groupstats_B = ddf.groupby(['col3'])
.mean()
groupstats_C = ddf.groupby(['col1', 'col2', 'col3'])
.mean()
A, B, C = dask.compute(groupstats_A, groupstats_B, groupstats_C)
Thanks! I did look for duplicates as well, but couldn't find it either. This answer is exactly what I was looking for, it makes everything so much faster.
– Gijs
Mar 25 at 9:46
add a comment |
This is a lot like a duplicate, but I cannot find the original.
You can pass multiple things to compute as follows, and any possible intermediates will be shared.
groupstats_A = ddf.groupby(['col1', 'col2'])
.mean()
groupstats_B = ddf.groupby(['col3'])
.mean()
groupstats_C = ddf.groupby(['col1', 'col2', 'col3'])
.mean()
A, B, C = dask.compute(groupstats_A, groupstats_B, groupstats_C)
Thanks! I did look for duplicates as well, but couldn't find it either. This answer is exactly what I was looking for, it makes everything so much faster.
– Gijs
Mar 25 at 9:46
add a comment |
This is a lot like a duplicate, but I cannot find the original.
You can pass multiple things to compute as follows, and any possible intermediates will be shared.
groupstats_A = ddf.groupby(['col1', 'col2'])
.mean()
groupstats_B = ddf.groupby(['col3'])
.mean()
groupstats_C = ddf.groupby(['col1', 'col2', 'col3'])
.mean()
A, B, C = dask.compute(groupstats_A, groupstats_B, groupstats_C)
This is a lot like a duplicate, but I cannot find the original.
You can pass multiple things to compute as follows, and any possible intermediates will be shared.
groupstats_A = ddf.groupby(['col1', 'col2'])
.mean()
groupstats_B = ddf.groupby(['col3'])
.mean()
groupstats_C = ddf.groupby(['col1', 'col2', 'col3'])
.mean()
A, B, C = dask.compute(groupstats_A, groupstats_B, groupstats_C)
answered Mar 22 at 16:32
mdurantmdurant
11.9k11741
11.9k11741
Thanks! I did look for duplicates as well, but couldn't find it either. This answer is exactly what I was looking for, it makes everything so much faster.
– Gijs
Mar 25 at 9:46
add a comment |
Thanks! I did look for duplicates as well, but couldn't find it either. This answer is exactly what I was looking for, it makes everything so much faster.
– Gijs
Mar 25 at 9:46
Thanks! I did look for duplicates as well, but couldn't find it either. This answer is exactly what I was looking for, it makes everything so much faster.
– Gijs
Mar 25 at 9:46
Thanks! I did look for duplicates as well, but couldn't find it either. This answer is exactly what I was looking for, it makes everything so much faster.
– Gijs
Mar 25 at 9:46
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55300011%2fhow-to-do-multiple-dask-computations-without-re-loading-my-large-csv%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown