Write output directly from a dask workerHow to flush output of print function?How to randomly select an item from a list?Correct way to write line to file?dask s3 access on ec2 workersHow to load dataframe on all dask workersDask: longer run time than PandasHow to replicate data when it is faster to compute than transfer in dask distributed?Returning a dataframe in DaskAnalyzing data flow of Dask dataframesWriting a dask bag of data frame to disk (Generating 2 million features with dask and featuretools)
Is a USB 3.0 device possible with a four contact USB 2.0 connector?
What should I do if actually I found a serious flaw in someone's PhD thesis and an article derived from that PhD thesis?
If it isn't [someone's name]!
What is the opposite of "hunger level"?
Interaction between Leonin Warleader and Divine Visitation
What does a comma signify in inorganic chemistry?
Parse a simple key=value config file in C
Output the list of musical notes
Why is su world executable?
Why should P.I be willing to write strong LOR even if that means losing a undergraduate from his/her lab?
Not fallen in Latin
How to render "have ideas above his station" into German
Animate flow lines of time-dependent 3D dynamical system
What would cause a nuclear power plant to break down after 2000 years, but not sooner?
What would be synonyms for "be into something"?
Gofer work in exchange for LoR
Vegetarian dishes on Russian trains (European part)
Expressing a chain of boolean ORs using ILP
When does The Truman Show take place?
Adding things to bunches of things vs multiplication
Are there any rules on how characters go from 0th to 1st level in a class?
When did Bilbo and Frodo learn that Gandalf was a Maia?
What are the advantages of this gold finger shape?
If a person claims to know anything could it be disproven by saying 'prove that we are not in a simulation'?
Write output directly from a dask worker
How to flush output of print function?How to randomly select an item from a list?Correct way to write line to file?dask s3 access on ec2 workersHow to load dataframe on all dask workersDask: longer run time than PandasHow to replicate data when it is faster to compute than transfer in dask distributed?Returning a dataframe in DaskAnalyzing data flow of Dask dataframesWriting a dask bag of data frame to disk (Generating 2 million features with dask and featuretools)
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
I have a pipeline that transforms (maps) a dataframe. The output is large - rows in the input dataframe contain audio in binary format and rows in the output dataframe contain extracted binary features.
I'm reading the input from a partitioned parquet file and writing it back to a different parquet file(s) - both on a network share.
From my understanding, in distributed dask, each worker will send the output back to the scheduler (and then maybe the scheduler sends it back to the client??) and only then will the scheduler (or the client) write it to the network share. Is this correct?
If yes, if the data is big and bandwidth is an issue it seems there is redundant communication in this scenario - why can't the workers send the output directly to the final destination (network share in this case)? Certainly, the share needs to be available to all workers, and someone needs to synchronize the writes, but isn't this what the magic of dask is about?
python dask
add a comment |
I have a pipeline that transforms (maps) a dataframe. The output is large - rows in the input dataframe contain audio in binary format and rows in the output dataframe contain extracted binary features.
I'm reading the input from a partitioned parquet file and writing it back to a different parquet file(s) - both on a network share.
From my understanding, in distributed dask, each worker will send the output back to the scheduler (and then maybe the scheduler sends it back to the client??) and only then will the scheduler (or the client) write it to the network share. Is this correct?
If yes, if the data is big and bandwidth is an issue it seems there is redundant communication in this scenario - why can't the workers send the output directly to the final destination (network share in this case)? Certainly, the share needs to be available to all workers, and someone needs to synchronize the writes, but isn't this what the magic of dask is about?
python dask
add a comment |
I have a pipeline that transforms (maps) a dataframe. The output is large - rows in the input dataframe contain audio in binary format and rows in the output dataframe contain extracted binary features.
I'm reading the input from a partitioned parquet file and writing it back to a different parquet file(s) - both on a network share.
From my understanding, in distributed dask, each worker will send the output back to the scheduler (and then maybe the scheduler sends it back to the client??) and only then will the scheduler (or the client) write it to the network share. Is this correct?
If yes, if the data is big and bandwidth is an issue it seems there is redundant communication in this scenario - why can't the workers send the output directly to the final destination (network share in this case)? Certainly, the share needs to be available to all workers, and someone needs to synchronize the writes, but isn't this what the magic of dask is about?
python dask
I have a pipeline that transforms (maps) a dataframe. The output is large - rows in the input dataframe contain audio in binary format and rows in the output dataframe contain extracted binary features.
I'm reading the input from a partitioned parquet file and writing it back to a different parquet file(s) - both on a network share.
From my understanding, in distributed dask, each worker will send the output back to the scheduler (and then maybe the scheduler sends it back to the client??) and only then will the scheduler (or the client) write it to the network share. Is this correct?
If yes, if the data is big and bandwidth is an issue it seems there is redundant communication in this scenario - why can't the workers send the output directly to the final destination (network share in this case)? Certainly, the share needs to be available to all workers, and someone needs to synchronize the writes, but isn't this what the magic of dask is about?
python dask
python dask
asked Mar 27 at 12:58
StavStav
6066 silver badges23 bronze badges
6066 silver badges23 bronze badges
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Your understanding is incorrect: the workers will read and write to shared storage or cloud/network services directly, this is the normal way that things are calculated.
df = dd.read_parquet(url)
df_out = do_work(df)
df_out.to_parquet(url2)
In this case, the data is never seen by the scheduler or the client. They do communicate, though: the client will load metadata about the dataset, so that it can make inferences about how to split up the work to be done, and the scheduler talks to both the client and the workers to farm out these task specifications and check when they are done.
You can optionally bring the whole dataset into the memory of the client as a pandas dataframe with
local_df = df.compute()
but this is optional and obviously not recommended where the data size is bigger than memory. You usually never need to do this for the whole dataset, only maybe for some aggregate result much smaller than the original. Even in this case, the scheduler itself does not store the results.
Thanks @mdurant! That's what I would expect but couldn't find an explicit documentation that states so. Also, how does dask decide if a worker can write directly? i.e if I save to a local file, where would the file be?
– Stav
Mar 27 at 14:57
1
If you supply a local path, the piece will be in the file-system of the worker. You will want to write to a shared network resource or cloud storage - this is also how you get parallel bandwidth.
– mdurant
Mar 27 at 15:13
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55377801%2fwrite-output-directly-from-a-dask-worker%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Your understanding is incorrect: the workers will read and write to shared storage or cloud/network services directly, this is the normal way that things are calculated.
df = dd.read_parquet(url)
df_out = do_work(df)
df_out.to_parquet(url2)
In this case, the data is never seen by the scheduler or the client. They do communicate, though: the client will load metadata about the dataset, so that it can make inferences about how to split up the work to be done, and the scheduler talks to both the client and the workers to farm out these task specifications and check when they are done.
You can optionally bring the whole dataset into the memory of the client as a pandas dataframe with
local_df = df.compute()
but this is optional and obviously not recommended where the data size is bigger than memory. You usually never need to do this for the whole dataset, only maybe for some aggregate result much smaller than the original. Even in this case, the scheduler itself does not store the results.
Thanks @mdurant! That's what I would expect but couldn't find an explicit documentation that states so. Also, how does dask decide if a worker can write directly? i.e if I save to a local file, where would the file be?
– Stav
Mar 27 at 14:57
1
If you supply a local path, the piece will be in the file-system of the worker. You will want to write to a shared network resource or cloud storage - this is also how you get parallel bandwidth.
– mdurant
Mar 27 at 15:13
add a comment |
Your understanding is incorrect: the workers will read and write to shared storage or cloud/network services directly, this is the normal way that things are calculated.
df = dd.read_parquet(url)
df_out = do_work(df)
df_out.to_parquet(url2)
In this case, the data is never seen by the scheduler or the client. They do communicate, though: the client will load metadata about the dataset, so that it can make inferences about how to split up the work to be done, and the scheduler talks to both the client and the workers to farm out these task specifications and check when they are done.
You can optionally bring the whole dataset into the memory of the client as a pandas dataframe with
local_df = df.compute()
but this is optional and obviously not recommended where the data size is bigger than memory. You usually never need to do this for the whole dataset, only maybe for some aggregate result much smaller than the original. Even in this case, the scheduler itself does not store the results.
Thanks @mdurant! That's what I would expect but couldn't find an explicit documentation that states so. Also, how does dask decide if a worker can write directly? i.e if I save to a local file, where would the file be?
– Stav
Mar 27 at 14:57
1
If you supply a local path, the piece will be in the file-system of the worker. You will want to write to a shared network resource or cloud storage - this is also how you get parallel bandwidth.
– mdurant
Mar 27 at 15:13
add a comment |
Your understanding is incorrect: the workers will read and write to shared storage or cloud/network services directly, this is the normal way that things are calculated.
df = dd.read_parquet(url)
df_out = do_work(df)
df_out.to_parquet(url2)
In this case, the data is never seen by the scheduler or the client. They do communicate, though: the client will load metadata about the dataset, so that it can make inferences about how to split up the work to be done, and the scheduler talks to both the client and the workers to farm out these task specifications and check when they are done.
You can optionally bring the whole dataset into the memory of the client as a pandas dataframe with
local_df = df.compute()
but this is optional and obviously not recommended where the data size is bigger than memory. You usually never need to do this for the whole dataset, only maybe for some aggregate result much smaller than the original. Even in this case, the scheduler itself does not store the results.
Your understanding is incorrect: the workers will read and write to shared storage or cloud/network services directly, this is the normal way that things are calculated.
df = dd.read_parquet(url)
df_out = do_work(df)
df_out.to_parquet(url2)
In this case, the data is never seen by the scheduler or the client. They do communicate, though: the client will load metadata about the dataset, so that it can make inferences about how to split up the work to be done, and the scheduler talks to both the client and the workers to farm out these task specifications and check when they are done.
You can optionally bring the whole dataset into the memory of the client as a pandas dataframe with
local_df = df.compute()
but this is optional and obviously not recommended where the data size is bigger than memory. You usually never need to do this for the whole dataset, only maybe for some aggregate result much smaller than the original. Even in this case, the scheduler itself does not store the results.
answered Mar 27 at 13:34
mdurantmdurant
13.4k1 gold badge20 silver badges44 bronze badges
13.4k1 gold badge20 silver badges44 bronze badges
Thanks @mdurant! That's what I would expect but couldn't find an explicit documentation that states so. Also, how does dask decide if a worker can write directly? i.e if I save to a local file, where would the file be?
– Stav
Mar 27 at 14:57
1
If you supply a local path, the piece will be in the file-system of the worker. You will want to write to a shared network resource or cloud storage - this is also how you get parallel bandwidth.
– mdurant
Mar 27 at 15:13
add a comment |
Thanks @mdurant! That's what I would expect but couldn't find an explicit documentation that states so. Also, how does dask decide if a worker can write directly? i.e if I save to a local file, where would the file be?
– Stav
Mar 27 at 14:57
1
If you supply a local path, the piece will be in the file-system of the worker. You will want to write to a shared network resource or cloud storage - this is also how you get parallel bandwidth.
– mdurant
Mar 27 at 15:13
Thanks @mdurant! That's what I would expect but couldn't find an explicit documentation that states so. Also, how does dask decide if a worker can write directly? i.e if I save to a local file, where would the file be?
– Stav
Mar 27 at 14:57
Thanks @mdurant! That's what I would expect but couldn't find an explicit documentation that states so. Also, how does dask decide if a worker can write directly? i.e if I save to a local file, where would the file be?
– Stav
Mar 27 at 14:57
1
1
If you supply a local path, the piece will be in the file-system of the worker. You will want to write to a shared network resource or cloud storage - this is also how you get parallel bandwidth.
– mdurant
Mar 27 at 15:13
If you supply a local path, the piece will be in the file-system of the worker. You will want to write to a shared network resource or cloud storage - this is also how you get parallel bandwidth.
– mdurant
Mar 27 at 15:13
add a comment |
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55377801%2fwrite-output-directly-from-a-dask-worker%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown