Write output directly from a dask workerHow to flush output of print function?How to randomly select an item from a list?Correct way to write line to file?dask s3 access on ec2 workersHow to load dataframe on all dask workersDask: longer run time than PandasHow to replicate data when it is faster to compute than transfer in dask distributed?Returning a dataframe in DaskAnalyzing data flow of Dask dataframesWriting a dask bag of data frame to disk (Generating 2 million features with dask and featuretools)

Is a USB 3.0 device possible with a four contact USB 2.0 connector?

What should I do if actually I found a serious flaw in someone's PhD thesis and an article derived from that PhD thesis?

If it isn't [someone's name]!

What is the opposite of "hunger level"?

Interaction between Leonin Warleader and Divine Visitation

What does a comma signify in inorganic chemistry?

Parse a simple key=value config file in C

Output the list of musical notes

Why is su world executable?

Why should P.I be willing to write strong LOR even if that means losing a undergraduate from his/her lab?

Not fallen in Latin

How to render "have ideas above his station" into German

Animate flow lines of time-dependent 3D dynamical system

What would cause a nuclear power plant to break down after 2000 years, but not sooner?

What would be synonyms for "be into something"?

Gofer work in exchange for LoR

Vegetarian dishes on Russian trains (European part)

Expressing a chain of boolean ORs using ILP

When does The Truman Show take place?

Adding things to bunches of things vs multiplication

Are there any rules on how characters go from 0th to 1st level in a class?

When did Bilbo and Frodo learn that Gandalf was a Maia?

What are the advantages of this gold finger shape?

If a person claims to know anything could it be disproven by saying 'prove that we are not in a simulation'?



Write output directly from a dask worker


How to flush output of print function?How to randomly select an item from a list?Correct way to write line to file?dask s3 access on ec2 workersHow to load dataframe on all dask workersDask: longer run time than PandasHow to replicate data when it is faster to compute than transfer in dask distributed?Returning a dataframe in DaskAnalyzing data flow of Dask dataframesWriting a dask bag of data frame to disk (Generating 2 million features with dask and featuretools)






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








0















I have a pipeline that transforms (maps) a dataframe. The output is large - rows in the input dataframe contain audio in binary format and rows in the output dataframe contain extracted binary features.



I'm reading the input from a partitioned parquet file and writing it back to a different parquet file(s) - both on a network share.



From my understanding, in distributed dask, each worker will send the output back to the scheduler (and then maybe the scheduler sends it back to the client??) and only then will the scheduler (or the client) write it to the network share. Is this correct?



If yes, if the data is big and bandwidth is an issue it seems there is redundant communication in this scenario - why can't the workers send the output directly to the final destination (network share in this case)? Certainly, the share needs to be available to all workers, and someone needs to synchronize the writes, but isn't this what the magic of dask is about?










share|improve this question






























    0















    I have a pipeline that transforms (maps) a dataframe. The output is large - rows in the input dataframe contain audio in binary format and rows in the output dataframe contain extracted binary features.



    I'm reading the input from a partitioned parquet file and writing it back to a different parquet file(s) - both on a network share.



    From my understanding, in distributed dask, each worker will send the output back to the scheduler (and then maybe the scheduler sends it back to the client??) and only then will the scheduler (or the client) write it to the network share. Is this correct?



    If yes, if the data is big and bandwidth is an issue it seems there is redundant communication in this scenario - why can't the workers send the output directly to the final destination (network share in this case)? Certainly, the share needs to be available to all workers, and someone needs to synchronize the writes, but isn't this what the magic of dask is about?










    share|improve this question


























      0












      0








      0








      I have a pipeline that transforms (maps) a dataframe. The output is large - rows in the input dataframe contain audio in binary format and rows in the output dataframe contain extracted binary features.



      I'm reading the input from a partitioned parquet file and writing it back to a different parquet file(s) - both on a network share.



      From my understanding, in distributed dask, each worker will send the output back to the scheduler (and then maybe the scheduler sends it back to the client??) and only then will the scheduler (or the client) write it to the network share. Is this correct?



      If yes, if the data is big and bandwidth is an issue it seems there is redundant communication in this scenario - why can't the workers send the output directly to the final destination (network share in this case)? Certainly, the share needs to be available to all workers, and someone needs to synchronize the writes, but isn't this what the magic of dask is about?










      share|improve this question














      I have a pipeline that transforms (maps) a dataframe. The output is large - rows in the input dataframe contain audio in binary format and rows in the output dataframe contain extracted binary features.



      I'm reading the input from a partitioned parquet file and writing it back to a different parquet file(s) - both on a network share.



      From my understanding, in distributed dask, each worker will send the output back to the scheduler (and then maybe the scheduler sends it back to the client??) and only then will the scheduler (or the client) write it to the network share. Is this correct?



      If yes, if the data is big and bandwidth is an issue it seems there is redundant communication in this scenario - why can't the workers send the output directly to the final destination (network share in this case)? Certainly, the share needs to be available to all workers, and someone needs to synchronize the writes, but isn't this what the magic of dask is about?







      python dask






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 27 at 12:58









      StavStav

      6066 silver badges23 bronze badges




      6066 silver badges23 bronze badges

























          1 Answer
          1






          active

          oldest

          votes


















          1














          Your understanding is incorrect: the workers will read and write to shared storage or cloud/network services directly, this is the normal way that things are calculated.



          df = dd.read_parquet(url)
          df_out = do_work(df)
          df_out.to_parquet(url2)


          In this case, the data is never seen by the scheduler or the client. They do communicate, though: the client will load metadata about the dataset, so that it can make inferences about how to split up the work to be done, and the scheduler talks to both the client and the workers to farm out these task specifications and check when they are done.



          You can optionally bring the whole dataset into the memory of the client as a pandas dataframe with



          local_df = df.compute()


          but this is optional and obviously not recommended where the data size is bigger than memory. You usually never need to do this for the whole dataset, only maybe for some aggregate result much smaller than the original. Even in this case, the scheduler itself does not store the results.






          share|improve this answer

























          • Thanks @mdurant! That's what I would expect but couldn't find an explicit documentation that states so. Also, how does dask decide if a worker can write directly? i.e if I save to a local file, where would the file be?

            – Stav
            Mar 27 at 14:57







          • 1





            If you supply a local path, the piece will be in the file-system of the worker. You will want to write to a shared network resource or cloud storage - this is also how you get parallel bandwidth.

            – mdurant
            Mar 27 at 15:13










          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55377801%2fwrite-output-directly-from-a-dask-worker%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          Your understanding is incorrect: the workers will read and write to shared storage or cloud/network services directly, this is the normal way that things are calculated.



          df = dd.read_parquet(url)
          df_out = do_work(df)
          df_out.to_parquet(url2)


          In this case, the data is never seen by the scheduler or the client. They do communicate, though: the client will load metadata about the dataset, so that it can make inferences about how to split up the work to be done, and the scheduler talks to both the client and the workers to farm out these task specifications and check when they are done.



          You can optionally bring the whole dataset into the memory of the client as a pandas dataframe with



          local_df = df.compute()


          but this is optional and obviously not recommended where the data size is bigger than memory. You usually never need to do this for the whole dataset, only maybe for some aggregate result much smaller than the original. Even in this case, the scheduler itself does not store the results.






          share|improve this answer

























          • Thanks @mdurant! That's what I would expect but couldn't find an explicit documentation that states so. Also, how does dask decide if a worker can write directly? i.e if I save to a local file, where would the file be?

            – Stav
            Mar 27 at 14:57







          • 1





            If you supply a local path, the piece will be in the file-system of the worker. You will want to write to a shared network resource or cloud storage - this is also how you get parallel bandwidth.

            – mdurant
            Mar 27 at 15:13















          1














          Your understanding is incorrect: the workers will read and write to shared storage or cloud/network services directly, this is the normal way that things are calculated.



          df = dd.read_parquet(url)
          df_out = do_work(df)
          df_out.to_parquet(url2)


          In this case, the data is never seen by the scheduler or the client. They do communicate, though: the client will load metadata about the dataset, so that it can make inferences about how to split up the work to be done, and the scheduler talks to both the client and the workers to farm out these task specifications and check when they are done.



          You can optionally bring the whole dataset into the memory of the client as a pandas dataframe with



          local_df = df.compute()


          but this is optional and obviously not recommended where the data size is bigger than memory. You usually never need to do this for the whole dataset, only maybe for some aggregate result much smaller than the original. Even in this case, the scheduler itself does not store the results.






          share|improve this answer

























          • Thanks @mdurant! That's what I would expect but couldn't find an explicit documentation that states so. Also, how does dask decide if a worker can write directly? i.e if I save to a local file, where would the file be?

            – Stav
            Mar 27 at 14:57







          • 1





            If you supply a local path, the piece will be in the file-system of the worker. You will want to write to a shared network resource or cloud storage - this is also how you get parallel bandwidth.

            – mdurant
            Mar 27 at 15:13













          1












          1








          1







          Your understanding is incorrect: the workers will read and write to shared storage or cloud/network services directly, this is the normal way that things are calculated.



          df = dd.read_parquet(url)
          df_out = do_work(df)
          df_out.to_parquet(url2)


          In this case, the data is never seen by the scheduler or the client. They do communicate, though: the client will load metadata about the dataset, so that it can make inferences about how to split up the work to be done, and the scheduler talks to both the client and the workers to farm out these task specifications and check when they are done.



          You can optionally bring the whole dataset into the memory of the client as a pandas dataframe with



          local_df = df.compute()


          but this is optional and obviously not recommended where the data size is bigger than memory. You usually never need to do this for the whole dataset, only maybe for some aggregate result much smaller than the original. Even in this case, the scheduler itself does not store the results.






          share|improve this answer













          Your understanding is incorrect: the workers will read and write to shared storage or cloud/network services directly, this is the normal way that things are calculated.



          df = dd.read_parquet(url)
          df_out = do_work(df)
          df_out.to_parquet(url2)


          In this case, the data is never seen by the scheduler or the client. They do communicate, though: the client will load metadata about the dataset, so that it can make inferences about how to split up the work to be done, and the scheduler talks to both the client and the workers to farm out these task specifications and check when they are done.



          You can optionally bring the whole dataset into the memory of the client as a pandas dataframe with



          local_df = df.compute()


          but this is optional and obviously not recommended where the data size is bigger than memory. You usually never need to do this for the whole dataset, only maybe for some aggregate result much smaller than the original. Even in this case, the scheduler itself does not store the results.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Mar 27 at 13:34









          mdurantmdurant

          13.4k1 gold badge20 silver badges44 bronze badges




          13.4k1 gold badge20 silver badges44 bronze badges















          • Thanks @mdurant! That's what I would expect but couldn't find an explicit documentation that states so. Also, how does dask decide if a worker can write directly? i.e if I save to a local file, where would the file be?

            – Stav
            Mar 27 at 14:57







          • 1





            If you supply a local path, the piece will be in the file-system of the worker. You will want to write to a shared network resource or cloud storage - this is also how you get parallel bandwidth.

            – mdurant
            Mar 27 at 15:13

















          • Thanks @mdurant! That's what I would expect but couldn't find an explicit documentation that states so. Also, how does dask decide if a worker can write directly? i.e if I save to a local file, where would the file be?

            – Stav
            Mar 27 at 14:57







          • 1





            If you supply a local path, the piece will be in the file-system of the worker. You will want to write to a shared network resource or cloud storage - this is also how you get parallel bandwidth.

            – mdurant
            Mar 27 at 15:13
















          Thanks @mdurant! That's what I would expect but couldn't find an explicit documentation that states so. Also, how does dask decide if a worker can write directly? i.e if I save to a local file, where would the file be?

          – Stav
          Mar 27 at 14:57






          Thanks @mdurant! That's what I would expect but couldn't find an explicit documentation that states so. Also, how does dask decide if a worker can write directly? i.e if I save to a local file, where would the file be?

          – Stav
          Mar 27 at 14:57





          1




          1





          If you supply a local path, the piece will be in the file-system of the worker. You will want to write to a shared network resource or cloud storage - this is also how you get parallel bandwidth.

          – mdurant
          Mar 27 at 15:13





          If you supply a local path, the piece will be in the file-system of the worker. You will want to write to a shared network resource or cloud storage - this is also how you get parallel bandwidth.

          – mdurant
          Mar 27 at 15:13








          Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.







          Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.



















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55377801%2fwrite-output-directly-from-a-dask-worker%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          SQL error code 1064 with creating Laravel foreign keysForeign key constraints: When to use ON UPDATE and ON DELETEDropping column with foreign key Laravel error: General error: 1025 Error on renameLaravel SQL Can't create tableLaravel Migration foreign key errorLaravel php artisan migrate:refresh giving a syntax errorSQLSTATE[42S01]: Base table or view already exists or Base table or view already exists: 1050 Tableerror in migrating laravel file to xampp serverSyntax error or access violation: 1064:syntax to use near 'unsigned not null, modelName varchar(191) not null, title varchar(191) not nLaravel cannot create new table field in mysqlLaravel 5.7:Last migration creates table but is not registered in the migration table

          용인 삼성생명 블루밍스 목차 통계 역대 감독 선수단 응원단 경기장 같이 보기 외부 링크 둘러보기 메뉴samsungblueminx.comeh선수 명단용인 삼성생명 블루밍스용인 삼성생명 블루밍스ehsamsungblueminx.comeheheheh

          155 수학 과학 기타 둘러보기 메뉴eh추가해eh문서를 완성해