OpenMP On-Demand Nested ParallelismWhat is the difference between concurrency and parallelism?OpenMP iteration for loop in parallel regionParallelize function using OpenMPNested openmp loopOpenMP nested parallelism with sectionsPerformance problems using OpenMP in nested loopsOpenMP with nested loopsParallelization of Red Black SOR using OpenMbOpenMP paralelization inhibits vectorizationparallel 'task's inside an already parallelized 'for' loop in OpenMP

Why didn't General Martok receive discommendation in Star Trek: Deep Space Nine?

Avoiding Implicit Conversion in Constructor. Explicit keyword doesn't help here

Reducing the time for rolling hash

Is it possible to tell if a child will turn into a Hag?

Introduction to the Sicilian

When did J.K. Rowling decide to make Ron and Hermione a couple?

My employer is refusing to give me the pay that was advertised after an internal job move

Does Ubuntu reduce battery life?

Derivative is just speed of change?

Coworker mumbles to herself when working, how to ask her to stop?

Why would an invisible personal shield be necessary?

Why are prop blades not shaped like household fan blades?

How can I type the name of the person I'm calling on the dial pad and make the call?

Do cabinets and speaker enclosures add the additional harmonic contents?

Should students have access to past exams or an exam bank?

Move arrows along a contour

Balancing Humanoid fantasy races: Elves

Rampant sharing of authorship among colleagues in the name of "collaboration". Is not taking part in it a death knell for a future in academia?

Would people understand me speaking German all over Europe?

Can machine learning learn a function like finding maximum from a list?

What do the novel titles of The Expanse series refer to?

What is the full text of the song about the failed battle of Kiska?

Just how much information should you share with a former client?

Database Cache Memory in Performance Monitor drops down significantly after DBCC CheckDB



OpenMP On-Demand Nested Parallelism


What is the difference between concurrency and parallelism?OpenMP iteration for loop in parallel regionParallelize function using OpenMPNested openmp loopOpenMP nested parallelism with sectionsPerformance problems using OpenMP in nested loopsOpenMP with nested loopsParallelization of Red Black SOR using OpenMbOpenMP paralelization inhibits vectorizationparallel 'task's inside an already parallelized 'for' loop in OpenMP






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








2















I have a list of jobs, which I am processing in parallel with OpenMP:



void processAllJobs()

#pragma omp parallel for
for(int i = 0; i < n; ++i)
processJob(i);



All jobs have some sequential parts and parts that could be parallelized if called alone:



void processJob(int i)

for(int iteration = 0; iteration < iterationCount; ++iteration)

doSomePreparation(i);
std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp parallel for
for(int j = 0; j < substasks.size(); ++j)
subtasks[j].Process();
doSomePostProcessing(i)




When I run processAllJobs(), threads are created for the outer loop (over each job) and the inner loop (over the subtasks) are done sequentially within the thread. This is all fine and intended.



Sometimes there are very large jobs that take a lot of time to process. Long enough, such that all other threads in the outer loop already finish way before the last thread and don't do anything. Is there a way to re-purpose the unused threads to parallelize the inner loop as soon as they are finished? I imagine something that checks the number of unused threads each time the inner parallel region is entered.



I cannot predict how long a job runs. It might not only be one long-lasting job - maybe there are two or three.










share|improve this question



















  • 1





    You could use dynamic scheduling in the outer loop with a small number of threads. And use nested parallelism in the inner loop also controlling the number of threads. If your total number of threads is 16, you can try num_thread(4) in both case. With dynamic scheduling, fast threads will end early and you can process several small chunks while a long processing takes place. With nested parallelism you guarantee that several threads will be used for long jobs.

    – Alain Merigot
    Mar 26 at 22:54






  • 1





    Try task loops for both levels. Whether or not that will be of benefit is impossible to tell from the information in the question.

    – Zulan
    Mar 26 at 22:59











  • @Alain: Thanks for the suggestion. That would probably help a bit. Although it will somehow shift the problem. In the example with a single long-lasting job, if there were 15 idle threads with the initial approach, there would be 12 idle threads with the modified approach (with four working threads). And the sequential parts would not benefit from maximum parallelization.

    – Nico Schertler
    Mar 26 at 23:55

















2















I have a list of jobs, which I am processing in parallel with OpenMP:



void processAllJobs()

#pragma omp parallel for
for(int i = 0; i < n; ++i)
processJob(i);



All jobs have some sequential parts and parts that could be parallelized if called alone:



void processJob(int i)

for(int iteration = 0; iteration < iterationCount; ++iteration)

doSomePreparation(i);
std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp parallel for
for(int j = 0; j < substasks.size(); ++j)
subtasks[j].Process();
doSomePostProcessing(i)




When I run processAllJobs(), threads are created for the outer loop (over each job) and the inner loop (over the subtasks) are done sequentially within the thread. This is all fine and intended.



Sometimes there are very large jobs that take a lot of time to process. Long enough, such that all other threads in the outer loop already finish way before the last thread and don't do anything. Is there a way to re-purpose the unused threads to parallelize the inner loop as soon as they are finished? I imagine something that checks the number of unused threads each time the inner parallel region is entered.



I cannot predict how long a job runs. It might not only be one long-lasting job - maybe there are two or three.










share|improve this question



















  • 1





    You could use dynamic scheduling in the outer loop with a small number of threads. And use nested parallelism in the inner loop also controlling the number of threads. If your total number of threads is 16, you can try num_thread(4) in both case. With dynamic scheduling, fast threads will end early and you can process several small chunks while a long processing takes place. With nested parallelism you guarantee that several threads will be used for long jobs.

    – Alain Merigot
    Mar 26 at 22:54






  • 1





    Try task loops for both levels. Whether or not that will be of benefit is impossible to tell from the information in the question.

    – Zulan
    Mar 26 at 22:59











  • @Alain: Thanks for the suggestion. That would probably help a bit. Although it will somehow shift the problem. In the example with a single long-lasting job, if there were 15 idle threads with the initial approach, there would be 12 idle threads with the modified approach (with four working threads). And the sequential parts would not benefit from maximum parallelization.

    – Nico Schertler
    Mar 26 at 23:55













2












2








2








I have a list of jobs, which I am processing in parallel with OpenMP:



void processAllJobs()

#pragma omp parallel for
for(int i = 0; i < n; ++i)
processJob(i);



All jobs have some sequential parts and parts that could be parallelized if called alone:



void processJob(int i)

for(int iteration = 0; iteration < iterationCount; ++iteration)

doSomePreparation(i);
std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp parallel for
for(int j = 0; j < substasks.size(); ++j)
subtasks[j].Process();
doSomePostProcessing(i)




When I run processAllJobs(), threads are created for the outer loop (over each job) and the inner loop (over the subtasks) are done sequentially within the thread. This is all fine and intended.



Sometimes there are very large jobs that take a lot of time to process. Long enough, such that all other threads in the outer loop already finish way before the last thread and don't do anything. Is there a way to re-purpose the unused threads to parallelize the inner loop as soon as they are finished? I imagine something that checks the number of unused threads each time the inner parallel region is entered.



I cannot predict how long a job runs. It might not only be one long-lasting job - maybe there are two or three.










share|improve this question














I have a list of jobs, which I am processing in parallel with OpenMP:



void processAllJobs()

#pragma omp parallel for
for(int i = 0; i < n; ++i)
processJob(i);



All jobs have some sequential parts and parts that could be parallelized if called alone:



void processJob(int i)

for(int iteration = 0; iteration < iterationCount; ++iteration)

doSomePreparation(i);
std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp parallel for
for(int j = 0; j < substasks.size(); ++j)
subtasks[j].Process();
doSomePostProcessing(i)




When I run processAllJobs(), threads are created for the outer loop (over each job) and the inner loop (over the subtasks) are done sequentially within the thread. This is all fine and intended.



Sometimes there are very large jobs that take a lot of time to process. Long enough, such that all other threads in the outer loop already finish way before the last thread and don't do anything. Is there a way to re-purpose the unused threads to parallelize the inner loop as soon as they are finished? I imagine something that checks the number of unused threads each time the inner parallel region is entered.



I cannot predict how long a job runs. It might not only be one long-lasting job - maybe there are two or three.







c++ parallel-processing openmp






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 26 at 22:10









Nico SchertlerNico Schertler

26.9k4 gold badges24 silver badges53 bronze badges




26.9k4 gold badges24 silver badges53 bronze badges










  • 1





    You could use dynamic scheduling in the outer loop with a small number of threads. And use nested parallelism in the inner loop also controlling the number of threads. If your total number of threads is 16, you can try num_thread(4) in both case. With dynamic scheduling, fast threads will end early and you can process several small chunks while a long processing takes place. With nested parallelism you guarantee that several threads will be used for long jobs.

    – Alain Merigot
    Mar 26 at 22:54






  • 1





    Try task loops for both levels. Whether or not that will be of benefit is impossible to tell from the information in the question.

    – Zulan
    Mar 26 at 22:59











  • @Alain: Thanks for the suggestion. That would probably help a bit. Although it will somehow shift the problem. In the example with a single long-lasting job, if there were 15 idle threads with the initial approach, there would be 12 idle threads with the modified approach (with four working threads). And the sequential parts would not benefit from maximum parallelization.

    – Nico Schertler
    Mar 26 at 23:55












  • 1





    You could use dynamic scheduling in the outer loop with a small number of threads. And use nested parallelism in the inner loop also controlling the number of threads. If your total number of threads is 16, you can try num_thread(4) in both case. With dynamic scheduling, fast threads will end early and you can process several small chunks while a long processing takes place. With nested parallelism you guarantee that several threads will be used for long jobs.

    – Alain Merigot
    Mar 26 at 22:54






  • 1





    Try task loops for both levels. Whether or not that will be of benefit is impossible to tell from the information in the question.

    – Zulan
    Mar 26 at 22:59











  • @Alain: Thanks for the suggestion. That would probably help a bit. Although it will somehow shift the problem. In the example with a single long-lasting job, if there were 15 idle threads with the initial approach, there would be 12 idle threads with the modified approach (with four working threads). And the sequential parts would not benefit from maximum parallelization.

    – Nico Schertler
    Mar 26 at 23:55







1




1





You could use dynamic scheduling in the outer loop with a small number of threads. And use nested parallelism in the inner loop also controlling the number of threads. If your total number of threads is 16, you can try num_thread(4) in both case. With dynamic scheduling, fast threads will end early and you can process several small chunks while a long processing takes place. With nested parallelism you guarantee that several threads will be used for long jobs.

– Alain Merigot
Mar 26 at 22:54





You could use dynamic scheduling in the outer loop with a small number of threads. And use nested parallelism in the inner loop also controlling the number of threads. If your total number of threads is 16, you can try num_thread(4) in both case. With dynamic scheduling, fast threads will end early and you can process several small chunks while a long processing takes place. With nested parallelism you guarantee that several threads will be used for long jobs.

– Alain Merigot
Mar 26 at 22:54




1




1





Try task loops for both levels. Whether or not that will be of benefit is impossible to tell from the information in the question.

– Zulan
Mar 26 at 22:59





Try task loops for both levels. Whether or not that will be of benefit is impossible to tell from the information in the question.

– Zulan
Mar 26 at 22:59













@Alain: Thanks for the suggestion. That would probably help a bit. Although it will somehow shift the problem. In the example with a single long-lasting job, if there were 15 idle threads with the initial approach, there would be 12 idle threads with the modified approach (with four working threads). And the sequential parts would not benefit from maximum parallelization.

– Nico Schertler
Mar 26 at 23:55





@Alain: Thanks for the suggestion. That would probably help a bit. Although it will somehow shift the problem. In the example with a single long-lasting job, if there were 15 idle threads with the initial approach, there would be 12 idle threads with the modified approach (with four working threads). And the sequential parts would not benefit from maximum parallelization.

– Nico Schertler
Mar 26 at 23:55












1 Answer
1






active

oldest

votes


















3














Your description of the problem sounds more like OpenMP tasking will be a much better choice. Your code would then look like this:



void processAllJobs()

#pragma omp parallel master
for(int i = 0; i < n; ++i)
#pragma omp task
processJob(i);



Then the processing of the job would look like this:



void processJob(int i)

for(int iteration = 0; iteration < iterationCount; ++iteration)

doSomePreparation(i);
std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp taskloop // add grainsize() clause, if Process() is very short
for(int j = 0; j < substasks.size(); ++j)
subtasks[j].Process();
doSomePostProcessing(i)




That way you get natural load balancing (assuming that you have enough tasks) without having to rely on nested parallelism.






share|improve this answer

























  • One more thing: You could even use tasks for doSomePreparation(i) and doSomePostProcessing(i) when you add task dependences to synchronize prep and post-processing tasks with the Process() tasks.

    – Michael Klemm
    Mar 27 at 15:56










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55366920%2fopenmp-on-demand-nested-parallelism%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









3














Your description of the problem sounds more like OpenMP tasking will be a much better choice. Your code would then look like this:



void processAllJobs()

#pragma omp parallel master
for(int i = 0; i < n; ++i)
#pragma omp task
processJob(i);



Then the processing of the job would look like this:



void processJob(int i)

for(int iteration = 0; iteration < iterationCount; ++iteration)

doSomePreparation(i);
std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp taskloop // add grainsize() clause, if Process() is very short
for(int j = 0; j < substasks.size(); ++j)
subtasks[j].Process();
doSomePostProcessing(i)




That way you get natural load balancing (assuming that you have enough tasks) without having to rely on nested parallelism.






share|improve this answer

























  • One more thing: You could even use tasks for doSomePreparation(i) and doSomePostProcessing(i) when you add task dependences to synchronize prep and post-processing tasks with the Process() tasks.

    – Michael Klemm
    Mar 27 at 15:56















3














Your description of the problem sounds more like OpenMP tasking will be a much better choice. Your code would then look like this:



void processAllJobs()

#pragma omp parallel master
for(int i = 0; i < n; ++i)
#pragma omp task
processJob(i);



Then the processing of the job would look like this:



void processJob(int i)

for(int iteration = 0; iteration < iterationCount; ++iteration)

doSomePreparation(i);
std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp taskloop // add grainsize() clause, if Process() is very short
for(int j = 0; j < substasks.size(); ++j)
subtasks[j].Process();
doSomePostProcessing(i)




That way you get natural load balancing (assuming that you have enough tasks) without having to rely on nested parallelism.






share|improve this answer

























  • One more thing: You could even use tasks for doSomePreparation(i) and doSomePostProcessing(i) when you add task dependences to synchronize prep and post-processing tasks with the Process() tasks.

    – Michael Klemm
    Mar 27 at 15:56













3












3








3







Your description of the problem sounds more like OpenMP tasking will be a much better choice. Your code would then look like this:



void processAllJobs()

#pragma omp parallel master
for(int i = 0; i < n; ++i)
#pragma omp task
processJob(i);



Then the processing of the job would look like this:



void processJob(int i)

for(int iteration = 0; iteration < iterationCount; ++iteration)

doSomePreparation(i);
std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp taskloop // add grainsize() clause, if Process() is very short
for(int j = 0; j < substasks.size(); ++j)
subtasks[j].Process();
doSomePostProcessing(i)




That way you get natural load balancing (assuming that you have enough tasks) without having to rely on nested parallelism.






share|improve this answer













Your description of the problem sounds more like OpenMP tasking will be a much better choice. Your code would then look like this:



void processAllJobs()

#pragma omp parallel master
for(int i = 0; i < n; ++i)
#pragma omp task
processJob(i);



Then the processing of the job would look like this:



void processJob(int i)

for(int iteration = 0; iteration < iterationCount; ++iteration)

doSomePreparation(i);
std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp taskloop // add grainsize() clause, if Process() is very short
for(int j = 0; j < substasks.size(); ++j)
subtasks[j].Process();
doSomePostProcessing(i)




That way you get natural load balancing (assuming that you have enough tasks) without having to rely on nested parallelism.







share|improve this answer












share|improve this answer



share|improve this answer










answered Mar 27 at 7:07









Michael KlemmMichael Klemm

1,1617 silver badges12 bronze badges




1,1617 silver badges12 bronze badges















  • One more thing: You could even use tasks for doSomePreparation(i) and doSomePostProcessing(i) when you add task dependences to synchronize prep and post-processing tasks with the Process() tasks.

    – Michael Klemm
    Mar 27 at 15:56

















  • One more thing: You could even use tasks for doSomePreparation(i) and doSomePostProcessing(i) when you add task dependences to synchronize prep and post-processing tasks with the Process() tasks.

    – Michael Klemm
    Mar 27 at 15:56
















One more thing: You could even use tasks for doSomePreparation(i) and doSomePostProcessing(i) when you add task dependences to synchronize prep and post-processing tasks with the Process() tasks.

– Michael Klemm
Mar 27 at 15:56





One more thing: You could even use tasks for doSomePreparation(i) and doSomePostProcessing(i) when you add task dependences to synchronize prep and post-processing tasks with the Process() tasks.

– Michael Klemm
Mar 27 at 15:56








Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.







Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.



















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55366920%2fopenmp-on-demand-nested-parallelism%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

SQL error code 1064 with creating Laravel foreign keysForeign key constraints: When to use ON UPDATE and ON DELETEDropping column with foreign key Laravel error: General error: 1025 Error on renameLaravel SQL Can't create tableLaravel Migration foreign key errorLaravel php artisan migrate:refresh giving a syntax errorSQLSTATE[42S01]: Base table or view already exists or Base table or view already exists: 1050 Tableerror in migrating laravel file to xampp serverSyntax error or access violation: 1064:syntax to use near 'unsigned not null, modelName varchar(191) not null, title varchar(191) not nLaravel cannot create new table field in mysqlLaravel 5.7:Last migration creates table but is not registered in the migration table

용인 삼성생명 블루밍스 목차 통계 역대 감독 선수단 응원단 경기장 같이 보기 외부 링크 둘러보기 메뉴samsungblueminx.comeh선수 명단용인 삼성생명 블루밍스용인 삼성생명 블루밍스ehsamsungblueminx.comeheheheh

155 수학 과학 기타 둘러보기 메뉴eh추가해eh문서를 완성해