OpenMP On-Demand Nested ParallelismWhat is the difference between concurrency and parallelism?OpenMP iteration for loop in parallel regionParallelize function using OpenMPNested openmp loopOpenMP nested parallelism with sectionsPerformance problems using OpenMP in nested loopsOpenMP with nested loopsParallelization of Red Black SOR using OpenMbOpenMP paralelization inhibits vectorizationparallel 'task's inside an already parallelized 'for' loop in OpenMP
Why didn't General Martok receive discommendation in Star Trek: Deep Space Nine?
Avoiding Implicit Conversion in Constructor. Explicit keyword doesn't help here
Reducing the time for rolling hash
Is it possible to tell if a child will turn into a Hag?
Introduction to the Sicilian
When did J.K. Rowling decide to make Ron and Hermione a couple?
My employer is refusing to give me the pay that was advertised after an internal job move
Does Ubuntu reduce battery life?
Derivative is just speed of change?
Coworker mumbles to herself when working, how to ask her to stop?
Why would an invisible personal shield be necessary?
Why are prop blades not shaped like household fan blades?
How can I type the name of the person I'm calling on the dial pad and make the call?
Do cabinets and speaker enclosures add the additional harmonic contents?
Should students have access to past exams or an exam bank?
Move arrows along a contour
Balancing Humanoid fantasy races: Elves
Rampant sharing of authorship among colleagues in the name of "collaboration". Is not taking part in it a death knell for a future in academia?
Would people understand me speaking German all over Europe?
Can machine learning learn a function like finding maximum from a list?
What do the novel titles of The Expanse series refer to?
What is the full text of the song about the failed battle of Kiska?
Just how much information should you share with a former client?
Database Cache Memory in Performance Monitor drops down significantly after DBCC CheckDB
OpenMP On-Demand Nested Parallelism
What is the difference between concurrency and parallelism?OpenMP iteration for loop in parallel regionParallelize function using OpenMPNested openmp loopOpenMP nested parallelism with sectionsPerformance problems using OpenMP in nested loopsOpenMP with nested loopsParallelization of Red Black SOR using OpenMbOpenMP paralelization inhibits vectorizationparallel 'task's inside an already parallelized 'for' loop in OpenMP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
I have a list of jobs, which I am processing in parallel with OpenMP:
void processAllJobs()
#pragma omp parallel for
for(int i = 0; i < n; ++i)
processJob(i);
All jobs have some sequential parts and parts that could be parallelized if called alone:
void processJob(int i)
for(int iteration = 0; iteration < iterationCount; ++iteration)
doSomePreparation(i);
std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp parallel for
for(int j = 0; j < substasks.size(); ++j)
subtasks[j].Process();
doSomePostProcessing(i)
When I run processAllJobs(), threads are created for the outer loop (over each job) and the inner loop (over the subtasks) are done sequentially within the thread. This is all fine and intended.
Sometimes there are very large jobs that take a lot of time to process. Long enough, such that all other threads in the outer loop already finish way before the last thread and don't do anything. Is there a way to re-purpose the unused threads to parallelize the inner loop as soon as they are finished? I imagine something that checks the number of unused threads each time the inner parallel region is entered.
I cannot predict how long a job runs. It might not only be one long-lasting job - maybe there are two or three.
c++ parallel-processing openmp
add a comment |
I have a list of jobs, which I am processing in parallel with OpenMP:
void processAllJobs()
#pragma omp parallel for
for(int i = 0; i < n; ++i)
processJob(i);
All jobs have some sequential parts and parts that could be parallelized if called alone:
void processJob(int i)
for(int iteration = 0; iteration < iterationCount; ++iteration)
doSomePreparation(i);
std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp parallel for
for(int j = 0; j < substasks.size(); ++j)
subtasks[j].Process();
doSomePostProcessing(i)
When I run processAllJobs(), threads are created for the outer loop (over each job) and the inner loop (over the subtasks) are done sequentially within the thread. This is all fine and intended.
Sometimes there are very large jobs that take a lot of time to process. Long enough, such that all other threads in the outer loop already finish way before the last thread and don't do anything. Is there a way to re-purpose the unused threads to parallelize the inner loop as soon as they are finished? I imagine something that checks the number of unused threads each time the inner parallel region is entered.
I cannot predict how long a job runs. It might not only be one long-lasting job - maybe there are two or three.
c++ parallel-processing openmp
1
You could usedynamicscheduling in the outer loop with a small number of threads. And use nested parallelism in the inner loop also controlling the number of threads. If your total number of threads is 16, you can trynum_thread(4)in both case. With dynamic scheduling, fast threads will end early and you can process several small chunks while a long processing takes place. With nested parallelism you guarantee that several threads will be used for long jobs.
– Alain Merigot
Mar 26 at 22:54
1
Try task loops for both levels. Whether or not that will be of benefit is impossible to tell from the information in the question.
– Zulan
Mar 26 at 22:59
@Alain: Thanks for the suggestion. That would probably help a bit. Although it will somehow shift the problem. In the example with a single long-lasting job, if there were 15 idle threads with the initial approach, there would be 12 idle threads with the modified approach (with four working threads). And the sequential parts would not benefit from maximum parallelization.
– Nico Schertler
Mar 26 at 23:55
add a comment |
I have a list of jobs, which I am processing in parallel with OpenMP:
void processAllJobs()
#pragma omp parallel for
for(int i = 0; i < n; ++i)
processJob(i);
All jobs have some sequential parts and parts that could be parallelized if called alone:
void processJob(int i)
for(int iteration = 0; iteration < iterationCount; ++iteration)
doSomePreparation(i);
std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp parallel for
for(int j = 0; j < substasks.size(); ++j)
subtasks[j].Process();
doSomePostProcessing(i)
When I run processAllJobs(), threads are created for the outer loop (over each job) and the inner loop (over the subtasks) are done sequentially within the thread. This is all fine and intended.
Sometimes there are very large jobs that take a lot of time to process. Long enough, such that all other threads in the outer loop already finish way before the last thread and don't do anything. Is there a way to re-purpose the unused threads to parallelize the inner loop as soon as they are finished? I imagine something that checks the number of unused threads each time the inner parallel region is entered.
I cannot predict how long a job runs. It might not only be one long-lasting job - maybe there are two or three.
c++ parallel-processing openmp
I have a list of jobs, which I am processing in parallel with OpenMP:
void processAllJobs()
#pragma omp parallel for
for(int i = 0; i < n; ++i)
processJob(i);
All jobs have some sequential parts and parts that could be parallelized if called alone:
void processJob(int i)
for(int iteration = 0; iteration < iterationCount; ++iteration)
doSomePreparation(i);
std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp parallel for
for(int j = 0; j < substasks.size(); ++j)
subtasks[j].Process();
doSomePostProcessing(i)
When I run processAllJobs(), threads are created for the outer loop (over each job) and the inner loop (over the subtasks) are done sequentially within the thread. This is all fine and intended.
Sometimes there are very large jobs that take a lot of time to process. Long enough, such that all other threads in the outer loop already finish way before the last thread and don't do anything. Is there a way to re-purpose the unused threads to parallelize the inner loop as soon as they are finished? I imagine something that checks the number of unused threads each time the inner parallel region is entered.
I cannot predict how long a job runs. It might not only be one long-lasting job - maybe there are two or three.
c++ parallel-processing openmp
c++ parallel-processing openmp
asked Mar 26 at 22:10
Nico SchertlerNico Schertler
26.9k4 gold badges24 silver badges53 bronze badges
26.9k4 gold badges24 silver badges53 bronze badges
1
You could usedynamicscheduling in the outer loop with a small number of threads. And use nested parallelism in the inner loop also controlling the number of threads. If your total number of threads is 16, you can trynum_thread(4)in both case. With dynamic scheduling, fast threads will end early and you can process several small chunks while a long processing takes place. With nested parallelism you guarantee that several threads will be used for long jobs.
– Alain Merigot
Mar 26 at 22:54
1
Try task loops for both levels. Whether or not that will be of benefit is impossible to tell from the information in the question.
– Zulan
Mar 26 at 22:59
@Alain: Thanks for the suggestion. That would probably help a bit. Although it will somehow shift the problem. In the example with a single long-lasting job, if there were 15 idle threads with the initial approach, there would be 12 idle threads with the modified approach (with four working threads). And the sequential parts would not benefit from maximum parallelization.
– Nico Schertler
Mar 26 at 23:55
add a comment |
1
You could usedynamicscheduling in the outer loop with a small number of threads. And use nested parallelism in the inner loop also controlling the number of threads. If your total number of threads is 16, you can trynum_thread(4)in both case. With dynamic scheduling, fast threads will end early and you can process several small chunks while a long processing takes place. With nested parallelism you guarantee that several threads will be used for long jobs.
– Alain Merigot
Mar 26 at 22:54
1
Try task loops for both levels. Whether or not that will be of benefit is impossible to tell from the information in the question.
– Zulan
Mar 26 at 22:59
@Alain: Thanks for the suggestion. That would probably help a bit. Although it will somehow shift the problem. In the example with a single long-lasting job, if there were 15 idle threads with the initial approach, there would be 12 idle threads with the modified approach (with four working threads). And the sequential parts would not benefit from maximum parallelization.
– Nico Schertler
Mar 26 at 23:55
1
1
You could use
dynamic scheduling in the outer loop with a small number of threads. And use nested parallelism in the inner loop also controlling the number of threads. If your total number of threads is 16, you can try num_thread(4) in both case. With dynamic scheduling, fast threads will end early and you can process several small chunks while a long processing takes place. With nested parallelism you guarantee that several threads will be used for long jobs.– Alain Merigot
Mar 26 at 22:54
You could use
dynamic scheduling in the outer loop with a small number of threads. And use nested parallelism in the inner loop also controlling the number of threads. If your total number of threads is 16, you can try num_thread(4) in both case. With dynamic scheduling, fast threads will end early and you can process several small chunks while a long processing takes place. With nested parallelism you guarantee that several threads will be used for long jobs.– Alain Merigot
Mar 26 at 22:54
1
1
Try task loops for both levels. Whether or not that will be of benefit is impossible to tell from the information in the question.
– Zulan
Mar 26 at 22:59
Try task loops for both levels. Whether or not that will be of benefit is impossible to tell from the information in the question.
– Zulan
Mar 26 at 22:59
@Alain: Thanks for the suggestion. That would probably help a bit. Although it will somehow shift the problem. In the example with a single long-lasting job, if there were 15 idle threads with the initial approach, there would be 12 idle threads with the modified approach (with four working threads). And the sequential parts would not benefit from maximum parallelization.
– Nico Schertler
Mar 26 at 23:55
@Alain: Thanks for the suggestion. That would probably help a bit. Although it will somehow shift the problem. In the example with a single long-lasting job, if there were 15 idle threads with the initial approach, there would be 12 idle threads with the modified approach (with four working threads). And the sequential parts would not benefit from maximum parallelization.
– Nico Schertler
Mar 26 at 23:55
add a comment |
1 Answer
1
active
oldest
votes
Your description of the problem sounds more like OpenMP tasking will be a much better choice. Your code would then look like this:
void processAllJobs()
#pragma omp parallel master
for(int i = 0; i < n; ++i)
#pragma omp task
processJob(i);
Then the processing of the job would look like this:
void processJob(int i)
for(int iteration = 0; iteration < iterationCount; ++iteration)
doSomePreparation(i);
std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp taskloop // add grainsize() clause, if Process() is very short
for(int j = 0; j < substasks.size(); ++j)
subtasks[j].Process();
doSomePostProcessing(i)
That way you get natural load balancing (assuming that you have enough tasks) without having to rely on nested parallelism.
One more thing: You could even use tasks fordoSomePreparation(i)anddoSomePostProcessing(i)when you add task dependences to synchronize prep and post-processing tasks with the Process() tasks.
– Michael Klemm
Mar 27 at 15:56
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55366920%2fopenmp-on-demand-nested-parallelism%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Your description of the problem sounds more like OpenMP tasking will be a much better choice. Your code would then look like this:
void processAllJobs()
#pragma omp parallel master
for(int i = 0; i < n; ++i)
#pragma omp task
processJob(i);
Then the processing of the job would look like this:
void processJob(int i)
for(int iteration = 0; iteration < iterationCount; ++iteration)
doSomePreparation(i);
std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp taskloop // add grainsize() clause, if Process() is very short
for(int j = 0; j < substasks.size(); ++j)
subtasks[j].Process();
doSomePostProcessing(i)
That way you get natural load balancing (assuming that you have enough tasks) without having to rely on nested parallelism.
One more thing: You could even use tasks fordoSomePreparation(i)anddoSomePostProcessing(i)when you add task dependences to synchronize prep and post-processing tasks with the Process() tasks.
– Michael Klemm
Mar 27 at 15:56
add a comment |
Your description of the problem sounds more like OpenMP tasking will be a much better choice. Your code would then look like this:
void processAllJobs()
#pragma omp parallel master
for(int i = 0; i < n; ++i)
#pragma omp task
processJob(i);
Then the processing of the job would look like this:
void processJob(int i)
for(int iteration = 0; iteration < iterationCount; ++iteration)
doSomePreparation(i);
std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp taskloop // add grainsize() clause, if Process() is very short
for(int j = 0; j < substasks.size(); ++j)
subtasks[j].Process();
doSomePostProcessing(i)
That way you get natural load balancing (assuming that you have enough tasks) without having to rely on nested parallelism.
One more thing: You could even use tasks fordoSomePreparation(i)anddoSomePostProcessing(i)when you add task dependences to synchronize prep and post-processing tasks with the Process() tasks.
– Michael Klemm
Mar 27 at 15:56
add a comment |
Your description of the problem sounds more like OpenMP tasking will be a much better choice. Your code would then look like this:
void processAllJobs()
#pragma omp parallel master
for(int i = 0; i < n; ++i)
#pragma omp task
processJob(i);
Then the processing of the job would look like this:
void processJob(int i)
for(int iteration = 0; iteration < iterationCount; ++iteration)
doSomePreparation(i);
std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp taskloop // add grainsize() clause, if Process() is very short
for(int j = 0; j < substasks.size(); ++j)
subtasks[j].Process();
doSomePostProcessing(i)
That way you get natural load balancing (assuming that you have enough tasks) without having to rely on nested parallelism.
Your description of the problem sounds more like OpenMP tasking will be a much better choice. Your code would then look like this:
void processAllJobs()
#pragma omp parallel master
for(int i = 0; i < n; ++i)
#pragma omp task
processJob(i);
Then the processing of the job would look like this:
void processJob(int i)
for(int iteration = 0; iteration < iterationCount; ++iteration)
doSomePreparation(i);
std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp taskloop // add grainsize() clause, if Process() is very short
for(int j = 0; j < substasks.size(); ++j)
subtasks[j].Process();
doSomePostProcessing(i)
That way you get natural load balancing (assuming that you have enough tasks) without having to rely on nested parallelism.
answered Mar 27 at 7:07
Michael KlemmMichael Klemm
1,1617 silver badges12 bronze badges
1,1617 silver badges12 bronze badges
One more thing: You could even use tasks fordoSomePreparation(i)anddoSomePostProcessing(i)when you add task dependences to synchronize prep and post-processing tasks with the Process() tasks.
– Michael Klemm
Mar 27 at 15:56
add a comment |
One more thing: You could even use tasks fordoSomePreparation(i)anddoSomePostProcessing(i)when you add task dependences to synchronize prep and post-processing tasks with the Process() tasks.
– Michael Klemm
Mar 27 at 15:56
One more thing: You could even use tasks for
doSomePreparation(i) and doSomePostProcessing(i) when you add task dependences to synchronize prep and post-processing tasks with the Process() tasks.– Michael Klemm
Mar 27 at 15:56
One more thing: You could even use tasks for
doSomePreparation(i) and doSomePostProcessing(i) when you add task dependences to synchronize prep and post-processing tasks with the Process() tasks.– Michael Klemm
Mar 27 at 15:56
add a comment |
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55366920%2fopenmp-on-demand-nested-parallelism%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
You could use
dynamicscheduling in the outer loop with a small number of threads. And use nested parallelism in the inner loop also controlling the number of threads. If your total number of threads is 16, you can trynum_thread(4)in both case. With dynamic scheduling, fast threads will end early and you can process several small chunks while a long processing takes place. With nested parallelism you guarantee that several threads will be used for long jobs.– Alain Merigot
Mar 26 at 22:54
1
Try task loops for both levels. Whether or not that will be of benefit is impossible to tell from the information in the question.
– Zulan
Mar 26 at 22:59
@Alain: Thanks for the suggestion. That would probably help a bit. Although it will somehow shift the problem. In the example with a single long-lasting job, if there were 15 idle threads with the initial approach, there would be 12 idle threads with the modified approach (with four working threads). And the sequential parts would not benefit from maximum parallelization.
– Nico Schertler
Mar 26 at 23:55