OpenMP On-Demand Nested ParallelismWhat is the difference between concurrency and parallelism?OpenMP iteration for loop in parallel regionParallelize function using OpenMPNested openmp loopOpenMP nested parallelism with sectionsPerformance problems using OpenMP in nested loopsOpenMP with nested loopsParallelization of Red Black SOR using OpenMbOpenMP paralelization inhibits vectorizationparallel 'task's inside an already parallelized 'for' loop in OpenMP

Why didn't General Martok receive discommendation in Star Trek: Deep Space Nine?

Avoiding Implicit Conversion in Constructor. Explicit keyword doesn't help here

Reducing the time for rolling hash

Is it possible to tell if a child will turn into a Hag?

Introduction to the Sicilian

When did J.K. Rowling decide to make Ron and Hermione a couple?

My employer is refusing to give me the pay that was advertised after an internal job move

Does Ubuntu reduce battery life?

Derivative is just speed of change?

Coworker mumbles to herself when working, how to ask her to stop?

Why would an invisible personal shield be necessary?

Why are prop blades not shaped like household fan blades?

How can I type the name of the person I'm calling on the dial pad and make the call?

Do cabinets and speaker enclosures add the additional harmonic contents?

Should students have access to past exams or an exam bank?

Move arrows along a contour

Balancing Humanoid fantasy races: Elves

Rampant sharing of authorship among colleagues in the name of "collaboration". Is not taking part in it a death knell for a future in academia?

Would people understand me speaking German all over Europe?

Can machine learning learn a function like finding maximum from a list?

What do the novel titles of The Expanse series refer to?

What is the full text of the song about the failed battle of Kiska?

Just how much information should you share with a former client?

Database Cache Memory in Performance Monitor drops down significantly after DBCC CheckDB

OpenMP On-Demand Nested Parallelism

What is the difference between concurrency and parallelism?OpenMP iteration for loop in parallel regionParallelize function using OpenMPNested openmp loopOpenMP nested parallelism with sectionsPerformance problems using OpenMP in nested loopsOpenMP with nested loopsParallelization of Red Black SOR using OpenMbOpenMP paralelization inhibits vectorizationparallel 'task's inside an already parallelized 'for' loop in OpenMP

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

I have a list of jobs, which I am processing in parallel with OpenMP:

void processAllJobs()

#pragma omp parallel for
 for(int i = 0; i < n; ++i) 
 processJob(i);

All jobs have some sequential parts and parts that could be parallelized if called alone:

void processJob(int i)

 for(int iteration = 0; iteration < iterationCount; ++iteration)
 
 doSomePreparation(i);
 std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp parallel for
 for(int j = 0; j < substasks.size(); ++j)
 subtasks[j].Process();
 doSomePostProcessing(i)

When I run processAllJobs(), threads are created for the outer loop (over each job) and the inner loop (over the subtasks) are done sequentially within the thread. This is all fine and intended.

Sometimes there are very large jobs that take a lot of time to process. Long enough, such that all other threads in the outer loop already finish way before the last thread and don't do anything. Is there a way to re-purpose the unused threads to parallelize the inner loop as soon as they are finished? I imagine something that checks the number of unused threads each time the inner parallel region is entered.

I cannot predict how long a job runs. It might not only be one long-lasting job - maybe there are two or three.

asked Mar 26 at 22:10

Nico Schertler

26.9k4 gold badges24 silver badges53 bronze badges

1

You could use dynamic scheduling in the outer loop with a small number of threads. And use nested parallelism in the inner loop also controlling the number of threads. If your total number of threads is 16, you can try num_thread(4) in both case. With dynamic scheduling, fast threads will end early and you can process several small chunks while a long processing takes place. With nested parallelism you guarantee that several threads will be used for long jobs.

– Alain Merigot
Mar 26 at 22:54

1

Try task loops for both levels. Whether or not that will be of benefit is impossible to tell from the information in the question.

– Zulan
Mar 26 at 22:59

@Alain: Thanks for the suggestion. That would probably help a bit. Although it will somehow shift the problem. In the example with a single long-lasting job, if there were 15 idle threads with the initial approach, there would be 12 idle threads with the modified approach (with four working threads). And the sequential parts would not benefit from maximum parallelization.

– Nico Schertler
Mar 26 at 23:55

add a comment |

I have a list of jobs, which I am processing in parallel with OpenMP:

void processAllJobs()

#pragma omp parallel for
 for(int i = 0; i < n; ++i) 
 processJob(i);

All jobs have some sequential parts and parts that could be parallelized if called alone:

void processJob(int i)

 for(int iteration = 0; iteration < iterationCount; ++iteration)
 
 doSomePreparation(i);
 std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp parallel for
 for(int j = 0; j < substasks.size(); ++j)
 subtasks[j].Process();
 doSomePostProcessing(i)

When I run processAllJobs(), threads are created for the outer loop (over each job) and the inner loop (over the subtasks) are done sequentially within the thread. This is all fine and intended.

I cannot predict how long a job runs. It might not only be one long-lasting job - maybe there are two or three.

asked Mar 26 at 22:10

Nico Schertler

26.9k4 gold badges24 silver badges53 bronze badges

1

You could use dynamic scheduling in the outer loop with a small number of threads. And use nested parallelism in the inner loop also controlling the number of threads. If your total number of threads is 16, you can try num_thread(4) in both case. With dynamic scheduling, fast threads will end early and you can process several small chunks while a long processing takes place. With nested parallelism you guarantee that several threads will be used for long jobs.

– Alain Merigot
Mar 26 at 22:54

1

Try task loops for both levels. Whether or not that will be of benefit is impossible to tell from the information in the question.

– Zulan
Mar 26 at 22:59

@Alain: Thanks for the suggestion. That would probably help a bit. Although it will somehow shift the problem. In the example with a single long-lasting job, if there were 15 idle threads with the initial approach, there would be 12 idle threads with the modified approach (with four working threads). And the sequential parts would not benefit from maximum parallelization.

– Nico Schertler
Mar 26 at 23:55

add a comment |

I have a list of jobs, which I am processing in parallel with OpenMP:

void processAllJobs()

#pragma omp parallel for
 for(int i = 0; i < n; ++i) 
 processJob(i);

All jobs have some sequential parts and parts that could be parallelized if called alone:

void processJob(int i)

 for(int iteration = 0; iteration < iterationCount; ++iteration)
 
 doSomePreparation(i);
 std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp parallel for
 for(int j = 0; j < substasks.size(); ++j)
 subtasks[j].Process();
 doSomePostProcessing(i)

When I run processAllJobs(), threads are created for the outer loop (over each job) and the inner loop (over the subtasks) are done sequentially within the thread. This is all fine and intended.

I cannot predict how long a job runs. It might not only be one long-lasting job - maybe there are two or three.

asked Mar 26 at 22:10

Nico Schertler

26.9k4 gold badges24 silver badges53 bronze badges

I have a list of jobs, which I am processing in parallel with OpenMP:

void processAllJobs()

#pragma omp parallel for
 for(int i = 0; i < n; ++i) 
 processJob(i);

All jobs have some sequential parts and parts that could be parallelized if called alone:

void processJob(int i)

 for(int iteration = 0; iteration < iterationCount; ++iteration)
 
 doSomePreparation(i);
 std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp parallel for
 for(int j = 0; j < substasks.size(); ++j)
 subtasks[j].Process();
 doSomePostProcessing(i)

When I run processAllJobs(), threads are created for the outer loop (over each job) and the inner loop (over the subtasks) are done sequentially within the thread. This is all fine and intended.

I cannot predict how long a job runs. It might not only be one long-lasting job - maybe there are two or three.

c++ parallel-processing openmp

asked Mar 26 at 22:10

Nico Schertler

26.9k4 gold badges24 silver badges53 bronze badges

asked Mar 26 at 22:10

Nico Schertler

26.9k4 gold badges24 silver badges53 bronze badges

asked Mar 26 at 22:10

Nico Schertler

26.9k4 gold badges24 silver badges53 bronze badges

asked Mar 26 at 22:10

Nico Schertler

26.9k4 gold badges24 silver badges53 bronze badges

asked Mar 26 at 22:10

Nico Schertler

26.9k4 gold badges24 silver badges53 bronze badges

1

You could use dynamic scheduling in the outer loop with a small number of threads. And use nested parallelism in the inner loop also controlling the number of threads. If your total number of threads is 16, you can try num_thread(4) in both case. With dynamic scheduling, fast threads will end early and you can process several small chunks while a long processing takes place. With nested parallelism you guarantee that several threads will be used for long jobs.

– Alain Merigot
Mar 26 at 22:54

1

Try task loops for both levels. Whether or not that will be of benefit is impossible to tell from the information in the question.

– Zulan
Mar 26 at 22:59

@Alain: Thanks for the suggestion. That would probably help a bit. Although it will somehow shift the problem. In the example with a single long-lasting job, if there were 15 idle threads with the initial approach, there would be 12 idle threads with the modified approach (with four working threads). And the sequential parts would not benefit from maximum parallelization.

– Nico Schertler
Mar 26 at 23:55

add a comment |

1

You could use dynamic scheduling in the outer loop with a small number of threads. And use nested parallelism in the inner loop also controlling the number of threads. If your total number of threads is 16, you can try num_thread(4) in both case. With dynamic scheduling, fast threads will end early and you can process several small chunks while a long processing takes place. With nested parallelism you guarantee that several threads will be used for long jobs.

– Alain Merigot
Mar 26 at 22:54

1

Try task loops for both levels. Whether or not that will be of benefit is impossible to tell from the information in the question.

– Zulan
Mar 26 at 22:59

@Alain: Thanks for the suggestion. That would probably help a bit. Although it will somehow shift the problem. In the example with a single long-lasting job, if there were 15 idle threads with the initial approach, there would be 12 idle threads with the modified approach (with four working threads). And the sequential parts would not benefit from maximum parallelization.

– Nico Schertler
Mar 26 at 23:55

You could use dynamic scheduling in the outer loop with a small number of threads. And use nested parallelism in the inner loop also controlling the number of threads. If your total number of threads is 16, you can try num_thread(4) in both case. With dynamic scheduling, fast threads will end early and you can process several small chunks while a long processing takes place. With nested parallelism you guarantee that several threads will be used for long jobs.

– Alain Merigot
Mar 26 at 22:54

Try task loops for both levels. Whether or not that will be of benefit is impossible to tell from the information in the question.

– Zulan
Mar 26 at 22:59

@Alain: Thanks for the suggestion. That would probably help a bit. Although it will somehow shift the problem. In the example with a single long-lasting job, if there were 15 idle threads with the initial approach, there would be 12 idle threads with the modified approach (with four working threads). And the sequential parts would not benefit from maximum parallelization.

– Nico Schertler
Mar 26 at 23:55

add a comment |

1 Answer
1

active

oldest

votes

Your description of the problem sounds more like OpenMP tasking will be a much better choice. Your code would then look like this:

void processAllJobs()

#pragma omp parallel master
 for(int i = 0; i < n; ++i) 
#pragma omp task
 processJob(i);

Then the processing of the job would look like this:

void processJob(int i)

 for(int iteration = 0; iteration < iterationCount; ++iteration)
 
 doSomePreparation(i);
 std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp taskloop // add grainsize() clause, if Process() is very short
 for(int j = 0; j < substasks.size(); ++j)
 subtasks[j].Process();
 doSomePostProcessing(i)

That way you get natural load balancing (assuming that you have enough tasks) without having to rely on nested parallelism.

answered Mar 27 at 7:07

Michael Klemm

1,1617 silver badges12 bronze badges

One more thing: You could even use tasks for doSomePreparation(i) and doSomePostProcessing(i) when you add task dependences to synchronize prep and post-processing tasks with the Process() tasks.

– Michael Klemm
Mar 27 at 15:56

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55366920%2fopenmp-on-demand-nested-parallelism%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Your description of the problem sounds more like OpenMP tasking will be a much better choice. Your code would then look like this:

void processAllJobs()

#pragma omp parallel master
 for(int i = 0; i < n; ++i) 
#pragma omp task
 processJob(i);

Then the processing of the job would look like this:

void processJob(int i)

 for(int iteration = 0; iteration < iterationCount; ++iteration)
 
 doSomePreparation(i);
 std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp taskloop // add grainsize() clause, if Process() is very short
 for(int j = 0; j < substasks.size(); ++j)
 subtasks[j].Process();
 doSomePostProcessing(i)

That way you get natural load balancing (assuming that you have enough tasks) without having to rely on nested parallelism.

answered Mar 27 at 7:07

Michael Klemm

1,1617 silver badges12 bronze badges

One more thing: You could even use tasks for doSomePreparation(i) and doSomePostProcessing(i) when you add task dependences to synchronize prep and post-processing tasks with the Process() tasks.

– Michael Klemm
Mar 27 at 15:56

add a comment |

Your description of the problem sounds more like OpenMP tasking will be a much better choice. Your code would then look like this:

void processAllJobs()

#pragma omp parallel master
 for(int i = 0; i < n; ++i) 
#pragma omp task
 processJob(i);

Then the processing of the job would look like this:

void processJob(int i)

 for(int iteration = 0; iteration < iterationCount; ++iteration)
 
 doSomePreparation(i);
 std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp taskloop // add grainsize() clause, if Process() is very short
 for(int j = 0; j < substasks.size(); ++j)
 subtasks[j].Process();
 doSomePostProcessing(i)

That way you get natural load balancing (assuming that you have enough tasks) without having to rely on nested parallelism.

answered Mar 27 at 7:07

Michael Klemm

1,1617 silver badges12 bronze badges

One more thing: You could even use tasks for doSomePreparation(i) and doSomePostProcessing(i) when you add task dependences to synchronize prep and post-processing tasks with the Process() tasks.

– Michael Klemm
Mar 27 at 15:56

add a comment |

Your description of the problem sounds more like OpenMP tasking will be a much better choice. Your code would then look like this:

void processAllJobs()

#pragma omp parallel master
 for(int i = 0; i < n; ++i) 
#pragma omp task
 processJob(i);

Then the processing of the job would look like this:

void processJob(int i)

 for(int iteration = 0; iteration < iterationCount; ++iteration)
 
 doSomePreparation(i);
 std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp taskloop // add grainsize() clause, if Process() is very short
 for(int j = 0; j < substasks.size(); ++j)
 subtasks[j].Process();
 doSomePostProcessing(i)

That way you get natural load balancing (assuming that you have enough tasks) without having to rely on nested parallelism.

answered Mar 27 at 7:07

Michael Klemm

1,1617 silver badges12 bronze badges

Your description of the problem sounds more like OpenMP tasking will be a much better choice. Your code would then look like this:

void processAllJobs()

#pragma omp parallel master
 for(int i = 0; i < n; ++i) 
#pragma omp task
 processJob(i);

Then the processing of the job would look like this:

void processJob(int i)

 for(int iteration = 0; iteration < iterationCount; ++iteration)
 
 doSomePreparation(i);
 std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp taskloop // add grainsize() clause, if Process() is very short
 for(int j = 0; j < substasks.size(); ++j)
 subtasks[j].Process();
 doSomePostProcessing(i)

That way you get natural load balancing (assuming that you have enough tasks) without having to rely on nested parallelism.

answered Mar 27 at 7:07

Michael Klemm

1,1617 silver badges12 bronze badges

answered Mar 27 at 7:07

Michael Klemm

1,1617 silver badges12 bronze badges

answered Mar 27 at 7:07

Michael Klemm

1,1617 silver badges12 bronze badges

answered Mar 27 at 7:07

Michael Klemm

1,1617 silver badges12 bronze badges

One more thing: You could even use tasks for doSomePreparation(i) and doSomePostProcessing(i) when you add task dependences to synchronize prep and post-processing tasks with the Process() tasks.

– Michael Klemm
Mar 27 at 15:56

add a comment |

One more thing: You could even use tasks for doSomePreparation(i) and doSomePostProcessing(i) when you add task dependences to synchronize prep and post-processing tasks with the Process() tasks.

– Michael Klemm
Mar 27 at 15:56

One more thing: You could even use tasks for doSomePreparation(i) and doSomePostProcessing(i) when you add task dependences to synchronize prep and post-processing tasks with the Process() tasks.

– Michael Klemm
Mar 27 at 15:56

add a comment |

Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴

1 Answer
1

1 Answer
1

1 Answer
1