CUDA execution time compared to block sizeUnderstanding CUDA grid dimensions, block dimensions and threads organization (simple explanation)How to get the cuda version?How do CUDA blocks/warps/threads map onto CUDA cores?installing Cuda NVIDIA Graphic Driver failedMatlab + CUDA slow in solving matrix-vector equation A*x=BDoes CUDA allow multiple applications on same gpu at the same time?Behavior of CUDA kernels from guest VM on NVIDIA GRID vGPUUsing CUDA GPUs at prediction time for high througput streamsCUDA program doesn't measure the execution time : cudaEventRecordTimings differ while measuring a CUDA kernel

"Best practices" for formulating MIPs

What is the right way to query an I2C device from an interrupt service routine?

Will greasing clutch parts make it softer

Did Stalin kill all Soviet officers involved in the Winter War?

Puzzling Knight has a Message for all- Especially Newcomers

Is there any connection between "Whispers of the heart" and "The cat returns"?

A student "completes" 2-week project in 3 hours and lies about doing it himself

Construction of the word подтвержда́ть

My players like to search everything. What do they find?

Bypass with wrong cvv of debit card and getting OTP

Why did moving the mouse cursor cause Windows 95 to run more quickly?

What verb goes with "coup"?

Old story where computer expert digitally animates The Lord of the Rings

Should I hide my travel history to the UK when I apply for an Australian visa?

Isn't "Dave's protocol" good if only the database, and not the code, is leaked?

Finding integer database columns that may have their data type changed to reduce size

What does "another" mean in this case?

PhD: When to quit and move on?

Sleepy tired vs physically tired

What does the ash content of broken wheat really mean?

what is the meaning of "stock" dilution on the Massive Dev Chart Website?

gzip compress a local folder and extract it to remote server

How can solar sailed ships be protected from space debris?

Should I cross-validate metrics that were not optimised?

CUDA execution time compared to block size

Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation)How to get the cuda version?How do CUDA blocks/warps/threads map onto CUDA cores?installing Cuda NVIDIA Graphic Driver failedMatlab + CUDA slow in solving matrix-vector equation A*x=BDoes CUDA allow multiple applications on same gpu at the same time?Behavior of CUDA kernels from guest VM on NVIDIA GRID vGPUUsing CUDA GPUs at prediction time for high througput streamsCUDA program doesn't measure the execution time : cudaEventRecordTimings differ while measuring a CUDA kernel

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

The goal is simple: plot the effect of block size on the execution time with CUDA. What one would expect to see is that for each blocksize that is a multiple of 32 the execution time is lowest, after thes multiples (e.g. 33, 65, 97, 129, ...) the execution time should increase. However this is not the result I'm getting. The execution time simply goes down and then flattens out.
enter image description here

I'm running CUDA runtime 10.0 on an NVIDIA GeForce 940M.

I've tried several ways of getting the execution time. The one recommended in the CUDA documentation says following should work:

cudaEventCreate(&gpu_execution_start);
cudaEventCreate(&gpu_execution_end);

cudaEventRecord(gpu_execution_start);
kernel<<<n_blocks,blocksize>>> (device_a, device_b, device_out, arraysize);
cudaEventRecord(gpu_execution_end);

cudaEventSynchronize(gpu_execution_end);

This way of timing however generates previously mentioned result.

Does the issue lie in timing the execution? Or does the specific GPU cause problems in the result maybe?

edited Mar 25 at 18:59

talonmies

60.6k17 gold badges140 silver badges208 bronze badges

asked Mar 25 at 18:15

jonas

61 bronze badge

3

What does your kernel actually do? Do you keep the overall number of threads the same and just change how many blocks you use or do you keep the number of blocks the same and just keep adding threads to the blocks? If what you're doing is increasing the overall number of threads, then what do the additional threads do?

– Michael Kenzel
Mar 25 at 21:35

In one of my applications a 8x8x8 block takes 8 seconds, a 9x9x9 block takes 5 secods and a 10x10x10 takes 11 secods. There is no general answer for blocksize vs time, depends on the kernel.

– Ander Biguri
Mar 26 at 14:10

add a comment |

I'm running CUDA runtime 10.0 on an NVIDIA GeForce 940M.

I've tried several ways of getting the execution time. The one recommended in the CUDA documentation says following should work:

cudaEventCreate(&gpu_execution_start);
cudaEventCreate(&gpu_execution_end);

cudaEventRecord(gpu_execution_start);
kernel<<<n_blocks,blocksize>>> (device_a, device_b, device_out, arraysize);
cudaEventRecord(gpu_execution_end);

cudaEventSynchronize(gpu_execution_end);

This way of timing however generates previously mentioned result.

Does the issue lie in timing the execution? Or does the specific GPU cause problems in the result maybe?

edited Mar 25 at 18:59

talonmies

60.6k17 gold badges140 silver badges208 bronze badges

asked Mar 25 at 18:15

jonas

61 bronze badge

3

What does your kernel actually do? Do you keep the overall number of threads the same and just change how many blocks you use or do you keep the number of blocks the same and just keep adding threads to the blocks? If what you're doing is increasing the overall number of threads, then what do the additional threads do?

– Michael Kenzel
Mar 25 at 21:35

In one of my applications a 8x8x8 block takes 8 seconds, a 9x9x9 block takes 5 secods and a 10x10x10 takes 11 secods. There is no general answer for blocksize vs time, depends on the kernel.

– Ander Biguri
Mar 26 at 14:10

add a comment |

I'm running CUDA runtime 10.0 on an NVIDIA GeForce 940M.

I've tried several ways of getting the execution time. The one recommended in the CUDA documentation says following should work:

cudaEventCreate(&gpu_execution_start);
cudaEventCreate(&gpu_execution_end);

cudaEventRecord(gpu_execution_start);
kernel<<<n_blocks,blocksize>>> (device_a, device_b, device_out, arraysize);
cudaEventRecord(gpu_execution_end);

cudaEventSynchronize(gpu_execution_end);

This way of timing however generates previously mentioned result.

Does the issue lie in timing the execution? Or does the specific GPU cause problems in the result maybe?

edited Mar 25 at 18:59

talonmies

60.6k17 gold badges140 silver badges208 bronze badges

asked Mar 25 at 18:15

jonas

61 bronze badge

I'm running CUDA runtime 10.0 on an NVIDIA GeForce 940M.

I've tried several ways of getting the execution time. The one recommended in the CUDA documentation says following should work:

cudaEventCreate(&gpu_execution_start);
cudaEventCreate(&gpu_execution_end);

cudaEventRecord(gpu_execution_start);
kernel<<<n_blocks,blocksize>>> (device_a, device_b, device_out, arraysize);
cudaEventRecord(gpu_execution_end);

cudaEventSynchronize(gpu_execution_end);

This way of timing however generates previously mentioned result.

Does the issue lie in timing the execution? Or does the specific GPU cause problems in the result maybe?

cuda nvidia

edited Mar 25 at 18:59

talonmies

60.6k17 gold badges140 silver badges208 bronze badges

asked Mar 25 at 18:15

jonas

61 bronze badge

edited Mar 25 at 18:59

talonmies

60.6k17 gold badges140 silver badges208 bronze badges

asked Mar 25 at 18:15

jonas

61 bronze badge

edited Mar 25 at 18:59

talonmies

60.6k17 gold badges140 silver badges208 bronze badges

edited Mar 25 at 18:59

talonmies

60.6k17 gold badges140 silver badges208 bronze badges

edited Mar 25 at 18:59

talonmies

60.6k17 gold badges140 silver badges208 bronze badges

asked Mar 25 at 18:15

jonas

61 bronze badge

asked Mar 25 at 18:15

jonas

61 bronze badge

asked Mar 25 at 18:15

jonas

61 bronze badge

3

What does your kernel actually do? Do you keep the overall number of threads the same and just change how many blocks you use or do you keep the number of blocks the same and just keep adding threads to the blocks? If what you're doing is increasing the overall number of threads, then what do the additional threads do?

– Michael Kenzel
Mar 25 at 21:35

In one of my applications a 8x8x8 block takes 8 seconds, a 9x9x9 block takes 5 secods and a 10x10x10 takes 11 secods. There is no general answer for blocksize vs time, depends on the kernel.

– Ander Biguri
Mar 26 at 14:10

add a comment |

3

What does your kernel actually do? Do you keep the overall number of threads the same and just change how many blocks you use or do you keep the number of blocks the same and just keep adding threads to the blocks? If what you're doing is increasing the overall number of threads, then what do the additional threads do?

– Michael Kenzel
Mar 25 at 21:35

In one of my applications a 8x8x8 block takes 8 seconds, a 9x9x9 block takes 5 secods and a 10x10x10 takes 11 secods. There is no general answer for blocksize vs time, depends on the kernel.

– Ander Biguri
Mar 26 at 14:10

What does your kernel actually do? Do you keep the overall number of threads the same and just change how many blocks you use or do you keep the number of blocks the same and just keep adding threads to the blocks? If what you're doing is increasing the overall number of threads, then what do the additional threads do?

– Michael Kenzel
Mar 25 at 21:35

In one of my applications a 8x8x8 block takes 8 seconds, a 9x9x9 block takes 5 secods and a 10x10x10 takes 11 secods. There is no general answer for blocksize vs time, depends on the kernel.

– Ander Biguri
Mar 26 at 14:10

add a comment |

1 Answer
1

active

oldest

votes

So each of those thread blocks will be translated into warps, and as you increase the number of threads per threadblock by 32, you decrease the percentage of diverged threads each time. For example, if you launch 33 threads per threadblock, each threadblock will have 1 warp with all 32 lanes active, and another with only 1 lane active. So at each increment of your test, you are not increasing the amount of divergence, you are just adding 1 more active warp to that threadblock.

If you are also not scaling your app correctly, all your work will be able to be scheduled at the same time anyway, so there won't be any effect on execution time.

Hope this helps!

answered Mar 25 at 19:11

user11127113

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55344153%2fcuda-execution-time-compared-to-block-size%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

If you are also not scaling your app correctly, all your work will be able to be scheduled at the same time anyway, so there won't be any effect on execution time.

Hope this helps!

answered Mar 25 at 19:11

user11127113

add a comment |

If you are also not scaling your app correctly, all your work will be able to be scheduled at the same time anyway, so there won't be any effect on execution time.

Hope this helps!

answered Mar 25 at 19:11

user11127113

add a comment |

If you are also not scaling your app correctly, all your work will be able to be scheduled at the same time anyway, so there won't be any effect on execution time.

Hope this helps!

answered Mar 25 at 19:11

user11127113

If you are also not scaling your app correctly, all your work will be able to be scheduled at the same time anyway, so there won't be any effect on execution time.

Hope this helps!

answered Mar 25 at 19:11

user11127113

answered Mar 25 at 19:11

user11127113

answered Mar 25 at 19:11

user11127113

answered Mar 25 at 19:11

user11127113

add a comment |

Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

1 Answer
1

1 Answer
1

1 Answer
1