CUDA execution time compared to block sizeUnderstanding CUDA grid dimensions, block dimensions and threads organization (simple explanation)How to get the cuda version?How do CUDA blocks/warps/threads map onto CUDA cores?installing Cuda NVIDIA Graphic Driver failedMatlab + CUDA slow in solving matrix-vector equation A*x=BDoes CUDA allow multiple applications on same gpu at the same time?Behavior of CUDA kernels from guest VM on NVIDIA GRID vGPUUsing CUDA GPUs at prediction time for high througput streamsCUDA program doesn't measure the execution time : cudaEventRecordTimings differ while measuring a CUDA kernel
"Best practices" for formulating MIPs
What is the right way to query an I2C device from an interrupt service routine?
Will greasing clutch parts make it softer
Did Stalin kill all Soviet officers involved in the Winter War?
Puzzling Knight has a Message for all- Especially Newcomers
Is there any connection between "Whispers of the heart" and "The cat returns"?
A student "completes" 2-week project in 3 hours and lies about doing it himself
Construction of the word подтвержда́ть
My players like to search everything. What do they find?
Bypass with wrong cvv of debit card and getting OTP
Why did moving the mouse cursor cause Windows 95 to run more quickly?
What verb goes with "coup"?
Old story where computer expert digitally animates The Lord of the Rings
Should I hide my travel history to the UK when I apply for an Australian visa?
Isn't "Dave's protocol" good if only the database, and not the code, is leaked?
Finding integer database columns that may have their data type changed to reduce size
What does "another" mean in this case?
PhD: When to quit and move on?
Sleepy tired vs physically tired
What does the ash content of broken wheat really mean?
what is the meaning of "stock" dilution on the Massive Dev Chart Website?
gzip compress a local folder and extract it to remote server
How can solar sailed ships be protected from space debris?
Should I cross-validate metrics that were not optimised?
CUDA execution time compared to block size
Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation)How to get the cuda version?How do CUDA blocks/warps/threads map onto CUDA cores?installing Cuda NVIDIA Graphic Driver failedMatlab + CUDA slow in solving matrix-vector equation A*x=BDoes CUDA allow multiple applications on same gpu at the same time?Behavior of CUDA kernels from guest VM on NVIDIA GRID vGPUUsing CUDA GPUs at prediction time for high througput streamsCUDA program doesn't measure the execution time : cudaEventRecordTimings differ while measuring a CUDA kernel
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
The goal is simple: plot the effect of block size on the execution time with CUDA. What one would expect to see is that for each blocksize that is a multiple of 32 the execution time is lowest, after thes multiples (e.g. 33, 65, 97, 129, ...) the execution time should increase. However this is not the result I'm getting. The execution time simply goes down and then flattens out.
I'm running CUDA runtime 10.0 on an NVIDIA GeForce 940M.
I've tried several ways of getting the execution time. The one recommended in the CUDA documentation says following should work:
cudaEventCreate(&gpu_execution_start);
cudaEventCreate(&gpu_execution_end);
cudaEventRecord(gpu_execution_start);
kernel<<<n_blocks,blocksize>>> (device_a, device_b, device_out, arraysize);
cudaEventRecord(gpu_execution_end);
cudaEventSynchronize(gpu_execution_end);
This way of timing however generates previously mentioned result.
Does the issue lie in timing the execution? Or does the specific GPU cause problems in the result maybe?
cuda nvidia
add a comment |
The goal is simple: plot the effect of block size on the execution time with CUDA. What one would expect to see is that for each blocksize that is a multiple of 32 the execution time is lowest, after thes multiples (e.g. 33, 65, 97, 129, ...) the execution time should increase. However this is not the result I'm getting. The execution time simply goes down and then flattens out.
I'm running CUDA runtime 10.0 on an NVIDIA GeForce 940M.
I've tried several ways of getting the execution time. The one recommended in the CUDA documentation says following should work:
cudaEventCreate(&gpu_execution_start);
cudaEventCreate(&gpu_execution_end);
cudaEventRecord(gpu_execution_start);
kernel<<<n_blocks,blocksize>>> (device_a, device_b, device_out, arraysize);
cudaEventRecord(gpu_execution_end);
cudaEventSynchronize(gpu_execution_end);
This way of timing however generates previously mentioned result.
Does the issue lie in timing the execution? Or does the specific GPU cause problems in the result maybe?
cuda nvidia
3
What does your kernel actually do? Do you keep the overall number of threads the same and just change how many blocks you use or do you keep the number of blocks the same and just keep adding threads to the blocks? If what you're doing is increasing the overall number of threads, then what do the additional threads do?
– Michael Kenzel
Mar 25 at 21:35
In one of my applications a 8x8x8 block takes 8 seconds, a 9x9x9 block takes 5 secods and a 10x10x10 takes 11 secods. There is no general answer for blocksize vs time, depends on the kernel.
– Ander Biguri
Mar 26 at 14:10
add a comment |
The goal is simple: plot the effect of block size on the execution time with CUDA. What one would expect to see is that for each blocksize that is a multiple of 32 the execution time is lowest, after thes multiples (e.g. 33, 65, 97, 129, ...) the execution time should increase. However this is not the result I'm getting. The execution time simply goes down and then flattens out.
I'm running CUDA runtime 10.0 on an NVIDIA GeForce 940M.
I've tried several ways of getting the execution time. The one recommended in the CUDA documentation says following should work:
cudaEventCreate(&gpu_execution_start);
cudaEventCreate(&gpu_execution_end);
cudaEventRecord(gpu_execution_start);
kernel<<<n_blocks,blocksize>>> (device_a, device_b, device_out, arraysize);
cudaEventRecord(gpu_execution_end);
cudaEventSynchronize(gpu_execution_end);
This way of timing however generates previously mentioned result.
Does the issue lie in timing the execution? Or does the specific GPU cause problems in the result maybe?
cuda nvidia
The goal is simple: plot the effect of block size on the execution time with CUDA. What one would expect to see is that for each blocksize that is a multiple of 32 the execution time is lowest, after thes multiples (e.g. 33, 65, 97, 129, ...) the execution time should increase. However this is not the result I'm getting. The execution time simply goes down and then flattens out.
I'm running CUDA runtime 10.0 on an NVIDIA GeForce 940M.
I've tried several ways of getting the execution time. The one recommended in the CUDA documentation says following should work:
cudaEventCreate(&gpu_execution_start);
cudaEventCreate(&gpu_execution_end);
cudaEventRecord(gpu_execution_start);
kernel<<<n_blocks,blocksize>>> (device_a, device_b, device_out, arraysize);
cudaEventRecord(gpu_execution_end);
cudaEventSynchronize(gpu_execution_end);
This way of timing however generates previously mentioned result.
Does the issue lie in timing the execution? Or does the specific GPU cause problems in the result maybe?
cuda nvidia
cuda nvidia
edited Mar 25 at 18:59
talonmies
60.6k17 gold badges140 silver badges208 bronze badges
60.6k17 gold badges140 silver badges208 bronze badges
asked Mar 25 at 18:15
jonasjonas
61 bronze badge
61 bronze badge
3
What does your kernel actually do? Do you keep the overall number of threads the same and just change how many blocks you use or do you keep the number of blocks the same and just keep adding threads to the blocks? If what you're doing is increasing the overall number of threads, then what do the additional threads do?
– Michael Kenzel
Mar 25 at 21:35
In one of my applications a 8x8x8 block takes 8 seconds, a 9x9x9 block takes 5 secods and a 10x10x10 takes 11 secods. There is no general answer for blocksize vs time, depends on the kernel.
– Ander Biguri
Mar 26 at 14:10
add a comment |
3
What does your kernel actually do? Do you keep the overall number of threads the same and just change how many blocks you use or do you keep the number of blocks the same and just keep adding threads to the blocks? If what you're doing is increasing the overall number of threads, then what do the additional threads do?
– Michael Kenzel
Mar 25 at 21:35
In one of my applications a 8x8x8 block takes 8 seconds, a 9x9x9 block takes 5 secods and a 10x10x10 takes 11 secods. There is no general answer for blocksize vs time, depends on the kernel.
– Ander Biguri
Mar 26 at 14:10
3
3
What does your kernel actually do? Do you keep the overall number of threads the same and just change how many blocks you use or do you keep the number of blocks the same and just keep adding threads to the blocks? If what you're doing is increasing the overall number of threads, then what do the additional threads do?
– Michael Kenzel
Mar 25 at 21:35
What does your kernel actually do? Do you keep the overall number of threads the same and just change how many blocks you use or do you keep the number of blocks the same and just keep adding threads to the blocks? If what you're doing is increasing the overall number of threads, then what do the additional threads do?
– Michael Kenzel
Mar 25 at 21:35
In one of my applications a 8x8x8 block takes 8 seconds, a 9x9x9 block takes 5 secods and a 10x10x10 takes 11 secods. There is no general answer for blocksize vs time, depends on the kernel.
– Ander Biguri
Mar 26 at 14:10
In one of my applications a 8x8x8 block takes 8 seconds, a 9x9x9 block takes 5 secods and a 10x10x10 takes 11 secods. There is no general answer for blocksize vs time, depends on the kernel.
– Ander Biguri
Mar 26 at 14:10
add a comment |
1 Answer
1
active
oldest
votes
So each of those thread blocks will be translated into warps, and as you increase the number of threads per threadblock by 32, you decrease the percentage of diverged threads each time. For example, if you launch 33 threads per threadblock, each threadblock will have 1 warp with all 32 lanes active, and another with only 1 lane active. So at each increment of your test, you are not increasing the amount of divergence, you are just adding 1 more active warp to that threadblock.
If you are also not scaling your app correctly, all your work will be able to be scheduled at the same time anyway, so there won't be any effect on execution time.
Hope this helps!
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55344153%2fcuda-execution-time-compared-to-block-size%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
So each of those thread blocks will be translated into warps, and as you increase the number of threads per threadblock by 32, you decrease the percentage of diverged threads each time. For example, if you launch 33 threads per threadblock, each threadblock will have 1 warp with all 32 lanes active, and another with only 1 lane active. So at each increment of your test, you are not increasing the amount of divergence, you are just adding 1 more active warp to that threadblock.
If you are also not scaling your app correctly, all your work will be able to be scheduled at the same time anyway, so there won't be any effect on execution time.
Hope this helps!
add a comment |
So each of those thread blocks will be translated into warps, and as you increase the number of threads per threadblock by 32, you decrease the percentage of diverged threads each time. For example, if you launch 33 threads per threadblock, each threadblock will have 1 warp with all 32 lanes active, and another with only 1 lane active. So at each increment of your test, you are not increasing the amount of divergence, you are just adding 1 more active warp to that threadblock.
If you are also not scaling your app correctly, all your work will be able to be scheduled at the same time anyway, so there won't be any effect on execution time.
Hope this helps!
add a comment |
So each of those thread blocks will be translated into warps, and as you increase the number of threads per threadblock by 32, you decrease the percentage of diverged threads each time. For example, if you launch 33 threads per threadblock, each threadblock will have 1 warp with all 32 lanes active, and another with only 1 lane active. So at each increment of your test, you are not increasing the amount of divergence, you are just adding 1 more active warp to that threadblock.
If you are also not scaling your app correctly, all your work will be able to be scheduled at the same time anyway, so there won't be any effect on execution time.
Hope this helps!
So each of those thread blocks will be translated into warps, and as you increase the number of threads per threadblock by 32, you decrease the percentage of diverged threads each time. For example, if you launch 33 threads per threadblock, each threadblock will have 1 warp with all 32 lanes active, and another with only 1 lane active. So at each increment of your test, you are not increasing the amount of divergence, you are just adding 1 more active warp to that threadblock.
If you are also not scaling your app correctly, all your work will be able to be scheduled at the same time anyway, so there won't be any effect on execution time.
Hope this helps!
answered Mar 25 at 19:11
user11127113
add a comment |
add a comment |
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55344153%2fcuda-execution-time-compared-to-block-size%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
3
What does your kernel actually do? Do you keep the overall number of threads the same and just change how many blocks you use or do you keep the number of blocks the same and just keep adding threads to the blocks? If what you're doing is increasing the overall number of threads, then what do the additional threads do?
– Michael Kenzel
Mar 25 at 21:35
In one of my applications a 8x8x8 block takes 8 seconds, a 9x9x9 block takes 5 secods and a 10x10x10 takes 11 secods. There is no general answer for blocksize vs time, depends on the kernel.
– Ander Biguri
Mar 26 at 14:10