CUDA execution time compared to block sizeUnderstanding CUDA grid dimensions, block dimensions and threads organization (simple explanation)How to get the cuda version?How do CUDA blocks/warps/threads map onto CUDA cores?installing Cuda NVIDIA Graphic Driver failedMatlab + CUDA slow in solving matrix-vector equation A*x=BDoes CUDA allow multiple applications on same gpu at the same time?Behavior of CUDA kernels from guest VM on NVIDIA GRID vGPUUsing CUDA GPUs at prediction time for high througput streamsCUDA program doesn't measure the execution time : cudaEventRecordTimings differ while measuring a CUDA kernel

"Best practices" for formulating MIPs

What is the right way to query an I2C device from an interrupt service routine?

Will greasing clutch parts make it softer

Did Stalin kill all Soviet officers involved in the Winter War?

Puzzling Knight has a Message for all- Especially Newcomers

Is there any connection between "Whispers of the heart" and "The cat returns"?

A student "completes" 2-week project in 3 hours and lies about doing it himself

Construction of the word подтвержда́ть

My players like to search everything. What do they find?

Bypass with wrong cvv of debit card and getting OTP

Why did moving the mouse cursor cause Windows 95 to run more quickly?

What verb goes with "coup"?

Old story where computer expert digitally animates The Lord of the Rings

Should I hide my travel history to the UK when I apply for an Australian visa?

Isn't "Dave's protocol" good if only the database, and not the code, is leaked?

Finding integer database columns that may have their data type changed to reduce size

What does "another" mean in this case?

PhD: When to quit and move on?

Sleepy tired vs physically tired

What does the ash content of broken wheat really mean?

what is the meaning of "stock" dilution on the Massive Dev Chart Website?

gzip compress a local folder and extract it to remote server

How can solar sailed ships be protected from space debris?

Should I cross-validate metrics that were not optimised?



CUDA execution time compared to block size


Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation)How to get the cuda version?How do CUDA blocks/warps/threads map onto CUDA cores?installing Cuda NVIDIA Graphic Driver failedMatlab + CUDA slow in solving matrix-vector equation A*x=BDoes CUDA allow multiple applications on same gpu at the same time?Behavior of CUDA kernels from guest VM on NVIDIA GRID vGPUUsing CUDA GPUs at prediction time for high througput streamsCUDA program doesn't measure the execution time : cudaEventRecordTimings differ while measuring a CUDA kernel






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








1















The goal is simple: plot the effect of block size on the execution time with CUDA. What one would expect to see is that for each blocksize that is a multiple of 32 the execution time is lowest, after thes multiples (e.g. 33, 65, 97, 129, ...) the execution time should increase. However this is not the result I'm getting. The execution time simply goes down and then flattens out.
enter image description here



I'm running CUDA runtime 10.0 on an NVIDIA GeForce 940M.



I've tried several ways of getting the execution time. The one recommended in the CUDA documentation says following should work:



cudaEventCreate(&gpu_execution_start);
cudaEventCreate(&gpu_execution_end);

cudaEventRecord(gpu_execution_start);
kernel<<<n_blocks,blocksize>>> (device_a, device_b, device_out, arraysize);
cudaEventRecord(gpu_execution_end);

cudaEventSynchronize(gpu_execution_end);


This way of timing however generates previously mentioned result.



Does the issue lie in timing the execution? Or does the specific GPU cause problems in the result maybe?










share|improve this question



















  • 3





    What does your kernel actually do? Do you keep the overall number of threads the same and just change how many blocks you use or do you keep the number of blocks the same and just keep adding threads to the blocks? If what you're doing is increasing the overall number of threads, then what do the additional threads do?

    – Michael Kenzel
    Mar 25 at 21:35












  • In one of my applications a 8x8x8 block takes 8 seconds, a 9x9x9 block takes 5 secods and a 10x10x10 takes 11 secods. There is no general answer for blocksize vs time, depends on the kernel.

    – Ander Biguri
    Mar 26 at 14:10

















1















The goal is simple: plot the effect of block size on the execution time with CUDA. What one would expect to see is that for each blocksize that is a multiple of 32 the execution time is lowest, after thes multiples (e.g. 33, 65, 97, 129, ...) the execution time should increase. However this is not the result I'm getting. The execution time simply goes down and then flattens out.
enter image description here



I'm running CUDA runtime 10.0 on an NVIDIA GeForce 940M.



I've tried several ways of getting the execution time. The one recommended in the CUDA documentation says following should work:



cudaEventCreate(&gpu_execution_start);
cudaEventCreate(&gpu_execution_end);

cudaEventRecord(gpu_execution_start);
kernel<<<n_blocks,blocksize>>> (device_a, device_b, device_out, arraysize);
cudaEventRecord(gpu_execution_end);

cudaEventSynchronize(gpu_execution_end);


This way of timing however generates previously mentioned result.



Does the issue lie in timing the execution? Or does the specific GPU cause problems in the result maybe?










share|improve this question



















  • 3





    What does your kernel actually do? Do you keep the overall number of threads the same and just change how many blocks you use or do you keep the number of blocks the same and just keep adding threads to the blocks? If what you're doing is increasing the overall number of threads, then what do the additional threads do?

    – Michael Kenzel
    Mar 25 at 21:35












  • In one of my applications a 8x8x8 block takes 8 seconds, a 9x9x9 block takes 5 secods and a 10x10x10 takes 11 secods. There is no general answer for blocksize vs time, depends on the kernel.

    – Ander Biguri
    Mar 26 at 14:10













1












1








1








The goal is simple: plot the effect of block size on the execution time with CUDA. What one would expect to see is that for each blocksize that is a multiple of 32 the execution time is lowest, after thes multiples (e.g. 33, 65, 97, 129, ...) the execution time should increase. However this is not the result I'm getting. The execution time simply goes down and then flattens out.
enter image description here



I'm running CUDA runtime 10.0 on an NVIDIA GeForce 940M.



I've tried several ways of getting the execution time. The one recommended in the CUDA documentation says following should work:



cudaEventCreate(&gpu_execution_start);
cudaEventCreate(&gpu_execution_end);

cudaEventRecord(gpu_execution_start);
kernel<<<n_blocks,blocksize>>> (device_a, device_b, device_out, arraysize);
cudaEventRecord(gpu_execution_end);

cudaEventSynchronize(gpu_execution_end);


This way of timing however generates previously mentioned result.



Does the issue lie in timing the execution? Or does the specific GPU cause problems in the result maybe?










share|improve this question
















The goal is simple: plot the effect of block size on the execution time with CUDA. What one would expect to see is that for each blocksize that is a multiple of 32 the execution time is lowest, after thes multiples (e.g. 33, 65, 97, 129, ...) the execution time should increase. However this is not the result I'm getting. The execution time simply goes down and then flattens out.
enter image description here



I'm running CUDA runtime 10.0 on an NVIDIA GeForce 940M.



I've tried several ways of getting the execution time. The one recommended in the CUDA documentation says following should work:



cudaEventCreate(&gpu_execution_start);
cudaEventCreate(&gpu_execution_end);

cudaEventRecord(gpu_execution_start);
kernel<<<n_blocks,blocksize>>> (device_a, device_b, device_out, arraysize);
cudaEventRecord(gpu_execution_end);

cudaEventSynchronize(gpu_execution_end);


This way of timing however generates previously mentioned result.



Does the issue lie in timing the execution? Or does the specific GPU cause problems in the result maybe?







cuda nvidia






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 25 at 18:59









talonmies

60.6k17 gold badges140 silver badges208 bronze badges




60.6k17 gold badges140 silver badges208 bronze badges










asked Mar 25 at 18:15









jonasjonas

61 bronze badge




61 bronze badge







  • 3





    What does your kernel actually do? Do you keep the overall number of threads the same and just change how many blocks you use or do you keep the number of blocks the same and just keep adding threads to the blocks? If what you're doing is increasing the overall number of threads, then what do the additional threads do?

    – Michael Kenzel
    Mar 25 at 21:35












  • In one of my applications a 8x8x8 block takes 8 seconds, a 9x9x9 block takes 5 secods and a 10x10x10 takes 11 secods. There is no general answer for blocksize vs time, depends on the kernel.

    – Ander Biguri
    Mar 26 at 14:10












  • 3





    What does your kernel actually do? Do you keep the overall number of threads the same and just change how many blocks you use or do you keep the number of blocks the same and just keep adding threads to the blocks? If what you're doing is increasing the overall number of threads, then what do the additional threads do?

    – Michael Kenzel
    Mar 25 at 21:35












  • In one of my applications a 8x8x8 block takes 8 seconds, a 9x9x9 block takes 5 secods and a 10x10x10 takes 11 secods. There is no general answer for blocksize vs time, depends on the kernel.

    – Ander Biguri
    Mar 26 at 14:10







3




3





What does your kernel actually do? Do you keep the overall number of threads the same and just change how many blocks you use or do you keep the number of blocks the same and just keep adding threads to the blocks? If what you're doing is increasing the overall number of threads, then what do the additional threads do?

– Michael Kenzel
Mar 25 at 21:35






What does your kernel actually do? Do you keep the overall number of threads the same and just change how many blocks you use or do you keep the number of blocks the same and just keep adding threads to the blocks? If what you're doing is increasing the overall number of threads, then what do the additional threads do?

– Michael Kenzel
Mar 25 at 21:35














In one of my applications a 8x8x8 block takes 8 seconds, a 9x9x9 block takes 5 secods and a 10x10x10 takes 11 secods. There is no general answer for blocksize vs time, depends on the kernel.

– Ander Biguri
Mar 26 at 14:10





In one of my applications a 8x8x8 block takes 8 seconds, a 9x9x9 block takes 5 secods and a 10x10x10 takes 11 secods. There is no general answer for blocksize vs time, depends on the kernel.

– Ander Biguri
Mar 26 at 14:10












1 Answer
1






active

oldest

votes


















1














So each of those thread blocks will be translated into warps, and as you increase the number of threads per threadblock by 32, you decrease the percentage of diverged threads each time. For example, if you launch 33 threads per threadblock, each threadblock will have 1 warp with all 32 lanes active, and another with only 1 lane active. So at each increment of your test, you are not increasing the amount of divergence, you are just adding 1 more active warp to that threadblock.



If you are also not scaling your app correctly, all your work will be able to be scheduled at the same time anyway, so there won't be any effect on execution time.



Hope this helps!






share|improve this answer






















    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55344153%2fcuda-execution-time-compared-to-block-size%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    So each of those thread blocks will be translated into warps, and as you increase the number of threads per threadblock by 32, you decrease the percentage of diverged threads each time. For example, if you launch 33 threads per threadblock, each threadblock will have 1 warp with all 32 lanes active, and another with only 1 lane active. So at each increment of your test, you are not increasing the amount of divergence, you are just adding 1 more active warp to that threadblock.



    If you are also not scaling your app correctly, all your work will be able to be scheduled at the same time anyway, so there won't be any effect on execution time.



    Hope this helps!






    share|improve this answer



























      1














      So each of those thread blocks will be translated into warps, and as you increase the number of threads per threadblock by 32, you decrease the percentage of diverged threads each time. For example, if you launch 33 threads per threadblock, each threadblock will have 1 warp with all 32 lanes active, and another with only 1 lane active. So at each increment of your test, you are not increasing the amount of divergence, you are just adding 1 more active warp to that threadblock.



      If you are also not scaling your app correctly, all your work will be able to be scheduled at the same time anyway, so there won't be any effect on execution time.



      Hope this helps!






      share|improve this answer

























        1












        1








        1







        So each of those thread blocks will be translated into warps, and as you increase the number of threads per threadblock by 32, you decrease the percentage of diverged threads each time. For example, if you launch 33 threads per threadblock, each threadblock will have 1 warp with all 32 lanes active, and another with only 1 lane active. So at each increment of your test, you are not increasing the amount of divergence, you are just adding 1 more active warp to that threadblock.



        If you are also not scaling your app correctly, all your work will be able to be scheduled at the same time anyway, so there won't be any effect on execution time.



        Hope this helps!






        share|improve this answer













        So each of those thread blocks will be translated into warps, and as you increase the number of threads per threadblock by 32, you decrease the percentage of diverged threads each time. For example, if you launch 33 threads per threadblock, each threadblock will have 1 warp with all 32 lanes active, and another with only 1 lane active. So at each increment of your test, you are not increasing the amount of divergence, you are just adding 1 more active warp to that threadblock.



        If you are also not scaling your app correctly, all your work will be able to be scheduled at the same time anyway, so there won't be any effect on execution time.



        Hope this helps!







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Mar 25 at 19:11







        user11127113

























            Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.







            Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.



















            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55344153%2fcuda-execution-time-compared-to-block-size%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

            Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

            Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript