0%

Nvidia GPU Architecture

TBD

CUDA core

  • A pipeline that can do 32bit floating point addition, 32bit floating point multiplication, 32bit to 8bit integer operations such as shifts, adds, muls and similar operations and also generates memory requests on its SM unit to be served with data to keep constantly working on data. Also requests special function calculations and gets synchronized by SM unit for proper calculation of parallel algorithms.

  • When a CUDA kernel is running, it is cloned on all “CUDA threads” and these “CUDA threads” flow through all “CUDA pipelines”. Each pipeline is capable of up to 16-way threading. This lets all of its compute resources (integer, floatingpoint, data requests, ..) be efficiently used. When a “warp” of “CUDA threads” is eligible for issuing, it takes 32 “CUDA pipelines” to run on and gets locked onto it. This makes 32 “CUDA pipelines” work as a team, to implicitly do a lock-step iteration of CUDA kernel instructions. This happens concurrently with other “warps” too.

  • “CUDA core” is a “pipeline” which is used for “mapping” some “CUDA threads” onto them in a synchronized manner, to efficiently take on continuous computation.

  • Number of CUDA cores is only important when all a kernel do is just 32 bit multiplication and 32 bit addition. When it needs 64bit operations, then you should look at other parts such as “double precision” versions. They are generally a scarce resource in mainstream hardware. Also special function units are as scarce as 1/4 of “cores”. So if you are doing just square root calculations, you should take it into consideration. For example, Kepler architecture contains only 1/6 of the number of cores, as special function units. Latest generation GPUs generally have 1/4 ratio so they are better at that.

  • Because of architectual differencies, a CUDA core on an old architecture is not same performance as a new architecure CUDA core (pipeline). You shouldn’t compare Kepler’s cores to Volta’s cores. Volta has much more bandwidth per CUDA pipeline than what Kepler had. So, 1 GFLOPS of Volta is actually like 1.5-2.0 GFLOPS of Kepler.

  • CUDA core is not a full core, it is a pipeline. It just does requests on some other resources and responds with results. It resembles a 16-thread pentium with single core without i/o.