Nvidia GPU Architecture

Posted on 2021-01-09 Edited on 2022-02-16 In CUDA Disqus:

TBD

CUDA core

A pipeline that can do 32bit floating point addition, 32bit floating point multiplication, 32bit to 8bit integer operations such as shifts, adds, muls and similar operations and also generates memory requests on its SM unit to be served with data to keep constantly working on data. Also requests special function calculations and gets synchronized by SM unit for proper calculation of parallel algorithms.
When a CUDA kernel is running, it is cloned on all “CUDA threads” and these “CUDA threads” flow through all “CUDA pipelines”. Each pipeline is capable of up to 16-way threading. This lets all of its compute resources (integer, floatingpoint, data requests, ..) be efficiently used. When a “warp” of “CUDA threads” is eligible for issuing, it takes 32 “CUDA pipelines” to run on and gets locked onto it. This makes 32 “CUDA pipelines” work as a team, to implicitly do a lock-step iteration of CUDA kernel instructions. This happens concurrently with other “warps” too.
“CUDA core” is a “pipeline” which is used for “mapping” some “CUDA threads” onto them in a synchronized manner, to efficiently take on continuous computation.
Number of CUDA cores is only important when all a kernel do is just 32 bit multiplication and 32 bit addition. When it needs 64bit operations, then you should look at other parts such as “double precision” versions. They are generally a scarce resource in mainstream hardware. Also special function units are as scarce as 1/4 of “cores”. So if you are doing just square root calculations, you should take it into consideration. For example, Kepler architecture contains only 1/6 of the number of cores, as special function units. Latest generation GPUs generally have 1/4 ratio so they are better at that.
Because of architectual differencies, a CUDA core on an old architecture is not same performance as a new architecure CUDA core (pipeline). You shouldn’t compare Kepler’s cores to Volta’s cores. Volta has much more bandwidth per CUDA pipeline than what Kepler had. So, 1 GFLOPS of Volta is actually like 1.5-2.0 GFLOPS of Kepler.
CUDA core is not a full core, it is a pipeline. It just does requests on some other resources and responds with results. It resembles a 16-thread pentium with single core without i/o.

CUDA notes

Posted on 2021-01-04 Edited on 2022-02-16 In CUDA Disqus:

Introduction

CUDA provides three key abstractions: a hierarchy of thread groups, shared memories, and barrier synchronization, that provide a clear parallel structure to conventional C code for one thread of the hierarchy.

Thread Hierarchy

The programmer organizes these threads into a hierarchy of grids of thread blocks. A thread block is a set of concurrent threads that can cooperate among themselves through barrier synchronization and shared access to a memory space private to the block. A grid is a set of thread blocks that may each be executed independently and thus may execute in parallel.
Each thread is given a unique thread ID number threadIdx within its thread block, numbered
0, 1, 2, …, blockDim–1, and each thread block is given a unique block ID number blockIdx within its grid. CUDA supports thread blocks containing up to 512 threads.
kernel<<<dimGrid, dimBlock>>>(... parameter list ...); where dimGrid and dimBlock are three-element vectors of type dim3 that specify the dimensions of the grid in blocks and the dimensions of the blocks in threads.
CUDA requires that thread blocks execute independently. It must be possible to execute blocks

in any order, in parallel or in series. Different blocks have no means of direct communication, although they may coordinate their activities using atomic memory operations on the global memory visible to all threads—by atomically incrementing queue pointers, for example.

Compile Debug version TensorFlow whl package from source

Posted on 2020-12-04 Edited on 2022-02-16 In TensorFlow Disqus:

Compile tensorflow 2.3 from source with Debug symbols

Source code compiling handbook: https://www.tensorflow.org/install/source
bazel manual: https://docs.bazel.build/versions/master/user-manual.html

Docker environment

Docker image: tensorflow/tensorflow:2.3.0-gpu
ubuntu: 18.04
CUDA: incomplete CUDA 10.1
gcc: 7.5.0
python: 3.6.9

TensorFlow third party modules

Posted on 2020-11-09 Edited on 2022-02-16 In TensorFlow Disqus:

TensorFlow

ProtoBuf: ProtoBuf file could be saved in binary and text formats. It could save graph defination and model weight etc.

C++ Syntax

C++ Function Object: Function object is any object for which the function call operator is defined, i.e., operator() is overloaded. It is also called functor.
inline Specifier Compilation: The inline specifier will suggest compiler (not guarantee) to place a copy of the code of that function at each point where the function is called at compile time.
CMake learning resource: https://github.com/ttroy50/cmake-examples

Project Discipline

Posted on 2020-10-18 Edited on 2022-02-16 In Tools Disqus:

C/C++ code standard

File, Naming and Type of Data

Follow the Google C++ Style Guide (only for independent projects, the code integrated with other projects follows the corresponding style)
The inclusion of the external library uses quotation marks, for exmaple #include "gtest/gtest.h"
The macro definition should start with the project name PROJECTNAME_
Other scenarios for designing global namespace conflicts are similer. For example, the script directory name can start with projectname-
Forbid C-style cast
Use nullptr instead of NULL
Be serious about precision-sensitive code: use uint8_t, int64_t instead of int, unsigned long, etc.
Read more »