HugeCTR notes

Introduction

HugeCTR is a GPU-accelerated training framework for CTR estimation. It support model-parallel and data-parallel scaling.
It explicitly prevents users from developing their model in a way that isn’t optimal, constraining to optimal layer width and memory sizes in order to achieve significant performance benefits.
HugeCTR is also a reference design for framework developers who want to port their CPU solutions to GPU or optimize their current GPU solutions.

Basic element

tensor2.hpp::Class Tensor2
- It has three data members:
  1
  2
  3
  std::vector<size_t> dimensions_;
  size_t num_elements_;
  std::shared_ptr<TensorBuffer2> buffer_;
- It could be initialized with empty parameter and in this case, it will not alloc memory.
- Shrink() method could shrink this current Tensor2 object to a TensorBag2 object. TensorBag2 object could also be transform to Tensor2 object by calling stretch_from(const TensorBag2 &bag)
general_buffer2.hpp::Class CudaManagedAllocator:
- Encapsulate cudaMallocManaged and cudaFree with error check.
- Allocates memory that will be automatically managed by the Unified Memory system.

embedding.hpp::Struct BufferBag

It only has three members

1
2
3

TensorBag2 keys;
TensorBag2 slot_id;
Tensor2<float> embedding;

embedding.hpp::Struct SparseEmbeddingHsahParams
- It include params of batch size, vocabulary size, slot size array, embedding vec size, max feature num, slot number, optimizer params and etc.
resource_manager.hpp::Class ResourceManager
- It is a GPU resource container in one node. An instance includes: GPU resource vector, thread pool for training, nccl communicators.
- It has key data members of device_map_, cpu_resource_, gpu_resource_, device_memory_resource
- Here are some device_map_related methods:
  - const std::vector<int>& get_local_gpu_device_id_list() const
  - int get_process_id_from_gpu_global_id(size_t global_gpu_id) const
  - size_t get_gpu_local_id_from_global_id(size_t global_gpu_id) const // sequential GPU indices
  - size_t get_gpu_global_id_from_local_id(size_t local_gpu_id) const // sequential GPU indices
- device_memory_resource related method:
  - const std::shared_ptr<rmm::mr::device_memory_resource>& get_device_rmm_device_memory_resource(int local_gpu_id) const
embedding.hpp::Class IEmbedding
- It is an interface of Embedding classes. It could load parameters from std file stream or BufferBag object.
metadata.hpp::Class Metadata: Define the spec of a layer from json file.

Multi-slot embedding with in-memory GPU hash table

The hash table is distributed across multiple GPUs.
The embedding layer, which includes a GPU-accelerated hash table, and harnesses NCCL as its inter-GPU communication primitives.
The hash table implementation is based on RAPIDS cuDF, which is a GPU DataFrame library from NVIDIA.
During an embedding lookup, the input sparse features which belong to the same slot, after being converted to the corresponding dense embedding vectors independently, are reduced to a single embedding vector. Then, the embedding vectors from the different slots are concatenated together.
Slot: In HugeCTR, a slot is a feature field or table. The features in a slot can be one-hot or multi-hot. The number of features in different slots can be various. You can specify the number of slots (slot_num) in the data layer of your configuration file.
About the LocalizedSlotEmbedding and DistributedSlotEmbedding: There are two sub-classes of Embedding layer, LocalizedSlotEmbedding and DistributedSlotEmbedding. They are distinguished by different methods of distributing embedding tables on multiple GPUs as model parallelism. For LocalizedSlotEmbedding, the features in the same slot will be stored in one GPU (that is why we call it “localized slot”), and different slots may be stored in different GPUs according to the index number of the slot. For DistributedSlotEmbedding, all the features are distributed to different GPUs according to the index number of the feature, regardless of the index number of the slot. That means the features in the same slot may be stored in different GPUs (that is why we call it “distributed slot”).
Thus LocalizedSlotEmbedding is optimized for the case each embedding is smaller than the memory size of GPU. As local reduction per slot is used in LocalizedSlotEmbedding and no global reduce between GPUs, the overall data transaction in Embedding is much less than DistributedSlotEmbedding. DistributedSlotEmbedding is made for the case some of the embeddings are larger than the memory size of GPU. As global reduction is required. DistributedSlotEmbedding has much more memory trasactions between GPUs.

DataReader

it implements a dedicated data reader which is inherently asynchronous and multi-threaded, so that the data transfer time overlaps with the GPU computation.
DataReader required to read the same batch of data on each node for each step

Ouyang's Blog

HugeCTR notes

Introduction

Basic element

Multi-slot embedding with in-memory GPU hash table

DataReader

Reference