Multi GPU Training Notes
Tips for Faster Multi-GPU Training
This article provides a collection of notes and practical tips to significantly improve the speed and efficiency of your multi-GPU deep learning models. By paying attention to these small details, you can avoid common pitfalls and optimize your training workflow.
1. Optimizers and Schedulers
-
Learning Rate: For fine-grained control over your learning rate, you can configure
CosineAnnealingLR
to adjust the rate on a per-step basis in PyTorch Lightning. This allows for more dynamic learning rate decay throughout the training process. -
Optimizers:
AdamW
remains a reliable and highly effective choice for most deep learning tasks. While some newer optimizers like Muon show promise, they often come with added complexity. Muon, for instance, requires you to use other standard optimizers like AdamW for specific parameters (embeddings, biases, classifier heads), which can be cumbersome to manage. Sticking with AdamW simplifies your setup and provides excellent performance.
2. Efficient Data Management
-
Image Loading: Avoid using OpenCV's
cv2.imread
to load a full image before cropping. This is an expensive operation that consumes a lot of CPU memory. A much more efficient approach is to use PIL'sImage.open
followed by the.crop
method. This handles the image data more judiciously and saves a significant amount of memory. -
On-the-fly Computations: Calculating normal maps or other derived data within the
__getitem__
method of your dataset, especially with NumPy on the CPU, can be a major bottleneck. Pre-computing this data offline and storing it is the most performant strategy. If that's not possible, consider using a faster, GPU-accelerated method. -
Batching: It is crucial to process your data in batches. Even if you have to pad some samples with zeros to ensure uniform batch sizes, the performance benefits of processing multiple samples simultaneously on the GPU far outweigh the overhead.
3. Troubleshooting Multi-GPU Training
-
DDP Instantiation: Never mix
torchrun
withddp_spawn
. Using both simultaneously can lead to multiple processes being created on a single GPU, resulting in corrupted checkpoints and duplicate output files (e.g.,version_#
). Use one or the other for distributed training. -
NCCL Hangs: If your training process hangs during the NCCL (NVIDIA Collective Communications Library) initialization, it's often a sign of a problem with inter-GPU communication. This can be related to the PCI Express Access Control Services (ACS).
- Diagnosis: Check if ACS is enabled by running
sudo lspci -vvv | grep ACSCtl
. If you seeSrcValid+
, ACS may be a factor. - Quick Fix: A temporary workaround is to disable P2P communication by setting the environment variable
NCCL_P2P_DISABLE=1
. While this may resolve the hang, it can lead to slower training speeds. - Further Investigation: Use the p2pBandwidthLatencyTest tool from the CUDA samples to diagnose the specific communication issue between your GPUs. The latency before and after disabling ACS changes a lot!
- Disable ACS in BIOS: In the BIOS, go to the
Advanced
tab, find and select theAMD CBS
menu. InsideAMD CBS
, find and selectNBIO Common Options
. ChangeACS Enable
toDIsabled
. I also setpcie_acs_override=downstream
inetc/default/grub
. Then runsudo update-grub
and reboot the workstation.
- Diagnosis: Check if ACS is enabled by running
-
GPU Interconnect Topology: The way your GPUs are connected significantly impacts communication speed. Use
nvidia-smi topo -m
to check your setup.NV#
: Your GPUs are connected via NVLink, which offers very high-speed communication. This is the optimal configuration.PIX
: This indicates a PCIe connection, which is still a good, common setup.NODE
: This means your GPUs are on separate CPU sockets, and their communication must traverse the CPU and motherboard's inter-socket link, which is very slow.- BIOS Settings for AMD: If you have an AMD system with a multi-socket or single-socket motherboard showing
NODE
connections, check your BIOS settings. Navigate toAMD CBS
->DF Common Options
->Memory Addressing
. Change theNUMA nodes
setting from[AUTO, NPS2, NPS4]
toNPS1
. This configures the system to treat all memory as a single NUMA domain, often improving inter-GPU communication. - Verify the system setting in Ubuntu by running
lscpu
to see 1NUMA node
.