Multi GPU Training Notes
Tips for Faster Multi-GPU Training
This article provides a collection of notes and practical tips to significantly improve the speed and efficiency of your multi-GPU deep learning models. By paying attention to these small details, you can avoid common pitfalls and optimize your training workflow.
1. Optimizers and Schedulers
- 
Learning Rate: For fine-grained control over your learning rate, you can configure
CosineAnnealingLRto adjust the rate on a per-step basis in PyTorch Lightning. This allows for more dynamic learning rate decay throughout the training process. - 
Optimizers:
AdamWremains a reliable and highly effective choice for most deep learning tasks. While some newer optimizers like Muon show promise, they often come with added complexity. Muon, for instance, requires you to use other standard optimizers like AdamW for specific parameters (embeddings, biases, classifier heads), which can be cumbersome to manage. Sticking with AdamW simplifies your setup and provides excellent performance. 
2. Efficient Data Management
- 
Image Loading: Avoid using OpenCV's
cv2.imreadto load a full image before cropping. This is an expensive operation that consumes a lot of CPU memory. A much more efficient approach is to use PIL'sImage.openfollowed by the.cropmethod. This handles the image data more judiciously and saves a significant amount of memory. - 
On-the-fly Computations: Calculating normal maps or other derived data within the
__getitem__method of your dataset, especially with NumPy on the CPU, can be a major bottleneck. Pre-computing this data offline and storing it is the most performant strategy. If that's not possible, consider using a faster, GPU-accelerated method. - 
Batching: It is crucial to process your data in batches. Even if you have to pad some samples with zeros to ensure uniform batch sizes, the performance benefits of processing multiple samples simultaneously on the GPU far outweigh the overhead.
 
3. Troubleshooting Multi-GPU Training
- 
DDP Instantiation: Never mix
torchrunwithddp_spawn. Using both simultaneously can lead to multiple processes being created on a single GPU, resulting in corrupted checkpoints and duplicate output files (e.g.,version_#). Use one or the other for distributed training. - 
NCCL Hangs: If your training process hangs during the NCCL (NVIDIA Collective Communications Library) initialization, it's often a sign of a problem with inter-GPU communication. This can be related to the PCI Express Access Control Services (ACS).
- Diagnosis: Check if ACS is enabled by running 
sudo lspci -vvv | grep ACSCtl. If you seeSrcValid+, ACS may be a factor. - Quick Fix: A temporary workaround is to disable P2P communication by setting the environment variable 
NCCL_P2P_DISABLE=1. While this may resolve the hang, it can lead to slower training speeds. - Further Investigation: Use the p2pBandwidthLatencyTest tool from the CUDA samples to diagnose the specific communication issue between your GPUs. The latency before and after disabling ACS changes a lot!
 - Disable ACS in BIOS: In the BIOS, go to the 
Advancedtab, find and select theAMD CBSmenu. InsideAMD CBS, find and selectNBIO Common Options. ChangeACS EnabletoDIsabled. I also setpcie_acs_override=downstreaminetc/default/grub. Then runsudo update-gruband reboot the workstation. 
 - Diagnosis: Check if ACS is enabled by running 
 - 
GPU Interconnect Topology: The way your GPUs are connected significantly impacts communication speed. Use
nvidia-smi topo -mto check your setup.NV#: Your GPUs are connected via NVLink, which offers very high-speed communication. This is the optimal configuration.PIX: This indicates a PCIe connection, which is still a good, common setup.NODE: This means your GPUs are on separate CPU sockets, and their communication must traverse the CPU and motherboard's inter-socket link, which is very slow.- BIOS Settings for AMD: If you have an AMD system with a multi-socket or single-socket motherboard showing 
NODEconnections, check your BIOS settings. Navigate toAMD CBS->DF Common Options->Memory Addressing. Change theNUMA nodessetting from[AUTO, NPS2, NPS4]toNPS1. This configures the system to treat all memory as a single NUMA domain, often improving inter-GPU communication. - Verify the system setting in Ubuntu by running 
lscputo see 1NUMA node.