Multi GPU Training Notes

Sun Sep 21 2025 • Tech

Tips for Faster Multi-GPU Training

This article provides a collection of notes and practical tips to significantly improve the speed and efficiency of your multi-GPU deep learning models. By paying attention to these small details, you can avoid common pitfalls and optimize your training workflow.

1. Optimizers and Schedulers

Learning Rate: For fine-grained control over your learning rate, you can configure CosineAnnealingLR to adjust the rate on a per-step basis in PyTorch Lightning. This allows for more dynamic learning rate decay throughout the training process.
Optimizers: AdamW remains a reliable and highly effective choice for most deep learning tasks. While some newer optimizers like Muon show promise, they often come with added complexity. Muon, for instance, requires you to use other standard optimizers like AdamW for specific parameters (embeddings, biases, classifier heads), which can be cumbersome to manage. Sticking with AdamW simplifies your setup and provides excellent performance.

2. Efficient Data Management

Image Loading: Avoid using OpenCV's cv2.imread to load a full image before cropping. This is an expensive operation that consumes a lot of CPU memory. A much more efficient approach is to use PIL's Image.open followed by the .crop method. This handles the image data more judiciously and saves a significant amount of memory.
On-the-fly Computations: Calculating normal maps or other derived data within the __getitem__ method of your dataset, especially with NumPy on the CPU, can be a major bottleneck. Pre-computing this data offline and storing it is the most performant strategy. If that's not possible, consider using a faster, GPU-accelerated method.
Batching: It is crucial to process your data in batches. Even if you have to pad some samples with zeros to ensure uniform batch sizes, the performance benefits of processing multiple samples simultaneously on the GPU far outweigh the overhead.

3. Troubleshooting Multi-GPU Training

DDP Instantiation: Never mix torchrun with ddp_spawn. Using both simultaneously can lead to multiple processes being created on a single GPU, resulting in corrupted checkpoints and duplicate output files (e.g., version_#). Use one or the other for distributed training.
NCCL Hangs: If your training process hangs during the NCCL (NVIDIA Collective Communications Library) initialization, it's often a sign of a problem with inter-GPU communication. This can be related to the PCI Express Access Control Services (ACS).
- Diagnosis: Check if ACS is enabled by running sudo lspci -vvv | grep ACSCtl. If you see SrcValid+, ACS may be a factor.
- Quick Fix: A temporary workaround is to disable P2P communication by setting the environment variable NCCL_P2P_DISABLE=1. While this may resolve the hang, it can lead to slower training speeds.
- Further Investigation: Use the p2pBandwidthLatencyTest tool from the CUDA samples to diagnose the specific communication issue between your GPUs. The latency before and after disabling ACS changes a lot!
- Disable ACS in BIOS: In the BIOS, go to the Advanced tab, find and select the AMD CBS menu. Inside AMD CBS, find and select NBIO Common Options. Change ACS Enable to DIsabled. I also set pcie_acs_override=downstream in etc/default/grub. Then run sudo update-grub and reboot the workstation.
GPU Interconnect Topology: The way your GPUs are connected significantly impacts communication speed. Use nvidia-smi topo -m to check your setup.
- NV#: Your GPUs are connected via NVLink, which offers very high-speed communication. This is the optimal configuration.
- PIX: This indicates a PCIe connection, which is still a good, common setup.
- NODE: This means your GPUs are on separate CPU sockets, and their communication must traverse the CPU and motherboard's inter-socket link, which is very slow.
- BIOS Settings for AMD: If you have an AMD system with a multi-socket or single-socket motherboard showing NODE connections, check your BIOS settings. Navigate to AMD CBS -> DF Common Options -> Memory Addressing. Change the NUMA nodes setting from [AUTO, NPS2, NPS4] to NPS1. This configures the system to treat all memory as a single NUMA domain, often improving inter-GPU communication.
- Verify the system setting in Ubuntu by running lscpu to see 1 NUMA node.