Create README.md

Signed-off-by: David Rotermund <54365609+davrot@users.noreply.github.com>
2025-07-11 07:00:04 +02:00 · 2024-03-09 16:43:12 +01:00 · 2024-03-09 16:43:12 +01:00 · 0d2b78be8f
commit 0d2b78be8f
parent e6a609a295
1 changed files with 184 additions and 0 deletions
--- a/admin/pytorch/torchrun/README.md
+++ b/admin/pytorch/torchrun/README.md
@ -0,0 +1,184 @@
+# torchrun for multi-node but single GPU -- checking for network problems 
+
+## Top
+
+Questions to [David Rotermund](mailto:davrot@uni-bremen.de)
+
+This is not a tutorial on how to use DistributedDataParallel.
+We use only three of the four elements to use: 
+* [distributed.init_process_group](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group)
+* [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel)
+* [DistributedSampler](https://pytorch.org/docs/stable/data.html#torch.utils.data.distributed.DistributedSampler)
+* [torch.distributed.destroy_process_group ](https://pytorch.org/tnt/stable/utils/generated/torchtnt.utils.distributed.destroy_process_group.html#torchtnt.utils.distributed.destroy_process_group)
+
+The DistributedSampler is not used in this network test example. It would look something like this: 
+
+```python
+sampler = DistributedSampler(dataset) 
+loader = DataLoader(dataset, shuffle=False,sampler=sampler)
+```
+
+Also you might not want to use [@record](https://pytorch.org/docs/stable/elastic/errors.html#torch.distributed.elastic.multiprocessing.errors.record) during learning. 
+
+Another note:
+
+If you want to test the performance on test data locally (what you probably want to do, even if you don't know yet) 
+
+```python
+local_rank = int(os.environ["LOCAL_RANK"])
+world_rank = int(os.environ["RANK"])
+```
+
+Make sure that local_rank AND world_rank are both == 0.
+
+
+Which brings me to the point what these numbers (i.e. the lingo behind it) mean:
+
+Identifier of the session on the computer. In the single GPU case, this is always 0. If you have more GPUs (or even CPU threads), this will be different. 
+
+```python
+local_rank = int(os.environ["LOCAL_RANK"])
+```
+
+The number of nodes (i.e. computers) in the distributed network. In the following I will often use 2, because I used two computers for testing it.   
+
+```python
+world_size = int(os.environ["WORLD_SIZE"])
+```
+
+The world rank is the identifier if the computer in the distributed network. The master computer is always 0! Do not start another node which is not the master node with 0! Furthermore, don't start to nodes / computers with the same identifie. 
+Never ever. It will crash the distributed network. 
+
+```python
+world_rank = int(os.environ["RANK"])
+```
+
+## torchrun script
+
+The script needs to be started on the computer with the master ip first.
+
+You need to change the master_ip,  master_port and python_file: 
+
+```shell
+master_ip="10.10.10.10"
+master_port="40001"
+python_file="main.py"
+
+ip_check=`ip addr | grep $master_ip | wc -l`
+
+if [[ "$#" -ne 2 ]]
+then
+    echo "Illegal number of parameters" 
+    exit 1  
+fi
+
+WORLD_SIZE=$1
+WORLD_RANK=$2
+
+echo World Size: $WORLD_SIZE
+echo World Rank: $WORLD_RANK
+
+if [[ "$WORLD_SIZE" -le "$WORLD_RANK" ]]
+then
+    echo "WORLD_RANK needs to bigger then WORLD_SIZE." 
+    exit 1  
+fi
+
+if [[ $ip_check == "1" ]]
+then
+    if [[ "$2" -ne 0 ]]
+    then
+        echo "Master: You need to have the WORLD_RANK of 0." 
+        exit 1  
+    fi
+else
+    if [[ "$2" -eq 0 ]]
+    then
+        echo "Client: You are not allowed to have the WORLD_RANK of 0." 
+        exit 1  
+    fi
+fi
+
+torchrun \
+    --nproc_per_node=1 --nnodes=$WORLD_SIZE --node_rank=$WORLD_RANK \
+    --master_addr=$master_ip --master_port=$master_port \
+    $python_file
+```
+
+## Back to the problem at hand
+
+Let's say you ran the following python programm with backend="gloo" for torch.distributed.init_process_group and everything worked. Now you switch to the recommended backend for GPUs: backend="nccl". However you get totally useless error message: 
+
+```shell
+  torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
+  ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
+  Last error: 
+```  
+
+What happend? 
+
+I don't know what happend in your case but for me it was the following problem: 
+
+nccl uses a peer-to-peer network for data exchange. And it uses a lot of ports. Which port range? This port range: 
+
+```shell
+cat /proc/sys/net/ipv4/ip_local_port_range
+```  
+
+```shell
+32768	60999
+```  
+
+You need to configure all your firewalls to allow data exchange between the nodes. Please don't just open the ports and limited them to the IP range of your nodes. 
+
+```python
+import os
+import torch
+from torch.distributed.elastic.multiprocessing.errors import record
+
+local_rank = int(os.environ["LOCAL_RANK"])
+world_rank = int(os.environ["RANK"])
+world_size = int(os.environ["WORLD_SIZE"])
+
+
+@record
+def main() -> None:
+    gpu_identifier: str = f"cuda:{local_rank}"
+    assert torch.cuda.device(gpu_identifier)
+    device: torch.device = torch.device(gpu_identifier)
+
+    print("init_process_group")
+    torch.distributed.init_process_group(
+        backend="nccl",
+        rank=world_rank,
+        world_size=world_size,
+    )  # gloo,nccl
+    assert torch.distributed.is_initialized()
+
+    print("Sequential")
+    model = torch.nn.Sequential()
+    model.append(torch.nn.Linear(in_features=100, out_features=10))
+    model = model.to(device=device)
+
+    print("DistributedDataParallel")
+    ddp_model = torch.nn.parallel.DistributedDataParallel(
+        model, device_ids=[local_rank], output_device=local_rank
+    )
+    print(ddp_model)
+
+    print("destroy_process_group")
+    torch.distributed.destroy_process_group()
+    print("END")
+
+
+if __name__ == "__main__":
+    assert torch.cuda.is_available()
+    assert torch.distributed.is_available()
+    assert torch.distributed.is_nccl_available()
+    assert torch.distributed.is_torchelastic_launched()
+
+    print(f"world size: {world_size} world rank: {world_rank} local rank: {local_rank}")
+    main()
+
+    exit()
+```