# torchrun for multi-node but single GPU -- checking for network problems 

## Top

Questions to [David Rotermund](mailto:davrot@uni-bremen.de)

This is not a tutorial on how to use DistributedDataParallel.
We use only three of the four elements to use: 
* [distributed.init_process_group](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group)
* [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel)
* [DistributedSampler](https://pytorch.org/docs/stable/data.html#torch.utils.data.distributed.DistributedSampler)
* [torch.distributed.destroy_process_group ](https://pytorch.org/tnt/stable/utils/generated/torchtnt.utils.distributed.destroy_process_group.html#torchtnt.utils.distributed.destroy_process_group)

The DistributedSampler is not used in this network test example. It would look something like this: 

```python
sampler = DistributedSampler(dataset) 
loader = DataLoader(dataset, shuffle=False,sampler=sampler)
```

Also you might not want to use [@record](https://pytorch.org/docs/stable/elastic/errors.html#torch.distributed.elastic.multiprocessing.errors.record) during learning. 

Another note:

If you want to test the performance on test data locally (what you probably want to do, even if you don't know yet) 

```python
local_rank = int(os.environ["LOCAL_RANK"])
world_rank = int(os.environ["RANK"])
```

Make sure that local_rank AND world_rank are both == 0.


Which brings me to the point what these numbers (i.e. the lingo behind it) mean:

Identifier of the session on the computer. In the single GPU case, this is always 0. If you have more GPUs (or even CPU threads), this will be different. 

```python
local_rank = int(os.environ["LOCAL_RANK"])
```

The number of nodes (i.e. computers) in the distributed network. In the following I will often use 2, because I used two computers for testing it.   

```python
world_size = int(os.environ["WORLD_SIZE"])
```

The world rank is the identifier if the computer in the distributed network. The master computer is always 0! Do not start another node which is not the master node with 0! Furthermore, don't start to nodes / computers with the same identifie. 
Never ever. It will crash the distributed network. 

```python
world_rank = int(os.environ["RANK"])
```

## torchrun script

The script needs to be started on the computer with the master ip first.

You need to change the master_ip,  master_port and python_file: 

```shell
master_ip="10.10.10.10"
master_port="40001"
python_file="main.py"

ip_check=`ip addr | grep $master_ip | wc -l`

if [[ "$#" -ne 2 ]]
then
    echo "Illegal number of parameters" 
    exit 1  
fi

WORLD_SIZE=$1
WORLD_RANK=$2

echo World Size: $WORLD_SIZE
echo World Rank: $WORLD_RANK

if [[ "$WORLD_SIZE" -le "$WORLD_RANK" ]]
then
    echo "WORLD_RANK needs to bigger then WORLD_SIZE." 
    exit 1  
fi

if [[ $ip_check == "1" ]]
then
    if [[ "$2" -ne 0 ]]
    then
        echo "Master: You need to have the WORLD_RANK of 0." 
        exit 1  
    fi
else
    if [[ "$2" -eq 0 ]]
    then
        echo "Client: You are not allowed to have the WORLD_RANK of 0." 
        exit 1  
    fi
fi

torchrun \
    --nproc_per_node=1 --nnodes=$WORLD_SIZE --node_rank=$WORLD_RANK \
    --master_addr=$master_ip --master_port=$master_port \
    $python_file
```

## Back to the problem at hand

Let's say you ran the following python programm with backend="gloo" for torch.distributed.init_process_group and everything worked. Now you switch to the recommended backend for GPUs: backend="nccl". However you get totally useless error message: 

```shell
  torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
  ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
  Last error: 
```  

What happend? 

I don't know what happend in your case but for me it was the following problem: 

nccl uses a peer-to-peer network for data exchange. And it uses a lot of ports. Which port range? This port range: 

```shell
cat /proc/sys/net/ipv4/ip_local_port_range
```  

```shell
32768	60999
```  

You need to configure all your firewalls to allow data exchange between the nodes. Please don't just open the ports and limited them to the IP range of your nodes. 

```python
import os
import torch
from torch.distributed.elastic.multiprocessing.errors import record

local_rank = int(os.environ["LOCAL_RANK"])
world_rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])


@record
def main() -> None:
    gpu_identifier: str = f"cuda:{local_rank}"
    assert torch.cuda.device(gpu_identifier)
    device: torch.device = torch.device(gpu_identifier)

    print("init_process_group")
    torch.distributed.init_process_group(
        backend="nccl",
        rank=world_rank,
        world_size=world_size,
    )  # gloo,nccl
    assert torch.distributed.is_initialized()

    print("Sequential")
    model = torch.nn.Sequential()
    model.append(torch.nn.Linear(in_features=100, out_features=10))
    model = model.to(device=device)

    print("DistributedDataParallel")
    ddp_model = torch.nn.parallel.DistributedDataParallel(
        model, device_ids=[local_rank], output_device=local_rank
    )
    print(ddp_model)

    print("destroy_process_group")
    torch.distributed.destroy_process_group()
    print("END")


if __name__ == "__main__":
    assert torch.cuda.is_available()
    assert torch.distributed.is_available()
    assert torch.distributed.is_nccl_available()
    assert torch.distributed.is_torchelastic_launched()

    print(f"world size: {world_size} world rank: {world_rank} local rank: {local_rank}")
    main()

    exit()
```