Skip to content
This repository was archived by the owner on Jun 6, 2024. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
c612608
imagenet-nccl for test
vvfreesoul Aug 18, 2020
9b8ce66
imagenet-nccl for test
vvfreesoul Aug 21, 2020
e6772d3
imagenet-nccl for test
vvfreesoul Aug 21, 2020
b1f5b8c
imagenet-nccl for test
vvfreesoul Aug 21, 2020
31a46c8
imagenet-nccl for test
vvfreesoul Aug 23, 2020
9057564
imagenet-nccl for test
vvfreesoul Aug 23, 2020
610a420
imagenet-nccl for test
vvfreesoul Aug 23, 2020
da4b007
imagenet-nccl for test
vvfreesoul Aug 23, 2020
6a5fc8c
imagenet-nccl for test
vvfreesoul Aug 23, 2020
f51c5aa
imagenet-nccl for test
vvfreesoul Aug 23, 2020
b4f03fe
imagenet-nccl for test
vvfreesoul Aug 23, 2020
e18c9f8
imagenet-nccl for test
vvfreesoul Aug 23, 2020
cf7c284
imagenet-nccl for test
vvfreesoul Aug 23, 2020
3a84055
Add distributed training examples of PyTorch
vvfreesoul Aug 24, 2020
4ad2f85
Add distributed training examples of PyTorch
vvfreesoul Aug 24, 2020
43a11d2
Add distributed training examples of PyTorch
vvfreesoul Aug 24, 2020
2e59d33
Add distributed training examples of PyTorch
vvfreesoul Aug 24, 2020
ed0a7c6
Add distributed training examples of PyTorch
vvfreesoul Aug 24, 2020
6ac0633
Add distributed training examples of PyTorch
vvfreesoul Aug 25, 2020
562c448
Add distributed training examples of PyTorch
vvfreesoul Aug 25, 2020
e4b5dd1
Add distributed training examples of PyTorch
vvfreesoul Aug 26, 2020
ce8b3ce
Add distributed training examples of PyTorch
vvfreesoul Aug 26, 2020
0fd1f19
Add distributed training examples of PyTorch
vvfreesoul Aug 26, 2020
7db6cbd
Add distributed training examples of PyTorch
vvfreesoul Aug 31, 2020
f46a663
Add distributed training examples of PyTorch
vvfreesoul Aug 31, 2020
4519685
Add distributed training examples of PyTorch
vvfreesoul Aug 31, 2020
326b051
Add distributed training examples of PyTorch
vvfreesoul Aug 31, 2020
d9f2d8d
Add distributed training examples of PyTorch
vvfreesoul Aug 31, 2020
4bdb7c5
Add distributed training examples of PyTorch
vvfreesoul Aug 31, 2020
4cbb352
Add distributed training examples of PyTorch
vvfreesoul Aug 31, 2020
2c488f5
Add distributed training examples of PyTorch
vvfreesoul Aug 31, 2020
9a93e9f
Add distributed training examples of PyTorch
vvfreesoul Aug 31, 2020
1bb98ac
Add distributed training examples of PyTorch
vvfreesoul Aug 31, 2020
353bfdf
Add distributed training examples of PyTorch
vvfreesoul Sep 2, 2020
078d645
Add distributed training examples of PyTorch
vvfreesoul Sep 2, 2020
4efc9ac
Add distributed training examples of PyTorch
vvfreesoul Sep 2, 2020
6373f3a
Add distributed training examples of PyTorch
vvfreesoul Sep 4, 2020
429a6e9
Add distributed training examples of PyTorch
vvfreesoul Sep 4, 2020
f8fa108
Add distributed training examples of PyTorch
vvfreesoul Sep 4, 2020
659c48b
Add distributed training examples of PyTorch
vvfreesoul Sep 4, 2020
0037ab4
Merge remote-tracking branch 'origin/master'
vvfreesoul Sep 4, 2020
f957c60
Add distributed training examples of PyTorch
vvfreesoul Sep 4, 2020
863eda6
Add distributed training examples of PyTorch
vvfreesoul Sep 4, 2020
640c193
Add distributed training examples of PyTorch
vvfreesoul Sep 4, 2020
42cda8e
Add distributed training examples of PyTorch
vvfreesoul Sep 7, 2020
8c2c599
Add distributed training examples of PyTorch
vvfreesoul Sep 9, 2020
eed7c7f
Add distributed training examples of PyTorch
vvfreesoul Sep 9, 2020
adeb4c6
Add distributed training examples of PyTorch
vvfreesoul Sep 9, 2020
1f675a1
Add distributed training examples of PyTorch
vvfreesoul Sep 9, 2020
f0242c7
Add distributed training examples of PyTorch
vvfreesoul Sep 9, 2020
a54c606
Add distributed training examples of PyTorch
vvfreesoul Sep 9, 2020
f494dcf
Add distributed training examples of PyTorch
vvfreesoul Sep 9, 2020
c46c462
Add distributed training examples of PyTorch
vvfreesoul Sep 9, 2020
b18d0df
Add distributed training examples of PyTorch
vvfreesoul Sep 9, 2020
f585648
Add distributed training examples of PyTorch
vvfreesoul Sep 10, 2020
853d112
Add distributed training examples of PyTorch
vvfreesoul Sep 10, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add distributed training examples of PyTorch
  • Loading branch information
vvfreesoul committed Sep 4, 2020
commit 6373f3aa77baacdd95eb52eeeacff769e34904b8
36 changes: 18 additions & 18 deletions examples/Distributed-example/readme.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,35 @@
# How OpenPAI Models Distributed Jobs
## Taskrole and Instance
When we execute distributed programs on PAI, we can add different task roles for our job. For single server jobs, there is only one task role. For distributed jobs, there may be multiple task roles. For example, when TensorFlow is used to running distributed jobs, it has two roles, including the parameter server and the worker. In distributed jobs, it depends on how many instances are needed for a task role. For example, if it's 8 in a worker role of TensorFlow. It means there should be 8 Docker containers for the worker role.[Please visit this link for operations.](https://openpai.readthedocs.io/en/latest/manual/cluster-user/how-to-use-advanced-job-settings.html#multiple-task-roles)
## How OpenPAI Models Distributed Jobs
### Taskrole and Instance
When we execute distributed programs on PAI, we can add different task roles for our job. For single server jobs, there is only one task role. For distributed jobs, there may be multiple task roles. For example, when TensorFlow is used to running distributed jobs, it has two roles, including the parameter server and the worker. In distributed jobs, each role may have one or more instances. For example, if it's 8 instances in a worker role of TensorFlow. It means there should be 8 Docker containers for the worker role. Please visit [here](https://openpai.readthedocs.io/en/latest/manual/cluster-user/how-to-use-advanced-job-settings.html#multiple-task-roles) for specific operations.

## Environmental variables
In a distributed job, one task might communicate with others (When we say task, we mean a single instance of a task role). So a task need to be aware of other tasks' runtime information such as IP, port, etc. The system exposes such runtime information as environment variables to each task's Docker container. For mutual communication, users can write code in the container to access those runtime environment variables.[Please visit this link for operations.](https://openpai.readthedocs.io/en/latest/manual/cluster-user/how-to-use-advanced-job-settings.html#environmental-variables-and-port-reservation)

## Retry policy and Completion policy;
If unknown error happens, PAI will retry the job according to user settings. To set a retry policy and completion policy for your job.
[Please visit this link for operations.](https://openpai.readthedocs.io/en/latest/manual/cluster-user/how-to-use-advanced-job-settings.html#job-exit-spec-retry-policy-and-completion-policy)
## Run PyTorch Distributed Jobs in OpenPAI
### Environmental variables
In a distributed job, one task might communicate with others (When we say task, we mean a single instance of a task role). So a task need to be aware of other tasks' runtime information such as IP, port, etc. The system exposes such runtime information as environment variables to each task's Docker container. For mutual communication, users can write code in the container to access those runtime environment variables. Please visit [here](https://openpai.readthedocs.io/en/latest/manual/cluster-user/how-to-use-advanced-job-settings.html#environmental-variables-and-port-reservation) for specific operations.

### Retry policy and Completion policy
If unknown error happens, PAI will retry the job according to user settings. To set a retry policy and completion policy for user's job,PAI asks user to switch to Advanced mode. Please visit [here](https://openpai.readthedocs.io/en/latest/manual/cluster-user/how-to-use-advanced-job-settings.html#job-exit-spec-retry-policy-and-completion-policy) for specific operations.
### Run PyTorch Distributed Jobs in OpenPAI
Example Name | Multi-GPU | Multi-Node | Backend |Apex| Job protocol |
---|---|---|---|---|---|
Single-Node DataParallel CIFAR-10 | ✓| x | -|-| [cifar10-single-node-gpus-cpu-DP.yaml](https://github.com/vvfreesoul/pai/blob/master/examples/Yaml/cifar10-single-node-gpus-cpu-DP.yaml)|
cifar10-single-mul-DDP-gloo.yaml | ✓| ✓ | gloo|-| [cifar10-single-mul-DDP-gloo.yaml](https://github.com/vvfreesoul/pai/blob/master/examples/Yaml/cifar10-single-mul-DDP-gloo.yaml)|
cifar10-single-mul-DDP-nccl | ✓| ✓ |nccl|-| [cifar10-single-mul-DDP-nccl.yaml](https://github.com/vvfreesoul/pai/blob/master/examples/Yaml/cifar10-single-mul-DDP-nccl.yaml)|
cifar10-single-mul-DDP-gloo-Apex-mixed | ✓| ✓ | gloo|✓ | [cifar10-single-mul-DDP-gloo-Apex-mixed.yaml](https://github.com/vvfreesoul/pai/blob/master/examples/Yaml/cifar10-single-mul-DDP-gloo-Apex-mixed.yaml)|
cifar10-single-mul-DDP-nccl-Apex-mixed | ✓| ✓ | nccl| ✓ | [cifar10-single-mul-DDP-gloo-Apex-mixed.yaml](https://github.com/vvfreesoul/pai/blob/master/examples/Yaml/cifar10-single-mul-DDP-gloo-Apex-mixed.yaml)|
imagenet-single-mul-DDP-gloo | ✓| ✓| gl00|-| [imagenet-single-mul-DDP-gloo.yaml](https://github.com/vvfreesoul/pai/blob/master/examples/Yaml/Lite-imagenet-single-mul-DDP-gloo.yaml)|
imagenet-single-mul-DDP-gloo | ✓| ✓| gloo|-| [imagenet-single-mul-DDP-gloo.yaml](https://github.com/vvfreesoul/pai/blob/master/examples/Yaml/Lite-imagenet-single-mul-DDP-gloo.yaml)|

## DataParallel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should starts with ###, and the same for follows

The single node program is simple. The program executed in PAI is exactly the same as the program in our machine. It should be noted that an Worker can be applied in PAI and a Instance can be applied in Worker. In a worker, we can apply for GPUs that we need. [So let's give an example of DP here.]()
The single node program is simple. The program executed in PAI is exactly the same as the program in our machine. It should be noted that an Worker can be applied in PAI and a Instance can be applied in Worker. In a worker, we can apply for GPUs that we need. We provide an [example]() of DP.

## DistributedDataParallel
Of course, running distributed programs, we can also use DDP, which is a little more complex than DP programs. When we need to use DDP, we need to consider the IP and Port of the master node, and ensure that all nodes can access the same host port for process synchronization. In Pai, we can apply for a Port dedicated for multi process synchronization in the job submission interface. The reason for this is that we try our best to avoid the occupation of our distributed programs that have been allocated to the Port for other tasks. Of course, we also need wordd-size in the DDP program, which represents the total number of our processes. In Pai, we can also get it by reading environment variables. If we want to implement DDP on multiple nodes, we can apply for an Worker and then apply for multiple Instances to correspond to multiple nodes. If we want to run DDP on a single node, we only need to apply for an Instance and a Worker.The specific code for reading environment variables in Pai is as follows:
1. os.environ['MASTER_ADDR'] = os.environ['PAI_HOST_IP_worker_0']
2. os.environ['MASTER_PORT'] = os.environ['PAI_worker_0_SynPort_PORT']
DDP requires users set a master node ip and port for synchronization in PyTorch. For the port, you can simply set one certain port, such as `5000` as your master port. However, this port may conflict with others. To prevent port conflict, you can reserve a port in OpenPAI, as we mentioned [here](https://openpai.readthedocs.io/en/latest/manual/cluster-user/how-to-use-advanced-job-settings.html#environmental-variables-and-port-reservation). The port you reserved is available in environmental variables like `PAI_PORT_LIST_$taskRole_$taskIndex_$portLabel`, where `$taskIndex` means the instance index of that task role. For example, if your task role name is `work` and port label is `SyncPort`, you can add the following code in your PyTorch DDP program:

```
os.environ['MASTER_ADDR'] = os.environ['PAI_HOST_IP_worker_0']
os.environ['MASTER_PORT'] = os.environ['PAI_worker_0_SynPort_PORT']
```

DDP communication back-end using GLOO, you need to add a command in yaml file, otherwise there will be communication errors.We must add export GLOO_SOCKET_IFNAME=eth0 for GLOO.
If you are using `gloo` as your DDP communication backend, please set correct network interface such as `export GLOO_SOCKET_IFNAME=eth0`.

[So let's give an example of DDP(nccl) here.]()

[So let's give an example of DDP(gloo) here.]()
We provide examples with [nccl]() and [gloo]() as backend.