This repository was archived by the owner on Jun 6, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 550
Add distributed training examples of PyTorch #4821
Merged
Merged
Changes from 1 commit
Commits
Show all changes
56 commits
Select commit
Hold shift + click to select a range
c612608
imagenet-nccl for test
vvfreesoul 9b8ce66
imagenet-nccl for test
vvfreesoul e6772d3
imagenet-nccl for test
vvfreesoul b1f5b8c
imagenet-nccl for test
vvfreesoul 31a46c8
imagenet-nccl for test
vvfreesoul 9057564
imagenet-nccl for test
vvfreesoul 610a420
imagenet-nccl for test
vvfreesoul da4b007
imagenet-nccl for test
vvfreesoul 6a5fc8c
imagenet-nccl for test
vvfreesoul f51c5aa
imagenet-nccl for test
vvfreesoul b4f03fe
imagenet-nccl for test
vvfreesoul e18c9f8
imagenet-nccl for test
vvfreesoul cf7c284
imagenet-nccl for test
vvfreesoul 3a84055
Add distributed training examples of PyTorch
vvfreesoul 4ad2f85
Add distributed training examples of PyTorch
vvfreesoul 43a11d2
Add distributed training examples of PyTorch
vvfreesoul 2e59d33
Add distributed training examples of PyTorch
vvfreesoul ed0a7c6
Add distributed training examples of PyTorch
vvfreesoul 6ac0633
Add distributed training examples of PyTorch
vvfreesoul 562c448
Add distributed training examples of PyTorch
vvfreesoul e4b5dd1
Add distributed training examples of PyTorch
vvfreesoul ce8b3ce
Add distributed training examples of PyTorch
vvfreesoul 0fd1f19
Add distributed training examples of PyTorch
vvfreesoul 7db6cbd
Add distributed training examples of PyTorch
vvfreesoul f46a663
Add distributed training examples of PyTorch
vvfreesoul 4519685
Add distributed training examples of PyTorch
vvfreesoul 326b051
Add distributed training examples of PyTorch
vvfreesoul d9f2d8d
Add distributed training examples of PyTorch
vvfreesoul 4bdb7c5
Add distributed training examples of PyTorch
vvfreesoul 4cbb352
Add distributed training examples of PyTorch
vvfreesoul 2c488f5
Add distributed training examples of PyTorch
vvfreesoul 9a93e9f
Add distributed training examples of PyTorch
vvfreesoul 1bb98ac
Add distributed training examples of PyTorch
vvfreesoul 353bfdf
Add distributed training examples of PyTorch
vvfreesoul 078d645
Add distributed training examples of PyTorch
vvfreesoul 4efc9ac
Add distributed training examples of PyTorch
vvfreesoul 6373f3a
Add distributed training examples of PyTorch
vvfreesoul 429a6e9
Add distributed training examples of PyTorch
vvfreesoul f8fa108
Add distributed training examples of PyTorch
vvfreesoul 659c48b
Add distributed training examples of PyTorch
vvfreesoul 0037ab4
Merge remote-tracking branch 'origin/master'
vvfreesoul f957c60
Add distributed training examples of PyTorch
vvfreesoul 863eda6
Add distributed training examples of PyTorch
vvfreesoul 640c193
Add distributed training examples of PyTorch
vvfreesoul 42cda8e
Add distributed training examples of PyTorch
vvfreesoul 8c2c599
Add distributed training examples of PyTorch
vvfreesoul eed7c7f
Add distributed training examples of PyTorch
vvfreesoul adeb4c6
Add distributed training examples of PyTorch
vvfreesoul 1f675a1
Add distributed training examples of PyTorch
vvfreesoul f0242c7
Add distributed training examples of PyTorch
vvfreesoul a54c606
Add distributed training examples of PyTorch
vvfreesoul f494dcf
Add distributed training examples of PyTorch
vvfreesoul c46c462
Add distributed training examples of PyTorch
vvfreesoul b18d0df
Add distributed training examples of PyTorch
vvfreesoul f585648
Add distributed training examples of PyTorch
vvfreesoul 853d112
Add distributed training examples of PyTorch
vvfreesoul File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Add distributed training examples of PyTorch
- Loading branch information
commit 6373f3aa77baacdd95eb52eeeacff769e34904b8
There are no files selected for viewing
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,35 +1,35 @@ | ||
| # How OpenPAI Models Distributed Jobs | ||
| ## Taskrole and Instance | ||
| When we execute distributed programs on PAI, we can add different task roles for our job. For single server jobs, there is only one task role. For distributed jobs, there may be multiple task roles. For example, when TensorFlow is used to running distributed jobs, it has two roles, including the parameter server and the worker. In distributed jobs, it depends on how many instances are needed for a task role. For example, if it's 8 in a worker role of TensorFlow. It means there should be 8 Docker containers for the worker role.[Please visit this link for operations.](https://openpai.readthedocs.io/en/latest/manual/cluster-user/how-to-use-advanced-job-settings.html#multiple-task-roles) | ||
| ## How OpenPAI Models Distributed Jobs | ||
| ### Taskrole and Instance | ||
| When we execute distributed programs on PAI, we can add different task roles for our job. For single server jobs, there is only one task role. For distributed jobs, there may be multiple task roles. For example, when TensorFlow is used to running distributed jobs, it has two roles, including the parameter server and the worker. In distributed jobs, each role may have one or more instances. For example, if it's 8 instances in a worker role of TensorFlow. It means there should be 8 Docker containers for the worker role. Please visit [here](https://openpai.readthedocs.io/en/latest/manual/cluster-user/how-to-use-advanced-job-settings.html#multiple-task-roles) for specific operations. | ||
|
|
||
| ## Environmental variables | ||
| In a distributed job, one task might communicate with others (When we say task, we mean a single instance of a task role). So a task need to be aware of other tasks' runtime information such as IP, port, etc. The system exposes such runtime information as environment variables to each task's Docker container. For mutual communication, users can write code in the container to access those runtime environment variables.[Please visit this link for operations.](https://openpai.readthedocs.io/en/latest/manual/cluster-user/how-to-use-advanced-job-settings.html#environmental-variables-and-port-reservation) | ||
|
|
||
| ## Retry policy and Completion policy; | ||
| If unknown error happens, PAI will retry the job according to user settings. To set a retry policy and completion policy for your job. | ||
| [Please visit this link for operations.](https://openpai.readthedocs.io/en/latest/manual/cluster-user/how-to-use-advanced-job-settings.html#job-exit-spec-retry-policy-and-completion-policy) | ||
| ## Run PyTorch Distributed Jobs in OpenPAI | ||
| ### Environmental variables | ||
| In a distributed job, one task might communicate with others (When we say task, we mean a single instance of a task role). So a task need to be aware of other tasks' runtime information such as IP, port, etc. The system exposes such runtime information as environment variables to each task's Docker container. For mutual communication, users can write code in the container to access those runtime environment variables. Please visit [here](https://openpai.readthedocs.io/en/latest/manual/cluster-user/how-to-use-advanced-job-settings.html#environmental-variables-and-port-reservation) for specific operations. | ||
|
|
||
| ### Retry policy and Completion policy | ||
| If unknown error happens, PAI will retry the job according to user settings. To set a retry policy and completion policy for user's job,PAI asks user to switch to Advanced mode. Please visit [here](https://openpai.readthedocs.io/en/latest/manual/cluster-user/how-to-use-advanced-job-settings.html#job-exit-spec-retry-policy-and-completion-policy) for specific operations. | ||
| ### Run PyTorch Distributed Jobs in OpenPAI | ||
| Example Name | Multi-GPU | Multi-Node | Backend |Apex| Job protocol | | ||
| ---|---|---|---|---|---| | ||
| Single-Node DataParallel CIFAR-10 | ✓| x | -|-| [cifar10-single-node-gpus-cpu-DP.yaml](https://github.com/vvfreesoul/pai/blob/master/examples/Yaml/cifar10-single-node-gpus-cpu-DP.yaml)| | ||
| cifar10-single-mul-DDP-gloo.yaml | ✓| ✓ | gloo|-| [cifar10-single-mul-DDP-gloo.yaml](https://github.com/vvfreesoul/pai/blob/master/examples/Yaml/cifar10-single-mul-DDP-gloo.yaml)| | ||
| cifar10-single-mul-DDP-nccl | ✓| ✓ |nccl|-| [cifar10-single-mul-DDP-nccl.yaml](https://github.com/vvfreesoul/pai/blob/master/examples/Yaml/cifar10-single-mul-DDP-nccl.yaml)| | ||
| cifar10-single-mul-DDP-gloo-Apex-mixed | ✓| ✓ | gloo|✓ | [cifar10-single-mul-DDP-gloo-Apex-mixed.yaml](https://github.com/vvfreesoul/pai/blob/master/examples/Yaml/cifar10-single-mul-DDP-gloo-Apex-mixed.yaml)| | ||
| cifar10-single-mul-DDP-nccl-Apex-mixed | ✓| ✓ | nccl| ✓ | [cifar10-single-mul-DDP-gloo-Apex-mixed.yaml](https://github.com/vvfreesoul/pai/blob/master/examples/Yaml/cifar10-single-mul-DDP-gloo-Apex-mixed.yaml)| | ||
| imagenet-single-mul-DDP-gloo | ✓| ✓| gl00|-| [imagenet-single-mul-DDP-gloo.yaml](https://github.com/vvfreesoul/pai/blob/master/examples/Yaml/Lite-imagenet-single-mul-DDP-gloo.yaml)| | ||
| imagenet-single-mul-DDP-gloo | ✓| ✓| gloo|-| [imagenet-single-mul-DDP-gloo.yaml](https://github.com/vvfreesoul/pai/blob/master/examples/Yaml/Lite-imagenet-single-mul-DDP-gloo.yaml)| | ||
|
|
||
| ## DataParallel | ||
| The single node program is simple. The program executed in PAI is exactly the same as the program in our machine. It should be noted that an Worker can be applied in PAI and a Instance can be applied in Worker. In a worker, we can apply for GPUs that we need. [So let's give an example of DP here.]() | ||
| The single node program is simple. The program executed in PAI is exactly the same as the program in our machine. It should be noted that an Worker can be applied in PAI and a Instance can be applied in Worker. In a worker, we can apply for GPUs that we need. We provide an [example]() of DP. | ||
|
|
||
| ## DistributedDataParallel | ||
| Of course, running distributed programs, we can also use DDP, which is a little more complex than DP programs. When we need to use DDP, we need to consider the IP and Port of the master node, and ensure that all nodes can access the same host port for process synchronization. In Pai, we can apply for a Port dedicated for multi process synchronization in the job submission interface. The reason for this is that we try our best to avoid the occupation of our distributed programs that have been allocated to the Port for other tasks. Of course, we also need wordd-size in the DDP program, which represents the total number of our processes. In Pai, we can also get it by reading environment variables. If we want to implement DDP on multiple nodes, we can apply for an Worker and then apply for multiple Instances to correspond to multiple nodes. If we want to run DDP on a single node, we only need to apply for an Instance and a Worker.The specific code for reading environment variables in Pai is as follows: | ||
| 1. os.environ['MASTER_ADDR'] = os.environ['PAI_HOST_IP_worker_0'] | ||
| 2. os.environ['MASTER_PORT'] = os.environ['PAI_worker_0_SynPort_PORT'] | ||
| DDP requires users set a master node ip and port for synchronization in PyTorch. For the port, you can simply set one certain port, such as `5000` as your master port. However, this port may conflict with others. To prevent port conflict, you can reserve a port in OpenPAI, as we mentioned [here](https://openpai.readthedocs.io/en/latest/manual/cluster-user/how-to-use-advanced-job-settings.html#environmental-variables-and-port-reservation). The port you reserved is available in environmental variables like `PAI_PORT_LIST_$taskRole_$taskIndex_$portLabel`, where `$taskIndex` means the instance index of that task role. For example, if your task role name is `work` and port label is `SyncPort`, you can add the following code in your PyTorch DDP program: | ||
|
|
||
| ``` | ||
| os.environ['MASTER_ADDR'] = os.environ['PAI_HOST_IP_worker_0'] | ||
| os.environ['MASTER_PORT'] = os.environ['PAI_worker_0_SynPort_PORT'] | ||
| ``` | ||
|
|
||
| DDP communication back-end using GLOO, you need to add a command in yaml file, otherwise there will be communication errors.We must add export GLOO_SOCKET_IFNAME=eth0 for GLOO. | ||
| If you are using `gloo` as your DDP communication backend, please set correct network interface such as `export GLOO_SOCKET_IFNAME=eth0`. | ||
|
|
||
| [So let's give an example of DDP(nccl) here.]() | ||
|
|
||
| [So let's give an example of DDP(gloo) here.]() | ||
| We provide examples with [nccl]() and [gloo]() as backend. | ||
|
|
||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should starts with
###, and the same for follows