This repository was archived by the owner on Jun 6, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 550
This repository was archived by the owner on Jun 6, 2024. It is now read-only.
extend prerequisite field in job protocol #5145
Copy link
Copy link
Open
Labels
Description
Motivation
OpenPAI protocol support users to specify prerequisites (e.g. dockerimage, data, and script) and then reference them in taskrole. There are some limitations in current version.
- current solution only support parameter (e.g.
uri) definition. This is enough for the most frequently useddockerimagebecause docker plays a role of corresponding runtime executor. However, it is too limited for other types. For example, commands has to be injected in every taskrole to make the data ready in the job config below. - it is not well organized (object-oriented). The command
wgetis actions with the data, but it could not be placed together.- It is hard to reuse. If the data is referenced by more than one taskrole, the
wgetcommands must be injected everywhere. - It is hard to use. User (or marketplace plugin) must modify more than one places to enable a data.
- It is hard to reuse. If the data is referenced by more than one taskrole, the
- taskrole could only reference one data (or script, output)
prerequisites:
- name: covid_data
type: data
uri:
- https://x.x.x/yyy.zip # data uri
- name: default_image
type: dockerimage
uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
taskRoles:
taskrole:
dockerImage: default_image
data: covid_data
commands:
- mkdir -p /data/covid19/data/
- cd /data/covid19/data/
- 'wget <% $data.uri[0] %>'
- export DATA_DIR=/data/covid19/data/Goal
- propose protocol updates and runtime plugin to make
prerequisitesbe well organized and object-oriented. Besides defining parameters, it also supports real functions (callbacks on specific events). - make easy and flexible reuse of data, script, and other
prerequisites - better support management of dataset (via marketplace)
- enable advanced features (e.g. cluster data set, data location aware scheduling) in the future
- backward compatible (this version should support previous config).
Proposal
- support callbacks in
prerequisites - taskrole could reference a list of
prerequisites - runtime plugin implementation
Examples
- defining actions with data
- Different data requires different pre-commands: e.g. wget, nfs mount, azure blob download
prerequisites:
- name: covid_data
type: data
callbacks:
- event: containerStart
commands:
- mkdir -p /data/covid19/data/
- cd /data/covid19/data/
- 'wget https://x.x.x/yyy.zip'
- export DATA_DIR=/data/covid19/data/
taskRoles:
taskrole:
dockerImage: default_image
prerequisites:
- covid_data
commands:
- ls $DATA_DIR- setup environment/script prerequisites:
- Some should run before the script starts: e.g. install pip packages, install openpai sdk.
- Some should run after the script completes / succeeds / fails: e.g. log uploading, reports, alert
- Enhanced debuggability such as start jupyter server (or ssh) in 30 mins after user's command fails
Full Spec:
prerequisites:
- name: string # required, unique name to find the prerequisite (from local or marketplace)
type: "dockerimage | script | data | output" # for survey purpose (except dockerimage), useless for backend
plugin: string # optional, the executor to handle current prerequisite; default is com.microsoft.pai.runtimeplugin.cmd or docker (for dockerimage)
require: [] # optional, other prerequisites on which the current one depends
callbacks: # optional, commands to run on events
- event: "containerStart | containerExit"
commands: # commands translated by plugin
- string # shell commands for com.microsoft.pai.runtimeplugin.cmd
- string # TODO: other commands (e.g. python) for other plugins
failurePolicy: "ignore | fail" # optional, same default as runtime plugin
# plugin-specific properties
uri: string | array # optional, for backward compatibility (it is required before)
key1: value1 # referred by <% this.parameters.key1 %>
key2: value2 # TODO: inheritable from required ones
taskRoles:
taskrole:
prerequisites: # optional, requirements will be automatically parsed and inserted
- prerequisite-1 # on containerStart, will execute in order
- prerequisite-2 # on containerExit, will execute in reverse orderEach of prerequisites will be handled in a way like
for prerequisite in prerequisites:
plugin(**prerequisite)Reactions are currently unavailable