Skip to content
This repository was archived by the owner on Jun 6, 2024. It is now read-only.
This repository was archived by the owner on Jun 6, 2024. It is now read-only.

extend prerequisite field in job protocol #5145

@hzy46

Description

@hzy46

Motivation

OpenPAI protocol support users to specify prerequisites (e.g. dockerimage, data, and script) and then reference them in taskrole. There are some limitations in current version.

  • current solution only support parameter (e.g. uri) definition. This is enough for the most frequently used dockerimage because docker plays a role of corresponding runtime executor. However, it is too limited for other types. For example, commands has to be injected in every taskrole to make the data ready in the job config below.
  • it is not well organized (object-oriented). The command wget is actions with the data, but it could not be placed together.
    • It is hard to reuse. If the data is referenced by more than one taskrole, the wget commands must be injected everywhere.
    • It is hard to use. User (or marketplace plugin) must modify more than one places to enable a data.
  • taskrole could only reference one data (or script, output)
prerequisites:
  - name: covid_data
    type: data
    uri:
      - https://x.x.x/yyy.zip # data uri
  - name: default_image
    type: dockerimage
    uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
taskRoles:
  taskrole:
    dockerImage: default_image
    data: covid_data
    commands:
      - mkdir -p /data/covid19/data/
      - cd /data/covid19/data/
      - 'wget <% $data.uri[0] %>'
      - export DATA_DIR=/data/covid19/data/

Goal

  • propose protocol updates and runtime plugin to make prerequisites be well organized and object-oriented. Besides defining parameters, it also supports real functions (callbacks on specific events).
  • make easy and flexible reuse of data, script, and other prerequisites
  • better support management of dataset (via marketplace)
  • enable advanced features (e.g. cluster data set, data location aware scheduling) in the future
  • backward compatible (this version should support previous config).

Proposal

  1. support callbacks in prerequisites
  2. taskrole could reference a list of prerequisites
  3. runtime plugin implementation

Examples

  • defining actions with data
    • Different data requires different pre-commands: e.g. wget, nfs mount, azure blob download
prerequisites:
  - name: covid_data
    type: data
    callbacks:
      - event: containerStart
        commands:
          - mkdir -p /data/covid19/data/
          - cd /data/covid19/data/
          - 'wget https://x.x.x/yyy.zip'
          - export DATA_DIR=/data/covid19/data/

taskRoles:
  taskrole:
    dockerImage: default_image
    prerequisites: 
      - covid_data
    commands:
      - ls $DATA_DIR
  • setup environment/script prerequisites:
    • Some should run before the script starts: e.g. install pip packages, install openpai sdk.
    • Some should run after the script completes / succeeds / fails: e.g. log uploading, reports, alert
    • Enhanced debuggability such as start jupyter server (or ssh) in 30 mins after user's command fails

Full Spec:

prerequisites:
  - name: string # required, unique name to find the prerequisite (from local or marketplace)
    type: "dockerimage | script | data | output" # for survey purpose (except dockerimage), useless for backend
    plugin: string # optional, the executor to handle current prerequisite; default is com.microsoft.pai.runtimeplugin.cmd or docker (for dockerimage)
    require: [] # optional, other prerequisites on which the current one depends
    callbacks: # optional, commands to run on events
      - event: "containerStart | containerExit"
        commands: # commands translated by plugin
          - string # shell commands for com.microsoft.pai.runtimeplugin.cmd
          - string # TODO: other commands (e.g. python) for other plugins
    failurePolicy: "ignore | fail" # optional, same default as runtime plugin
    # plugin-specific properties
    uri: string | array # optional, for backward compatibility (it is required before)
    key1: value1 # referred by <% this.parameters.key1 %>
    key2: value2 # TODO: inheritable from required ones

taskRoles:
  taskrole:
    prerequisites: # optional, requirements will be automatically parsed and inserted
      - prerequisite-1 # on containerStart, will execute in order
      - prerequisite-2 # on containerExit, will execute in reverse order

Each of prerequisites will be handled in a way like

for prerequisite in prerequisites:
  plugin(**prerequisite)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions