New RestServer Architecture: RestServer -> DB -> ApiServer; P0 Items by hzy46 · Pull Request #4761 · microsoft/pai

hzy46 · 2020-07-23T06:45:54Z

Review

webportal
rest-server
database controller

Database Controller Test Cases

End-to-end Test
Path Test
Upgrade Test
Stress Test

End-to-end Test

Test Jobs

A job with simple output

Case: job command echo success

Expect: Succeed, output success.
Stop a job

Case: job command sleep 1h; then stop this job

Expect: The job can be stopped.
Job with retries

Case: job command: exit 1; set max retry to be 10

Expect: The job retry history can be viewed.
Job with docker secret

Case: Submit a job with docker auth info

Expect: The job can run successfully.
Job with priority class

Case: Submit a job with priority class

Expect: The job priority class is correct.
Job with secret

Case: Submit a job with secret info

Expect: The secret info is successfully written into job.
Bulk job

Case: Submit a job with 1000+ instances.

Expect: The job can run successfully.
Bulk job with retries

Case: Submit a job with 1000+ instances, and 50+ retries.

Expect: The job can run successfully. The job retry history can be viewed.

Test in Certain Case

Database failure

Case: submit a job; shutdown the database; try to submit another job; start the database

Expect: the first job is not affected; the second job cannot be submitted.
Database controller failure

Case: submit a job; shutdown the database controller; try to submit another job; start the database

Expect: the first job is not affected; the second job cannot be submitted.
Framework controller failure

Case: shutdown the framework controller; try to submit a job; start the framework controller

Expect: the job can be submitted, but in WAITING status. After the framework controller is started, its state will turn to RUNNING.

Path Test

Rest-server checks conflicts

Case: Submit a job twice

Expect: The second job submission is not allowed.
Case: Stop a non-existed job

Expect: The operation is not allowed.

(Write merger) Add/Update FR (If not equal, override and mark !requestSynced, else no-op)

Environment: Write-merger doesn't forward request; Shutdown database poller;

Case: Fake two same framework requests to write-merger.

Expect: The second request will be ignored.

(Write merger) If retain mode is off, delete frameworks which are not submitted through db.

Environment: Turn on retain mode

Case: Create a framework directly in API server;

Expect: The framework is not deleted.
Environment: Turn off retain mode

Case: Create a framework directly in API server;

Expect: The framework is deleted.

(Write merger) If not equal, override and mark as !requestSynced

Environment: Shutdown database poller

Case: Submit a job; Then change its requestGeneration in API server.

Expect: The framework is marked as requestSynced=false.

(Database poller) If API server 404 error, mock a delete Framework

Case: Submit A job; Then shutdown watcher after the job starts; Wait until the job succeeded; Stop poller; Start watcher; Stop watcher; Delete this framework manually; Start Poller

Expect:

After watcher is restarted, the job is marked as state=Completed and requestSynced=true;
After poller is restarted, it can mock a delete framework event to write merger.
Finally, the job will be marked as apiServerDeleted=true.

Upgrade Test

Upgrade from v1.0.y

Drop the database; Deploy a v1.0.y bed, submit some jobs (must include: one running job, one completed job with retry history, and one completed job without retry history). Then upgrade it. Make sure the job information is correct, and running jobs are not affected.
Upgrade from v1.1.y

Drop the database; Deploy a v1.1.y bed, submit some jobs (must include: one running job, one completed job with retry history, and one completed job without retry history). Then upgrade it. Make sure the job information is correct, and running jobs are not affected.

Stress Test

Submit 100000+ jobs in 1 hour. It should be handled properly.

* fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * save * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * save * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix

coveralls · 2020-07-23T06:49:44Z

Coverage increased (+0.2%) to 34.564% when pulling 7c4878d on zhiyuhe/db_integration into fbba438 on master.

…egration

src/database-controller/build/build-pre.sh

.github/workflows/continuous-integration.yml

src/database-controller/sdk/package.json

src/database-controller/src/.dockerignore

src/rest-server/package.json

yqwang-ms · 2020-08-06T03:14:52Z

Move our Architecture doc to here?

#4651 #Closed

Refers to: src/database-controller/README.md:9 in 345d3af. [](commit_id = 345d3af, deletion_comment = False)

src/database-controller/test-case.md

yqwang-ms · 2020-08-06T03:17:06Z

src/database-controller/deploy/database-controller.yaml.template

+          value: "{{ cluster_cfg['database-controller']['k8s-connection-timeout-second'] }}"
+        - name: WRITE_MERGER_CONNECTION_TIMEOUT_SECOND
+          value: "{{ cluster_cfg['database-controller']['write-merger-connection-timeout-second'] }}"
+        - name: RECOVERY_MODE_ENABLED


Add doc for all your flags? #Closed

Done. #Closed

yqwang-ms · 2020-08-06T03:18:27Z

src/database-controller/deploy/database-controller.yaml.template

+        - name: RECOVERY_MODE_ENABLED
+{% if cluster_cfg['database-controller']['recovery-mode'] %}
+          value: "true"
+{% else %}


If recovery-mode=false, but you found table is empty, will you still recover from ApiServer? #Closed

doc the behavior

In reply to: 466121240 [](ancestors = 466121240)

Done. The recovery mode is now initializer. #Closed

yqwang-ms · 2020-08-06T03:20:20Z

src/database-controller/deploy/database-controller.yaml.template

+        - name: K8S_CONNECTION_TIMEOUT_SECOND
+          value: "{{ cluster_cfg['database-controller']['k8s-connection-timeout-second'] }}"
+        - name: WRITE_MERGER_CONNECTION_TIMEOUT_SECOND
+          value: "{{ cluster_cfg['database-controller']['write-merger-connection-timeout-second'] }}"


Do you also implement the [ApiServerRetain Mode]:

Do not delete exceeded Frameworks in ApiServer DB Service [ApiServerRetain Mode] -> As the exceeded Frameworks are not visible to user, but still occupy resources, in [Default: Normal Mode], we should auto delete them.

#Closed

And do you have a test case for the [Normal Mode], i.e. you will delete jobs which in apiserver but not in DB

In reply to: 466121708 [](ancestors = 466121708)

Done. #Closed

yqwang-ms · 2020-08-06T03:33:08Z

Disable FC GC: frameworkCompletedRetainSec: 2147483600 #Closed

src/database-controller/src/write-merger/handler.js

yqwang-ms · 2020-08-06T08:16:31Z

src/database-controller/src/write-merger/handler.js

+            {},
+            snapshot.getAllUpdate(),
+            addOns.getUpdate(),
+          );


Explicitly init requestSynced=false, Deleted=false? #Closed

Done.

const record = _.assign( {requestSynced: false, apiServerDeleted: false}, snapshot.getAllUpdate(), addOns.getUpdate(), ); #Closed

src/database-controller/src/write-merger/handler.js

yqwang-ms · 2020-08-06T08:28:19Z

src/rest-server/src/models/v2/job/k8s.js

+      'metadata.annotations',
+      'spec',
+    ]);
+    frameworkRequest.spec.executionType = `${executionType.charAt(0)}${executionType.slice(1).toLowerCase()}`;


No lock during this RMW, so update executionType may cause other's update during this period lost.

Can you move this RMW to write merge or let write merger support things like k8s patch? #Closed

Currently we only modify one field executionType. So the lost-update problem won't happen.

I suggest to add a patch api in write merger in future to handle multiple-field updates. #Closed

Better to solve it in current PR, since we will also update tasknumber, completionpolicy etc.
If it is too hard, pls explain the reason, comment it and track in issue #Closed

It is not hard. I will implement a patch interface using json merge patch. #Closed

Great. Given we only have a few update scenarios, your implementation does not need too complex or too general #Closed

src/rest-server/src/models/v2/job-attempt.js

yqwang-ms · 2020-08-06T08:34:47Z

src/rest-server/src/models/v2/job-attempt.js

+        attributes: ['snapshot'],
+        where: {frameworkName: encodedFrameworkName},
+        order: [['attemptIndex', 'ASC']],
+      }


Do we need to dedup here?
Such as use the latest snapshot for the same framework and attemptIndex #Closed

Use hash(frameworkName, attemptIndex, historyType) as primary key to solve this problem. #Closed

yqwang-ms · 2020-08-06T08:49:21Z

src/rest-server/src/models/v2/job-attempt.js

+      const historyFramework = await databaseModel.FrameworkHistory.findOne({
+        attributes: ['snapshot'],
+        where: {frameworkName: encodedFrameworkName, attemptIndex: jobAttemptIndex},
+      });


use the latest snapshot for the same framework and attemptIndex?

To judge if it is the latest snapshot, could you also store the snapshot last write time in (history) DB #Closed

There is insertedAt and updatedAt field in history db.

What do you mean by use the latest snapshot for the same framework and attemptIndex? #Closed

In the history DB, for the same framework name and attemptIndex, there may be many snapshots (some in running state, some in completed state), when show the retry history, we should give user the latest snapshot which is generally the completed state snapshot. #Closed

Seems use updatedAt or snapshot last write time is not right. We should use the snapshot generation time instead of time write to db. If so, seems we should use framework.status.transitionTime #Closed

For the same framework name and attemptIndex, there is only one record in the history db.

The history is recorded when fc log outputs something like framework xxx is retried. #Closed

Will the history table store more snapshots in future? Such as rescale snapshot, or it is just retry snapshot?
BTW, even if it is just retry snapshot table, framework xxx is retried. may be produced by FC more than 1 time in some coner cases, will you override previous one, to make sure there is always only one?

#Closed

Currently only retry snapshots are available. For your corner case, there is no overwrite. Two records will be all recorded. #Closed

So you need to get the last one #Closed

Use hash(frameworkName, attemptIndex, historyType) as primary key to solve this problem.

#Closed

yqwang-ms · 2020-08-11T07:50:54Z

For rescale, historyType will be rescale. So the md5 hash is totally different from retry snapshot. We should caculate its hash using a different logic (do not need to be the same with retry snapshot).

But they are in the same table. The way to calculate primary key should be the same? #Closed

hzy46 · 2020-08-11T07:56:48Z

For rescale, historyType will be rescale. So the md5 hash is totally different from retry snapshot. We should caculate its hash using a different logic (do not need to be the same with retry snapshot).

But they are in the same table. The way to calculate primary key should be the same?

I think the logic could be different as long as we have a clear definition. Here the uid only indicates an identical descriptor for snapshots in OpenPAI. #Closed

yqwang-ms · 2020-08-11T07:58:01Z

src/fluentd/src/fluent-plugin-pgjson/lib/fluent/plugin/out_pgjson.rb

+              # use frameworkName + attemptIndex + historyType to generate a uid
+              uid = Digest::MD5.hexdigest "#{frameworkName}+#{attemptIndex}+#{historyType}"
+              thread[:conn].put_copy_data "#{time}\x01#{time}\x01#{uid}\x01#{frameworkName}\x01#{attemptIndex}\x01#{historyType}\x01#{snapshot}\n"
            elsif kind == "Pod"


Add comment that if duplicate, what is the behaiour here?
Crash, retry, ignore or log #Closed

It will raise an error and log the error. #Closed

yqwang-ms · 2020-08-11T08:01:02Z

For rescale, historyType will be rescale. So the md5 hash is totally different from retry snapshot. We should caculate its hash using a different logic (do not need to be the same with retry snapshot).

But they are in the same table. The way to calculate primary key should be the same?

I think the logic could be different as long as we have a clear definition. Here the uid only indicates an identical descriptor for snapshots in OpenPAI.

OK #Closed

yqwang-ms · 2020-08-11T08:46:14Z

src/database-controller/src/write-merger/handler.js

+        requestSynced: false,
+      }),
+      { where: { name: snapshot.getName() } },
+    );


If databaseModel.Framework.update failed, will silentSynchronizeRequest still be called?

Make sure silentSynchronizeRequest is called only when DB success #Closed

This is guaranteed. await will throw any error that happened. #Closed

yqwang-ms · 2020-08-11T08:48:57Z

src/database-controller/src/common/framework.js

+    synchronizeRequest(snapshot, addOns).catch(logError);
+  } catch (err) {
+    logError(err);
+  }


Why need 2 catch? #Closed

One is for normal error. The other is for promise-rejected error. They are different. #Closed

yqwang-ms · 2020-08-11T08:51:35Z

src/database-controller/src/write-merger/handler.js

+        if (config.retainModeEnabled) {
+          // If database doesn't have the corresponding framework,
+          // and retain mode is enabled
+          // tolerate the error and create framework in database.


tolerate the error and create framework in database. [](start = 13, length = 52)

"tolerate the error and create framework in database" ->
"retain the framework, i.e. do not delete it" #Closed

Done. #Closed

yqwang-ms

Sign off, pls test log flush

hzy46 and others added 4 commits July 22, 2020 22:20

fix

3039d89

fix

eec3e80

trigger

429dc4e

scarlett2018 mentioned this pull request Jul 23, 2020

2020 July ~ Aug Release #4642

Closed

39 tasks

hzy46 added 14 commits July 23, 2020 14:57

fix

dbe94ac

fix

c2adce5

fix

642c868

fix

e46f0ac

fix

c6900f6

fix

e4e7e72

Merge branch 'master' of github.com:microsoft/pai into zhiyuhe/db_int…

b2ec6f5

…egration

fix rbac

4a64f65

fix

3850a10

change rest-server & initializer

a3e2cab

fix

d2ed448

fix

197ca82

fix

d235d4e

fix

1cb37d0

yiyione approved these changes Aug 3, 2020

View reviewed changes

src/database-controller/build/build-pre.sh Outdated Show resolved Hide resolved

hzy46 added 2 commits August 3, 2020 14:53

change license comment

54d219f

fix lint

345d3af

abuccts reviewed Aug 5, 2020

View reviewed changes

yqwang-ms reviewed Aug 6, 2020

View reviewed changes

src/database-controller/test-case.md Show resolved Hide resolved

yqwang-ms reviewed Aug 6, 2020

View reviewed changes

src/database-controller/src/write-merger/handler.js Show resolved Hide resolved

yqwang-ms reviewed Aug 6, 2020

View reviewed changes

src/database-controller/src/write-merger/handler.js Outdated Show resolved Hide resolved

yqwang-ms reviewed Aug 6, 2020

View reviewed changes

src/rest-server/src/models/v2/job-attempt.js Show resolved Hide resolved

yqwang-ms reviewed Aug 6, 2020

View reviewed changes

hzy46 added 8 commits August 7, 2020 15:33

fix

979fe5d

fix

94d8ae9

resolve conflicts

fd13d0e

fluentd fix

2dce698

fix

e459a88

fix

a20d16f

fix

77352f4

fix

7355131

yqwang-ms reviewed Aug 11, 2020

View reviewed changes

fix

da47640

yqwang-ms reviewed Aug 11, 2020

View reviewed changes

hzy46 added 4 commits August 11, 2020 16:52

fix

bcd1b3c

fix

ae73bfe

fix

e203fd4

fix

7c4878d

yqwang-ms approved these changes Aug 12, 2020

View reviewed changes

hzy46 merged commit 5fc32be into master Aug 12, 2020

hzy46 deleted the zhiyuhe/db_integration branch September 3, 2020 07:36

Conversation

hzy46 commented Jul 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review

Database Controller Test Cases

End-to-end Test

Test Jobs

Test in Certain Case

Path Test

Rest-server checks conflicts

(Write merger) Add/Update FR (If not equal, override and mark !requestSynced, else no-op)

(Write merger) If retain mode is off, delete frameworks which are not submitted through db.

(Write merger) If not equal, override and mark as !requestSynced

(Database poller) If API server 404 error, mock a delete Framework

Upgrade Test

Stress Test

Uh oh!

coveralls commented Jul 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yqwang-ms commented Aug 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

yqwang-ms Aug 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hzy46 Aug 11, 2020 • edited by yqwang-ms Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yqwang-ms Aug 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yqwang-ms Aug 6, 2020

Choose a reason for hiding this comment

Uh oh!

hzy46 Aug 11, 2020 • edited by yqwang-ms Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yqwang-ms Aug 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yqwang-ms Aug 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hzy46 Aug 11, 2020 • edited by yqwang-ms Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yqwang-ms commented Aug 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

yqwang-ms Aug 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hzy46 Aug 7, 2020 • edited by yqwang-ms Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yqwang-ms Aug 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hzy46 Aug 7, 2020 • edited by yqwang-ms Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

hzy46 commented Jul 23, 2020 •

edited

Loading

coveralls commented Jul 23, 2020 •

edited

Loading

yqwang-ms commented Aug 6, 2020 •

edited

Loading

yqwang-ms Aug 6, 2020 •

edited

Loading

hzy46 Aug 11, 2020 •

edited by yqwang-ms

Loading

yqwang-ms Aug 6, 2020 •

edited

Loading

hzy46 Aug 11, 2020 •

edited by yqwang-ms

Loading

yqwang-ms Aug 6, 2020 •

edited

Loading

yqwang-ms Aug 6, 2020 •

edited

Loading

hzy46 Aug 11, 2020 •

edited by yqwang-ms

Loading

yqwang-ms commented Aug 6, 2020 •

edited

Loading

yqwang-ms Aug 6, 2020 •

edited

Loading

hzy46 Aug 7, 2020 •

edited by yqwang-ms

Loading

yqwang-ms Aug 6, 2020 •

edited

Loading

hzy46 Aug 7, 2020 •

edited by yqwang-ms

Loading

yqwang-ms Aug 7, 2020 •

edited

Loading

hzy46 Aug 7, 2020 •

edited by yqwang-ms

Loading

yqwang-ms Aug 7, 2020 •

edited

Loading

yqwang-ms Aug 6, 2020 •

edited

Loading

hzy46 Aug 11, 2020 •

edited by yqwang-ms

Loading

yqwang-ms Aug 6, 2020 •

edited

Loading

hzy46 Aug 7, 2020 •

edited by yqwang-ms

Loading

yqwang-ms Aug 7, 2020 •

edited

Loading

yqwang-ms Aug 7, 2020 •

edited

Loading

hzy46 Aug 10, 2020 •

edited by yqwang-ms

Loading

yqwang-ms Aug 10, 2020 •

edited

Loading

hzy46 Aug 10, 2020 •

edited by yqwang-ms

Loading

yqwang-ms Aug 10, 2020 •

edited

Loading

hzy46 Aug 11, 2020 •

edited by yqwang-ms

Loading

yqwang-ms commented Aug 11, 2020 •

edited

Loading

hzy46 commented Aug 11, 2020 •

edited by yqwang-ms

Loading