[display event] add event watcher in database controller by hzy46 · Pull Request #4939 · microsoft/pai

hzy46 · 2020-09-29T06:48:46Z

database size control strategy:

In event watcher, check the disk usage in the beginning. If the disk usage > 80%, stop the event watcher and exit with a non-zero code.
The disk check also happens every 60s. If the disk usage > 80%, stop the event watcher and exit with a non-zero code.
60s and 80% are configurable.

Problem found:

If a job generates too many events, it will affect all other jobs.

fix fix fix fix fix fix fix fix Add index on task uid instead of framework name and attempt index (#4938) fix fix fix fix fix fix fix

coveralls · 2020-09-29T06:52:37Z

Coverage remained the same at 34.383% when pulling 87d4225 on zhiyuhe/add_event_watcher into 9755553 on master.

yqwang-ms · 2020-09-29T07:00:50Z

src/database-controller/src/watcher/cluster-event/index.js

+  }
+};
+
+async function assertDiskUsageHealthy() {


Better to limit the quota per job and global, and does not impact the critical path.

Recorded in #4953 . We can solve this problem in the future.

src/database-controller/src/watcher/cluster-event/index.js

yqwang-ms · 2020-09-29T07:19:19Z

src/database-controller/config/database-controller.yaml

+# Max connection number to database in cluster event watcher.
+cluster-event-max-db-connection: 40
+# Max disk usage in internal storage for cluster event watcher
+cluster-event-watcher-max-disk-usage-percent: 80


Also limit for history ? Why not move non-critical things to another DB server?

Recorded in #4954 . We can solve this problem in the future.

hzy46 · 2020-10-23T04:02:41Z

Event Watcher Test Cases:

Test: the event watcher works properly

Submit a job that will be always in waiting status (e.g. use a lot of resource). Then check if there is any event about "failed scheduling" on the job event page after a few minutes.

Test: the event watcher can handle a large number of events.

Submit a job with 2000+ tasks. After a few minutes check the event page can work properly.

Test: the event watcher will exit if too much disk size is used.

Go to internal storage to see the existing usage:

kubectl exec -it `kubectl get po  | grep internal-storage | awk '{print $1}' ` bash
df -h

Please notice the usage of loop device /paiInternal/storage.

Create a big file under /paiInternal/storage and make its usage larger than 80%.

After a few minutes, confirm that 1. there is a NodeFilesystemUsage alert shown on webportal 2. the event watcher should exit automatically.

Remove the big file. After a few minutes, confirm that: 1. there is no more NodeFilesystemUsage alert 2. the event watcher should work properly, and we can see events of new jobs on webportal.

hzy46 added 2 commits September 29, 2020 14:44

fix

8693cf4

fix fix fix fix fix fix fix fix Add index on task uid instead of framework name and attempt index (#4938) fix fix fix fix fix fix fix

fix

e74f0cf

hzy46 requested a review from yqwang-ms September 29, 2020 06:48

fix

755047b

yqwang-ms reviewed Sep 29, 2020

View reviewed changes

src/database-controller/src/watcher/cluster-event/index.js Show resolved Hide resolved

yqwang-ms reviewed Sep 29, 2020

View reviewed changes

fix

87d4225

hzy46 mentioned this pull request Oct 12, 2020

2020 Sept ~ Oct release plan #4898

Closed

31 tasks

yqwang-ms approved these changes Oct 13, 2020

View reviewed changes

hzy46 merged commit 16f55e5 into master Oct 13, 2020

hzy46 deleted the zhiyuhe/add_event_watcher branch November 3, 2020 09:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[display event] add event watcher in database controller#4939

[display event] add event watcher in database controller#4939
hzy46 merged 4 commits intomasterfrom
zhiyuhe/add_event_watcher

hzy46 commented Sep 29, 2020 •

edited

Loading

Uh oh!

coveralls commented Sep 29, 2020 •

edited

Loading

Uh oh!

yqwang-ms Sep 29, 2020

Uh oh!

hzy46 Oct 13, 2020

Uh oh!

Uh oh!

yqwang-ms Sep 29, 2020

Uh oh!

hzy46 Oct 13, 2020 •

edited

Loading

Uh oh!

hzy46 commented Oct 23, 2020 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hzy46 commented Sep 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Sep 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yqwang-ms Sep 29, 2020

Choose a reason for hiding this comment

Uh oh!

hzy46 Oct 13, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yqwang-ms Sep 29, 2020

Choose a reason for hiding this comment

Uh oh!

hzy46 Oct 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hzy46 commented Oct 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hzy46 commented Sep 29, 2020 •

edited

Loading

coveralls commented Sep 29, 2020 •

edited

Loading

hzy46 Oct 13, 2020 •

edited

Loading

hzy46 commented Oct 23, 2020 •

edited

Loading