New RestServer Architecture: RestServer -> DB -> ApiServer; P0 Items#4761
New RestServer Architecture: RestServer -> DB -> ApiServer; P0 Items#4761
Conversation
* fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * save * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * save * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix
| value: "{{ cluster_cfg['database-controller']['k8s-connection-timeout-second'] }}" | ||
| - name: WRITE_MERGER_CONNECTION_TIMEOUT_SECOND | ||
| value: "{{ cluster_cfg['database-controller']['write-merger-connection-timeout-second'] }}" | ||
| - name: RECOVERY_MODE_ENABLED |
There was a problem hiding this comment.
Add doc for all your flags? #Closed
| - name: RECOVERY_MODE_ENABLED | ||
| {% if cluster_cfg['database-controller']['recovery-mode'] %} | ||
| value: "true" | ||
| {% else %} |
There was a problem hiding this comment.
If recovery-mode=false, but you found table is empty, will you still recover from ApiServer? #Closed
There was a problem hiding this comment.
There was a problem hiding this comment.
Done. The recovery mode is now initializer. #Closed
| - name: K8S_CONNECTION_TIMEOUT_SECOND | ||
| value: "{{ cluster_cfg['database-controller']['k8s-connection-timeout-second'] }}" | ||
| - name: WRITE_MERGER_CONNECTION_TIMEOUT_SECOND | ||
| value: "{{ cluster_cfg['database-controller']['write-merger-connection-timeout-second'] }}" |
There was a problem hiding this comment.
There was a problem hiding this comment.
And do you have a test case for the [Normal Mode], i.e. you will delete jobs which in apiserver but not in DB
In reply to: 466121708 [](ancestors = 466121708)
|
Disable FC GC: frameworkCompletedRetainSec: 2147483600 #Closed |
| {}, | ||
| snapshot.getAllUpdate(), | ||
| addOns.getUpdate(), | ||
| ); |
There was a problem hiding this comment.
Explicitly init requestSynced=false, Deleted=false? #Closed
There was a problem hiding this comment.
Done.
const record = _.assign(
{requestSynced: false, apiServerDeleted: false},
snapshot.getAllUpdate(),
addOns.getUpdate(),
); #Closed
| 'metadata.annotations', | ||
| 'spec', | ||
| ]); | ||
| frameworkRequest.spec.executionType = `${executionType.charAt(0)}${executionType.slice(1).toLowerCase()}`; |
There was a problem hiding this comment.
No lock during this RMW, so update executionType may cause other's update during this period lost.
Can you move this RMW to write merge or let write merger support things like k8s patch? #Closed
There was a problem hiding this comment.
Currently we only modify one field executionType. So the lost-update problem won't happen.
I suggest to add a patch api in write merger in future to handle multiple-field updates. #Closed
There was a problem hiding this comment.
Better to solve it in current PR, since we will also update tasknumber, completionpolicy etc.
If it is too hard, pls explain the reason, comment it and track in issue #Closed
There was a problem hiding this comment.
It is not hard. I will implement a patch interface using json merge patch. #Closed
There was a problem hiding this comment.
Great. Given we only have a few update scenarios, your implementation does not need too complex or too general #Closed
| attributes: ['snapshot'], | ||
| where: {frameworkName: encodedFrameworkName}, | ||
| order: [['attemptIndex', 'ASC']], | ||
| } |
There was a problem hiding this comment.
Do we need to dedup here?
Such as use the latest snapshot for the same framework and attemptIndex #Closed
There was a problem hiding this comment.
Use hash(frameworkName, attemptIndex, historyType) as primary key to solve this problem. #Closed
| const historyFramework = await databaseModel.FrameworkHistory.findOne({ | ||
| attributes: ['snapshot'], | ||
| where: {frameworkName: encodedFrameworkName, attemptIndex: jobAttemptIndex}, | ||
| }); |
There was a problem hiding this comment.
use the latest snapshot for the same framework and attemptIndex?
To judge if it is the latest snapshot, could you also store the snapshot last write time in (history) DB #Closed
There was a problem hiding this comment.
There is insertedAt and updatedAt field in history db.
What do you mean by use the latest snapshot for the same framework and attemptIndex? #Closed
There was a problem hiding this comment.
In the history DB, for the same framework name and attemptIndex, there may be many snapshots (some in running state, some in completed state), when show the retry history, we should give user the latest snapshot which is generally the completed state snapshot. #Closed
There was a problem hiding this comment.
Seems use updatedAt or snapshot last write time is not right. We should use the snapshot generation time instead of time write to db. If so, seems we should use framework.status.transitionTime #Closed
There was a problem hiding this comment.
For the same framework name and attemptIndex, there is only one record in the history db.
The history is recorded when fc log outputs something like framework xxx is retried. #Closed
There was a problem hiding this comment.
Will the history table store more snapshots in future? Such as rescale snapshot, or it is just retry snapshot?
BTW, even if it is just retry snapshot table, framework xxx is retried. may be produced by FC more than 1 time in some coner cases, will you override previous one, to make sure there is always only one?
#Closed
There was a problem hiding this comment.
Currently only retry snapshots are available. For your corner case, there is no overwrite. Two records will be all recorded. #Closed
There was a problem hiding this comment.
So you need to get the last one #Closed
There was a problem hiding this comment.
Use hash(frameworkName, attemptIndex, historyType) as primary key to solve this problem.
#Closed
But they are in the same table. The way to calculate primary key should be the same? #Closed |
I think the logic could be different as long as we have a clear definition. Here the uid only indicates an identical descriptor for snapshots in OpenPAI. #Closed |
| # use frameworkName + attemptIndex + historyType to generate a uid | ||
| uid = Digest::MD5.hexdigest "#{frameworkName}+#{attemptIndex}+#{historyType}" | ||
| thread[:conn].put_copy_data "#{time}\x01#{time}\x01#{uid}\x01#{frameworkName}\x01#{attemptIndex}\x01#{historyType}\x01#{snapshot}\n" | ||
| elsif kind == "Pod" |
There was a problem hiding this comment.
Add comment that if duplicate, what is the behaiour here?
Crash, retry, ignore or log #Closed
There was a problem hiding this comment.
It will raise an error and log the error. #Closed
OK #Closed |
| requestSynced: false, | ||
| }), | ||
| { where: { name: snapshot.getName() } }, | ||
| ); |
There was a problem hiding this comment.
If databaseModel.Framework.update failed, will silentSynchronizeRequest still be called?
Make sure silentSynchronizeRequest is called only when DB success #Closed
There was a problem hiding this comment.
This is guaranteed. await will throw any error that happened. #Closed
| synchronizeRequest(snapshot, addOns).catch(logError); | ||
| } catch (err) { | ||
| logError(err); | ||
| } |
There was a problem hiding this comment.
One is for normal error. The other is for promise-rejected error. They are different. #Closed
| if (config.retainModeEnabled) { | ||
| // If database doesn't have the corresponding framework, | ||
| // and retain mode is enabled | ||
| // tolerate the error and create framework in database. |
There was a problem hiding this comment.
tolerate the error and create framework in database. [](start = 13, length = 52)
"tolerate the error and create framework in database" ->
"retain the framework, i.e. do not delete it" #Closed
yqwang-ms
left a comment
There was a problem hiding this comment.
Sign off, pls test log flush

Review
Database Controller Test Cases
End-to-end Test
Test Jobs
A job with simple output
Case: job command
echo successExpect: Succeed, output
success.Stop a job
Case: job command
sleep 1h; then stop this jobExpect: The job can be stopped.
Job with retries
Case: job command:
exit 1; set max retry to be 10Expect: The job retry history can be viewed.
Job with docker secret
Case: Submit a job with docker auth info
Expect: The job can run successfully.
Job with priority class
Case: Submit a job with priority class
Expect: The job priority class is correct.
Job with secret
Case: Submit a job with secret info
Expect: The secret info is successfully written into job.
Bulk job
Case: Submit a job with 1000+ instances.
Expect: The job can run successfully.
Bulk job with retries
Case: Submit a job with 1000+ instances, and 50+ retries.
Expect: The job can run successfully. The job retry history can be viewed.
Test in Certain Case
Database failure
Case: submit a job; shutdown the database; try to submit another job; start the database
Expect: the first job is not affected; the second job cannot be submitted.
Database controller failure
Case: submit a job; shutdown the database controller; try to submit another job; start the database
Expect: the first job is not affected; the second job cannot be submitted.
Framework controller failure
Case: shutdown the framework controller; try to submit a job; start the framework controller
Expect: the job can be submitted, but in
WAITINGstatus. After the framework controller is started, its state will turn toRUNNING.Path Test
Rest-server checks conflicts
Case: Submit a job twice
Expect: The second job submission is not allowed.
Case: Stop a non-existed job
Expect: The operation is not allowed.
(Write merger) Add/Update FR (If not equal, override and mark !requestSynced, else no-op)
Environment: Write-merger doesn't forward request; Shutdown database poller;
Case: Fake two same framework requests to write-merger.
Expect: The second request will be ignored.
(Write merger) If retain mode is off, delete frameworks which are not submitted through db.
Environment: Turn on retain mode
Case: Create a framework directly in API server;
Expect: The framework is not deleted.
Environment: Turn off retain mode
Case: Create a framework directly in API server;
Expect: The framework is deleted.
(Write merger) If not equal, override and mark as !requestSynced
Environment: Shutdown database poller
Case: Submit a job; Then change its requestGeneration in API server.
Expect: The framework is marked as
requestSynced=false.(Database poller) If API server 404 error, mock a delete Framework
Case: Submit A job; Then shutdown watcher after the job starts; Wait until the job succeeded; Stop poller; Start watcher; Stop watcher; Delete this framework manually; Start Poller
Expect:
After watcher is restarted, the job is marked as
state=CompletedandrequestSynced=true;After poller is restarted, it can mock a delete framework event to write merger.
Finally, the job will be marked as
apiServerDeleted=true.Upgrade Test
Upgrade from
v1.0.yDrop the database; Deploy a
v1.0.ybed, submit some jobs (must include: one running job, one completed job with retry history, and one completed job without retry history). Then upgrade it. Make sure the job information is correct, and running jobs are not affected.Upgrade from
v1.1.yDrop the database; Deploy a
v1.1.ybed, submit some jobs (must include: one running job, one completed job with retry history, and one completed job without retry history). Then upgrade it. Make sure the job information is correct, and running jobs are not affected.Stress Test