Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

New RestServer Architecture: RestServer -> DB -> ApiServer; P0 Items #4761

Merged
merged 34 commits into from
Aug 12, 2020

Conversation

hzy46
Copy link
Contributor

@hzy46 hzy46 commented Jul 23, 2020

Review

  • webportal
  • rest-server
  • database controller

Database Controller Test Cases

End-to-end Test

Test Jobs

  • A job with simple output

    Case: job command echo success

    Expect: Succeed, output success.

  • Stop a job

    Case: job command sleep 1h; then stop this job

    Expect: The job can be stopped.

  • Job with retries

    Case: job command: exit 1; set max retry to be 10

    Expect: The job retry history can be viewed.

  • Job with docker secret

    Case: Submit a job with docker auth info

    Expect: The job can run successfully.

  • Job with priority class

    Case: Submit a job with priority class

    Expect: The job priority class is correct.

  • Job with secret

    Case: Submit a job with secret info

    Expect: The secret info is successfully written into job.

  • Bulk job

    Case: Submit a job with 1000+ instances.

    Expect: The job can run successfully.

  • Bulk job with retries

    Case: Submit a job with 1000+ instances, and 50+ retries.

    Expect: The job can run successfully. The job retry history can be viewed.

Test in Certain Case

  • Database failure

    Case: submit a job; shutdown the database; try to submit another job; start the database

    Expect: the first job is not affected; the second job cannot be submitted.

  • Database controller failure

    Case: submit a job; shutdown the database controller; try to submit another job; start the database

    Expect: the first job is not affected; the second job cannot be submitted.

  • Framework controller failure

    Case: shutdown the framework controller; try to submit a job; start the framework controller

    Expect: the job can be submitted, but in WAITING status. After the framework controller is started, its state will turn to RUNNING.

Path Test

Rest-server checks conflicts

  • Case: Submit a job twice

    Expect: The second job submission is not allowed.

  • Case: Stop a non-existed job

    Expect: The operation is not allowed.

(Write merger) Add/Update FR (If not equal, override and mark !requestSynced, else no-op)

  • Environment: Write-merger doesn't forward request; Shutdown database poller;

    Case: Fake two same framework requests to write-merger.

    Expect: The second request will be ignored.

(Write merger) If retain mode is off, delete frameworks which are not submitted through db.

  • Environment: Turn on retain mode

    Case: Create a framework directly in API server;

    Expect: The framework is not deleted.

  • Environment: Turn off retain mode

    Case: Create a framework directly in API server;

    Expect: The framework is deleted.

(Write merger) If not equal, override and mark as !requestSynced

  • Environment: Shutdown database poller

    Case: Submit a job; Then change its requestGeneration in API server.

    Expect: The framework is marked as requestSynced=false.

(Database poller) If API server 404 error, mock a delete Framework

  • Case: Submit A job; Then shutdown watcher after the job starts; Wait until the job succeeded; Stop poller; Start watcher; Stop watcher; Delete this framework manually; Start Poller

    Expect:

    After watcher is restarted, the job is marked as state=Completed and requestSynced=true;
    After poller is restarted, it can mock a delete framework event to write merger.
    Finally, the job will be marked as apiServerDeleted=true.

Upgrade Test

  • Upgrade from v1.0.y

    Drop the database; Deploy a v1.0.y bed, submit some jobs (must include: one running job, one completed job with retry history, and one completed job without retry history). Then upgrade it. Make sure the job information is correct, and running jobs are not affected.

  • Upgrade from v1.1.y

    Drop the database; Deploy a v1.1.y bed, submit some jobs (must include: one running job, one completed job with retry history, and one completed job without retry history). Then upgrade it. Make sure the job information is correct, and running jobs are not affected.

Stress Test

  • Submit 100000+ jobs in 1 hour. It should be handled properly.

hzy46 and others added 4 commits July 22, 2020 22:20
* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* save

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* save

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix
@coveralls
Copy link

coveralls commented Jul 23, 2020

Coverage Status

Coverage increased (+0.2%) to 34.564% when pulling 7c4878d on zhiyuhe/db_integration into fbba438 on master.

@scarlett2018 scarlett2018 mentioned this pull request Jul 23, 2020
39 tasks
src/database-controller/build/build-pre.sh Outdated Show resolved Hide resolved
.github/workflows/continuous-integration.yml Outdated Show resolved Hide resolved
.github/workflows/continuous-integration.yml Outdated Show resolved Hide resolved
src/database-controller/sdk/package.json Outdated Show resolved Hide resolved
src/database-controller/src/.dockerignore Outdated Show resolved Hide resolved
src/rest-server/package.json Outdated Show resolved Hide resolved
@yqwang-ms
Copy link
Member

yqwang-ms commented Aug 6, 2020

Move our Architecture doc to here?

#4651 #Closed


Refers to: src/database-controller/README.md:9 in 345d3af. [](commit_id = 345d3af, deletion_comment = False)

value: "{{ cluster_cfg['database-controller']['k8s-connection-timeout-second'] }}"
- name: WRITE_MERGER_CONNECTION_TIMEOUT_SECOND
value: "{{ cluster_cfg['database-controller']['write-merger-connection-timeout-second'] }}"
- name: RECOVERY_MODE_ENABLED
Copy link
Member

@yqwang-ms yqwang-ms Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add doc for all your flags? #Closed

Copy link
Contributor Author

@hzy46 hzy46 Aug 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. #Closed

- name: RECOVERY_MODE_ENABLED
{% if cluster_cfg['database-controller']['recovery-mode'] %}
value: "true"
{% else %}
Copy link
Member

@yqwang-ms yqwang-ms Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If recovery-mode=false, but you found table is empty, will you still recover from ApiServer? #Closed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc the behavior


In reply to: 466121240 [](ancestors = 466121240)

Copy link
Contributor Author

@hzy46 hzy46 Aug 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. The recovery mode is now initializer. #Closed

- name: K8S_CONNECTION_TIMEOUT_SECOND
value: "{{ cluster_cfg['database-controller']['k8s-connection-timeout-second'] }}"
- name: WRITE_MERGER_CONNECTION_TIMEOUT_SECOND
value: "{{ cluster_cfg['database-controller']['write-merger-connection-timeout-second'] }}"
Copy link
Member

@yqwang-ms yqwang-ms Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you also implement the [ApiServerRetain Mode]:

Do not delete exceeded Frameworks in ApiServer
DB Service [ApiServerRetain Mode] ->
As the exceeded Frameworks are not visible to user, but still occupy resources, in [Default: Normal Mode], we should auto delete them.

image

#Closed

Copy link
Member

@yqwang-ms yqwang-ms Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And do you have a test case for the [Normal Mode], i.e. you will delete jobs which in apiserver but not in DB


In reply to: 466121708 [](ancestors = 466121708)

Copy link
Contributor Author

@hzy46 hzy46 Aug 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. #Closed

@yqwang-ms
Copy link
Member

yqwang-ms commented Aug 6, 2020

Disable FC GC: frameworkCompletedRetainSec: 2147483600 #Closed

{},
snapshot.getAllUpdate(),
addOns.getUpdate(),
);
Copy link
Member

@yqwang-ms yqwang-ms Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly init requestSynced=false, Deleted=false? #Closed

Copy link
Contributor Author

@hzy46 hzy46 Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

      const record = _.assign(
        {requestSynced: false, apiServerDeleted: false},
        snapshot.getAllUpdate(),
        addOns.getUpdate(),
      ); #Closed

'metadata.annotations',
'spec',
]);
frameworkRequest.spec.executionType = `${executionType.charAt(0)}${executionType.slice(1).toLowerCase()}`;
Copy link
Member

@yqwang-ms yqwang-ms Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No lock during this RMW, so update executionType may cause other's update during this period lost.

Can you move this RMW to write merge or let write merger support things like k8s patch? #Closed

Copy link
Contributor Author

@hzy46 hzy46 Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we only modify one field executionType. So the lost-update problem won't happen.

I suggest to add a patch api in write merger in future to handle multiple-field updates. #Closed

Copy link
Member

@yqwang-ms yqwang-ms Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to solve it in current PR, since we will also update tasknumber, completionpolicy etc.
If it is too hard, pls explain the reason, comment it and track in issue #Closed

Copy link
Contributor Author

@hzy46 hzy46 Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not hard. I will implement a patch interface using json merge patch. #Closed

Copy link
Member

@yqwang-ms yqwang-ms Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. Given we only have a few update scenarios, your implementation does not need too complex or too general #Closed

attributes: ['snapshot'],
where: {frameworkName: encodedFrameworkName},
order: [['attemptIndex', 'ASC']],
}
Copy link
Member

@yqwang-ms yqwang-ms Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to dedup here?
Such as use the latest snapshot for the same framework and attemptIndex #Closed

Copy link
Contributor Author

@hzy46 hzy46 Aug 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use hash(frameworkName, attemptIndex, historyType) as primary key to solve this problem. #Closed

const historyFramework = await databaseModel.FrameworkHistory.findOne({
attributes: ['snapshot'],
where: {frameworkName: encodedFrameworkName, attemptIndex: jobAttemptIndex},
});
Copy link
Member

@yqwang-ms yqwang-ms Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the latest snapshot for the same framework and attemptIndex?

To judge if it is the latest snapshot, could you also store the snapshot last write time in (history) DB #Closed

Copy link
Contributor Author

@hzy46 hzy46 Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is insertedAt and updatedAt field in history db.

What do you mean by use the latest snapshot for the same framework and attemptIndex? #Closed

Copy link
Member

@yqwang-ms yqwang-ms Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the history DB, for the same framework name and attemptIndex, there may be many snapshots (some in running state, some in completed state), when show the retry history, we should give user the latest snapshot which is generally the completed state snapshot. #Closed

Copy link
Member

@yqwang-ms yqwang-ms Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems use updatedAt or snapshot last write time is not right. We should use the snapshot generation time instead of time write to db. If so, seems we should use framework.status.transitionTime #Closed

Copy link
Contributor Author

@hzy46 hzy46 Aug 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the same framework name and attemptIndex, there is only one record in the history db.

The history is recorded when fc log outputs something like framework xxx is retried. #Closed

Copy link
Member

@yqwang-ms yqwang-ms Aug 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the history table store more snapshots in future? Such as rescale snapshot, or it is just retry snapshot?
BTW, even if it is just retry snapshot table, framework xxx is retried. may be produced by FC more than 1 time in some coner cases, will you override previous one, to make sure there is always only one?

#Closed

Copy link
Contributor Author

@hzy46 hzy46 Aug 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently only retry snapshots are available. For your corner case, there is no overwrite. Two records will be all recorded. #Closed

Copy link
Member

@yqwang-ms yqwang-ms Aug 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you need to get the last one #Closed

Copy link
Contributor Author

@hzy46 hzy46 Aug 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use hash(frameworkName, attemptIndex, historyType) as primary key to solve this problem.

#Closed

@yqwang-ms
Copy link
Member

yqwang-ms commented Aug 11, 2020

For rescale, historyType will be rescale. So the md5 hash is totally different from retry snapshot. We should caculate its hash using a different logic (do not need to be the same with retry snapshot).

But they are in the same table. The way to calculate primary key should be the same? #Closed

@hzy46
Copy link
Contributor Author

hzy46 commented Aug 11, 2020

For rescale, historyType will be rescale. So the md5 hash is totally different from retry snapshot. We should caculate its hash using a different logic (do not need to be the same with retry snapshot).

But they are in the same table. The way to calculate primary key should be the same?

I think the logic could be different as long as we have a clear definition. Here the uid only indicates an identical descriptor for snapshots in OpenPAI. #Closed

thread[:conn].put_copy_data "#{time}\x01#{frameworkName}\x01#{attemptIndex}\x01#{historyType}\x01#{snapshot}\n"
# use frameworkName + attemptIndex + historyType to generate a uid
uid = Digest::MD5.hexdigest "#{frameworkName}+#{attemptIndex}+#{historyType}"
thread[:conn].put_copy_data "#{time}\x01#{time}\x01#{uid}\x01#{frameworkName}\x01#{attemptIndex}\x01#{historyType}\x01#{snapshot}\n"
elsif kind == "Pod"
Copy link
Member

@yqwang-ms yqwang-ms Aug 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment that if duplicate, what is the behaiour here?
Crash, retry, ignore or log #Closed

Copy link
Contributor Author

@hzy46 hzy46 Aug 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will raise an error and log the error. #Closed

@yqwang-ms
Copy link
Member

yqwang-ms commented Aug 11, 2020

For rescale, historyType will be rescale. So the md5 hash is totally different from retry snapshot. We should caculate its hash using a different logic (do not need to be the same with retry snapshot).

But they are in the same table. The way to calculate primary key should be the same?

I think the logic could be different as long as we have a clear definition. Here the uid only indicates an identical descriptor for snapshots in OpenPAI.

OK #Closed

requestSynced: false,
}),
{ where: { name: snapshot.getName() } },
);
Copy link
Member

@yqwang-ms yqwang-ms Aug 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If databaseModel.Framework.update failed, will silentSynchronizeRequest still be called?

Make sure silentSynchronizeRequest is called only when DB success #Closed

Copy link
Contributor Author

@hzy46 hzy46 Aug 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is guaranteed. await will throw any error that happened. #Closed

synchronizeRequest(snapshot, addOns).catch(logError);
} catch (err) {
logError(err);
}
Copy link
Member

@yqwang-ms yqwang-ms Aug 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why need 2 catch? #Closed

Copy link
Contributor Author

@hzy46 hzy46 Aug 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One is for normal error. The other is for promise-rejected error. They are different. #Closed

if (config.retainModeEnabled) {
// If database doesn't have the corresponding framework,
// and retain mode is enabled
// tolerate the error and create framework in database.
Copy link
Member

@yqwang-ms yqwang-ms Aug 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tolerate the error and create framework in database. [](start = 13, length = 52)

"tolerate the error and create framework in database" ->
"retain the framework, i.e. do not delete it" #Closed

Copy link
Contributor Author

@hzy46 hzy46 Aug 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. #Closed

Copy link
Member

@yqwang-ms yqwang-ms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign off, pls test log flush

@hzy46 hzy46 merged commit 5fc32be into master Aug 12, 2020
@hzy46 hzy46 deleted the zhiyuhe/db_integration branch September 3, 2020 07:36
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants