Skip to content

[ci] Remove hardcoded test shards#10743

Merged
areusch merged 1 commit intoapache:mainfrom
driazati:generate_shard
Apr 8, 2022
Merged

[ci] Remove hardcoded test shards#10743
areusch merged 1 commit intoapache:mainfrom
driazati:generate_shard

Conversation

@driazati
Copy link
Copy Markdown
Member

@driazati driazati commented Mar 23, 2022

This moves the sharding logic from being inlined in the Jenkinsfile to templated, so we can change just the number of shards and the test allocation in conftest.py and the Jenkinsfile will work to match. This also changes the test allocation from a manual balancing before to be random between shards. Each shard needs to know only its shard number and the total number of shards, then it hashes each test and skips it unless that hash falls within its allocated tests. This breaks up related tests across shards but has the downside that any change to the number of shards will shuffle around where the tests end up (but ideally this is rare as we settle on a good number of shards to use). Some tests are also manually allocated via round-robin to different shards to ensure that long-running tests run on different shards as much as possible.

This only does this for the GPU frontend tests but eventually we could expand it to more.

cc @areusch

@driazati driazati changed the title generate shard [ci] Remove hardcoded test shards Mar 23, 2022
@driazati driazati marked this pull request as ready for review March 25, 2022 23:00
@github-actions github-actions Bot requested a review from areusch March 25, 2022 23:03
@driazati driazati marked this pull request as draft March 29, 2022 20:25
Copy link
Copy Markdown
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @driazati this looks pretty good! just a couple small suggestions, could also defer if you feel strongly against them.

Comment thread jenkins/macros.j2

{% macro sharded_test_step(name, num_shards, node, ws) %}
{% for shard_index in range(1, num_shards + 1) %}
'{{ name }} {{ shard_index }} of {{ num_shards }}': {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit: want to 0-pad shard_index and num_shards here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meh, this is intended for humans and everything is text-align: center-ed anyways so 0-padding won't make it easier to read IMO (also we are at like 3 shards max not 10 so we can revisit later)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok sg. the main intent was that Jenkins sorts these alphabetically iiuc.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see, we should still be ok on that front but would probably need to pad if we shard more than n=9

Comment thread jenkins/macros.j2 Outdated
Comment thread jenkins/macros.j2 Outdated
@driazati driazati marked this pull request as ready for review April 6, 2022 18:40
@driazati driazati requested a review from areusch April 6, 2022 18:40
@driazati driazati marked this pull request as draft April 7, 2022 19:33
This moves the sharding logic from being inlined in the Jenkinsfile to templated, so we can change just the number of shards and the test allocation in `conftest.py` and the Jenkinsfile will work to match. This also changes the test allocation from a manual balancing before to be random between shards. Each shard needs to know only its shard number and the total number of shards, then it hashes each test and skips it unless that hash falls within its allocated tests. This breaks up related tests across shards but has the downside that any change to the number of shards will shuffle around where the tests end up (but ideally this is rare as we settle on a good number of shards to use).

This only does this for the GPU frontend tests but eventually we could expand it to more.
@driazati driazati marked this pull request as ready for review April 8, 2022 00:58
@driazati
Copy link
Copy Markdown
Member Author

driazati commented Apr 8, 2022

I also verified the tests ran in the various shards vs the tests from a recent main run, the overall result of tests ran is the same so the sharding is working correctly. Code is here https://gist.github.com/f636700cd68b5717350c107a0eaaee4e for the curious.

Copy link
Copy Markdown
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @driazati , couple small comments here and there but could defer them as they're less important than reducing CI time

Comment thread conftest.py

def pytest_collection_modifyitems(config, items):
if not all(k in os.environ for k in ["CI", "TVM_NUM_SHARDS", "TVM_SHARD_INDEX"]):
# Only apportion tests if in CI and in a job that is set up for it
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could log here if CI is present

Comment thread conftest.py
"tests/python/topi/python/test_topi_conv2d_winograd.py::test_conv2d_nchw",
"tests/python/relay/test_py_converter.py::test_global_recursion",
]
HARDCODED_ALLOCATIONS = {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could do with dict comprehension: HARDCODED_ALLOCATIONS = {i: v for i, v in enumerate(_slowest_tests)}

@areusch areusch merged commit 0c17f07 into apache:main Apr 8, 2022
mehrdadh pushed a commit to mehrdadh/tvm that referenced this pull request Apr 11, 2022
This moves the sharding logic from being inlined in the Jenkinsfile to templated, so we can change just the number of shards and the test allocation in `conftest.py` and the Jenkinsfile will work to match. This also changes the test allocation from a manual balancing before to be random between shards. Each shard needs to know only its shard number and the total number of shards, then it hashes each test and skips it unless that hash falls within its allocated tests. This breaks up related tests across shards but has the downside that any change to the number of shards will shuffle around where the tests end up (but ideally this is rare as we settle on a good number of shards to use).

This only does this for the GPU frontend tests but eventually we could expand it to more.

Co-authored-by: driazati <driazati@users.noreply.github.com>
altanh pushed a commit to altanh/tvm that referenced this pull request Apr 28, 2022
This moves the sharding logic from being inlined in the Jenkinsfile to templated, so we can change just the number of shards and the test allocation in `conftest.py` and the Jenkinsfile will work to match. This also changes the test allocation from a manual balancing before to be random between shards. Each shard needs to know only its shard number and the total number of shards, then it hashes each test and skips it unless that hash falls within its allocated tests. This breaks up related tests across shards but has the downside that any change to the number of shards will shuffle around where the tests end up (but ideally this is rare as we settle on a good number of shards to use).

This only does this for the GPU frontend tests but eventually we could expand it to more.

Co-authored-by: driazati <driazati@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants