Skip to content

Explicitly specify nodes in srun calls#630

Merged
xylar merged 1 commit intoMPAS-Dev:mainfrom
xylar:fix-nodes-and-pio-tasks
May 14, 2023
Merged

Explicitly specify nodes in srun calls#630
xylar merged 1 commit intoMPAS-Dev:mainfrom
xylar:fix-nodes-and-pio-tasks

Conversation

@xylar
Copy link
Collaborator

@xylar xylar commented May 10, 2023

We want to explicitly specify how many nodes to use in srun calls for slurm steps because otherwise we may end up with a job running on more nodes than needed and an incompatible PIO task count.

Checklist

  • Document (in a comment titled Testing in this PR) any testing that was used to verify the changes

closes #629

We want to explicitly specify how many nodes to use in `srun`
calls for slurm steps because otherwise we may end up with a
job running on more nodes than needed and an incompatible
PIO task count.
@xylar xylar added bug Something isn't working framework labels May 10, 2023
@xylar xylar requested a review from matthewhoffman May 10, 2023 01:51
@xylar xylar self-assigned this May 10, 2023
@xylar
Copy link
Collaborator Author

xylar commented May 10, 2023

@matthewhoffman, it would be good to make sure the full_integration suite works fine on Perlmutter with these changes. It might be worth testing on 3 nodes even if you don't have any tests that big, because that seemed to be the point at which the problem emerged.

@xylar
Copy link
Collaborator Author

xylar commented May 10, 2023

Testing

The pr test suite on Perlmutter with 3 nodes passes with this branch and is BFB with a baseline using main (using 2 nodes). Without this fix, the pr suite hangs on 3 nodes as reported in #629.

@xylar xylar marked this pull request as draft May 10, 2023 15:23
@xylar
Copy link
Collaborator Author

xylar commented May 10, 2023

With further testing, this problem persists even with the fix in this branch. I'm putting it in draft mode. @matthewhoffman, please don't review or test just yet.

@xylar xylar marked this pull request as ready for review May 10, 2023 22:17
@xylar
Copy link
Collaborator Author

xylar commented May 10, 2023

Okay, things are now working for me in #631, so I think this is ready for you to test when you can, @matthewhoffman.

@xylar xylar mentioned this pull request May 11, 2023
5 tasks
@xylar xylar removed the request for review from matthewhoffman May 14, 2023 04:08
@xylar
Copy link
Collaborator Author

xylar commented May 14, 2023

@matthewhoffman, I ran the full_integration suite on Chrysalis and everything passed so I think we're good. I'm going to go ahead with this one since it's the first in a chain that's getting pretty long.

@xylar xylar merged commit f12807b into MPAS-Dev:main May 14, 2023
@xylar xylar deleted the fix-nodes-and-pio-tasks branch May 14, 2023 04:09
@matthewhoffman
Copy link
Member

@xylar , sorry to not have noticed this until now. Glad you were able to take care of it without me holding you up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working framework

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hanging on pr suite tests with 3 Perlmutter nodes

2 participants