Skip to content

HDDS-15208. OM should learn to finalize from SCM#10236

Open
sodonnel wants to merge 28 commits into
apache:HDDS-14496-zdufrom
sodonnel:HDDS-15208
Open

HDDS-15208. OM should learn to finalize from SCM#10236
sodonnel wants to merge 28 commits into
apache:HDDS-14496-zdufrom
sodonnel:HDDS-15208

Conversation

@sodonnel
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

When OM is started, it checks if finalization is needed. If so, it spawns a background service which polls SCM to see if it is time to finalize or not. If SCM indicates it should finalize, then it will finalize the OMs via Ratis and then shutdown the background service to avoid any further polling.

The poll interval defaults to 60 seconds, but its configurable via ozone.om.upgrade.finalization.check.interval.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15208

How was this patch tested?

Existing integration tests were modified to remove the calls to the old OM finalize command, and instead trigger finalize via SCM. New unit tests have been added for the new polling service.

@github-actions github-actions Bot added the zdu Pull requests for Zero Downtime Upgrade (ZDU) https://issues.apache.org/jira/browse/HDDS-14496 label May 11, 2026
@errose28 errose28 self-requested a review May 11, 2026 15:44
Copy link
Copy Markdown
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this. Overall looks good, just left some minor comments inline.

Comment on lines +115 to +118
conf.setInt(OMStorage.TESTING_INIT_APPARENT_VERSION_KEY, OzoneManagerVersion.ATOMIC_REWRITE_KEY.serialize());
conf.setInt(SCMStorageConfig.TESTING_INIT_LAYOUT_VERSION_KEY,
HDDSLayoutFeature.HADOOP_PRC_PORTS_IN_DATANODEDETAILS.layoutVersion());
conf.set(OMConfigKeys.OZONE_OM_UPGRADE_FINALIZATION_CHECK_INTERVAL, "10ms");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why were these versions changed? The OM version used here is older than the unified versioning framework but not an OMLayoutFeature so it is not meant to be written to the version file.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was making changes to try to get this to work locally, but was failing due to the annotation / aspectj stuff and then forgot to change it back. I've reset it and will see if it passes the CI now.

Comment on lines +237 to +238
conf.setInt(SCMStorageConfig.TESTING_INIT_LAYOUT_VERSION_KEY,
HDDSLayoutFeature.HADOOP_PRC_PORTS_IN_DATANODEDETAILS.layoutVersion());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to change the HDDS version in this and other OM tests?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If SCM does not need finalized then OM will immediately finalize on the first poll of SCM and it makes it difficult to control the flow in the test. Therefore we need to make it so that SCM is unfinalized, then we finalize it and that triggers the OM finalize.

Comment on lines +218 to +219
cluster.getStorageContainerLocationClient().finalizeUpgrade();
OMUpgradeTestUtils.waitForFinalization(omClient);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure these will always be called together and only in an integration test scenario. Should we combine them into a single method in MiniOzoneCluster to finalize and wait, similar to waitForClusterToBeReady?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know - the discoverability of these sort of methods isn't great and there may be places we want to have something in between.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

zdu Pull requests for Zero Downtime Upgrade (ZDU) https://issues.apache.org/jira/browse/HDDS-14496

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants