[AMORO-4223][ams] Fix AMS startup crash when recovering Iceberg maintenance processes#4224
Merged
zhoujinsong merged 2 commits intoMay 19, 2026
Conversation
…enance processes IcebergProcessFactory.recover() was a stub that unconditionally threw RecoverProcessFailedException. Since ProcessService.recoverProcesses() had no per-record error handling, any SUBMITTED/RUNNING Iceberg expire-snapshots or clean-orphan-files record made AMS fail to start and enter a crash loop with no automatic recovery. Changes: - Implement IcebergProcessFactory.recover(): dispatch by the persisted action and rebuild SnapshotsExpiringProcess / OrphanFilesCleaningProcess bound to the local engine. Both are stateless, idempotent one-shot local maintenance tasks, so re-running them on recovery is safe. - Harden ProcessService.recoverProcesses(): recover each record in its own try/catch; on failure log it, mark the record FAILED and skip it so a single un-recoverable record can no longer abort the whole AMS startup. - Add Action.toString() returning the action name so diagnostic logs print a readable name instead of org.apache.amoro.Action@hash. - Add unit tests for IcebergProcessFactory.recover() and Action.toString(), and a regression test covering the recoverProcesses() fail-safe path.
09141bd to
5043946
Compare
zhoujinsong
approved these changes
May 19, 2026
zhoujinsong
left a comment
Contributor
There was a problem hiding this comment.
LGTM.
Thanks a lot for the work!
czy006
pushed a commit
that referenced
this pull request
May 20, 2026
…enance processes (#4224) IcebergProcessFactory.recover() was a stub that unconditionally threw RecoverProcessFailedException. Since ProcessService.recoverProcesses() had no per-record error handling, any SUBMITTED/RUNNING Iceberg expire-snapshots or clean-orphan-files record made AMS fail to start and enter a crash loop with no automatic recovery. Changes: - Implement IcebergProcessFactory.recover(): dispatch by the persisted action and rebuild SnapshotsExpiringProcess / OrphanFilesCleaningProcess bound to the local engine. Both are stateless, idempotent one-shot local maintenance tasks, so re-running them on recovery is safe. - Harden ProcessService.recoverProcesses(): recover each record in its own try/catch; on failure log it, mark the record FAILED and skip it so a single un-recoverable record can no longer abort the whole AMS startup. - Add Action.toString() returning the action name so diagnostic logs print a readable name instead of org.apache.amoro.Action@hash. - Add unit tests for IcebergProcessFactory.recover() and Action.toString(), and a regression test covering the recoverProcesses() fail-safe path. Co-authored-by: lintingbin <lintingbin31@gmail.com> Co-authored-by: ZhouJinsong <zhoujinsong0505@163.com> (cherry picked from commit 1d4bc38)
j1wonpark
pushed a commit
to j1wonpark/amoro
that referenced
this pull request
Jun 4, 2026
upstream 11커밋 동기화: AMS startup crash fix(apache#4224 AMORO-4223), snapshot/data/orphan/dangling cleaning ProcessFactory 리팩터(apache#4226/apache#4218/apache#4209/apache#4214), JUnit5 마이그레이션(apache#4199/apache#4204) 등. 사내 고유 JP CI(ts-ci-jp.yml: JDK11, tag-base) 보존.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are the changes needed?
Close #4223.
IcebergProcessFactory.recover()was a stub that unconditionally threwRecoverProcessFailedExceptionfor every action. BecauseProcessService.recoverProcesses()had no per-record error handling, the exception (wrapped asIllegalStateExceptionbyDefaultActionCoordinator) propagated out of the synchronous AMS startup path:AmoroServiceContainer.main→transitionToLeader→startOptimizingService→DefaultTableService.initialize→RuntimeHandlerChain.initialize→ProcessService.initialize→recoverProcesses().As a result, after an AMS restart, a single
SUBMITTED/RUNNINGIcebergexpire-snapshotsorclean-orphan-filesrecord intable_processmade AMS fail to start and enter a crash loop with no automatic recovery. This is a regression introduced by the recent ProcessFactory refactor (#4107, #4209).Reported stack trace:
Brief change log
IcebergProcessFactory.recover(): implemented. It now dispatches on the persisted action and rebuildsSnapshotsExpiringProcess/OrphanFilesCleaningProcessbound to the local execution engine. Both are stateless, idempotent one-shot local maintenance tasks (no checkpoint), so re-running them on recovery is safe. An unsupported action or a missing local engine still throwsRecoverProcessFailedExceptionwith a clear message.ProcessService.recoverProcesses(): hardened. Each record is now recovered in its owntry/catch; on failure it is logged, the offending record is markedFAILEDand skipped, so one un-recoverable record can no longer abort the whole AMS startup. The affected periodic maintenance action is re-scheduled normally by its scheduler.Action.toString(): added, returning the action name, so diagnostic logs print a readable name (e.g.EXPIRE-SNAPSHOTS) instead oforg.apache.amoro.Action@86552c00.How was this patch tested?
TestIcebergProcessFactory: recover ofexpire-snapshots/clean-orphan-filesreturns the right process bound to the local engine; unsupported action and missing local engine throwRecoverProcessFailedException.TestAction:toString()returns the action name.TestDefaultProcessService#testRecoverProcessFailSafe: a coordinator whoserecover()always fails no longer abortsrecoverProcesses(); the bad record is markedFAILEDand a subsequent restart neither throws nor re-picks it.amoro-common+amoro-amsaffected tests pass; spotless & checkstyle clean)Documentation