Skip to content

[AMORO-4223][ams] Fix AMS startup crash when recovering Iceberg maintenance processes#4224

Merged
zhoujinsong merged 2 commits into
apache:masterfrom
lintingbin:AMORO-4223-fix-iceberg-process-recover
May 19, 2026
Merged

[AMORO-4223][ams] Fix AMS startup crash when recovering Iceberg maintenance processes#4224
zhoujinsong merged 2 commits into
apache:masterfrom
lintingbin:AMORO-4223-fix-iceberg-process-recover

Conversation

@lintingbin

Copy link
Copy Markdown
Contributor

Why are the changes needed?

Close #4223.

IcebergProcessFactory.recover() was a stub that unconditionally threw RecoverProcessFailedException for every action. Because ProcessService.recoverProcesses() had no per-record error handling, the exception (wrapped as IllegalStateException by DefaultActionCoordinator) propagated out of the synchronous AMS startup path:

AmoroServiceContainer.maintransitionToLeaderstartOptimizingServiceDefaultTableService.initializeRuntimeHandlerChain.initializeProcessService.initializerecoverProcesses().

As a result, after an AMS restart, a single SUBMITTED/RUNNING Iceberg expire-snapshots or clean-orphan-files record in table_process made AMS fail to start and enter a crash loop with no automatic recovery. This is a regression introduced by the recent ProcessFactory refactor (#4107, #4209).

Reported stack trace:

java.lang.IllegalStateException: Failed to recover table process for format ICEBERG,
  action org.apache.amoro.Action@86552c00, table iceberg_catalog.ods.xxxxx(tableId=10)
  at org.apache.amoro.server.table.DefaultActionCoordinator.recoverTableProcess(DefaultActionCoordinator.java:118)
Caused by: org.apache.amoro.process.RecoverProcessFailedException:
  Unsupported action for IcebergProcessFactory: org.apache.amoro.Action@86552c00
  at org.apache.amoro.server.process.iceberg.IcebergProcessFactory.recover(IcebergProcessFactory.java:102)

Brief change log

  • IcebergProcessFactory.recover(): implemented. It now dispatches on the persisted action and rebuilds SnapshotsExpiringProcess / OrphanFilesCleaningProcess bound to the local execution engine. Both are stateless, idempotent one-shot local maintenance tasks (no checkpoint), so re-running them on recovery is safe. An unsupported action or a missing local engine still throws RecoverProcessFailedException with a clear message.
  • ProcessService.recoverProcesses(): hardened. Each record is now recovered in its own try/catch; on failure it is logged, the offending record is marked FAILED and skipped, so one un-recoverable record can no longer abort the whole AMS startup. The affected periodic maintenance action is re-scheduled normally by its scheduler.
  • Action.toString(): added, returning the action name, so diagnostic logs print a readable name (e.g. EXPIRE-SNAPSHOTS) instead of org.apache.amoro.Action@86552c00.

How was this patch tested?

  • Add some test cases that check the changes thoroughly including negative and positive cases if possible
    • TestIcebergProcessFactory: recover of expire-snapshots / clean-orphan-files returns the right process bound to the local engine; unsupported action and missing local engine throw RecoverProcessFailedException.
    • TestAction: toString() returns the action name.
    • TestDefaultProcessService#testRecoverProcessFailSafe: a coordinator whose recover() always fails no longer aborts recoverProcesses(); the bad record is marked FAILED and a subsequent restart neither throws nor re-picks it.
  • Run test locally before making a pull request (amoro-common + amoro-ams affected tests pass; spotless & checkstyle clean)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not applicable)

…enance processes

IcebergProcessFactory.recover() was a stub that unconditionally threw
RecoverProcessFailedException. Since ProcessService.recoverProcesses()
had no per-record error handling, any SUBMITTED/RUNNING Iceberg
expire-snapshots or clean-orphan-files record made AMS fail to start and
enter a crash loop with no automatic recovery.

Changes:
- Implement IcebergProcessFactory.recover(): dispatch by the persisted
  action and rebuild SnapshotsExpiringProcess / OrphanFilesCleaningProcess
  bound to the local engine. Both are stateless, idempotent one-shot local
  maintenance tasks, so re-running them on recovery is safe.
- Harden ProcessService.recoverProcesses(): recover each record in its own
  try/catch; on failure log it, mark the record FAILED and skip it so a
  single un-recoverable record can no longer abort the whole AMS startup.
- Add Action.toString() returning the action name so diagnostic logs print
  a readable name instead of org.apache.amoro.Action@hash.
- Add unit tests for IcebergProcessFactory.recover() and Action.toString(),
  and a regression test covering the recoverProcesses() fail-safe path.
@lintingbin lintingbin force-pushed the AMORO-4223-fix-iceberg-process-recover branch from 09141bd to 5043946 Compare May 18, 2026 10:43

@zhoujinsong zhoujinsong left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Thanks a lot for the work!

@zhoujinsong zhoujinsong merged commit 1d4bc38 into apache:master May 19, 2026
6 checks passed
czy006 pushed a commit that referenced this pull request May 20, 2026
…enance processes (#4224)

IcebergProcessFactory.recover() was a stub that unconditionally threw
RecoverProcessFailedException. Since ProcessService.recoverProcesses()
had no per-record error handling, any SUBMITTED/RUNNING Iceberg
expire-snapshots or clean-orphan-files record made AMS fail to start and
enter a crash loop with no automatic recovery.

Changes:
- Implement IcebergProcessFactory.recover(): dispatch by the persisted
  action and rebuild SnapshotsExpiringProcess / OrphanFilesCleaningProcess
  bound to the local engine. Both are stateless, idempotent one-shot local
  maintenance tasks, so re-running them on recovery is safe.
- Harden ProcessService.recoverProcesses(): recover each record in its own
  try/catch; on failure log it, mark the record FAILED and skip it so a
  single un-recoverable record can no longer abort the whole AMS startup.
- Add Action.toString() returning the action name so diagnostic logs print
  a readable name instead of org.apache.amoro.Action@hash.
- Add unit tests for IcebergProcessFactory.recover() and Action.toString(),
  and a regression test covering the recoverProcesses() fail-safe path.

Co-authored-by: lintingbin <lintingbin31@gmail.com>
Co-authored-by: ZhouJinsong <zhoujinsong0505@163.com>
(cherry picked from commit 1d4bc38)
j1wonpark pushed a commit to j1wonpark/amoro that referenced this pull request Jun 4, 2026
upstream 11커밋 동기화: AMS startup crash fix(apache#4224 AMORO-4223),
snapshot/data/orphan/dangling cleaning ProcessFactory 리팩터(apache#4226/apache#4218/apache#4209/apache#4214),
JUnit5 마이그레이션(apache#4199/apache#4204) 등. 사내 고유 JP CI(ts-ci-jp.yml: JDK11, tag-base) 보존.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Failed to recover table process for format ICEBERG

2 participants