[Improvement]: Refactor snapshot-expiring via ProcessFactory plugin by baiyangtx · Pull Request #4107 · apache/amoro

baiyangtx · 2026-03-05T14:11:10Z

Why are the changes needed?

Refactor snapshot expiring action via ProcessFactory plugin

Brief change log

Refactor Iceberg snapshot expiration from the inline scheduler into a pluggable process model (ProcessFactory + ExecuteEngine).
Introduce IcebergProcessFactory and SnapshotsExpiringProcess, executed by the default LocalExecutionEngine with async execution, status tracking and cancellation.
Externalize enablement/interval knobs via plugin YAML and new config options (e.g. expire-snapshots.enabled, expire-snapshots.interval).
Add unit tests for IcebergProcessFactory and LocalExecutionEngine, and update the “last snapshots expiring” timestamp only after the actual expiration run finishes to preserve correct interval semantics.

How was this patch tested?

Add some test cases that check the changes thoroughly including negative and positive cases if possible
Add screenshots for manual tests if appropriate
Run test locally before making a pull request

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

czy006 · 2026-03-13T08:12:22Z

+      ConfigOptions.key("expire-snapshots.enabled").booleanType().defaultValue(true);
+
+  public static final ConfigOption<Duration> SNAPSHOT_EXPIRE_INTERVAL =
+      ConfigOptions.key("expire-snapshot.interval")


YAML is expire-snapshots.interval

czy006 · 2026-03-13T08:17:19Z

+        properties.keySet().stream()
+            .filter(key -> key.startsWith(POOL_CONFIG_PREFIX))
+            .map(key -> key.substring(POOL_CONFIG_PREFIX.length()))
+            .map(key -> key.substring(0, key.indexOf(".") + 1))


last result is pool.default..thread-count / pool.snapshots-expiring..thread-count, that's not get the pool

I have fix it and add some tests

czy006 · 2026-03-13T08:20:30Z

org.apache.amoro.process.ExecuteEngine

czy006 · 2026-03-13T08:28:29Z

+  @Override
+  public void run() {
+    try {
+      AmoroTable<?> amoroTable = tableRuntime.loadTable();


The problem is that the new scheduling path no longer preserves the old “run, then record cleanup time” behavior for snapshot expiration.

In the old implementation, SnapshotsExpiringExecutor.java executed tableMaintainer.expireSnapshots() synchronously. Only after that finished did PeriodicTableScheduler.java (line 125) update lastCleanTime and schedule the next run. So the interval was effectively measured from the end of the previous cleanup.

In the new path, ActionCoordinatorScheduler.java (line 103) only submits/registers a process and returns immediately. After that return, PeriodicTableScheduler still updates lastCleanTime right away, even though the real cleanup work has not finished yet. The actual cleanup now happens later in SnapshotsExpiringProcess.java (line 53).

Building on your observation — the async submission also introduces a state-loss issue in LocalExecutionEngine.getStatus().

getStatus() removes the Future from the map on terminal states (isDone/isCancelled), making it non-idempotent:

Call 1: future.isDone() == true → remove → SUCCESS Call 2: future == null → KILLED (wrong!)

TableProcessExecutor polls getStatus() in a loop (line 107), so if any retry or concurrent access queries the same identifier twice after completion, it gets KILLED instead of the real result.

There's also a TOCTOU race between containsKey and get across cancelingInstances/activeInstances (lines 67-70), since the compound check-then-act isn't atomic even with ConcurrentHashMap.

The information of a completed process is not cleaned up immediately, but waits for a TTL period, which is long enough for the process executor to obtain its final status and persist the state into the database.

czy006 · 2026-03-13T08:38:43Z

It looks like IcebergProcessFactory receives available execute engines too early.

In AmoroServiceContainer, availableExecuteEngines(executeEngineManager.installedPlugins()) is called before executeEngineManager.initialize(), so installedPlugins() is still empty at that point. As a result, IcebergProcessFactory.localEngine is never set.

Later, when snapshot expiration is triggered, triggerExpireSnapshot() returns Optional.empty() because localEngine == null, so no SnapshotsExpiringProcess is ever created or submitted.

In other words, the new expire-snapshots path is effectively disabled due to initialization order. We probably need to initialize execute engines before injecting them into process factories, or re-inject them after engine initialization.

…actory Co-Authored-By: Aime <aime@bytedance.com>

新增 `IcebergProcessFactory` 与 `LocalExecutionEngine` 的单测覆盖。 - IcebergProcessFactory：覆盖 open/supportedActions、触发策略 interval、触发/不触发/禁用等分支 - LocalExecutionEngine：覆盖 tag 线程池选择、取消流程、失败状态、完成态 TTL 过期与非法 identifier 本地验证： - mvn -pl amoro-common -am -Dtest=TestLocalExecutionEngine test - mvn -pl amoro-ams -am -Dtest=TestIcebergProcessFactory test Co-Authored-By: Aime <aime@bytedance.com> Change-Id: I14dcd3c1286f2be72f8135777e1a81568d060b7d

修正新增单测文件被误标记为可执行(100755)的问题，避免在 MR 中产生无意义的 mode diff。 Co-Authored-By: Aime <aime@bytedance.com> Change-Id: I83c82894a66514cabecff2751e22c1c418469ac5

zhoujinsong

Overall this is a clean refactor that migrates snapshot expiring from the inline executor into the ProcessFactory framework. A few minor notes:

Duplicate interval check (IcebergProcessFactory.triggerExpireSnapshot): The time-since-last-execution guard is redundant since ActionCoordinatorScheduler already controls trigger frequency via ProcessTriggerStrategy.FIXED_RATE. Consider removing the manual check from the factory to avoid two sources of truth for scheduling state.
retryNumber on recovery: processMeta.getRetryNumber() may deserialize as 0 if the field is not persisted, causing already-failed processes to reset their retry count on AMS restart. Worth verifying the DB mapping.
Empty getSummary(): SnapshotsExpiringProcess.getSummary() returns an empty map, so nothing is recorded in the process store after execution. Even a minimal summary (e.g. elapsed time) would help with debugging.

None of these are blockers. LGTM.

github-actions Bot added module:ams-server Ams server module type:infra type:build module:common labels Mar 5, 2026

czy006 self-requested a review March 12, 2026 13:42

baiyangtx force-pushed the upstream/SnapshotsExpiring-processFactory branch from 709fc06 to b5e817f Compare March 12, 2026 13:50

baiyangtx marked this pull request as ready for review March 12, 2026 13:51

czy006 reviewed Mar 13, 2026

View reviewed changes

baiyangtx force-pushed the upstream/SnapshotsExpiring-processFactory branch from 2381805 to d4fb5e7 Compare March 16, 2026 06:19

[Improvement] Refactor SnapshotExpiring inline executor with ProcessF…

1846d5b

…actory Co-Authored-By: Aime <aime@bytedance.com>

baiyangtx force-pushed the upstream/SnapshotsExpiring-processFactory branch from d4fb5e7 to 1846d5b Compare March 16, 2026 06:22

github-actions Bot removed the type:infra label Mar 16, 2026

zhangyongxiang.alpha and others added 8 commits March 16, 2026 15:26

chore: fix test file permissions

6fe31f5

修正新增单测文件被误标记为可执行(100755)的问题，避免在 MR 中产生无意义的 mode diff。 Co-Authored-By: Aime <aime@bytedance.com> Change-Id: I83c82894a66514cabecff2751e22c1c418469ac5

CI

104b05e

CI

d7a2337

CI

4cd7eeb

FIX update lastExpireSnapshot logic

4bbf471

Merge branch 'master' into upstream/SnapshotsExpiring-processFactory

f174f19

remove uesless config

d28da76

zhoujinsong approved these changes Mar 17, 2026

View reviewed changes

czy006 approved these changes Mar 17, 2026

View reviewed changes

baiyangtx merged commit 0721df9 into apache:master Mar 17, 2026
6 checks passed

baiyangtx deleted the upstream/SnapshotsExpiring-processFactory branch March 17, 2026 13:51

xxubai added this to the Release 0.9.0 milestone Mar 27, 2026

This was referenced May 5, 2026

[Improvement]: Migrate remaining table maintenance Executors to Process architecture #4202

Open

[AMORO-4208] Refactor orphan-files-cleaning via ProcessFactory plugin #4209

Merged

zhangwl9 mentioned this pull request May 14, 2026

[Subtask]: Remove invalid code and configuration about SnapshotsExpiringExecutor #4217

Closed

3 tasks

lintingbin mentioned this pull request May 18, 2026

[AMORO-4223][ams] Fix AMS startup crash when recovering Iceberg maintenance processes #4224

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Improvement]: Refactor snapshot-expiring via ProcessFactory plugin#4107

[Improvement]: Refactor snapshot-expiring via ProcessFactory plugin#4107
baiyangtx merged 9 commits into
apache:masterfrom
baiyangtx:upstream/SnapshotsExpiring-processFactory

baiyangtx commented Mar 5, 2026 •

edited

Loading

Uh oh!

czy006 Mar 13, 2026

Uh oh!

czy006 Mar 13, 2026

Uh oh!

baiyangtx Mar 16, 2026

Uh oh!

czy006 Mar 13, 2026

Uh oh!

czy006 Mar 13, 2026

Uh oh!

j1wonpark Mar 15, 2026

Uh oh!

baiyangtx Mar 17, 2026

Uh oh!

czy006 commented Mar 13, 2026

Uh oh!

zhoujinsong left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

baiyangtx commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are the changes needed?

Brief change log

How was this patch tested?

Documentation

Uh oh!

czy006 Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

czy006 Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

baiyangtx Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

czy006 Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

czy006 Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

j1wonpark Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

baiyangtx Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

czy006 commented Mar 13, 2026

Uh oh!

zhoujinsong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

baiyangtx commented Mar 5, 2026 •

edited

Loading