[AMORO-4216] Refactor dangling-delete-files-cleaning via ProcessFactory plugin#4214
Conversation
# Conflicts: # amoro-ams/src/main/java/org/apache/amoro/server/AmoroServiceContainer.java # amoro-ams/src/main/java/org/apache/amoro/server/process/iceberg/IcebergProcessFactory.java # amoro-ams/src/main/java/org/apache/amoro/server/scheduler/inline/InlineTableExecutors.java
1a78771 to
88b9736
Compare
|
@zhoujinsong cc |
| tableRuntime.updateState( | ||
| DefaultTableRuntime.CLEANUP_STATE_KEY, | ||
| cleanUp -> cleanUp.setLastDanglingDeleteFilesCleanTime(System.currentTimeMillis())); | ||
| } catch (Throwable t) { |
There was a problem hiding this comment.
LocalExecutionEngine tracks process status via CompletableFuture.whenComplete: if run() throws, the process is marked FAILED; if it returns normally, it is marked SUCCESS.
By catching Throwable and swallowing it here, exceptions never propagate to the CompletableFuture, so the process status is always reported as SUCCESS even when the actual operation failed. This makes failure invisible to the framework and to operators.
The fix is to let exceptions propagate out of run(). The same issue exists in OrphanFilesCleaningProcess and should be fixed there as well.
@Override
public void run() {
AmoroTable<?> amoroTable = tableRuntime.loadTable();
TableMaintainer tableMaintainer = TableMaintainerFactory.create(amoroTable, tableRuntime);
tableMaintainer.cleanDanglingDeleteFiles();
tableRuntime.updateState(
DefaultTableRuntime.CLEANUP_STATE_KEY,
cleanUp -> cleanUp.setLastDanglingDeleteFilesCleanTime(System.currentTimeMillis()));
}If a caught-and-logged pattern is intentional (e.g., to avoid flooding failure metrics for transient errors), consider wrapping in a RuntimeException so the framework still sees a failure:
} catch (Throwable t) {
LOG.error("Unexpected dangling delete files cleaning error for table {}",
tableRuntime.getTableIdentifier(), t);
throw new RuntimeException(t);
}…ing updated to FAILED Problem Analysis: - In DanglingDeleteFilesCleaningProcess, SnapshotsExpiringProcess, and OrphanFilesCleaningProcess, exceptions in the run() method were caught but not re-thrown - This caused LocalExecutionEngine.ProcessHolder.onComplete() to never detect failures - Process status was always set to SUCCESS even when execution actually failed Fix: - Add in the catch block to re-throw exceptions - Keep existing logging for troubleshooting - Ensure exceptions properly propagate to ProcessHolder so status is correctly updated to FAILED Modified Files: - amoro-ams/src/main/java/.../DanglingDeleteFilesCleaningProcess.java - amoro-ams/src/main/java/.../SnapshotsExpiringProcess.java - amoro-ams/src/main/java/.../OrphanFilesCleaningProcess.java
49a116a to
1996e60
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #4214 +/- ##
============================================
+ Coverage 29.80% 29.86% +0.06%
- Complexity 4284 4286 +2
============================================
Files 677 679 +2
Lines 54978 55054 +76
Branches 7013 7037 +24
============================================
+ Hits 16385 16441 +56
- Misses 37370 37386 +16
- Partials 1223 1227 +4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…ry plugin (#4214) * Refactor dangling-delete-files-cleaning via ProcessFactory plugin # Conflicts: # amoro-ams/src/main/java/org/apache/amoro/server/AmoroServiceContainer.java # amoro-ams/src/main/java/org/apache/amoro/server/process/iceberg/IcebergProcessFactory.java # amoro-ams/src/main/java/org/apache/amoro/server/scheduler/inline/InlineTableExecutors.java * Fix IcebergActions init failure by increasing MAX_NAME_LENGTH to 32 * rename action to clean-dangling-delete-files to fix pool tag mismatch * Fix: LocalProcess exceptions were swallowed, preventing state from being updated to FAILED Problem Analysis: - In DanglingDeleteFilesCleaningProcess, SnapshotsExpiringProcess, and OrphanFilesCleaningProcess, exceptions in the run() method were caught but not re-thrown - This caused LocalExecutionEngine.ProcessHolder.onComplete() to never detect failures - Process status was always set to SUCCESS even when execution actually failed Fix: - Add in the catch block to re-throw exceptions - Keep existing logging for troubleshooting - Ensure exceptions properly propagate to ProcessHolder so status is correctly updated to FAILED Modified Files: - amoro-ams/src/main/java/.../DanglingDeleteFilesCleaningProcess.java - amoro-ams/src/main/java/.../SnapshotsExpiringProcess.java - amoro-ams/src/main/java/.../OrphanFilesCleaningProcess.java * Fix logging message for dangling delete files error * Fix formatting of class documentation comment * fixup style --------- Co-authored-by: 张文领 <zhangwl9@chinatelecom.cn> (cherry picked from commit 0c5d3d8)
upstream 11커밋 동기화: AMS startup crash fix(apache#4224 AMORO-4223), snapshot/data/orphan/dangling cleaning ProcessFactory 리팩터(apache#4226/apache#4218/apache#4209/apache#4214), JUnit5 마이그레이션(apache#4199/apache#4204) 등. 사내 고유 JP CI(ts-ci-jp.yml: JDK11, tag-base) 보존.
Why are the changes needed?
Close #4216.
Brief change log
Replaced the old
DanglingDeleteFilesCleaningExecutorwithDanglingDeleteFilesCleaningProcess, a process-based implementation usingTableMaintainerFactory.Moved config from
AmoroManagementConf/config.yamltoprocess-factories.yaml/execute-engines.yaml, and removed theDANGLING_DELETE_FILES_CLEANINGenum value fromCleanupOperationalong with related cleanup methods inDefaultTableRuntime.Deleted old scheduler tests and invalid configuration.
updated
deployment.mdwith the new configuration.How was this patch tested?
Add some test cases that check the changes thoroughly including negative and positive cases if possible
Add screenshots for manual tests if appropriate
Run test locally before making a pull request
Documentation