Spark 3.5: Set log level to WARN for rewrite task failure with partial progress#9400
Spark 3.5: Set log level to WARN for rewrite task failure with partial progress#9400manuzhang wants to merge 1 commit into
Conversation
|
@ajantha-bhat please help review. |
| .onFailure( | ||
| (fileGroup, exception) -> { | ||
| LOG.error("Failure during rewrite group {}", fileGroup.info(), exception); | ||
| LOG.warn("Failure during rewrite group {}", fileGroup.info(), exception); |
There was a problem hiding this comment.
Hm I think I get the rationale that during partial progress it could be too noisy if we error logged every failure. But the issue is let's say most of the commit tasks are failures, and we only see warn logs which are easy to get masked. The only case we error log is when all tasks fail which seems insufficient logging in case of failure.
I actually think error is fine here. Are you seeing a lot of noisiness in this in any systems you may be running?
I also noticed that the non-partial progress case is also warn, but tbh that also doesn't seem right. What do you think @manuzhang also cc @szehon-ho @RussellSpitzer
There was a problem hiding this comment.
Another idea is that we log an error if the commit results size is less than some threshold rather than completely empty. This would eliminate the task level noise from errors but at the same time indicate to logging systems that the compaction isn't super effective.
if (commitResults.size() == 0) {
LOG.error(
"{} is true but no rewrite commits succeeded. Check the logs to determine why the individual "
+ "commits failed. If this is persistent it may help to increase {} which will break the rewrite operation "
+ "into smaller commits.",
PARTIAL_PROGRESS_ENABLED,
PARTIAL_PROGRESS_MAX_COMMITS);
}
to
if (commitResults.size()/commitAttempts <= SOME_THRESHOLD) {
LOG.error(
"{} is true but less than SOME_THRESHOLD rewrite commits succeeded. Check the logs to determine why the individual "
+ "commits failed. If this is persistent it may help to increase {} which will break the rewrite operation "
+ "into smaller commits.",
PARTIAL_PROGRESS_ENABLED,
PARTIAL_PROGRESS_MAX_COMMITS);
}
```
but maybe determining commitAttempts is overcomplicated.
There was a problem hiding this comment.
I agree ERROR is better for non-partial case. For partial progress, since failure is allowed, WARN sends the signal that failure is not fatal to the whole procedure.
Even the rewrite task failure without partial progress is logged as a WARN. It shouldn't be an ERROR with partial progress.