Fail retention app when the columnPattern mismatch partition spec for…#552
Conversation
mkuchenbecker
left a comment
There was a problem hiding this comment.
- How do we know when this is failing?
- When this is failing, what is the impact on the customer's limited retention? What is the AI for the customer / us?
There was a problem hiding this comment.
this will add toil to operations, the operations is already at peak so we can't choose to add toil.
why toilsome?
- user calls SET POLICY ddl
- 1 out of 60,000 async jobs starts failing. impossible to alert on
- many weeks go by
- user raises ticket about unmatched expectation between their policy, and data
- this problem also is obfuscated when the retention policy comes from UI instead of table properties.
- oncall needs to dig into retention app logs and triage to dig out this exception in the spark driver logs.
multiply that^ ticket burden across N tables which have retention policy on non-partitioned column.
Instead this needs to be fixed at the root cause:
Sometimes users use a non-partitioned column for retention policy
Somewhere there is a bug:
a) either explicitly disallow DDL to SET POLICY ... retention on a non-partition column
b) if that^ is already done, how are these tables leaking through? it must be an edge case we're not tolerating somehow e.g. maybe a gap in server validation.
It won't add much toil. This fail model only applies to the HCR-enabled tables, which is the tracking tables. The HDFS team is enabling hcr in a batch, and they will monitor the effects. This fail model is the safety catch so that they won't archive the wrong data.
Tracking tables uses CPS to config retention policies, so we can't prevent it from the |
We don't know it is failing since we don't set up alerts on individual tables.
This fail model only apply to HCR-enabled tables, which is the tracking tables. The HDFS team is enabling hcr in a batch, so they will contact us if the job fails and we will tell them the config is wrong. So no impact on LR for other customers. |
teamurko
left a comment
There was a problem hiding this comment.
Overall looks good. The trade-off here is data loss of back-ed up files vs partition retention job failure. This code runs on backup enabled tables. It's a protection against enabling backup on table with misaligned retention policy. Since now users can directly update retention policy on CPS, we can't enforce it in the catalog. This is a trade-off we have to make to enable further storage cost reduction. PTAL @mkuchenbecker @cbb330, discuss any further ideas.
cbb330
left a comment
There was a problem hiding this comment.
discussed with Stas offline,
- we unblock this solution knowing the cons, and mitigating with future fixes
- we will decide whether to block merging this PR until retention policy service implements a write-path fix, or else we merge this and deploy it
- we will ask policy service to write through to OpenHouse table properties, so that we can block retention policies on columns that aren't partitioned
- this aligns with our strategy that OH implements the chokepoint for datalake, and policy service can be a reflection of OH as the SOT. allows OH set technical constraints for downstreams/upstreams in the API in one place.
Reverts the 17 commits that landed on main after v0.5.417, bringing the tree back to exactly the v0.5.417 state. Squashed into a single revert commit for reviewability and to allow reinstating everything as one unit (revert this commit to bring all 17 changes back). Reverted commits (v0.5.417..main, newest first): - Revert #579 (HTS fields in table list api) (#610) - feat(optimizer): [3/N] Analyzer (#533) - [DataLoader] Handle Cast(Literal, TIMESTAMP/DATE/TIME) in scan optimizer (#569 follow-up) (#583) - Skip metadata.json parse in drop path (#589) - feat(optimizer): [2/N] Optimizer REST Service and Controller (#531) - [BDP-102028] feat(optimizer): [1/N] Optimizer Database (#530) - [RTAS]: Fix bug - remove fs scheme from tableLocation in commit (cont) (#594) - Trigger ELR process (#593) - [BDP-102028] feat(optimizer): [0/N] Optimizer API and internal model (#527) - Fail retention app when the columnPattern mismatch partition spec (#552) - [DataLoader] Drop OpenTelemetry minimum version to 1.38.0 (#590) - [DataLoader] Emit OpenTelemetry metrics for read operations (#582) - Cache iceberg metadata to reduce redundant requests to storage (#509) - bump iceberg 1.2 version to 1.2.0.17 (#587) - Support returning HTS fields in table list api (#579) - [DataLoader] Add unique id property to OpenHouseDataLoader (#580) - [DataLoader] Add OpenTelemetry metrics support (#575) ## Summary <!--- HINT: Replace #nnn with corresponding Issue number, if you are fixing an existing issue --> [Issue](https://github.com/linkedin/openhouse/issues/#nnn)] Briefly discuss the summary of the changes made in this pull request in 2-3 lines. ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [ ] Bug Fixes - [ ] New Features - [ ] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [ ] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done <!--- Check any relevant boxes with "x" --> - [ ] Manually Tested on local docker setup. Please include commands ran, and their output. - [ ] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
Fail retention app when the columnPattern mismatch partition spec for HCR tables. Sometimes users use a non-partitioned column for retention policy, and this will cause rewrite data files in the retention job. If we enable rename, this will cause data loss. So we decided to fail the operation when the delete is not a metdata-only operation. We detect it by checking if all the planned files have no residual from the the patition data.
Changes
For all the boxes checked, please include additional details of the changes made in this pull request.
Testing Done
For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.
Additional Information
For all the boxes checked, include additional details of the changes made in this pull request.