Fail retention app when the columnPattern mismatch partition spec for… by jiang95-dev · Pull Request #552 · linkedin/openhouse

jiang95-dev · 2026-04-17T20:36:19Z

Summary

Fail retention app when the columnPattern mismatch partition spec for HCR tables. Sometimes users use a non-partitioned column for retention policy, and this will cause rewrite data files in the retention job. If we enable rename, this will cause data loss. So we decided to fail the operation when the delete is not a metdata-only operation. We detect it by checking if all the planned files have no residual from the the patition data.

Changes

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

Manually Tested on local docker setup. Please include commands ran, and their output.
Added new tests for the changes made.
Updated existing tests to reflect the changes made.
No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
Some other form of testing like staging or soak time in production. Please explain.

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

Breaking Changes
Deprecations
Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

… HCR tables

mkuchenbecker

How do we know when this is failing?
When this is failing, what is the impact on the customer's limited retention? What is the AI for the customer / us?

cbb330

this will add toil to operations, the operations is already at peak so we can't choose to add toil.

why toilsome?

user calls SET POLICY ddl
1 out of 60,000 async jobs starts failing. impossible to alert on
many weeks go by
user raises ticket about unmatched expectation between their policy, and data
this problem also is obfuscated when the retention policy comes from UI instead of table properties.
oncall needs to dig into retention app logs and triage to dig out this exception in the spark driver logs.

multiply that^ ticket burden across N tables which have retention policy on non-partitioned column.

Instead this needs to be fixed at the root cause:

Sometimes users use a non-partitioned column for retention policy

Somewhere there is a bug:
a) either explicitly disallow DDL to SET POLICY ... retention on a non-partition column
b) if that^ is already done, how are these tables leaking through? it must be an edge case we're not tolerating somehow e.g. maybe a gap in server validation.

jiang95-dev · 2026-05-15T23:13:55Z

this will add toil to operations, the operations is already at peak so we can't choose to add toil.

It won't add much toil. This fail model only applies to the HCR-enabled tables, which is the tracking tables. The HDFS team is enabling hcr in a batch, and they will monitor the effects. This fail model is the safety catch so that they won't archive the wrong data.

Somewhere there is a bug: a) either explicitly disallow DDL to SET POLICY ... retention on a non-partition column b) if that^ is already done, how are these tables leaking through? it must be an edge case we're not tolerating somehow e.g. maybe a gap in server validation.

Tracking tables uses CPS to config retention policies, so we can't prevent it from the SET POLICY command. But agree that we should add the validations for other tables where currently we are missing. I'll send another PR for this.

jiang95-dev · 2026-05-15T23:31:42Z

How do we know when this is failing?

We don't know it is failing since we don't set up alerts on individual tables.

When this is failing, what is the impact on the customer's limited retention? What is the AI for the customer / us?

This fail model only apply to HCR-enabled tables, which is the tracking tables. The HDFS team is enabling hcr in a batch, so they will contact us if the job fails and we will tell them the config is wrong. So no impact on LR for other customers.

teamurko

Overall looks good. The trade-off here is data loss of back-ed up files vs partition retention job failure. This code runs on backup enabled tables. It's a protection against enabling backup on table with misaligned retention policy. Since now users can directly update retention policy on CPS, we can't enforce it in the catalog. This is a trade-off we have to make to enable further storage cost reduction. PTAL @mkuchenbecker @cbb330, discuss any further ideas.

cbb330

discussed with Stas offline,

we unblock this solution knowing the cons, and mitigating with future fixes
we will decide whether to block merging this PR until retention policy service implements a write-path fix, or else we merge this and deploy it
we will ask policy service to write through to OpenHouse table properties, so that we can block retention policies on columns that aren't partitioned
this aligns with our strategy that OH implements the chokepoint for datalake, and policy service can be a reflection of OH as the SOT. allows OH set technical constraints for downstreams/upstreams in the API in one place.

Reverts the 17 commits that landed on main after v0.5.417, bringing the tree back to exactly the v0.5.417 state. Squashed into a single revert commit for reviewability and to allow reinstating everything as one unit (revert this commit to bring all 17 changes back). Reverted commits (v0.5.417..main, newest first): - Revert #579 (HTS fields in table list api) (#610) - feat(optimizer): [3/N] Analyzer (#533) - [DataLoader] Handle Cast(Literal, TIMESTAMP/DATE/TIME) in scan optimizer (#569 follow-up) (#583) - Skip metadata.json parse in drop path (#589) - feat(optimizer): [2/N] Optimizer REST Service and Controller (#531) - [BDP-102028] feat(optimizer): [1/N] Optimizer Database (#530) - [RTAS]: Fix bug - remove fs scheme from tableLocation in commit (cont) (#594) - Trigger ELR process (#593) - [BDP-102028] feat(optimizer): [0/N] Optimizer API and internal model (#527) - Fail retention app when the columnPattern mismatch partition spec (#552) - [DataLoader] Drop OpenTelemetry minimum version to 1.38.0 (#590) - [DataLoader] Emit OpenTelemetry metrics for read operations (#582) - Cache iceberg metadata to reduce redundant requests to storage (#509) - bump iceberg 1.2 version to 1.2.0.17 (#587) - Support returning HTS fields in table list api (#579) - [DataLoader] Add unique id property to OpenHouseDataLoader (#580) - [DataLoader] Add OpenTelemetry metrics support (#575) ## Summary  [Issue](https://github.com/linkedin/openhouse/issues/#nnn)] Briefly discuss the summary of the changes made in this pull request in 2-3 lines. ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [ ] Bug Fixes - [ ] New Features - [ ] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [ ] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done  - [ ] Manually Tested on local docker setup. Please include commands ran, and their output. - [ ] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Fail retention app when the columnPattern mismatch partition spec for…

6c78617

… HCR tables

mkuchenbecker reviewed Apr 18, 2026

View reviewed changes

Comment thread apps/spark/src/main/java/com/linkedin/openhouse/jobs/spark/Operations.java Outdated

mkuchenbecker reviewed Apr 18, 2026

View reviewed changes

Comment thread apps/spark/src/main/java/com/linkedin/openhouse/jobs/spark/Operations.java Outdated

mkuchenbecker reviewed Apr 18, 2026

View reviewed changes

cbb330 requested changes Apr 19, 2026

View reviewed changes

jiang95-dev mentioned this pull request May 16, 2026

Disallow setting retention on non-partitioned columns #588

Open

17 tasks

address comments

cf5bdfb

teamurko previously approved these changes May 18, 2026

View reviewed changes

Comment thread apps/spark/src/main/java/com/linkedin/openhouse/jobs/spark/Operations.java Outdated

cbb330 previously approved these changes May 18, 2026

View reviewed changes

address comment

0bf7de2

jiang95-dev dismissed stale reviews from cbb330 and teamurko via 0bf7de2 May 19, 2026 00:33

teamurko approved these changes May 19, 2026

View reviewed changes

jiang95-dev merged commit 6d6ff0d into linkedin:main May 19, 2026
1 check passed

maluchari mentioned this pull request May 29, 2026

Revert all commits after v0.5.417 (restore v0.5.417 tree state) #619

Merged

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail retention app when the columnPattern mismatch partition spec for…#552

Fail retention app when the columnPattern mismatch partition spec for…#552
jiang95-dev merged 3 commits into
linkedin:mainfrom
jiang95-dev:lejiang/hcr-spec-mismatch

jiang95-dev commented Apr 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

mkuchenbecker left a comment

Uh oh!

cbb330 left a comment •

edited

Loading

Uh oh!

jiang95-dev commented May 15, 2026

Uh oh!

jiang95-dev commented May 15, 2026

Uh oh!

teamurko left a comment

Uh oh!

Uh oh!

cbb330 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jiang95-dev commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing Done

Additional Information

Uh oh!

Uh oh!

Uh oh!

mkuchenbecker left a comment

Choose a reason for hiding this comment

Uh oh!

cbb330 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiang95-dev commented May 15, 2026

Uh oh!

jiang95-dev commented May 15, 2026

Uh oh!

teamurko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cbb330 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jiang95-dev commented Apr 17, 2026 •

edited

Loading

cbb330 left a comment •

edited

Loading