Skip to content

out_s3: Provide options for inactive chunks behavior#11816

Open
cosmo0920 wants to merge 3 commits into
masterfrom
cosmo0920-provide-options-for-inactive-chunks-behavior
Open

out_s3: Provide options for inactive chunks behavior#11816
cosmo0920 wants to merge 3 commits into
masterfrom
cosmo0920-provide-options-for-inactive-chunks-behavior

Conversation

@cosmo0920
Copy link
Copy Markdown
Contributor

@cosmo0920 cosmo0920 commented May 18, 2026

In the current behavior of inactive files, out_s3 just persists to leave as-is and nop for the quaratined files.
In this PR, we provide options to change behavior of these quarantined files.
The default behavior of inactive files should be set up as quarantine and the limit of quarantine size of files are also provided.

Closes #11759.


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

  • New Features

    • Added retry_exhausted_action setting to control local buffer handling when uploads permanently fail (delete or quarantine; default: quarantine). Invalid values now fail initialization.
    • Added quarantine storage handling with directory size limits and enforcement; quarantined buffers tracked and managed.
  • Bug Fixes

    • Initialize buffered file sizes from actual chunk sizes.
  • Tests

    • Added integration and runtime tests covering default quarantine and retry-exhaustion behavior.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds configurable terminal retry-exhaustion handling to the S3 output: when retries are exhausted a configured action (delete or quarantine) is applied, with quarantine stream, directory-size enforcement, accounting updates, and tests exercising the new behavior.

Changes

S3 Retry-Exhausted Terminal Action

Layer / File(s) Summary
Data structures and configuration contract
plugins/out_s3/s3.h, plugins/out_s3/s3_store.h
Adds S3_RETRY_EXHAUSTED_DELETE and S3_RETRY_EXHAUSTED_QUARANTINE; extends struct flb_s3 with retry_exhausted_action_str, retry_exhausted_action, quarantine_dir_limit_size, quarantine_buffer_size, and stream_quarantine; declares s3_store_file_quarantine.
Initialization parsing and config entries
plugins/out_s3/s3.c
Parses retry_exhausted_action in cb_s3_init via s3_get_retry_exhausted_action; invalid values abort init. Adds quarantine_dir_limit_size and retry_exhausted_action to the plugin config map.
Core retry-exhausted cleanup routine
plugins/out_s3/s3.c
Implements s3_chunk_retry_exhausted_cleanup and s3_get_retry_exhausted_action; cleanup deletes or quarantines exhausted buffers, enforces quarantine_dir_limit_size, and updates accounting. Minor encoding/header formatting refactors included.
Cleanup integration into upload/flush paths
plugins/out_s3/s3.c
Replaces previous inactive/quarantine handling with s3_chunk_retry_exhausted_cleanup across put_all_chunks, upload-queue retry path, cb_s3_upload, and cb_s3_flush. Skips quarantine stream when scanning buffered chunks for upload.
Store file size initialization
plugins/out_s3/s3_store.c
set_files_context() now reads the underlying chunk real size and sets s3_file->size when positive.
Fstore quarantine stream and file quarantine
plugins/out_s3/s3_store.c
Creates a "quarantine" fstore stream during store init and adds s3_store_file_quarantine() to copy file content/metadata into the quarantine stream and remove the original file.
Runtime test harness and test case
tests/runtime/out_s3.c
Adds portable filesystem helpers (count_files_recursive, ensure_test_directory), a runtime test that forces PutObject failures to validate default quarantine behavior, and registers the test.
Integration test scenario and pytest validation
tests/integration/scenarios/out_s3/config/out_s3_retry_exhausted_default_quarantine.yaml, tests/integration/scenarios/out_s3/tests/test_out_s3_001.py
Adds an integration config with use_put_object, retry_limit: 1, and store path limits plus a pytest test_out_s3_default_retry_exhausted_action_quarantines_file which starts Fluent Bit, polls for quarantined files, and asserts their presence.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • fluent/fluent-bit#11669: Modifies retry_limit comparison semantics and defaults that determine when this PR's exhaustion cleanup triggers.

Suggested reviewers

  • edsiper
  • TheanLim

Poem

🐰 I dug a stream where failed uploads go,
A cozy quarantine when retries say no.
Files tucked away where the disk keeps score,
The rabbit hums and guards each lost byte's door. 🥕✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'out_s3: Provide options for inactive chunks behavior' accurately describes the main change: adding configurable options for how inactive/quarantined files are handled in the S3 plugin, including delete vs quarantine actions and quarantine size limits.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cosmo0920-provide-options-for-inactive-chunks-behavior

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6d712df473

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread plugins/out_s3/s3.c
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@plugins/out_s3/s3.c`:
- Around line 1987-1992: When s3_store_file_inactive(ctx, chunk_file) fails we
must enforce terminal semantics by falling back to deleting the chunk instead of
returning early; after the flb_plg_error call, invoke the existing chunk-delete
helper (e.g., s3_unlink_chunk(ctx, chunk_file) or the plugin's equivalent delete
function) and log its success/failure, and if no helper exists perform a safe
file unlink of chunk_file and log any error—ensure the function still returns
afterwards so the terminal cleanup path is always applied.

In `@tests/integration/scenarios/out_s3/tests/test_out_s3_001.py`:
- Around line 237-241: Replace the fixed time.sleep(6) with a polling loop that
repeatedly checks for files under store_dir (using the same glob pattern) until
either at least one file is found or a timeout elapses, then call
service.stop(); keep the final assertion that files exist. Concretely, remove
the time.sleep(6) call, implement a short-interval loop that re-evaluates the
glob (the same expression used to build files) up to a configurable timeout
(e.g., several seconds) and breaks when len(files) > 0, and then call
service.stop() and assert len(files) > 0 as before.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 58eccc69-6fde-42e1-856a-5bcc210112c2

📥 Commits

Reviewing files that changed from the base of the PR and between 6474297 and 6d712df.

📒 Files selected for processing (5)
  • plugins/out_s3/s3.c
  • plugins/out_s3/s3.h
  • tests/integration/scenarios/out_s3/config/out_s3_retry_exhausted_default_quarantine.yaml
  • tests/integration/scenarios/out_s3/tests/test_out_s3_001.py
  • tests/runtime/out_s3.c

Comment thread plugins/out_s3/s3.c Outdated
Comment thread tests/integration/scenarios/out_s3/tests/test_out_s3_001.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
tests/integration/scenarios/out_s3/tests/test_out_s3_001.py (2)

227-233: 💤 Low value

Consider more robust cleanup with shutil.rmtree.

The current cleanup only removes files and silently swallows all OSError exceptions. Using shutil.rmtree(store_dir, ignore_errors=True) followed by os.makedirs(store_dir, exist_ok=True) would be more concise and handle both files and directories.

♻️ Proposed refactor
+import shutil
+
 def test_out_s3_default_retry_exhausted_action_quarantines_file():
     store_dir = "/tmp/fluent-bit-test-suite-s3-retry-exhausted"
-    if os.path.exists(store_dir):
-        for path in glob.glob(f"{store_dir}/**", recursive=True):
-            if os.path.isfile(path):
-                try:
-                    os.remove(path)
-                except OSError:
-                    pass
+    shutil.rmtree(store_dir, ignore_errors=True)
+    os.makedirs(store_dir, exist_ok=True)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/scenarios/out_s3/tests/test_out_s3_001.py` around lines 227
- 233, Replace the manual file-by-file cleanup loop that checks
os.path.exists(store_dir) and iterates glob.glob(...) with a robust directory
remove and recreate: call shutil.rmtree(store_dir, ignore_errors=True) to remove
the whole tree and then os.makedirs(store_dir, exist_ok=True) to recreate the
empty directory; update imports to include shutil and remove the try/except
OSError block and the glob usage around store_dir.

237-246: 💤 Low value

Polling implementation properly addresses previous review feedback.

The timeout-based polling with 10-second deadline and 0.2-second intervals correctly replaces the previous fixed sleep approach, reducing CI flakiness. Consider adding a descriptive message to the assertion for easier debugging if the test fails.

💡 Optional: Add assertion message
-    assert len(files) > 0
+    assert len(files) > 0, f"Expected quarantined files in {store_dir}, but none were found after {timeout}s"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/scenarios/out_s3/tests/test_out_s3_001.py` around lines 237
- 246, Add a descriptive assertion message to the final check so test failures
are easier to debug: update the assertion "assert len(files) > 0" (in the
polling block that uses timeout, files, store_dir and service.stop()) to include
a message that shows the expected condition and relevant context (e.g.,
store_dir path and sampled files list or count) so test output clearly indicates
why no files were found.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@plugins/out_s3/s3.c`:
- Around line 1976-2005: The quarantine branch currently calls
s3_store_file_inactive which makes quarantined chunks recoverable; change this
to create a true terminal quarantine state by either (A) storing the file via a
new s3_store_file_quarantine(file, ctx) helper that moves the file into a
dedicated quarantine directory/name (and does not mark it as "inactive"), or (B)
setting a persistent quarantine flag/metadata on the chunk so recovery code can
skip it; also update the recovery path (put_all_chunks or its scan) to
explicitly ignore files marked as quarantined. Ensure you adjust bookkeeping
(ctx->quarantine_buffer_size, ctx->current_buffer_size) exactly as done now and
replace the s3_store_file_inactive call with the new terminal-quarantine
operation and its error handling, keeping existing calls to s3_store_file_delete
for failures.
- Around line 1987-1991: The current error path calls
s3_store_file_inactive(ctx, chunk_file) which frees the struct s3_file, and then
calls s3_store_file_delete(ctx, chunk_file), causing a use-after-free; to fix
this, avoid dereferencing chunk_file after s3_store_file_inactive by either
caching any fields needed for deletion (e.g., the fstore handle/size or other
identifiers) into local variables before calling s3_store_file_inactive, and
pass those cached values into s3_store_file_delete (or a new helper that deletes
by handle), or move the fallback-delete logic into
s3_store_file_inactive/s3_store routines so deletion happens before/free-safe
after inactivation; update calls around s3_store_file_inactive and
s3_store_file_delete accordingly and ensure no access to chunk_file after it may
have been freed.

---

Nitpick comments:
In `@tests/integration/scenarios/out_s3/tests/test_out_s3_001.py`:
- Around line 227-233: Replace the manual file-by-file cleanup loop that checks
os.path.exists(store_dir) and iterates glob.glob(...) with a robust directory
remove and recreate: call shutil.rmtree(store_dir, ignore_errors=True) to remove
the whole tree and then os.makedirs(store_dir, exist_ok=True) to recreate the
empty directory; update imports to include shutil and remove the try/except
OSError block and the glob usage around store_dir.
- Around line 237-246: Add a descriptive assertion message to the final check so
test failures are easier to debug: update the assertion "assert len(files) > 0"
(in the polling block that uses timeout, files, store_dir and service.stop()) to
include a message that shows the expected condition and relevant context (e.g.,
store_dir path and sampled files list or count) so test output clearly indicates
why no files were found.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7e0b9912-ffdc-4309-9be3-1b1da43ad418

📥 Commits

Reviewing files that changed from the base of the PR and between 6d712df and d602373.

📒 Files selected for processing (2)
  • plugins/out_s3/s3.c
  • tests/integration/scenarios/out_s3/tests/test_out_s3_001.py

Comment thread plugins/out_s3/s3.c
Comment thread plugins/out_s3/s3.c Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/integration/scenarios/out_s3/tests/test_out_s3_001.py`:
- Around line 227-233: The cleanup loop that swallows OSError can leave stale
files; replace it by fully removing and recreating the store_dir to hard-reset
state: call shutil.rmtree(store_dir) (guarded by os.path.exists(store_dir)),
then os.makedirs(store_dir, exist_ok=True) to recreate an empty directory, and
remove the try/except that silences os.remove failures so any cleanup error
surfaces and fails the test; update imports to include shutil and keep
references to store_dir, glob.glob, and os.path.exists as needed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4cd97470-539a-4234-a326-9718d6fcb378

📥 Commits

Reviewing files that changed from the base of the PR and between d602373 and 1311268.

📒 Files selected for processing (3)
  • plugins/out_s3/s3.c
  • plugins/out_s3/s3_store.c
  • tests/integration/scenarios/out_s3/tests/test_out_s3_001.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • plugins/out_s3/s3.c

Comment thread tests/integration/scenarios/out_s3/tests/test_out_s3_001.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@plugins/out_s3/s3_store.c`:
- Around line 330-338: s3_store_has_data() is counting the newly created
quarantine stream (ctx->stream_quarantine) as regular startup data; update the
detection logic in s3_store_has_data() to skip the quarantine stream — either by
comparing the stream name to "quarantine" (from flb_fstore_stream_create) or by
checking pointer equality to ctx->stream_quarantine — so quarantined files are
not treated as backlog/regular data during startup.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 54f56865-1b17-4174-a437-3249739e22e7

📥 Commits

Reviewing files that changed from the base of the PR and between 1311268 and 20d867a.

📒 Files selected for processing (4)
  • plugins/out_s3/s3.c
  • plugins/out_s3/s3.h
  • plugins/out_s3/s3_store.c
  • plugins/out_s3/s3_store.h
🚧 Files skipped from review as they are similar to previous changes (2)
  • plugins/out_s3/s3.h
  • plugins/out_s3/s3.c

Comment thread plugins/out_s3/s3_store.c
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/integration/scenarios/out_s3/tests/test_out_s3_001.py`:
- Around line 227-231: Replace the fixed /tmp path assigned to store_dir with a
unique temp directory (use
tempfile.mkdtemp(prefix="fluent-bit-test-suite-s3-retry-exhausted-")) and remove
the manual os.makedirs call; then ensure the directory is removed in a finally
block (shutil.rmtree(store_dir)) so cleanup always runs; update the test that
sets store_dir to use tempfile.mkdtemp and wrap the test body in try/finally to
guarantee removal.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 00e75e0d-40e6-4305-8833-a79ac6240035

📥 Commits

Reviewing files that changed from the base of the PR and between 20d867a and 1d0d330.

📒 Files selected for processing (1)
  • tests/integration/scenarios/out_s3/tests/test_out_s3_001.py

Comment thread tests/integration/scenarios/out_s3/tests/test_out_s3_001.py
cosmo0920 added 3 commits May 18, 2026 16:35
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

S3 output buffer consumed by stale files after multiple failed retries

1 participant