Skip to content

cmetrics: add cmt_expire for expiring metrics.#246

Open
pwhelan wants to merge 11 commits into
masterfrom
pwhelan-metrics-expire
Open

cmetrics: add cmt_expire for expiring metrics.#246
pwhelan wants to merge 11 commits into
masterfrom
pwhelan-metrics-expire

Conversation

@pwhelan
Copy link
Copy Markdown
Contributor

@pwhelan pwhelan commented Nov 5, 2025

Add two new functions, cmt_map_metrics_expire for expiring metrics in a cmt_map and cmt_expire which uses the function to expire all the metrics inside a cmetrics context.

This function is useful in general for removing metrics with labels for objects that will disappear, ie: PIDs, etc...

This will also be used in fluent/fluent-bit#7615.

@pwhelan pwhelan requested a review from edsiper as a code owner November 5, 2025 21:00
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/cmetrics.c Outdated
@pwhelan
Copy link
Copy Markdown
Contributor Author

pwhelan commented Nov 6, 2025

With a long running test running under valgrind:

==2706702== Memcheck, a memory error detector
==2706702== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
==2706702== Using Valgrind-3.25.1 and LibVEX; rerun with -h for copyright info
==2706702== Command: ./bin/fluent-bit -i mem_metrics -o stdout -f 1
==2706702== Parent PID: 2898109
==2706702== 
==2706702== 
==2706702== HEAP SUMMARY:
==2706702==     in use at exit: 0 bytes in 0 blocks
==2706702==   total heap usage: 27,895,617 allocs, 27,895,617 frees, 4,055,355,463,094 bytes allocated
==2706702== 
==2706702== All heap blocks were freed -- no leaks are possible
==2706702== 
==2706702== For lists of detected and suppressed errors, rerun with: -s
==2706702== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

@pwhelan
Copy link
Copy Markdown
Contributor Author

pwhelan commented Nov 6, 2025

I ran an instance of fluent-bit and monitored it from another while running several instances of the following script:

while true ; do sleep 1; echo -n . ; done

I then graphed out the memory usage overtime which seems to remain consistent:

image

@pwhelan
Copy link
Copy Markdown
Contributor Author

pwhelan commented Nov 6, 2025

I also monitored the CPU usage:

❯ ./bin/fluent-bit -i cpu -p pid=2780850 -o stdout -f 1
Fluent Bit v4.2.0
* Copyright (C) 2015-2025 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

______ _                  _    ______ _ _             ___   __
|  ___| |                | |   | ___ (_) |           /   | /  |
| |_  | |_   _  ___ _ __ | |_  | |_/ /_| |_  __   __/ /| | `| |
|  _| | | | | |/ _ \ '_ \| __| | ___ \ | __| \ \ / / /_| |  | |
| |   | | |_| |  __/ | | | |_  | |_/ / | |_   \ V /\___  |__| |_
\_|   |_|\__,_|\___|_| |_|\__| \____/|_|\__|   \_/     |_(_)___/


[2025/11/06 13:19:40.713673173] [ info] [fluent bit] version=4.2.0, commit=a83b287a08, pid=2817490
[2025/11/06 13:19:40.713718778] [ info] [storage] ver=1.5.3, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2025/11/06 13:19:40.713723247] [ info] [simd    ] disabled
[2025/11/06 13:19:40.713724629] [ info] [cmetrics] version=1.0.5
[2025/11/06 13:19:40.713727324] [ info] [ctraces ] version=0.6.6
[2025/11/06 13:19:40.713752181] [ info] [input:cpu:cpu.0] initializing
[2025/11/06 13:19:40.713753603] [ info] [input:cpu:cpu.0] storage_strategy='memory' (memory only)
[2025/11/06 13:19:40.718439987] [ info] [sp] stream processor started
[2025/11/06 13:19:40.718454424] [ info] [engine] Shutdown Grace Period=5, Shutdown Input Grace Period=2
[2025/11/06 13:19:40.718503666] [ info] [output:stdout:stdout.0] worker #0 started
[0] cpu.0: [[1762445981.089773594, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000}]
[0] cpu.0: [[1762445982.089794776, {}], {"cpu_p"=>0.031250, "user_p"=>0.031250, "system_p"=>0.000000}]
[0] cpu.0: [[1762445983.089775271, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000}]
[0] cpu.0: [[1762445984.089789067, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000}]
[0] cpu.0: [[1762445985.089779689, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000}]
[0] cpu.0: [[1762445986.089774128, {}], {"cpu_p"=>0.750000, "user_p"=>0.000000, "system_p"=>0.750000}]
[0] cpu.0: [[1762445987.089758506, {}], {"cpu_p"=>0.062500, "user_p"=>0.031250, "system_p"=>0.031250}]
[0] cpu.0: [[1762445988.089752954, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000}]
[0] cpu.0: [[1762445989.089761226, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000}]
[0] cpu.0: [[1762445990.089787390, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000}]
[0] cpu.0: [[1762445991.089760685, {}], {"cpu_p"=>0.750000, "user_p"=>0.031250, "system_p"=>0.718750}]
[0] cpu.0: [[1762445992.089789383, {}], {"cpu_p"=>0.062500, "user_p"=>0.031250, "system_p"=>0.031250}]
[0] cpu.0: [[1762445993.089766153, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000}]
[0] cpu.0: [[1762445994.089746818, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000}]
[0] cpu.0: [[1762445995.089785843, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000}]
[0] cpu.0: [[1762445996.089791685, {}], {"cpu_p"=>0.781250, "user_p"=>0.000000, "system_p"=>0.781250}]
[0] cpu.0: [[1762445997.089822553, {}], {"cpu_p"=>0.031250, "user_p"=>0.031250, "system_p"=>0.000000}]
[0] cpu.0: [[1762445998.089773421, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000}]
[0] cpu.0: [[1762445999.089761116, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000}]
[0] cpu.0: [[1762446000.089788144, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000}]
[0] cpu.0: [[1762446001.089792790, {}], {"cpu_p"=>0.843750, "user_p"=>0.031250, "system_p"=>0.812500}]
[0] cpu.0: [[1762446002.089789551, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000}]
[0] cpu.0: [[1762446003.089806348, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000}]
[0] cpu.0: [[1762446004.089793499, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000}]
[0] cpu.0: [[1762446005.089772063, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000}]
[0] cpu.0: [[1762446006.089785030, {}], {"cpu_p"=>0.781250, "user_p"=>0.031250, "system_p"=>0.750000}]
[0] cpu.0: [[1762446007.089756350, {}], {"cpu_p"=>0.062500, "user_p"=>0.031250, "system_p"=>0.031250}]
[0] cpu.0: [[1762446008.089761842, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000}]
[0] cpu.0: [[1762446009.089763876, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000}]
[0] cpu.0: [[1762446010.089796197, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000}]
[0] cpu.0: [[1762446011.089782130, {}], {"cpu_p"=>0.812500, "user_p"=>0.062500, "system_p"=>0.750000}]
[0] cpu.0: [[1762446012.089754117, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000}]
[0] cpu.0: [[1762446013.089782929, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000}]

@piwai
Copy link
Copy Markdown

piwai commented May 4, 2026

Hey @pwhelan , need any help to rebase/update the branch to help get it merged? As you can see in this fluent-bit issue, I'd be extremely interested by this feature, and ended up implementing the same thing but in the wrong place 😅

@pwhelan
Copy link
Copy Markdown
Contributor Author

pwhelan commented May 4, 2026

Hey @pwhelan , need any help to rebase/update the branch to help get it merged? As you can see in this fluent-bit issue, I'd be extremely interested by this feature, and ended up implementing the same thing but in the wrong place 😅

gimme a bit, let me see what I can do.

** edit ** rebased. hopefully master has been fixed and tests can pass now.

@piwai tests passed.

@cosmo0920 this PR looks ready to be merged. should I add a unit test or two first?

pwhelan added 3 commits May 4, 2026 12:18
…assed expiration timestamp.

Signed-off-by: Phillip Whelan <pwhelan@exis.cl>
Signed-off-by: Phillip Whelan <pwhelan@exis.cl>
Signed-off-by: Phillip Whelan <pwhelan@exis.cl>
@pwhelan pwhelan force-pushed the pwhelan-metrics-expire branch from ae8ca7e to c961ba6 Compare May 4, 2026 16:20
Copy link
Copy Markdown
Contributor

@cosmo0920 cosmo0920 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found two of the concerns and added as comments.

Comment thread src/cmetrics.c
Comment thread tests/expire.c
Signed-off-by: Phillip Whelan <pwhelan@exis.cl>
@pwhelan pwhelan force-pushed the pwhelan-metrics-expire branch 3 times, most recently from 251639b to a8a1fc9 Compare May 16, 2026 19:14
Signed-off-by: Phillip Whelan <pwhelan@exis.cl>
@pwhelan pwhelan force-pushed the pwhelan-metrics-expire branch from a8a1fc9 to 42ae0d0 Compare May 16, 2026 19:15
pwhelan added 3 commits May 16, 2026 15:20
Signed-off-by: Phillip Whelan <pwhelan@exis.cl>
Signed-off-by: Phillip Whelan <pwhelan@exis.cl>
… the other remain.

Signed-off-by: Phillip Whelan <pwhelan@exis.cl>
@pwhelan
Copy link
Copy Markdown
Contributor Author

pwhelan commented May 16, 2026

I found two of the concerns and added as comments.

@cosmo0920 I added the NULL parameter check and the test for expiring cmt_untyped.

@edsiper
Copy link
Copy Markdown
Member

edsiper commented May 17, 2026

A few things I think we should settle before merging:

  1. cmt_map_metrics_expire() currently removes metrics with timestamp <= expiration. The description says “older than”, so I think this should probably be < expiration. That also avoids deleting metrics updated during the same collection pass when the caller uses that pass timestamp as the cutoff.

  2. cmt_expire() does not walk exp_histograms, so the context-wide expiration is incomplete.

  3. cmt_map_metrics_expire() returns int but always returns 0. It may be clearer to either return the expired count or make it void.

Could you also add tests for the equal-timestamp boundary and exponential histograms?

pwhelan added 2 commits May 17, 2026 17:09
…p, return void.

Signed-off-by: Phillip Whelan <pwhelan@exis.cl>
Signed-off-by: Phillip Whelan <pwhelan@exis.cl>
@pwhelan pwhelan force-pushed the pwhelan-metrics-expire branch from 8b719fc to 4fc4027 Compare May 17, 2026 21:18
@pwhelan
Copy link
Copy Markdown
Contributor Author

pwhelan commented May 17, 2026

A few things I think we should settle before merging:

@edsiper all done for now.

** edit ** I'll add an off-by-one test today.

…n boundary.

Signed-off-by: Phillip Whelan <pwhelan@exis.cl>
@pwhelan
Copy link
Copy Markdown
Contributor Author

pwhelan commented May 22, 2026

I added a test called test_expire_off_by_one that tests the boundary by testing expiration at t0-2, t0-1, t0 and t0+1 to expire three labelled counters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants