[pkg/stanza] [receiver/windowseventlogreceiver]: speed up receiver by MrAnno · Pull Request #43195 · open-telemetry/opentelemetry-collector-contrib

MrAnno · 2025-10-07T15:59:55Z

Description

With the default max_reads: 100, poll_interval: 1s config fields, the receiver could process a maximum of 100 events per second.

This PR improves the performance of the Windows Event Log receiver by introducing an interruptible readAll() method, which tries to read all the messages in max_reads-sized batches and only falls back to polling when it reaches the end of the eventlog channel. The timer is started after each read cycle from now on.

Note that the same performance cannot be achieved by simply adjusting max_reads, because RPC_S_INVALID_BOUND errors limit the maximum configurable batch size. Alternatively, setting poll_interval to a smaller value would also incur some overhead.

linux-foundation-easycla · 2025-10-07T16:00:02Z

The committers listed above are authorized under a signed CLA.

✅ login: pjanotti / name: Paulo Janotti (12e0f6a)

github-actions · 2025-10-07T16:00:08Z

Welcome, contributor! Thank you for your contribution to opentelemetry-collector-contrib.

Important reminders:

Please review our Contributing Guidelines.
Don't forget to sign the Contributor License Agreement (CLA) if you haven't already.

A maintainer will review your pull request soon. Thank you for helping make OpenTelemetry better!

pjanotti

Thanks for your contribution @MrAnno!

I would describe this change as removing the upper-bound per poll and collecting events until there are no more events to be consumed. This can be useful in various scenarios. That said I'm a bit worried that now there are no upper bounds to the receiver reading events. It seems a good defense to have some default upper bound - it can be time or max number of events, I'm not sure at this point. Such upper bound limit can avoid the cases where something is generating a flood of events (and probably already using lots of CPU) and the collector is not backing off at all, causing CPU oversubscription on the box.

Consider the case that the collector is also configured to collect all events, it should back off from time to time even if that is going to make it taking longer to read all events. That said this change will make it catchup much faster in such cases.

MrAnno · 2025-10-08T12:57:45Z

@pjanotti Thank you for the quick response.

I've added a new config option called rate_limit so that the user has control over the mentioned upper bound expressed in events/sec.

I think setting the default value to anything other than 0 would be pretty arbitrary, because everything we want to achieve here depends on the given system's resources (for a PC, 1000 is a good default, but on a real collector server, I probably wouldn't go under 20 000).

If my speed-up patch is considered a breaking change due to its possible peak resource consumption, we could come up with a fair default. Otherwise, I would prefer 0, because the CPU usage of otelcontribcol stayed under 40% even when I started flooding events with a custom C tool (50.000 events/sec sent into the event viewer on 4 ordinary CPU cores).

pjanotti

@MrAnno I'm fine with defaulting to no rate limit by default, but, I think that just counting the overall number of events read in a single call to readAll and letting readThrottle return false when that limit is reached achieves the effect that we want. Simple having the limit per poll is easier for users to understand and control.

pkg/stanza/operator/input/windows/input.go

receiver/windowseventlogreceiver/README.md

.chloggen/eventlog-speedup.yaml

MrAnno · 2025-10-09T12:09:28Z

Simple having the limit per poll is easier for users to understand and control.

My thought process was that users shouldn't even know about the poll interval under normal circumstances, because that is an implementation detail and the default value should "just work" in 99% of the use-cases.
I'm not sure my previous sentence is true, I may be wrong here. If not, then a rate limit option that is independent from any other options seems easier to understand for me and it can even be useful for multiple purposes (not only for controlling CPU load).

The intention with the rate_limit implementation was to make things easier for users, but I certainly complicated things with it for the code maintainers (I handcoded a float-precise token bucket throttle).

I just wanted to share this, but I don't want to be a hindrance. Please let me know your preference, and I will change the PR accordingly.

pjanotti · 2025-10-09T20:53:25Z

Thanks for being flexible here @MrAnno

My thought process was that users shouldn't even know about the poll interval under normal circumstances, because that is an implementation detail and the default value should "just work" in 99% of the use-cases.

Yes, that seems reasonable to me. I'm asking for the simpler throttling for the few cases that the user gets to the point of configuring. It will consume at most X events until next poll interval is very straight forward to understand. In this context just counting has the effect of backing off and "refilling the bucket" automatically occurs at the poll interval.

MrAnno · 2025-10-10T09:46:05Z

@pjanotti Thanks.

I've reimplemented the rate limit option, the new implementation has the following 2 side effects:

the actual rate measured is unstable, for example, in case of 1000, it oscillates between 200 and 1000, but fortunately it is heavier on the 1000 side;
the rate_limit option collides with max_reads. I made rate_limit "stronger" so it overrides the batch size when reaching the end of the given poll interval limit.

pjanotti

Sorry for the delay @MrAnno - I think it is almost there, some small things and the need for a test, at least when the limiting is not zero.

pkg/stanza/operator/input/windows/input.go

pkg/stanza/operator/input/windows/config_all.go

pkg/stanza/operator/input/windows/input.go

pjanotti · 2025-10-15T23:54:30Z

@MrAnno noticed that I merged main into this PR branch, so remember to pull before you make any changes. My understanding is that you are looking into adding a basic test for the new feature, it will also be good to have some config tests.

pjanotti

Remaining issues:

Could you please update the description of the PR so it matches the current (hopefully final) implementation?

.chloggen/eventlog-speedup.yaml

MrAnno · 2025-10-27T11:51:20Z

@pjanotti Can we give the CI another try, please? I missed a C-API boundary in tests.

github-actions · 2025-11-11T05:21:26Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

MrAnno · 2025-11-25T10:56:49Z

Sorry, I forgot about this. I'll try to fix the test soon.

github-actions · 2025-12-10T05:22:16Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

MrAnno · 2025-12-12T08:45:50Z

@pjanotti Sorry for the delay. I think the test is finally ready as well.

pjanotti · 2025-12-15T22:30:54Z

@MrAnno are still having issues on CI - if you are too busy I can help with that so we get this merged soon.

With the default `max_reads: 100`, `poll_interval: 1s` config fields, the receiver could process a maximum of 100 events per second. This commit introduces an interruptible readAll() method, which tries to read all the messages in max_reads-sized batches and only falls back to polling when it reaches the end of the eventlog channel. Note that the same performance cannot be achieved by simply adjusting max_reads, because RPC_S_INVALID_BOUND errors limit the maximum configurable batch size. Alternatively, setting poll_interval to a smaller value would also incur some overhead.

…ith max_events_per_poll

It was impossible to implement proper mocks for evt* functions by patching the LazyProc calls directly, as runtime/checkptr.go has too strict checks for that. Now, the evt* functions themselves are patched, which have the proper type parameters.

MrAnno · 2025-12-15T22:43:42Z

Sorry again. I'm finally on vacation, I'm trying to fix the last remaining issue now.

MrAnno · 2025-12-15T23:03:44Z

Hm, it seems that the goleak error about the unexpected goroutine was not introduced by my PR, I can see the same error with a slightly different stack trace on main in my environment:

PS C:\Users\Anno\Desktop\opentelemetry-collector-contrib\pkg\stanza\operator\input\windows> go test -gcflags=-d=checkptr -run ^TestInputStart_RemoteSessionWithDomain$
PASS
goleak: Errors on successful test run: found unexpected goroutines:
[Goroutine 35 in state select, with github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/windows.(*Input).readOnInterval on top of the stack:
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/windows.(*Input).readOnInterval(0xc0001da200, {0x7ff736f8af98, 0xc0001922d0})
        C:/Users/Anno/Desktop/opentelemetry-collector-contrib/pkg/stanza/operator/input/windows/input.go:209 +0x117
created by github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/windows.(*Input).Start in goroutine 34
        C:/Users/Anno/Desktop/opentelemetry-collector-contrib/pkg/stanza/operator/input/windows/input.go:169 +0x9e8
]
exit status 1
FAIL    github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/windows     3.237s

@pjanotti Can you take a quick look, please?

pjanotti · 2025-12-16T19:37:28Z

@MrAnno I'll take a look at it today

@MrAnno

Fix a failure identified by @MrAnno at #43195 (comment)

pjanotti · 2025-12-17T01:13:32Z

@MrAnno the fix for the unrelated failure was merged. I'm updating the branch so we can run the tests with it. If any other changes are needed (likely not) you will have to pull your branch before making the changes, if it passes I will approve and put the label "ready to merge"

pjanotti

Thanks for your diligent work on this @MrAnno!

otelbot · 2025-12-17T08:32:21Z

Thank you for your contribution @MrAnno! 🎉 We would like to hear from you about your experience contributing to OpenTelemetry by taking a few minutes to fill out this survey. If you are getting started contributing, you can also join the CNCF Slack channel #opentelemetry-new-contributors to ask for guidance and get help.

MrAnno requested review from a team and andrzej-stencel as code owners October 7, 2025 15:59

github-actions bot assigned edmocosta Oct 7, 2025

github-actions bot added the first-time contributor PRs made by new contributors label Oct 7, 2025

github-actions bot added the pkg/stanza label Oct 7, 2025

MrAnno force-pushed the eventlog-speedup branch from a248f1f to 5609b94 Compare October 7, 2025 16:03

pjanotti reviewed Oct 8, 2025

View reviewed changes

github-actions bot added the receiver/windowseventlog label Oct 8, 2025

github-actions bot requested a review from armstrmi October 8, 2025 12:41

MrAnno requested a review from pjanotti October 8, 2025 12:41

pjanotti reviewed Oct 9, 2025

View reviewed changes

pkg/stanza/operator/input/windows/input.go Outdated Show resolved Hide resolved

receiver/windowseventlogreceiver/README.md Outdated Show resolved Hide resolved

mowies reviewed Oct 9, 2025

View reviewed changes

.chloggen/eventlog-speedup.yaml Outdated Show resolved Hide resolved

MrAnno force-pushed the eventlog-speedup branch from 020da97 to 599847c Compare October 9, 2025 12:07

MrAnno mentioned this pull request Oct 13, 2025

Wineventlog speedup axoflow/opentelemetry-collector-contrib#21

Merged

pjanotti reviewed Oct 14, 2025

View reviewed changes

MrAnno force-pushed the eventlog-speedup branch from 227a4d6 to 380b9ef Compare October 14, 2025 14:56

MrAnno force-pushed the eventlog-speedup branch from 9fe67a2 to fb5f0ae Compare October 20, 2025 10:05

pjanotti reviewed Oct 20, 2025

View reviewed changes

.chloggen/eventlog-speedup.yaml Show resolved Hide resolved

MrAnno force-pushed the eventlog-speedup branch 2 times, most recently from 404fbca to 60c6c9e Compare October 25, 2025 20:35

github-actions bot added the Stale label Nov 11, 2025

pjanotti mentioned this pull request Nov 13, 2025

Feature Request: Intelligent Polling with Backoff for Windows Event Log Receiver #44102

Closed

github-actions bot removed the Stale label Nov 13, 2025

github-actions bot added the Stale label Dec 10, 2025

MrAnno force-pushed the eventlog-speedup branch from e88dca6 to 24e8d97 Compare December 11, 2025 20:12

github-actions bot removed the Stale label Dec 12, 2025

MrAnno added 5 commits December 15, 2025 23:40

[pkg/stanza] [receiver/windowseventlogreceiver]: introduce rate-limit

72f5854

[pkg/stanza] [receiver/windowseventlogreceiver]: replace rate_limit w…

776796f

…ith max_events_per_poll

[pkg/stanza]: allow monkey patching typed evt* functions

389ce35

[pkg/stanza]: fix tests

cc778f9

It was impossible to implement proper mocks for evt* functions by patching the LazyProc calls directly, as runtime/checkptr.go has too strict checks for that. Now, the evt* functions themselves are patched, which have the proper type parameters.

MrAnno force-pushed the eventlog-speedup branch from 24e8d97 to cc778f9 Compare December 15, 2025 22:40

pjanotti mentioned this pull request Dec 16, 2025

[chore] Fix goleak in TestInputStart_RemoteSessionWithDomain #45012

Merged

songy23 pushed a commit that referenced this pull request Dec 17, 2025

[chore] Fix goleak in TestInputStart_RemoteSessionWithDomain (#45012)

c3f85e1

Fix a failure identified by @MrAnno at #43195 (comment)

Merge branch 'main' into eventlog-speedup

12e0f6a

pjanotti approved these changes Dec 17, 2025

View reviewed changes

pjanotti added the ready to merge Code review completed; ready to merge by maintainers label Dec 17, 2025

ChrsMark merged commit c051ed6 into open-telemetry:main Dec 17, 2025
205 checks passed

github-actions bot added this to the next release milestone Dec 17, 2025

github-actions bot mentioned this pull request Dec 17, 2025

[processor/remotetapprocessor]: Report for failed tests on main #42903

Closed

Conversation

MrAnno commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

linux-foundation-easycla bot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 7, 2025

Uh oh!

pjanotti left a comment

Choose a reason for hiding this comment

Uh oh!

MrAnno commented Oct 8, 2025

Uh oh!

pjanotti left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MrAnno commented Oct 9, 2025

Uh oh!

pjanotti commented Oct 9, 2025

Uh oh!

MrAnno commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pjanotti left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pjanotti commented Oct 15, 2025

Uh oh!

pjanotti left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MrAnno commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 11, 2025

Uh oh!

MrAnno commented Nov 25, 2025

Uh oh!

github-actions bot commented Dec 10, 2025

Uh oh!

MrAnno commented Dec 12, 2025

Uh oh!

pjanotti commented Dec 15, 2025

Uh oh!

MrAnno commented Dec 15, 2025

Uh oh!

MrAnno commented Dec 15, 2025

Uh oh!

pjanotti commented Dec 16, 2025

Uh oh!

pjanotti commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pjanotti left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

otelbot bot commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

MrAnno commented Oct 7, 2025 •

edited

Loading

linux-foundation-easycla bot commented Oct 7, 2025 •

edited

Loading

MrAnno commented Oct 10, 2025 •

edited

Loading

pjanotti left a comment •

edited

Loading

MrAnno commented Oct 27, 2025 •

edited

Loading

pjanotti commented Dec 17, 2025 •

edited

Loading