Skip to content

[pkg/stanza] [receiver/windowseventlogreceiver]: speed up receiver#43195

Merged
ChrsMark merged 6 commits intoopen-telemetry:mainfrom
MrAnno:eventlog-speedup
Dec 17, 2025
Merged

[pkg/stanza] [receiver/windowseventlogreceiver]: speed up receiver#43195
ChrsMark merged 6 commits intoopen-telemetry:mainfrom
MrAnno:eventlog-speedup

Conversation

@MrAnno
Copy link
Contributor

@MrAnno MrAnno commented Oct 7, 2025

Description

With the default max_reads: 100, poll_interval: 1s config fields, the receiver could process a maximum of 100 events per second.

This PR improves the performance of the Windows Event Log receiver by introducing an interruptible readAll() method, which tries to read all the messages in max_reads-sized batches and only falls back to polling when it reaches the end of the eventlog channel. The timer is started after each read cycle from now on.

Note that the same performance cannot be achieved by simply adjusting max_reads, because RPC_S_INVALID_BOUND errors limit the maximum configurable batch size. Alternatively, setting poll_interval to a smaller value would also incur some overhead.

@MrAnno MrAnno requested review from a team and andrzej-stencel as code owners October 7, 2025 15:59
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 7, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: pjanotti / name: Paulo Janotti (12e0f6a)

@github-actions github-actions bot added the first-time contributor PRs made by new contributors label Oct 7, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Oct 7, 2025

Welcome, contributor! Thank you for your contribution to opentelemetry-collector-contrib.

Important reminders:

A maintainer will review your pull request soon. Thank you for helping make OpenTelemetry better!

Copy link
Contributor

@pjanotti pjanotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution @MrAnno!

I would describe this change as removing the upper-bound per poll and collecting events until there are no more events to be consumed. This can be useful in various scenarios. That said I'm a bit worried that now there are no upper bounds to the receiver reading events. It seems a good defense to have some default upper bound - it can be time or max number of events, I'm not sure at this point. Such upper bound limit can avoid the cases where something is generating a flood of events (and probably already using lots of CPU) and the collector is not backing off at all, causing CPU oversubscription on the box.

Consider the case that the collector is also configured to collect all events, it should back off from time to time even if that is going to make it taking longer to read all events. That said this change will make it catchup much faster in such cases.

@MrAnno
Copy link
Contributor Author

MrAnno commented Oct 8, 2025

@pjanotti Thank you for the quick response.

I've added a new config option called rate_limit so that the user has control over the mentioned upper bound expressed in events/sec.

I think setting the default value to anything other than 0 would be pretty arbitrary, because everything we want to achieve here depends on the given system's resources (for a PC, 1000 is a good default, but on a real collector server, I probably wouldn't go under 20 000).

If my speed-up patch is considered a breaking change due to its possible peak resource consumption, we could come up with a fair default. Otherwise, I would prefer 0, because the CPU usage of otelcontribcol stayed under 40% even when I started flooding events with a custom C tool (50.000 events/sec sent into the event viewer on 4 ordinary CPU cores).

Copy link
Contributor

@pjanotti pjanotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MrAnno I'm fine with defaulting to no rate limit by default, but, I think that just counting the overall number of events read in a single call to readAll and letting readThrottle return false when that limit is reached achieves the effect that we want. Simple having the limit per poll is easier for users to understand and control.

@MrAnno
Copy link
Contributor Author

MrAnno commented Oct 9, 2025

Simple having the limit per poll is easier for users to understand and control.

My thought process was that users shouldn't even know about the poll interval under normal circumstances, because that is an implementation detail and the default value should "just work" in 99% of the use-cases.
I'm not sure my previous sentence is true, I may be wrong here. If not, then a rate limit option that is independent from any other options seems easier to understand for me and it can even be useful for multiple purposes (not only for controlling CPU load).

The intention with the rate_limit implementation was to make things easier for users, but I certainly complicated things with it for the code maintainers (I handcoded a float-precise token bucket throttle).

I just wanted to share this, but I don't want to be a hindrance. Please let me know your preference, and I will change the PR accordingly.

@pjanotti
Copy link
Contributor

pjanotti commented Oct 9, 2025

Thanks for being flexible here @MrAnno

My thought process was that users shouldn't even know about the poll interval under normal circumstances, because that is an implementation detail and the default value should "just work" in 99% of the use-cases.

Yes, that seems reasonable to me. I'm asking for the simpler throttling for the few cases that the user gets to the point of configuring. It will consume at most X events until next poll interval is very straight forward to understand. In this context just counting has the effect of backing off and "refilling the bucket" automatically occurs at the poll interval.

@MrAnno
Copy link
Contributor Author

MrAnno commented Oct 10, 2025

@pjanotti Thanks.

I've reimplemented the rate limit option, the new implementation has the following 2 side effects:

  • the actual rate measured is unstable, for example, in case of 1000, it oscillates between 200 and 1000, but fortunately it is heavier on the 1000 side;
  • the rate_limit option collides with max_reads. I made rate_limit "stronger" so it overrides the batch size when reaching the end of the given poll interval limit.

Copy link
Contributor

@pjanotti pjanotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay @MrAnno - I think it is almost there, some small things and the need for a test, at least when the limiting is not zero.

@pjanotti
Copy link
Contributor

@MrAnno noticed that I merged main into this PR branch, so remember to pull before you make any changes. My understanding is that you are looking into adding a basic test for the new feature, it will also be good to have some config tests.

Copy link
Contributor

@pjanotti pjanotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MrAnno MrAnno force-pushed the eventlog-speedup branch 2 times, most recently from 404fbca to 60c6c9e Compare October 25, 2025 20:35
@MrAnno
Copy link
Contributor Author

MrAnno commented Oct 27, 2025

@pjanotti Can we give the CI another try, please? I missed a C-API boundary in tests.

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Nov 11, 2025
@MrAnno
Copy link
Contributor Author

MrAnno commented Nov 25, 2025

Sorry, I forgot about this. I'll try to fix the test soon.

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@MrAnno
Copy link
Contributor Author

MrAnno commented Dec 12, 2025

@pjanotti Sorry for the delay. I think the test is finally ready as well.

@pjanotti
Copy link
Contributor

@MrAnno are still having issues on CI - if you are too busy I can help with that so we get this merged soon.

With the default `max_reads: 100`, `poll_interval: 1s` config fields,
the receiver could process a maximum of 100 events per second.

This commit introduces an interruptible readAll() method, which tries to
read all the messages in max_reads-sized batches and only falls back to
polling when it reaches the end of the eventlog channel.

Note that the same performance cannot be achieved by simply adjusting
max_reads, because RPC_S_INVALID_BOUND errors limit the maximum
configurable batch size. Alternatively, setting poll_interval to a smaller
value would also incur some overhead.
It was impossible to implement proper mocks for evt* functions by
patching the LazyProc calls directly, as runtime/checkptr.go has too
strict checks for that.

Now, the evt* functions themselves are patched, which have the proper
type parameters.
@MrAnno
Copy link
Contributor Author

MrAnno commented Dec 15, 2025

Sorry again. I'm finally on vacation, I'm trying to fix the last remaining issue now.

@MrAnno
Copy link
Contributor Author

MrAnno commented Dec 15, 2025

Hm, it seems that the goleak error about the unexpected goroutine was not introduced by my PR, I can see the same error with a slightly different stack trace on main in my environment:

PS C:\Users\Anno\Desktop\opentelemetry-collector-contrib\pkg\stanza\operator\input\windows> go test -gcflags=-d=checkptr -run ^TestInputStart_RemoteSessionWithDomain$
PASS
goleak: Errors on successful test run: found unexpected goroutines:
[Goroutine 35 in state select, with github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/windows.(*Input).readOnInterval on top of the stack:
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/windows.(*Input).readOnInterval(0xc0001da200, {0x7ff736f8af98, 0xc0001922d0})
        C:/Users/Anno/Desktop/opentelemetry-collector-contrib/pkg/stanza/operator/input/windows/input.go:209 +0x117
created by github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/windows.(*Input).Start in goroutine 34
        C:/Users/Anno/Desktop/opentelemetry-collector-contrib/pkg/stanza/operator/input/windows/input.go:169 +0x9e8
]
exit status 1
FAIL    github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/windows     3.237s

@pjanotti Can you take a quick look, please?

@pjanotti
Copy link
Contributor

@MrAnno I'll take a look at it today

@pjanotti
Copy link
Contributor

pjanotti commented Dec 17, 2025

@MrAnno the fix for the unrelated failure was merged. I'm updating the branch so we can run the tests with it. If any other changes are needed (likely not) you will have to pull your branch before making the changes, if it passes I will approve and put the label "ready to merge"

Copy link
Contributor

@pjanotti pjanotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your diligent work on this @MrAnno!

@pjanotti pjanotti added the ready to merge Code review completed; ready to merge by maintainers label Dec 17, 2025
@ChrsMark ChrsMark merged commit c051ed6 into open-telemetry:main Dec 17, 2025
205 checks passed
@github-actions github-actions bot added this to the next release milestone Dec 17, 2025
@otelbot
Copy link
Contributor

otelbot bot commented Dec 17, 2025

Thank you for your contribution @MrAnno! 🎉 We would like to hear from you about your experience contributing to OpenTelemetry by taking a few minutes to fill out this survey. If you are getting started contributing, you can also join the CNCF Slack channel #opentelemetry-new-contributors to ask for guidance and get help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

first-time contributor PRs made by new contributors pkg/stanza ready to merge Code review completed; ready to merge by maintainers receiver/windowseventlog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants