RSP (pronounced rasp) is designed to be an easy and practical scoped profiler for C++. It is not the most feature-rich profiler available, but it is efficient, predictable, and extremely easy to understand, integrate, and extend.
Most profilers make it difficult to extract the specific timing information algorithm developers actually need. They often generate enormous event streams with high collection overhead and require heavyweight GUI tools just to make sense of the results.
Our thesis is that while visualization is valuable, in practice algorithm developers gain far more insight from flexible analytics and aggregations over well-structured raw measurements.
Rather than building yet another profiler GUI, we designed the RSP pipeline around a compact message format that makes it straightforward to build custom, task-specific analysis tools tailored to your workflow. The provided FlatBuffer schema keeps output efficient to serialize, memory-friendly, and easily consumable from C++, Go, Python, Rust, or any other language with FlatBuffers support.
RSP has been used internally in production for quite some time. It originated as part of a larger codebase and has been lightly adapted for public release. The profiler is intentionally opinionated: it focuses on the needs of algorithm developers—people who require precise, targeted measurements—rather than broad, system-wide tracing. It excels at focused, low-overhead profiling, though it performs adequately for more general workloads as well.
Included in the cli directory, is a standalone binary that can be
built with some common analysis functions. See cli/README.md.
- Minimal overhead
- Scoped profiling WITH metadata
- Support for nested scoping
- Support for multithreading
- Serialized output (binary) in Flatbuffer format
- Profiling directives are able to be left in the code and "compiled out"
- Lightweight (header only), with only a single dependency that is not included - Flatbuffers.
- Configurable "sinks" - currently, streaming to
coutor a file on-disk is supported. - Permissively licensed (ISC)
- AMD64 CPU with invariant TSC. We do not support "varying" TSCs - this should not be an issue for most modern chips.
- ARM64 chip with access to
cnvct_el0andcntfrq_el0instructions - Recent compiler supporting modern C++ standards (clang 18 or newer is recommended, as is
-std=c++23, but-std=c++17works). - Linux or macOS (macOS support is limited to Apple Sillicon)
We use a nix flake to bring in our dependencies, and
toolchain. However, this isn't actually needed - if your host system
has the binaries and libraries available, this is ok. To use our
toolchain, setup nix and then run nix develop - this will ensure
our toolchain is on $PATH and usable.
The assumptions made by our example scripts are that clang++ and
flatc are on $PATH - if your system has these already you can just
use those.
The only external dependency needed for the profiler is Flatbuffers -
which is brought in by nix in our environment (specifically, version
25.9.23). However, there's nothing specific that requires use of that
version - any recent version will suffice.
However, you will need to rebuild the generated headers to suit a
different version of Flatbuffers. Our build_flatbuffers.sh script
will take care of that for both the profiler and the CLI tools.
To use the profiler in your codebase, two steps are required.
- In your sources, add
#include <afware/rsp/API.hpp> - In your build options, or in a header that gets included BEFORE our header,
#define RSP_ENABLE
The #define RSP_ENABLE is crucical, as otherwise no profiling data will be collected. Similarly, removing/omitting
this definition is how you can "compile out" the profiler. You can leave the profiling directives in place, however,
as long as you keep our headers included - we provide macros that are "no-ops" when profiling is not present in the binary.
Our library is entirely header-only, so there is nothing to link with.
In your main() somewhere, you should then add something like this:
//
// This will check if profiling is available: i.e. it's compiled in, and we were able to figure out
// that the CPU met the appropriate requirements.
//
if (rsp::Available()) {
auto sink_ptr = rsp::Profiler::CreateBinaryDiskSink("/path/to/profiling/output/on/disk");
rsp::Instance().SetSinkToBinaryDisk(sink_ptr);
//
// A call to StartProfiling() is necessary to ensure that:
// - All internal resources for the profiler are allocated and available
// - The configuration is correct
// - The aggregation/sinking thread has started
//
if (!rsp::Start()) {
throw std::runtime_error("Could not start profiling!");
}
}
//
// Do work...
//
// And then at the very end:
//
rsp::Stop();
The rsp::Start() function will spin up the I/O thread and allow events to be queued. It will return
true if everything started succesfully. If rsp::Available() is true, and you don't call rsp::Start() - that's ok, it's just nothing will be queued as the I/O thread has not started and the queueing
is gated upon the thread being up and running. It will be a little wasteful as scope events will be
created but never queued - but not awful.
The rsp::Stop() function doesn't simply prevent collection from occurring - it will also
stop the I/O thread (which is not free).
In most cases, you should call rsp::Start() near the beginning of your program, and rsp::Stop() somewhere toward the end. Since they aren't free - think carefully about where you call them.
Your first profiling operation might look like:
void SomeFunction() {
RSP_SCOPE("SomeFunction");
// Do work...
}
That's it. You will see scoping information tagged as SomeFunction. It's recommended you name the scopes as something
fairly easy to parse so you can filter them later.
Let's say you wanted to add some metadata:
void SomeFunction() {
RSP_SCOPE("SomeFunction");
const auto some_list_of_items = GetWorkFromSomewhere();
RSP_SCOPE_METADATA("ItemsInList", some_list_of_items.size());
DoMoreWork(some_list_of_items);
}
Now, each scope will also include metadata about the number of items in the list.
Nesting is also supported:
void SomeFunction() {
RSP_SCOPE("SomeFunction");
const auto some_list_of_items = GetWorkFromSomewhere();
RSP_SCOPE_METADATA("ItemsInList", some_list_of_items.size());
DoMoreWork(some_list_of_items);
const auto more_items = GetWorkFromSomewhere();
RSP_SCOPE("Secondary");
RSP_SCOPE_METADATA("ItemsInList", more_items.size());
}
In this case, the first scope (SomeFunction) will contain the timing related to the entire execution of the function, and will have
the ItemsInList metadata from the first RSP_SCOPE_METADATA directive associated with it. The second scope is the "child" scope,
and will contain timing information only for the portion of the execution after it was instantiated. Similarly, the second RSP_SCOPE_METADATA is associated with the Secondary scope.
Each scope contains the following elements:
- Start tick (upon construction)
- Stop tick (upon destruction)
- Metadata associated with the scope
Post-aggregation, these tick-deltas can be combined with the detected nominal clock rate and converted into timings.
The metadata is tagged with a key, and the value can be of a number of types: 8, 16, 32 or 64 bit ints (signed and unsigned), float or double.
We also offer a convenience macro to create a scope at function level:
void MyFunction() {
RSP_FUNCTION_SCOPE;
}
This works identically to a regular scope (in terms of attaching
metadata), and will be named according to the
std::source_location().function_name() (in this case, the scope will
be named void MyFunction()).
For illustrative examples it is recommended that the user reviews the following examples:
-
examples/simple.cpp: This is a simple example that prints out the output tostdout, making it easy to see the association between scopes and their metadata -
examples/threaded.cpp: An example showing how data from all threads get aggregated into a single output. -
examples/disk_consumer.cppandexamples/disk_producer.cpp: Example demonstrating serialization to disk.
These examples can be built by running build_examples.sh (assuming you have clang installed).
RSP includes a comprehensive test suite built with Google Test.
If using the Nix toolchain (nix develop), Google Test is provided automatically:
./build_tests.sh
This will:
- Build the main test binary (
bin/rsp_tests) withRSP_ENABLEdefined - Build the disabled-API test binary (
bin/rsp_tests_disabled) withoutRSP_ENABLE - Run both binaries
To run individual tests or filter by name:
./bin/rsp_tests --gtest_filter='Metadata*'
./bin/rsp_tests --gtest_filter='Threading.*'
Tests are organized by component in tests/:
test_constexpr_string.cpp- ConstexprString construction, truncation, null terminationtest_metadata.cpp- MakeScopeMetadata for all types, boundary values, enums, boolstest_slots.cpp- MetadataSlot and MetadataSlotStorage: acquire/release, expansion, thread safetytest_scope_info.cpp- ScopeInfo construction, metadata attachment, streamingtest_scope_manager.cpp- ScopeManager stack operations, thread localitytest_machine.cpp- Machine detection, Now() monotonicity, platform-specific checkstest_serialization.cpp- FlatBuffer serialization roundtrips for all typestest_sinks.cpp- BinaryDiskSink write/read verification, append modetest_profiler.cpp- Singleton, Ready(), sink configuration, slot storagetest_active_scope.cpp- ActiveScope timing, nesting, RSP_SCOPE/RSP_SCOPE_METADATA/RSP_FUNCTION_SCOPE macrostest_api_enabled.cpp- API surface with RSP_ENABLEtest_api_disabled.cpp- API surface without RSP_ENABLE (separate binary)test_integration.cpp- End-to-end pipeline: scope creation through serialization to disk and deserializationtest_threading.cpp- Concurrent scope creation, thread-local isolation, stress testing
If not using Nix, ensure gtest headers and libraries are available on your system, then:
clang++ -std=c++23 -Wall -Wextra -pedantic -Iinclude/ -DRSP_ENABLE \
tests/profiler_test_env.cpp tests/test_*.cpp \
-lgtest -lgtest_main -lpthread -o bin/rsp_tests
# Exclude test_api_disabled.cpp from the above and compile separately:
clang++ -std=c++23 -Wall -Wextra -pedantic -Iinclude/ \
tests/test_api_disabled.cpp \
-lgtest -lgtest_main -lpthread -o bin/rsp_tests_disabled
We do our best to ensure that we aren't allocating in the hot parts of
the code, and that we really only do the bare minimum in the critical
path. In particular, we use a bump-style memory manager that
preallocates a block of memory up-front for storing our
metadata. These memory slots are managed by the Profiler and get
re-used. Should more slots be needed, only then do we
reallocate. Ideally, such reallocations will be rare as we set a
fairly conservative highwater mark: the relevant setting can be found
by grepping for RSP_PROFILER_DEFAULT_STORAGE_SLOTS - this is
changable at compile-time without modifying the code should it be
needed.
As a basic measure of performance, we defined a test program that performs a large number of trials of two different algorithms for computing digits of pi.
See: examples/speedtest.cpp
Our test environment was openSUSE Tumbleweed running in WSL2 on a machine with a AMD Ryzen Threadripper PRO 7965WX. We
use -O3 -march=native -mtune=native compiling with clang 18.1.8.
The build_examples.sh script will produce two versions of the
speedtest binary - one with profiling enabled, the other without. In
each run, we perform 100000 trials of each of the two algorithms and
report the total time taken.
Our results are:
(opensuse) [HYDRA] -> ./bin/speedtest
Profiling enabled.
Profiling started.
Time doing actual work: 0.188837 seconds
(opensuse) [HYDRA] -> ./bin/speedtest_no_profiler
Profiling unavailable.
Time doing actual work: 0.165545 seconds
We declared this to be "good enough". Your mileage may vary, however we have found in actual workloads (rather than highly contrived tests), the footprint is barely noticable.
Please email ajf <at> afware <dot> io. Paid commerical support is available.
Copyright (c) 2025, AFWare LLC. All Rights Reserved.
Permission to use, copy, modify, and/or distribute this software
for any purpose with or without fee is hereby granted, provided
that the above copyright notice and this permission notice appear
in all copies.
THE SOFTWARE IS PROVIDED “AS IS” AND ISC DISCLAIMS ALL WARRANTIES
WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL ISC BE LIABLE FOR
ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY
DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,
WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS
ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE
OF THIS SOFTWARE.
We depend upon (and include a copy of) the excellent
moodycamel::ConcurrentQueue, which can be found here:
https://github.com/cameron314/concurrentqueue
It is licensed as follows:
Copyright (c) 2013-2016, Cameron Desrochers. All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.