Call Dispatch Benchmark
This note captures the playground experiment that compares several ways to dispatch a repeatable execution step. The benchmark was originally implemented in playground/srcs/main.cpp, and the essential code is recorded below so the experiment can be reproduced.
Benchmark Setup
- Run count: 100'000'000 iterations per strategy.
- Timer: GetSystemTimePreciseAsFileTime converted to nanoseconds.
- Workload: an atomic increment guarded by memory_order_relaxed.
- Environment: Windows build; results were collected for both Debug and Release configurations.
- Time request function
inline uint64_t now_ns_mono()
{
FILETIME ft;
GetSystemTimePreciseAsFileTime(&ft);
ULARGE_INTEGER u;
u.LowPart = ft.dwLowDateTime;
u.HighPart = ft.dwHighDateTime;
return (u.QuadPart * 100ULL);
}
Implementation Variants
ApplicationWithStdFunction - std::function dispatch
Stores a std::function<void()> and invokes it inside the loop. This exercises the type-erased call path introduced by std::function.
class ApplicationWithStdFunction
{
bool _isRunning = false;
std::function<void()> _executionStep = nullptr;
uint64_t _executionTime = 0;
public:
void setExecutionStep(const std::function<void()> &step) { _executionStep = step; }
uint64_t run(uint64_t iterations)
{
if (_executionStep == nullptr)
{
_executionTime = 0;
return 0;
}
_isRunning = true;
const uint64_t t0 = now_ns_mono();
for (uint64_t i = 0; i < iterations; ++i)
{
_executionStep();
}
const uint64_t t1 = now_ns_mono();
_isRunning = false;
_executionTime = (t1 - t0);
return 0;
}
uint64_t executionTime() const { return _executionTime; }
};
And, in the main, we will define the execution step as follow :
std::atomic<uint64_t> nbExecution{0};
appWithStdFunction.setExecutionStep([&nbExecution]() { nbExecution.fetch_add(1, std::memory_order_relaxed); });
ApplicationWithPredefinedLoop - direct member call
Hard-wires the increment logic inside the class and calls a private helper directly, showing the baseline cost without any form of indirection.
class ApplicationWithPredefinedLoop
{
bool _isRunning = false;
std::atomic<uint64_t> _nbExecution{0};
uint64_t _executionTime = 0;
void incrementExecution() { _nbExecution.fetch_add(1, std::memory_order_relaxed); }
public:
uint64_t run(uint64_t iterations)
{
_isRunning = true;
const uint64_t t0 = now_ns_mono();
for (uint64_t i = 0; i < iterations; ++i)
{
incrementExecution();
}
const uint64_t t1 = now_ns_mono();
_isRunning = false;
_executionTime = (t1 - t0);
return 0;
}
uint64_t executionTime() const { return _executionTime; }
uint64_t executionCount() const { return _nbExecution.load(std::memory_order_relaxed); }
};
ApplicationWithFunctionPointer - raw function pointer
Accepts a plain C-style function pointer and calls it from the hot loop. This isolates the cost of an indirect call without the additional machinery of std::function.
class ApplicationWithFunctionPointer
{
bool _isRunning = false;
void (*_executionStep)() = nullptr;
uint64_t _executionTime = 0;
public:
void setExecutionStep(void (*step)()) { _executionStep = step; }
uint64_t run(uint64_t iterations)
{
if (_executionStep == nullptr)
{
_executionTime = 0;
return 0;
}
_isRunning = true;
const uint64_t t0 = now_ns_mono();
for (uint64_t i = 0; i < iterations; ++i)
{
_executionStep();
}
const uint64_t t1 = now_ns_mono();
_isRunning = false;
_executionTime = (t1 - t0);
return 0;
}
uint64_t executionTime() const { return _executionTime; }
};
And, in the main, we will define the execution step as follow :
std::atomic<uint64_t> nbPublicExecution{0};
void incrementStep()
{
nbPublicExecution.fetch_add(1, std::memory_order_relaxed);
}
appWithFunctionPointer.setExecutionStep(&incrementStep);
ApplicationVirtual / ApplicationConcrete - virtual dispatch
Defines an abstract base with executeStep() and lets ApplicationConcrete implement the actual work, covering the cost of a virtual call and vtable lookup.
class ApplicationVirtual
{
bool _isRunning = false;
uint64_t _executionTime = 0;
virtual void executeStep() = 0;
protected:
std::atomic<uint64_t> _steps{0};
void incrementStep() { _steps.fetch_add(1, std::memory_order_relaxed); }
public:
uint64_t run(uint64_t iterations)
{
_isRunning = true;
const uint64_t t0 = now_ns_mono();
for (uint64_t i = 0; i < iterations; ++i)
{
executeStep();
}
const uint64_t t1 = now_ns_mono();
_isRunning = false;
_executionTime = (t1 - t0);
return 0;
}
uint64_t executionTime() const { return _executionTime; }
uint64_t executionCount() const { return _steps.load(std::memory_order_relaxed); }
};
class ApplicationConcrete : public ApplicationVirtual
{
void executeStep() override { incrementStep(); }
};
Benchmark harness
Each variant is exercised with the same set of iterations so their execution times are directly comparable.
std::atomic<uint64_t> nbPublicExecution{0};
void incrementStep()
{
nbPublicExecution.fetch_add(1, std::memory_order_relaxed);
}
int main()
{
constexpr uint64_t iterations = 100'000'000ULL;
ApplicationWithStdFunction appWithStdFunction;
ApplicationWithPredefinedLoop appWithPredefinedLoop;
ApplicationWithFunctionPointer appWithFunctionPointer;
ApplicationConcrete appWithVTableLookup;
std::atomic<uint64_t> nbExecution{0};
appWithStdFunction.setExecutionStep([&nbExecution]() { nbExecution.fetch_add(1, std::memory_order_relaxed); });
appWithFunctionPointer.setExecutionStep(&incrementStep);
appWithStdFunction.run(iterations);
appWithPredefinedLoop.run(iterations);
appWithFunctionPointer.run(iterations);
appWithVTableLookup.run(iterations);
std::cout << "Time for " << iterations << " iterations" << std::endl;
std::cout << "std::function -> Time: " << appWithStdFunction.executionTime() << " ns (" << (static_cast<float>(appWithStdFunction.executionTime()) / static_cast<float>(iterations)) << " ns per execusion pass)" << std::endl;
std::cout << "Hard coded loop -> Time: " << appWithPredefinedLoop.executionTime() << " ns (" << (static_cast<float>(appWithPredefinedLoop.executionTime()) / static_cast<float>(iterations)) << " ns per execusion pass)" << std::endl;
std::cout << "FunctionPointer -> Time: " << appWithFunctionPointer.executionTime() << " ns (" << (static_cast<float>(appWithFunctionPointer.executionTime()) / static_cast<float>(iterations)) << " ns per execusion pass)" << std::endl;
std::cout << "Virtual -> Time: " << appWithVTableLookup.executionTime() << " ns (" << (static_cast<float>(appWithVTableLookup.executionTime()) / static_cast<float>(iterations)) << " ns per execusion pass)" << std::endl;
}
Results
| Configuration | Strategy | Total Time (ns) | Per Iteration (ns) |
| Debug | std::function | 995807400 | 9.95807 |
| Debug | Hard coded loop | 680789100 | 6.80789 |
| Debug | Function pointer | 687073000 | 6.87073 |
| Debug | Virtual | 708402400 | 7.08402 |
| Release | std::function | 420704600 | 4.20705 |
| Release | Hard coded loop | 415345000 | 4.15345 |
| Release | Function pointer | 416335200 | 4.16335 |
| Release | Virtual | 430545000 | 4.30545 |
Takeaways
- The direct call and raw function pointer paths cluster together as the fastest options in Release builds.
- std::function adds a small but measurable overhead (~1% in Release, way more in Debug) due to type erasure and target management.
- Virtual calls remain competitive but trail the direct approaches by roughly 4% in Release, reflecting the vtable indirection cost.
Why keep separate classes?
Maintaining a dedicated class for each dispatch style keeps the benchmark logic isolated, so changes to one experiment do not affect the others. Even if some scaffolding code looks duplicated, the explicit classes make it easier to reason about lifetimes, members, and future tweaks (for example, swapping in a different workload or instrumentation). Avoiding inheritance between the variants also prevents accidental sharing of state that could skew timing results and keeps the benchmark faithful to the real-world code paths each strategy would use in production.
Conclusion
Uppon implementing the classes responsible for application management inside Sparkle, we will have to create two separated class ConsoleApplication and GraphicalApplication, fine-tuned for each case by having the different loop directly writen inside their code, instead of having them inherit from the same base class, to avoid both virtual call and std::function cost. This will probably disable the possibility to add job to the loop inside each of those classes, but we could concidere that the behavior of the application will be complete and wont need any dynamic insertion.