Sparkle 0.0.1
Loading...
Searching...
No Matches
Dispatch Benchmark

Call Dispatch Benchmark

This note captures the playground experiment that compares several ways to dispatch a repeatable execution step. The benchmark was originally implemented in playground/srcs/main.cpp, and the essential code is recorded below so the experiment can be reproduced.

Benchmark Setup

  • Run count: 100'000'000 iterations per strategy.
  • Timer: GetSystemTimePreciseAsFileTime converted to nanoseconds.
  • Workload: an atomic increment guarded by memory_order_relaxed.
  • Environment: Windows build; results were collected for both Debug and Release configurations.
  • Time request function
    inline uint64_t now_ns_mono()
    {
    FILETIME ft;
    GetSystemTimePreciseAsFileTime(&ft);
    ULARGE_INTEGER u;
    u.LowPart = ft.dwLowDateTime;
    u.HighPart = ft.dwHighDateTime;
    return (u.QuadPart * 100ULL);
    }

Implementation Variants

ApplicationWithStdFunction - std::function dispatch

Stores a std::function<void()> and invokes it inside the loop. This exercises the type-erased call path introduced by std::function.

class ApplicationWithStdFunction
{
bool _isRunning = false;
std::function<void()> _executionStep = nullptr;
uint64_t _executionTime = 0;
public:
void setExecutionStep(const std::function<void()> &step) { _executionStep = step; }
uint64_t run(uint64_t iterations)
{
if (_executionStep == nullptr)
{
_executionTime = 0;
return 0;
}
_isRunning = true;
const uint64_t t0 = now_ns_mono();
for (uint64_t i = 0; i < iterations; ++i)
{
_executionStep();
}
const uint64_t t1 = now_ns_mono();
_isRunning = false;
_executionTime = (t1 - t0);
return 0;
}
uint64_t executionTime() const { return _executionTime; }
};

And, in the main, we will define the execution step as follow :

std::atomic<uint64_t> nbExecution{0};
appWithStdFunction.setExecutionStep([&nbExecution]() { nbExecution.fetch_add(1, std::memory_order_relaxed); });

ApplicationWithPredefinedLoop - direct member call

Hard-wires the increment logic inside the class and calls a private helper directly, showing the baseline cost without any form of indirection.

class ApplicationWithPredefinedLoop
{
bool _isRunning = false;
std::atomic<uint64_t> _nbExecution{0};
uint64_t _executionTime = 0;
void incrementExecution() { _nbExecution.fetch_add(1, std::memory_order_relaxed); }
public:
uint64_t run(uint64_t iterations)
{
_isRunning = true;
const uint64_t t0 = now_ns_mono();
for (uint64_t i = 0; i < iterations; ++i)
{
incrementExecution();
}
const uint64_t t1 = now_ns_mono();
_isRunning = false;
_executionTime = (t1 - t0);
return 0;
}
uint64_t executionTime() const { return _executionTime; }
uint64_t executionCount() const { return _nbExecution.load(std::memory_order_relaxed); }
};

ApplicationWithFunctionPointer - raw function pointer

Accepts a plain C-style function pointer and calls it from the hot loop. This isolates the cost of an indirect call without the additional machinery of std::function.

class ApplicationWithFunctionPointer
{
bool _isRunning = false;
void (*_executionStep)() = nullptr;
uint64_t _executionTime = 0;
public:
void setExecutionStep(void (*step)()) { _executionStep = step; }
uint64_t run(uint64_t iterations)
{
if (_executionStep == nullptr)
{
_executionTime = 0;
return 0;
}
_isRunning = true;
const uint64_t t0 = now_ns_mono();
for (uint64_t i = 0; i < iterations; ++i)
{
_executionStep();
}
const uint64_t t1 = now_ns_mono();
_isRunning = false;
_executionTime = (t1 - t0);
return 0;
}
uint64_t executionTime() const { return _executionTime; }
};

And, in the main, we will define the execution step as follow :

std::atomic<uint64_t> nbPublicExecution{0};
void incrementStep()
{
nbPublicExecution.fetch_add(1, std::memory_order_relaxed);
}
appWithFunctionPointer.setExecutionStep(&incrementStep);

ApplicationVirtual / ApplicationConcrete - virtual dispatch

Defines an abstract base with executeStep() and lets ApplicationConcrete implement the actual work, covering the cost of a virtual call and vtable lookup.

class ApplicationVirtual
{
bool _isRunning = false;
uint64_t _executionTime = 0;
virtual void executeStep() = 0;
protected:
std::atomic<uint64_t> _steps{0};
void incrementStep() { _steps.fetch_add(1, std::memory_order_relaxed); }
public:
uint64_t run(uint64_t iterations)
{
_isRunning = true;
const uint64_t t0 = now_ns_mono();
for (uint64_t i = 0; i < iterations; ++i)
{
executeStep();
}
const uint64_t t1 = now_ns_mono();
_isRunning = false;
_executionTime = (t1 - t0);
return 0;
}
uint64_t executionTime() const { return _executionTime; }
uint64_t executionCount() const { return _steps.load(std::memory_order_relaxed); }
};
class ApplicationConcrete : public ApplicationVirtual
{
void executeStep() override { incrementStep(); }
};

Benchmark harness

Each variant is exercised with the same set of iterations so their execution times are directly comparable.

std::atomic<uint64_t> nbPublicExecution{0};
void incrementStep()
{
nbPublicExecution.fetch_add(1, std::memory_order_relaxed);
}
int main()
{
constexpr uint64_t iterations = 100'000'000ULL;
ApplicationWithStdFunction appWithStdFunction;
ApplicationWithPredefinedLoop appWithPredefinedLoop;
ApplicationWithFunctionPointer appWithFunctionPointer;
ApplicationConcrete appWithVTableLookup;
std::atomic<uint64_t> nbExecution{0};
appWithStdFunction.setExecutionStep([&nbExecution]() { nbExecution.fetch_add(1, std::memory_order_relaxed); });
appWithFunctionPointer.setExecutionStep(&incrementStep);
appWithStdFunction.run(iterations);
appWithPredefinedLoop.run(iterations);
appWithFunctionPointer.run(iterations);
appWithVTableLookup.run(iterations);
std::cout << "Time for " << iterations << " iterations" << std::endl;
std::cout << "std::function -> Time: " << appWithStdFunction.executionTime() << " ns (" << (static_cast<float>(appWithStdFunction.executionTime()) / static_cast<float>(iterations)) << " ns per execusion pass)" << std::endl;
std::cout << "Hard coded loop -> Time: " << appWithPredefinedLoop.executionTime() << " ns (" << (static_cast<float>(appWithPredefinedLoop.executionTime()) / static_cast<float>(iterations)) << " ns per execusion pass)" << std::endl;
std::cout << "FunctionPointer -> Time: " << appWithFunctionPointer.executionTime() << " ns (" << (static_cast<float>(appWithFunctionPointer.executionTime()) / static_cast<float>(iterations)) << " ns per execusion pass)" << std::endl;
std::cout << "Virtual -> Time: " << appWithVTableLookup.executionTime() << " ns (" << (static_cast<float>(appWithVTableLookup.executionTime()) / static_cast<float>(iterations)) << " ns per execusion pass)" << std::endl;
}

Results

Configuration Strategy Total Time (ns) Per Iteration (ns)
Debug std::function 995807400 9.95807
Debug Hard coded loop 680789100 6.80789
Debug Function pointer 687073000 6.87073
Debug Virtual 708402400 7.08402
Release std::function 420704600 4.20705
Release Hard coded loop 415345000 4.15345
Release Function pointer 416335200 4.16335
Release Virtual 430545000 4.30545

Takeaways

  • The direct call and raw function pointer paths cluster together as the fastest options in Release builds.
  • std::function adds a small but measurable overhead (~1% in Release, way more in Debug) due to type erasure and target management.
  • Virtual calls remain competitive but trail the direct approaches by roughly 4% in Release, reflecting the vtable indirection cost.

Why keep separate classes?

Maintaining a dedicated class for each dispatch style keeps the benchmark logic isolated, so changes to one experiment do not affect the others. Even if some scaffolding code looks duplicated, the explicit classes make it easier to reason about lifetimes, members, and future tweaks (for example, swapping in a different workload or instrumentation). Avoiding inheritance between the variants also prevents accidental sharing of state that could skew timing results and keeps the benchmark faithful to the real-world code paths each strategy would use in production.

Conclusion

Uppon implementing the classes responsible for application management inside Sparkle, we will have to create two separated class ConsoleApplication and GraphicalApplication, fine-tuned for each case by having the different loop directly writen inside their code, instead of having them inherit from the same base class, to avoid both virtual call and std::function cost. This will probably disable the possibility to add job to the loop inside each of those classes, but we could concidere that the behavior of the application will be complete and wont need any dynamic insertion.