Call Dispatch Benchmark

This note captures the playground experiment that compares several ways to dispatch a repeatable execution step. The benchmark was originally implemented in playground/srcs/main.cpp, and the essential code is recorded below so the experiment can be reproduced.

Benchmark Setup

Run count: 100'000'000 iterations per strategy.
Timer: GetSystemTimePreciseAsFileTime converted to nanoseconds.
Workload: an atomic increment guarded by memory_order_relaxed.
Environment: Windows build; results were collected for both Debug and Release configurations.
Time request function
inline uint64_t now_ns_mono()

{

FILETIME ft;

GetSystemTimePreciseAsFileTime(&ft);

ULARGE_INTEGER u;

u.LowPart = ft.dwLowDateTime;

u.HighPart = ft.dwHighDateTime;

return (u.QuadPart * 100ULL);

}

Implementation Variants

ApplicationWithStdFunction - std::function dispatch

Stores a std::function<void()> and invokes it inside the loop. This exercises the type-erased call path introduced by std::function.

class ApplicationWithStdFunction
{
    bool _isRunning = false;
    std::function<void()> _executionStep = nullptr;
    uint64_t _executionTime = 0;
 
public:
    void setExecutionStep(const std::function<void()> &step) { _executionStep = step; }
 
    uint64_t run(uint64_t iterations)
    {
        if (_executionStep == nullptr)
        {
            _executionTime = 0;
            return 0;
        }
 
        _isRunning = true;
 
        const uint64_t t0 = now_ns_mono();
        for (uint64_t i = 0; i < iterations; ++i)
        {
            _executionStep();
        }
        const uint64_t t1 = now_ns_mono();
 
        _isRunning = false;
        _executionTime = (t1 - t0);
        return 0;
    }
 
    uint64_t executionTime() const { return _executionTime; }
};

And, in the main, we will define the execution step as follow :

std::atomic<uint64_t> nbExecution{0};

appWithStdFunction.setExecutionStep([&nbExecution]() { nbExecution.fetch_add(1, std::memory_order_relaxed); });

ApplicationWithPredefinedLoop - direct member call

Hard-wires the increment logic inside the class and calls a private helper directly, showing the baseline cost without any form of indirection.

class ApplicationWithPredefinedLoop
{
    bool _isRunning = false;
    std::atomic<uint64_t> _nbExecution{0};
    uint64_t _executionTime = 0;
 
    void incrementExecution() { _nbExecution.fetch_add(1, std::memory_order_relaxed); }
 
public:
    uint64_t run(uint64_t iterations)
    {
        _isRunning = true;
 
        const uint64_t t0 = now_ns_mono();
        for (uint64_t i = 0; i < iterations; ++i)
        {
            incrementExecution();
        }
        const uint64_t t1 = now_ns_mono();
 
        _isRunning = false;
        _executionTime = (t1 - t0);
        return 0;
    }
 
    uint64_t executionTime() const { return _executionTime; }
    uint64_t executionCount() const { return _nbExecution.load(std::memory_order_relaxed); }
};

ApplicationWithFunctionPointer - raw function pointer

Accepts a plain C-style function pointer and calls it from the hot loop. This isolates the cost of an indirect call without the additional machinery of std::function.

class ApplicationWithFunctionPointer
{
    bool _isRunning = false;
    void (*_executionStep)() = nullptr;
    uint64_t _executionTime = 0;
 
public:
    void setExecutionStep(void (*step)()) { _executionStep = step; }
 
    uint64_t run(uint64_t iterations)
    {
        if (_executionStep == nullptr)
        {
            _executionTime = 0;
            return 0;
        }
 
        _isRunning = true;
 
        const uint64_t t0 = now_ns_mono();
        for (uint64_t i = 0; i < iterations; ++i)
        {
            _executionStep();
        }
        const uint64_t t1 = now_ns_mono();
 
        _isRunning = false;
        _executionTime = (t1 - t0);
        return 0;
    }
 
    uint64_t executionTime() const { return _executionTime; }
};

And, in the main, we will define the execution step as follow :

std::atomic<uint64_t> nbPublicExecution{0};
void incrementStep()
{
    nbPublicExecution.fetch_add(1, std::memory_order_relaxed);
}
 
appWithFunctionPointer.setExecutionStep(&incrementStep);

ApplicationVirtual / ApplicationConcrete - virtual dispatch

Defines an abstract base with executeStep() and lets ApplicationConcrete implement the actual work, covering the cost of a virtual call and vtable lookup.

class ApplicationVirtual
{
    bool _isRunning = false;
    uint64_t _executionTime = 0;
 
    virtual void executeStep() = 0;
 
protected:
    std::atomic<uint64_t> _steps{0};
    void incrementStep() { _steps.fetch_add(1, std::memory_order_relaxed); }
 
public:
    uint64_t run(uint64_t iterations)
    {
        _isRunning = true;
 
        const uint64_t t0 = now_ns_mono();
        for (uint64_t i = 0; i < iterations; ++i)
        {
            executeStep();
        }
        const uint64_t t1 = now_ns_mono();
 
        _isRunning = false;
        _executionTime = (t1 - t0);
        return 0;
    }
 
    uint64_t executionTime() const { return _executionTime; }
    uint64_t executionCount() const { return _steps.load(std::memory_order_relaxed); }
};
 
class ApplicationConcrete : public ApplicationVirtual
{
    void executeStep() override { incrementStep(); }
};

Benchmark harness

Each variant is exercised with the same set of iterations so their execution times are directly comparable.

std::atomic<uint64_t> nbPublicExecution{0};
void incrementStep()
{
    nbPublicExecution.fetch_add(1, std::memory_order_relaxed);
}
 
int main()
{
    constexpr uint64_t iterations = 100'000'000ULL;
 
    ApplicationWithStdFunction appWithStdFunction;
    ApplicationWithPredefinedLoop appWithPredefinedLoop;
    ApplicationWithFunctionPointer appWithFunctionPointer;
    ApplicationConcrete appWithVTableLookup;
 
    std::atomic<uint64_t> nbExecution{0};
    appWithStdFunction.setExecutionStep([&nbExecution]() { nbExecution.fetch_add(1, std::memory_order_relaxed); });
    appWithFunctionPointer.setExecutionStep(&incrementStep);
 
    appWithStdFunction.run(iterations);
    appWithPredefinedLoop.run(iterations);
    appWithFunctionPointer.run(iterations);
    appWithVTableLookup.run(iterations);
 
    std::cout << "Time for " << iterations << " iterations" << std::endl;
    std::cout << "std::function -> Time: " << appWithStdFunction.executionTime() << " ns (" << (static_cast<float>(appWithStdFunction.executionTime()) / static_cast<float>(iterations)) << " ns per execusion pass)" << std::endl;
    std::cout << "Hard coded loop -> Time: " << appWithPredefinedLoop.executionTime() << " ns (" << (static_cast<float>(appWithPredefinedLoop.executionTime()) / static_cast<float>(iterations)) << " ns per execusion pass)" << std::endl;
    std::cout << "FunctionPointer -> Time: " << appWithFunctionPointer.executionTime() << " ns (" << (static_cast<float>(appWithFunctionPointer.executionTime()) / static_cast<float>(iterations)) << " ns per execusion pass)" << std::endl;
    std::cout << "Virtual -> Time: " << appWithVTableLookup.executionTime() << " ns (" << (static_cast<float>(appWithVTableLookup.executionTime()) / static_cast<float>(iterations)) << " ns per execusion pass)" << std::endl;
}

Results

Configuration	Strategy	Total Time (ns)	Per Iteration (ns)
Debug	std::function	995807400	9.95807
Debug	Hard coded loop	680789100	6.80789
Debug	Function pointer	687073000	6.87073
Debug	Virtual	708402400	7.08402
Release	std::function	420704600	4.20705
Release	Hard coded loop	415345000	4.15345
Release	Function pointer	416335200	4.16335
Release	Virtual	430545000	4.30545

Takeaways

The direct call and raw function pointer paths cluster together as the fastest options in Release builds.
std::function adds a small but measurable overhead (~1% in Release, way more in Debug) due to type erasure and target management.
Virtual calls remain competitive but trail the direct approaches by roughly 4% in Release, reflecting the vtable indirection cost.

Why keep separate classes?

Maintaining a dedicated class for each dispatch style keeps the benchmark logic isolated, so changes to one experiment do not affect the others. Even if some scaffolding code looks duplicated, the explicit classes make it easier to reason about lifetimes, members, and future tweaks (for example, swapping in a different workload or instrumentation). Avoiding inheritance between the variants also prevents accidental sharing of state that could skew timing results and keeps the benchmark faithful to the real-world code paths each strategy would use in production.

Conclusion

Uppon implementing the classes responsible for application management inside Sparkle, we will have to create two separated class ConsoleApplication and GraphicalApplication, fine-tuned for each case by having the different loop directly writen inside their code, instead of having them inherit from the same base class, to avoid both virtual call and std::function cost. This will probably disable the possibility to add job to the loop inside each of those classes, but we could concidere that the behavior of the application will be complete and wont need any dynamic insertion.