blog:2024:0321_nervland_timestamp_queries

NervLand: Adding support to perform timestamp queries

In this short article, we will discuss how to add support an NervLand (or any other Dawn based WebGPU engine) to handle timestamp queries

The support for timestamp queries in Dawn is disabled by default and protected with the “allow_unsafe_apis” toggle. It took me some time to figure out how to enable that toggle to get the features I wanted, so I thought I should share some code on this here, in case this would be usefull to someone else. So here is what I'm doing myself:

    // Creating device:
    logDEBUG("Creating wgpu device...");

    _requiredFeatures = {
        WGPUFeatureName_Float32Filterable,
        WGPUFeatureName_TextureCompressionBC,
#if NV_USE_UNSAFE_WGPU_APIS
        WGPUFeatureName_TimestampQuery,
        WGPUFeatureName_TimestampQueryInsidePasses,
#endif
        // WGPUFeatureName_ChromiumExperimentalSubgroups, //guarded by toggle
        // guarded by toggle allow_unsafe_apis
        // WGPUFeatureName_ChromiumExperimentalReadWriteStorageTexture, //
        // guarded by toggle allow_unsafe_apis
    };

    WGPUDeviceDescriptor deviceDesc = {
        .requiredFeatureCount = _requiredFeatures.size(),
        .requiredFeatures = _requiredFeatures.data(),
    };

#if NV_USE_UNSAFE_WGPU_APIS
    logDEBUG("Unsafe WGPU APIs ENABLED.");
    // cf. https://groups.google.com/g/dawn-graphics/c/Y7Xu6a4n_T0

    WGPUDawnTogglesDescriptor deviceTogglesDesc{};
    deviceDesc.nextInChain =
        reinterpret_cast<WGPUChainedStruct*>(&deviceTogglesDesc);
    const char* const enabledToggles[] = {"allow_unsafe_apis"};
    deviceTogglesDesc.chain.next = nullptr;
    deviceTogglesDesc.chain.sType = WGPUSType_DawnTogglesDescriptor;
    deviceTogglesDesc.enabledToggles = enabledToggles;
    deviceTogglesDesc.enabledToggleCount = 1;
    deviceTogglesDesc.disabledToggles = nullptr;
    deviceTogglesDesc.disabledToggleCount = 0;
#else
    logDEBUG("Unsafe WGPU APIs DISABLED.");
#endif

    DawnProcTable backendProcs = dawn::native::GetProcs();
    dawnProcSetProcs(&backendProcs);

    auto* dev = backendAdapter.CreateDevice(&deviceDesc);
    NVCHK(dev != nullptr, "Invalid WPGU device.");

And in my cmake files, I defined the value NV_USE_UNSAFE_WGPU_APIS depending on the build type:

# In DEBUG mode we enable the unsafe WGPU APIs:
if(CMAKE_BUILD_TYPE STREQUAL "Debug")
  add_definitions(-DNV_USE_UNSAFE_WGPU_APIS=1)
else()
  add_definitions(-DNV_USE_UNSAFE_WGPU_APIS=0)
endif()

⇒ And this seems to do the trick since I can now build and run the engine without a validation error.

Side note: For reference, if the toggle is not enabled and we try to require the timestamp features, we get the following validation error:

Error: Feature timestamp-query is guarded by toggle allow_unsafe_apis.
    at ValidateFeatureSupportedWithToggles (D:/Projects/NervProj/build/libraries/dawn-git-20231004/src/dawn/native/PhysicalDevice.cpp:167)
    at CreateDevice (D:/Projects/NervProj/build/libraries/dawn-git-20231004/src/dawn/native/Adapter.cpp:207)
The allow_unsafe_apis is a toggle we should only enable in debug builds, not in final production code ;-) [I think]

After adding the base functions to create QuerySets and collect timestamps, I wrote the following first version of a GPU reduction unit test:

BOOST_AUTO_TEST_CASE(test_reduction) {

    // Create a compute pass:
    auto* eng = WGPUEngine::instance();

    auto cpass = create_ref_object<WGPUComputePass>();
    BOOST_CHECK(cpass != nullptr);

    // Prepare a ComputeNode:
    String code = R"(WGSL:
@group(0) @binding(0) var<storage,read> inputBuffer: array<u32>;
@group(0) @binding(1) var<storage,read_write> output: atomic<u32>;

@compute @workgroup_size(32)
fn main(@builtin(global_invocation_id) id: vec3<u32>) {
    // Accumulate in buffer:
    if(id.x<1000000) {
        atomicAdd(&output, inputBuffer[id.x]);
    }
}
)";

    WGPUComputeNode cnode(code);

    U32 num = 1e6;
    GPUBufferProxy input(num, 1, sizeof(U32), BufferUsage::Storage);
    GPUBuffer output(1 * sizeof(U32),
                     BufferUsage::Storage | BufferUsage::CopySrc);
    GPUBuffer staging(num * sizeof(U32),
                      BufferUsage::CopyDst | BufferUsage::MapRead);

    // Fill the input buffer:
    U32* ptr = input.ptr<U32>();
    for (U32 i = 0; i < num; ++i) {
        (*ptr++) = 42;
    }
    input.update();

    // We need to define some minimal bindings here:

    // Create a bind group layout:
    cnode.define_grp(input.as_sto(), output.as_rw_sto());

    // Create a bind group:
    // cnode.bind(0, input.as_sto(), output.as_rw_sto());
    auto grp0 = cnode.create_bind_grp(0, input.as_sto(), output.as_rw_sto());

    // Add a compute step:
    // Our buffer is of size 64, and have work groups of size 32, so we need 2
    // groups on X
    U32 xsize = (num + 31) / 32;
    cnode.add({xsize}, {grp0});

    // Check that we can get a pipeline from that node:
    BOOST_CHECK(cnode.get_pipeline() != nullptr);

    // Add the compute node to the compute pass:
    cpass->add(cnode);

    // create our query set:
    U32 capacity = 3;
    auto qset = eng->create_timestamp_query_set(capacity);

    // Prepare a buffer to hold the data:
    auto buf = eng->get_read_buffer(8ULL * capacity, BufferUsage::QueryResolve |
                                                         BufferUsage::CopySrc);
    auto queryStagingBuf = eng->get_read_buffer(
        8ULL * capacity, BufferUsage::MapRead | BufferUsage::CopyDst);

    auto& bld = eng->build_commands();
    bld.write_timestamp(qset, 0);
    bld.execute_compute_pass(*cpass);
    bld.write_timestamp(qset, 1);
    bld.copy_buffer_to_buffer(output.buffer(), staging.buffer(), 0, 0,
                              output.get_buffer_size());
    bld.write_timestamp(qset, 2);
    bld.resolve_query_set(qset, 0, capacity, buf.buffer(), 0);
    bld.copy_buffer_to_buffer(buf.buffer(), queryStagingBuf.buffer(), 0, 0,
                              buf.get_buffer_size());
    bld.submit(false);

    // Should have updated the output buffer now:
    const U32* data2 = (U32*)staging.read_sync();

    BOOST_REQUIRE(data2 != nullptr);
    BOOST_CHECK_EQUAL(*data2, 42000000);

    // Read the timestamps:
    const U64* tdata = (U64*)queryStagingBuf.read_sync();
    U64 t0 = tdata[0];
    U64 t1 = tdata[1];

    BOOST_CHECK_LE(t0, t1);
    U64 elapsed = t1 - t0;

    buf.release();
    queryStagingBuf.release();

    logNOTE("Compute shader took {}ns", elapsed);

    // Release the pass:
    cpass.reset();
}

… And this worked like a charm, except that it's still sometimes (often!) reporting that the compute shader took 0ns to execute 😲. That seems “a bit too fast” to be true. So I'm thinking maybe this is due to some caching mechanism on the GPU: since we are not changing anything in the code or the input data maybe somehow the GPU doesn't have to execute that pipeline later on ?? (I'm just speculating here…)

Anyway, let's make the input random and larger.

Here is the second version of this unit test, this time using random values in the input buffer and some syntax optimizations:

const char* reducShader1 = R"(WGSL:
@group(0) @binding(0) var<storage,read> inputBuffer: array<u32>;
@group(0) @binding(1) var<storage,read_write> output: atomic<u32>;

@compute @workgroup_size(256)
fn main(@builtin(global_invocation_id) id: vec3<u32>) {
    // Accumulate in buffer:
    if(id.x<4194304) {
        atomicAdd(&output, inputBuffer[id.x]);
    }
}
)";

BOOST_AUTO_TEST_SUITE(reduction)

BOOST_AUTO_TEST_CASE(test_reduction) {

    // Create a compute pass:
    auto* eng = WGPUEngine::instance();

    auto cpass = create_ref_object<WGPUComputePass>();

    // U32 num = 1e6;
    U32 num = 4194304; // 2^22

    GPUBufferProxy input(num, 1, sizeof(U32), BufferUsage::Storage);
    GPUBuffer output(1 * sizeof(U32),
                     BufferUsage::Storage | BufferUsage::CopySrc);

    // Fill the input buffer:
    U32 total = 0;
    RandGen rnd;

    U32* ptr = input.ptr<U32>();
    for (U32 i = 0; i < num; ++i) {
        U32 val = rnd.uniform_i32(0, 4);
        (*ptr++) = val;
        total += val;
    }
    input.update();

    // We need to define some minimal bindings here:
    cpass->add_simple_compute({.shaderFile = reducShader1,
                               .entries = {input.as_sto(), output.as_rw_sto()},
                               .dims = {(num + 255) / 256}});

    // create our query set:
    U32 capacity = 3;
    auto qset = eng->create_timestamp_query_set(capacity);

    // Prepare a buffer to hold the data:
    auto& buf = eng->acquire_buffer(8ULL * capacity, BufferUsage::QueryResolve |
                                                         BufferUsage::CopySrc);

    auto& bld = eng->build_commands();
    bld.write_timestamp(qset, 0);
    bld.execute_compute_pass(*cpass);
    bld.write_timestamp(qset, 1);
    bld.resolve_query_set(qset, 0, capacity, buf.buffer(), 0);
    bld.submit(false);

    // Should have updated the output buffer now:
    const U32* data2 = (U32*)output.copy_to_staged().read_sync();

    logNOTE("Expected reduction total: {}", total);
    BOOST_REQUIRE(data2 != nullptr);
    BOOST_CHECK_EQUAL(*data2, total);

    // Read the timestamps:
    const U64* tdata = (U64*)buf.copy_to_staged().read_sync();
    U64 t0 = tdata[0];
    U64 t1 = tdata[1];

    BOOST_CHECK_LE(t0, t1);
    U64 elapsed = t1 - t0;

    buf.release();

    logNOTE("Compute shader took {}ns", elapsed);

    // Release the pass:
    cpass.reset();
}

Next thing I think I could improve now is the ComputePass creation process: I think I could embed this completely in the command building process: Done!

And while I'm at it, I could maybe also encapsulate the timestamp query set as an internal engine resource ? 🤔 Let's see…

So here is the updated unit test code now:

const char* reducShader1 = R"(WGSL:
@group(0) @binding(0) var<storage,read> inputBuffer: array<u32>;
@group(0) @binding(1) var<storage,read_write> output: atomic<u32>;

@compute @workgroup_size(256)
fn main(@builtin(global_invocation_id) id: vec3<u32>) {
    // Accumulate in buffer:
    if(id.x<4194304) {
        atomicAdd(&output, inputBuffer[id.x]);
    }
}
)";

BOOST_AUTO_TEST_SUITE(reduction)

BOOST_AUTO_TEST_CASE(test_reduction) {

    // Create a compute pass:
    auto* eng = WGPUEngine::instance();

    // U32 num = 1e6;
    U32 num = 4194304; // 2^22

    GPUBufferProxy input(num, 1, sizeof(U32), BufferUsage::Storage);
    GPUBuffer output(1 * sizeof(U32),
                     BufferUsage::Storage | BufferUsage::CopySrc);

    // Fill the input buffer:
    U32 total = 0;
    RandGen rnd;

    U32* ptr = input.ptr<U32>();
    for (U32 i = 0; i < num; ++i) {
        U32 val = rnd.uniform_i32(0, 4);
        (*ptr++) = val;
        total += val;
    }
    input.update();

    auto& bld = eng->build_commands();
    bld.write_timestamp(0);
    bld.execute_simple_compute({.shaderFile = reducShader1,
                                .entries = {input.as_sto(), output.as_rw_sto()},
                                .dims = {(num + 255) / 256}});
    bld.write_timestamp(1);
    bld.submit(false);

    // Should have updated the output buffer now:
    const U32* data2 = (U32*)output.copy_to_staged().read_sync();

    logNOTE("Expected reduction total: {}", total);
    BOOST_REQUIRE(data2 != nullptr);
    BOOST_CHECK_EQUAL(*data2, total);

    // Read the elapsed time:
    F64 elapsed = eng->get_timestamp_delta_ns(0, 1);
    logNOTE("Compute shader took {}ns", elapsed);
}

As shown above, we don't have to create the timestamps queryset or buffer directly in the unit tests anymore: this will be handled in the CommandBuilder/WGPUEngine automatically, and the only thing we need to specify is when to write the timestamps (with thewrite_timestamp() method), and then to retrive those timestamps or even better, directly the differences with the get_timestamp_delta_ns() method 😎. And of course this is still working just fine.

Next, i think I could extend a bit my helper RandGen class to also support generating random vectors in addition to single random values. Let's do that.

Here is the method I added in the RandGen class:

    template <typename T>
    [[nodiscard]] auto uniform_int_vector(U32 count, T min, T max) const
        -> Vector<T> {
        Vector<T> res(count);
        T* ptr = res.data();
        std::uniform_int_distribution<T> dis(min, max);
        std::generate(res.begin(), res.end(),
                      [&dis, this]() { return dis(_gen); });
        // for (U32 i = 0; i < count; ++i) {
        //     (*ptr++) = dis(_gen);
        // }
        return res;
    }

And now, here is the further reduced unit test code:

const char* reducShader1 = R"(WGSL:
@group(0) @binding(0) var<storage,read> inputBuffer: array<u32>;
@group(0) @binding(1) var<storage,read_write> output: atomic<u32>;

@compute @workgroup_size(256)
fn main(@builtin(global_invocation_id) id: vec3<u32>) {
    // Accumulate in buffer:
    if(id.x<4194304) {
        atomicAdd(&output, inputBuffer[id.x]);
    }
}
)";

BOOST_AUTO_TEST_SUITE(reduction)

BOOST_AUTO_TEST_CASE(test_reduction) {

    // Create a compute pass:
    auto* eng = WGPUEngine::instance();

    // U32 num = 1e6;
    U32 num = 4194304; // 2^22

    RandGen rnd;
    auto in_data = rnd.uniform_int_vector<U32>(num, 0, 4);

    GPUBuffer input(num * sizeof(U32), BufferUsage::Storage, in_data.data());
    GPUBuffer output(1 * sizeof(U32),
                     BufferUsage::Storage | BufferUsage::CopySrc);

    // Get the accumulated value:
    U32 total = std::accumulate(in_data.begin(), in_data.end(), 0U);

    auto& bld = eng->build_commands();
    bld.write_timestamp(0);
    bld.execute_simple_compute({.shaderFile = reducShader1,
                                .entries = {input.as_sto(), output.as_rw_sto()},
                                .dims = {(num + 255) / 256}});
    bld.write_timestamp(1);
    bld.submit(false);

    // Should have updated the output buffer now:
    const U32* data2 = (U32*)output.copy_to_staged().read_sync();

    logNOTE("Expected reduction total: {}", total);
    BOOST_REQUIRE(data2 != nullptr);
    BOOST_CHECK_EQUAL(*data2, total);

    // Read the elapsed time:
    F64 elapsed = eng->get_timestamp_delta_ns(0, 1);
    logNOTE("Compute shader took {}ns", elapsed);
}

And finally, since I found the reported execution time to be really small I added support to execute the shader a few times in a row between the timestamps:

    // Get the accumulated value:
    U32 niters = 10;
    U32 total = std::accumulate(in_data.begin(), in_data.end(), 0U) * niters;

    auto& bld = eng->build_commands();
    bld.write_timestamp(0);
    for (I32 i = 0; i < niters; ++i) {
        bld.execute_simple_compute(
            {.shaderFile = reducShader1,
             .entries = {input.as_sto(), output.as_rw_sto()},
             .dims = {(num + 255) / 256}});
    }
    bld.write_timestamp(1);
    bld.submit(false);
/

Of course, as done in the code above, we also need to multiply the total count by the number of iterations for the unit test to work correctly.

And this concludes our little journey on the timestamps retrieval process. Now back to my work on the tree designer tool ;-)!

I actually just pushed the support for timestamps retrieval a bit further, now introducing support to get the timestamps at the beginning/end of a compute pass. So I updated the reduction unit test code as follow:

    auto& bld = eng->build_commands();
    eng->increase_num_timestamps(niters * 2);
    // bld.write_timestamp(0);
    for (I32 i = 0; i < niters; ++i) {
        bld.execute_simple_compute(
            {.shaderFile = code,
             .entries = {input.as_sto(), output.as_rw_sto()},
             .defs = defs != nullptr ? *defs : StringVector{},
             .dims = {ngrps}},
            2 * i, 2 * i + 1);
    }
    // bld.write_timestamp(1);
    bld.submit(false);

    // Should have updated the output buffer now:
    const U32* data2 = (U32*)output.copy_to_staged().read_sync();

    logNOTE("Expected reduction total: {}", total);
    BOOST_REQUIRE(data2 != nullptr);
    BOOST_CHECK_EQUAL(*data2, total);

    // Read the elapsed time:
    // F64 elapsed = eng->get_timestamp_delta_ns(0, 1);
    F64 elapsed = 0.0;
    for (I32 i = 0; i < niters; ++i) {
        elapsed += eng->get_timestamp_delta_ns(2 * i, 2 * i + 1);
    }

This still seems to work just fine, and the duration values I'm retrieving “feel” a little bit more precise… But I'm really not sure about that to be honest 😁
  • blog/2024/0321_nervland_timestamp_queries.txt
  • Last modified: 2024/03/24 10:38
  • by 127.0.0.1