blog:2023:0916_nvl_dev30_elevationproducer_study

NervLand DevLog #30: Initial study and design setup for Proland ElevationProducer

Hi! In this article, I'm mostly focusing on the ElevationProducer from ProLand: trying to understand how it works and how I should implement it in my own engine (replacing OpenGL with WebGPU in the process). Overall not an easy task to do, but I'm making some progress at least.

The structure of the article is actually a bit fuzzy, with ideas going in multiple directions, and unfortunately, this work is still not completed, since I eventually decided I should try and move to a compute pipeline implementation to perform the actual tile generation. So we end up preparing support for the ComputePass and ComputeNode classes, and will continue on this journey in our next episode ;-).

Youtube videos (3 parts) for this article available at:

The ElevationProducer will use the noise Texture2D Array, in R16F format: so first thing I think I should do here is to construct a simple unit test to check that I can write data to an R16F texture, and read this data back.

⇒ Introducing support to copy a Texture to a Buffer:

auto WGPUCommandBuilder::copy_texture_to_buffer(wgpu::Texture src,
                                                wgpu::Buffer dst, I32 mip_level)
    -> WGPUCommandBuilder& {

    // Textures should match to call this method:
    U32 width = src.GetWidth();
    U32 height = src.GetHeight();
    U32 nlayers = src.GetDepthOrArrayLayers();
    auto fmt = src.GetFormat();
    U32 nmips = src.GetMipLevelCount();

    auto bsize = dst.GetSize();

    U32 numChans = get_texture_format_num_channels(fmt);
    U32 chanSize = get_texture_format_channel_size(fmt);

    ImageCopyTexture srcDesc{.texture = std::move(src)};
    ImageCopyBuffer dstDesc{
        .layout = {.offset = 0,
                   .bytesPerRow = width * numChans * chanSize,
                   .rowsPerImage = height},
        .buffer = std::move(dst),
    };

    Extent3D csize{width, height, nlayers};

    if (nmips == 1 && mip_level < 0) {
        // Just copy the only available mip_level:
        mip_level = 0;
    }

    if (mip_level == -1) {
        THROW_MSG("No implementation yet to copy all mip levels");
        // // Copy all the mip levels:
        // for (U32 i = 0; i < nmips; ++i) {
        //     srcDesc.mipLevel = i;
        //     dstDesc.mipLevel = i;
        //     csize.width = maximum(width / (1 << i), 1U);
        //     csize.height = maximum(height / (1 << i), 1U);
        //     _encoder.CopyTextureToTexture(&srcDesc, &dstDesc, &csize);
        // }
    } else {
        NVCHK(mip_level < nmips, "Out of range mip level.");

        // Copy a single mip level:
        srcDesc.mipLevel = mip_level;
        csize.width = maximum(width / (1 << mip_level), 1U);
        csize.height = maximum(height / (1 << mip_level), 1U);
        dstDesc.layout.bytesPerRow = csize.width * numChans * chanSize;
        dstDesc.layout.rowsPerImage = csize.height;

        // Check the total size that we will copy:
        U32 tot_size = csize.width * csize.height * numChans * chanSize;
        NVCHK(bsize == tot_size,
              "Mismatch in dest buffer size for texture copy.");

        _encoder.CopyTextureToBuffer(&srcDesc, &dstDesc, &csize);
    }

    return *this;
}

And also introducing support to wait the WGPU queue to signal that all the work is done:

static void workDoneCallback(WGPUQueueWorkDoneStatus status, void* userdata) {
    logNOTE("In workDoneCallback with status: {}", status);
    auto* bptr = (std::atomic<bool>*)userdata;
    *bptr = true;
}

void WGPUEngine::wait_idle() {
    std::atomic<bool> done(false);
    _queue.OnSubmittedWorkDone(0U, workDoneCallback, &done);
    logNOTE("Waiting for GPU...");
    while (!done.load()) {
        process_events();
        sleep_us(100);
    }
    logNOTE("GPU is idle.");
}

To be honest, in the code above the first argument for OnSubmittedWorkDone is the signalValue, and I currently have no idea what this is exactly, but we will discover that soon enough.

⇒ Update: Looking into the Dawn unit tests code it seems that the signalValue should be set at 0 for now and it is an error to use another value (?).

Then also added support to generate the DEM Noise texture 2D Array: no idea yet how this will be used, but we'll figure this out eventually ;-).

Note: Oh crap… just realized that the blockingconcurrentqueue I'm using now by default in the LogManager will not build on EMSCRIPTEN: I get an error when trying to link to sem_timedwait:

cmd.exe /C "cd . && D:\Projects\NervProj\tools\windows\emsdk-git\upstream\emscripten\em++.bat -std=c++20 -fpch-instantiate-templates -DNDEBUG -O3 -s WASM=1 -O3 -s ALLOW_MEMORY_GROWTH=1 -s FETCH=1 "@D:/Projects/NervProj/build/nervland_emcc_release/NervSeed_emcc_preload_file.rsp" sources/apps/NervSeed/CMakeFiles/NervSeed.dir/src/main.cpp.o -o sources\apps\NervSeed\NervSeed.html  sources/nvCore/lite/libnvCore_lite.a  D:/Projects/NervProj/libraries/windows_emcc/fmt-9.1.1/lib/libfmt.a  D:/Projects/NervProj/libraries/windows_emcc/yamlcpp-0.7.0/lib/libyaml-cpp.a  -lpthread  D:/Projects/NervProj/libraries/windows_emcc/SDL2-2.26.5/lib/libSDL2.a  D:/Projects/NervProj/libraries/windows_emcc/yamlcpp-0.7.0/lib/libyaml-cpp.a  D:/Projects/NervProj/libraries/windows_emcc/freetype-git-2.13/lib/libfreetype.a  D:/Projects/NervProj/libraries/windows_emcc/harfbuzz-git-7.3/lib/libharfbuzz.a  D:/Projects/NervProj/libraries/windows_emcc/zlib-1.2.13/lib/libz.a  D:/Projects/NervProj/libraries/windows_emcc/brotli-git-1.0.9/lib/libbrotlidec.a  D:/Projects/NervProj/libraries/windows_emcc/brotli-git-1.0.9/lib/libbrotlicommon.a  D:/Projects/NervProj/libraries/windows_emcc/libpng-1.6.39/lib/libpng16.a  -lpthread && cd ."
wasm-ld: error: sources/nvCore/lite/libnvCore_lite.a(LogManager.cpp.o): undefined symbol: sem_timedwait

⇒ I will have to investigate this some day, but for now let's just disable the logger thread when building for EMSCRIPTEN.

To me it seems the tile creation process would start with a call to ElevationProducer::startCreateTile().

The startCreateTile() method is a protected virtual method, inherited from the base TileProducer class itself. And this is called in the base CreateTile task implementation.

⇒ So we need some implementation for this CreateTile class.

Note: the CreateTile make extensive checks on its “owner” member to see if it's a null pointer: I'm going to take a different road here and assume instead that the CreateTask should all be deleted before their parent TileProducer, so this owner pointer should always be valid.

⇒ I have now prepared a base/simplified implementation of the CreateTile class: one thing to note here is that I'm currently bypassing the begin()/end() methods from this class: as far as I understand this is only needed to setup a “framebuffer” before doing some rendering, and I don't think this will apply directly anymore with a WebGPU implementation.

Next, let's continue with the reimplementation of start_create_tile in the ElevationProducer class. But actually the first thing this will require is a call to createTaskGraph which itself will need the implementation of the CreateTileTaskGraph class, so let's add this one now.

Arrggghh… except that CreateTileTaskGraph class is mainly concerned with deleting/restoring a task graph, thus relying on the doRelease() method from ORK: not very good for us. So yeah… for the moment I don't think it would quite make sense to introduce this implementation as is. But let's try to monitor this point.

OK, now that we have the support for start_create_tile, the next with we need is the implementation for ElevationProducer::do_create_tile() which will be called in CreateTile::run() (and in fact, this is where most of the magic would happen if I'm not mistaken 🤔) Let's see…

Okay, hmmm, this is tricky, but working on the do_create_tile() method, I think I now need to replace the concept of a framebuffer with a render pass.

Side note: for now, I'm going to use some explicit synchronization but eventually it would be good to have some kind of per thread storage using a fixed thread id like this:

#include <thread>
#include <atomic>

// Thread-local storage for the identifier
thread_local int threadId = -1;

// Function to get and assign a unique identifier per thread
int GetThreadId() {
    // Atomic counter to generate unique IDs
    static std::atomic<int> counter(0);

    if (threadId == -1) {
        threadId = counter.fetch_add(1);
    }
    return threadId;
}

Still in the do_create_tile method, the next thing we will need is the implementation for the CPUTileStorage and also the capability to copy CPU data into a given GPU slot:

    if (_residualTiles != nullptr &&
        _residualTiles->has_tile(level, tx / mod, tz / mod)) {

        residualOSH.set(0.25F / tileWidth, 0.25F / tileWidth, 2.0F / tileWidth,
                        1.0F);

        I32 rx = (tx % mod) * tileSize;
        I32 ry = (tz % mod) * tileSize;

        TileCache::Tile* t =
            _residualTiles->find_tile(level, tx / mod, tz / mod);
        NVCHK(t != nullptr, "Invalid residual tile");

        CPUTileStorage<float>::CPUSlot* cpuTile =
            dynamic_cast<CPUTileStorage<float>::CPUSlot*>(t->getData());
        assert(cpuTile != NULL);
        for (I32 y = 0; y < tileWidth; ++y) {
            for (I32 x = 0; x < tileWidth; ++x) {

                float r =
                    cpuTile->data[(x + rx) + (y + ry) * residualTileWidth];
                assert(isFinite(r));
                residualTile[x + y * tileWidth] = r;
            }
        }

        residualTexture->setSubImage(0, 0, 0, tileWidth, tileWidth, RED, FLOAT,
                                     Buffer::Parameters(),
                                     CPUBuffer(residualTile));
    }

⇒ Yet, for the moment we don't need support for the “residualTiles” yet, so we will come back to those extensions afterwards.

Note: Drawing a fullscreen quad using only a single triangle could prove slightly more efficient: (cf. https://stackoverflow.com/questions/2588875/whats-the-best-way-to-draw-a-fullscreen-quad-in-opengl-3-2)

I'm now reaching the end of the do_create_tile() method, and with this we now have to actually do some rendering. And I think I now have a better understanding of how this should be done:

As mentioned above, we will have to replace the framebuffer concept with a dedicated RenderPass. In this render pass we will be able to render to the demTexture which is an RGBA32F texture.

The rendering will be done with the upsample program, so we will need to create a dedicated pipeline for this program.

This pipeline should accept multiple bind groups:

  • group0: [static textures] providing noiseTexture and residualTexture
  • group1: [dynamic data] providing the correct parent coarse texture (2D array) and the uniform buffer.

⇒ For the binding group 1 we should prepare as many bind group as the number of texture 2D arrays available in the GPUTileStorage corresponding to this producer.

In fact we should probably separate the uniform buffer in its own binding group (because that buffer will be allocated per thread), so we will have:

  • group0: [static textures] providing noiseTexture and residualTexture
  • group1: [parent coarse texture] providing the correct parent coarse texture (2D array)
  • group2: Uniform buffer

Another point to keep in mind here is that we are limited to only 4 binding groups. So here we are OK, but in other cases we might have to merge group0 and group1 together to provide all the textures in a single binding group.

And thinking about it: I'm now wondering how the rendering will work if we have multiple threads writing data to the same texture, then rendering to the same texture, then copying that texture to different layers in a texture 2D array. ⇒ maybe it's a better idea to use different texture targets upfront 🤔?

Even if we use multiple threads we can certainly share a single WGPU pipeline to execute the upsample program. So we will use a single RenderNode to build that:

WGPURenderNode rnode(0, "proland/upsample");

Next on this RenderNode must firtst define the layout we want for the binding groups. And on this point I want to introduce some update already: I want to have a unified DefEntry struct which I could use to define any kind of binding entry (just like we already have the BindEntry struct already to provide anykind of actual binding.)

⇒ Let's provide that! And here is the new struct:

struct NVGPU_EXPORT DefEntry {
    DefEntryType type{DEF_ENTRY_UNDEFINED};

    // Common settings:
    wgpu::ShaderStage stage{wgpu::ShaderStage::Vertex};
    I32 location{-1};
    I32 group{-1};

    // Sampler settings:
    wgpu::SamplerBindingType sampler_type{wgpu::SamplerBindingType::Filtering};

    // Texture settings:
    wgpu::TextureSampleType texture_stype = wgpu::TextureSampleType::Float;
    wgpu::TextureViewDimension texture_dim = wgpu::TextureViewDimension::e2D;
    bool multisampled{false};

    // constructors:
    DefEntry() = default;
    DefEntry(const DefEntry& rhs) = default;
    DefEntry(DefEntry&& rhs) = default;
    void operator=(const DefEntry&) = delete;
   

Note: Eventually I should also add the support to define buffers in this DefEntry struct, but for the moment this doesn't seem that critical.

While working on the DefEntry implementation I also realized I had to update a bit my DyamicBuffer implementation ⇒ I have now extracted a non-templated base class for this, and we can use a reference on such a base in a generic way in the BufferProvider class:

NVGPU_EXPORT class DynamicBufferBase : public RefObject {
  public:
    /** Default constructor */
    DynamicBufferBase() = default;

    /** Custom buffer constructor */
    DynamicBufferBase(U32 elemCount, U32 alignment, U32 elemSize,
                      wgpu::BufferUsage usage) {
        init(elemCount, alignment, elemSize);
        create_buffer(usage);
    }

    // Get the buffer:
    [[nodiscard]] auto buffer() const -> const wgpu::Buffer& { return _buffer; }

    /** get Aligned size */
    [[nodiscard]] auto get_aligned_size() const -> U32 { return _alignedSize; }

    /** Get the buffer size */
    [[nodiscard]] auto get_buffer_size() const -> U64 { return _data.size(); }

    /** Get the buffer offset */
    [[nodiscard]] auto get_buffer_offset() const -> U64 { return 0; }

    /** Update the buffer */
    void update() { wgpu_write_buffer(_buffer, 0, _data.data(), _data.size()); }

    /** Get the data element offset from the start of the buffer */
    auto get_dyn_offset(U32 idx) const -> U32 { return _alignedSize * idx; }

    auto as_ubo(Bool frag = false) const -> BindEntry {
        BindEntry entry;
        entry.type = BIND_ENTRY_BUFFER;
        entry.buffer_usage =
            wgpu::BufferUsage::CopyDst | wgpu::BufferUsage::Uniform;
        entry.stage =
            frag ? wgpu::ShaderStage::Fragment : wgpu::ShaderStage::Vertex;
        entry.buffer = _buffer;
        entry.dynamic = _isDynamic;
        entry.buffer_size = _isDynamic ? _elemSize : _elemSize * _elemCount;
        entry.buffer_offset = 0;
        return entry;
    };

    auto as_frag_ubo() const -> BindEntry { return as_ubo(true); }

    auto as_sto(Bool readonly = true, Bool frag = false) const -> BindEntry {
        BindEntry entry;
        entry.type = BIND_ENTRY_BUFFER;
        entry.buffer_usage =
            wgpu::BufferUsage::CopyDst | wgpu::BufferUsage::Storage;
        entry.stage =
            frag ? wgpu::ShaderStage::Fragment : wgpu::ShaderStage::Vertex;
        entry.buffer = _buffer;
        entry.dynamic = _isDynamic;
        entry.buffer_size = _isDynamic ? _elemSize : _elemSize * _elemCount;
        entry.buffer_offset = 0;
        entry.read_only = readonly;
        return entry;
    };

    /** get the data pointer */
    template <typename T> auto data(U32 idx = 0) -> T& {
        NVCHK(idx < _elemCount, "Out of range access in DynamicBuffer");
        NVCHK(sizeof(T) == 1 || sizeof(T) == _elemSize,
              "data(): Invalid type size {}!={}", sizeof(T), _elemSize);
        T* obj = (T*)((U8*)_data.data() + (size_t)(_alignedSize * idx));
        return *obj;
    }

    /** Fill the data memory with content */
    template <typename T> void set_data(const T& rhs) {
        NVCHK(sizeof(T) == _elemSize, "set_data(T): Invalid type size {}!={}",
              sizeof(T), _elemSize);
        U8* ptr = _data.data();

        for (U32 i = 0; i < _elemCount; ++i) {
            // Cast to our structure type:
            T* obj = (T*)ptr;

            // Write the data:
            *obj = rhs;

            // Move to next location:
            ptr += _alignedSize;
        }
    }

    /** Set the data from an array */
    template <typename T> void set_data(const Vector<T>& vec) {
        NVCHK(sizeof(T) == _elemSize,
              "set_data(Vector): Invalid type size {}!={}", sizeof(T),
              _elemSize);
        U8* ptr = _data.data();

        NVCHK(vec.size() == _elemCount, "Unexpected vector length.");

        for (U32 i = 0; i < _elemCount; ++i) {
            // Cast to our structure type:
            T* obj = (T*)ptr;

            // Write the data:
            *obj = vec[i];

            // Move to next location:
            ptr += _alignedSize;
        }
    }

    /** Set data from a vector */
    template <typename T> void set_data(const Vector<T>& vec, U32 offset) {
        NVCHK(sizeof(T) == _elemSize,
              "set_data(Vector, offset): Invalid type size {}!={}", sizeof(T),
              _elemSize);
        U8* ptr = _data.data();

        NVCHK(vec.size() >= (_elemCount + offset),
              "unexpected number of elements");

        for (U32 i = 0; i < _elemCount; ++i) {
            // Cast to our structure type:
            T* obj = (T*)ptr;

            // Write the data:
            *obj = vec[offset + i];

            // Move to next location:
            ptr += _alignedSize;
        }
    }

  protected:
    /** Create the containers for this object */
    void init(U32 elemCount, U32 alignment, U32 elemSize) {
        // Create the buffer:
        _elemCount = elemCount;
        _elemSize = elemSize;
        _isDynamic = elemCount > 1 && alignment == 0;

        // We need to figure out the total size of the buffer:
        U32 align_offset = alignment;
        if (align_offset == 0) {
            // Retrieve the buffer alignment from the engine:
            align_offset = wgpu_get_buffer_offset_alignment();
        }

        // Compute our structure aligned size:
        _alignedSize = get_aligned_element_size(_elemSize, align_offset);
        logDEBUG("Computed alignSize: {}", _alignedSize);

        // Compute the buffer total size:
        // The last element doesn't require any special alignment consideration:
        U32 buf_size = _elemSize + (elemCount - 1) * _alignedSize;
        logDEBUG("Computed buf_size: {}", buf_size);

        // Prepare the storage:
        _data.resize(buf_size, '\0');
        NVCHK(_data.size() == buf_size, "Invalid data size.");
    }

    /** Create and fill the initial buffer */
    void create_buffer(wgpu::BufferUsage usage) {
        _buffer = wgpu_create_buffer(_data.data(), _data.size(), usage);
    }

    /** The data object*/
    Vector<U8> _data;

    /** The WGPU buffer */
    wgpu::Buffer _buffer{};

    /** Aligned element size */
    U32 _alignedSize{0};

    /** Element count */
    U32 _elemCount{0};

    /** Element size */
    U32 _elemSize{0};

    /** Dynamic offset flags*/
    Bool _isDynamic{false};
};

Crap… 😅 I'm not even done with this initial Render pass setup and my brain is already on the next step: considering the replacement of the render pipeline I'm building right now with a compute pipeline instead ⇒ Sometimes it's really a pain when you realize your brain is going faster than your hands 🤣.

For the moment I was just preparing a RenderNode in the ElevationProducer constructor as follow:

    // Use a shared render node to create a single pipeline:
    _upsampleNode = create_ref_object<WGPURenderNode>(0, "proland/upsample");
    _upsampleNode->set_blend_state({false});
    _upsampleNode->set_depth_state({false});

    // Define the binding group 0 layout:
    // Containing a Sampler, the noise texture
    // (Tex2dArray) and the parent coarse texture
    _upsampleNode->define_grp(DefSampler, DefTex2dArray, DefTex2dArray);

    // Define the binding group 1 layout:
    // Containing our uniform buffer data and the residual texture (Tex2d)
    _upsampleNode->define_grp(sizeof(UniformData), DefTex2d);

    // Next we should generate as many bind group for the binding 0 as we have
    // textures in the GPUTileStorage:
    auto* gpustorage = storage->as_gpu_tile_storage();
    NVCHK(gpustorage != nullptr, "Invalid GPUTileStorage");
    U32 ntexs = gpustorage->get_num_textures();

    for (U32 i = 0; i < ntexs; ++i) {
        auto bindgrp = _upsampleNode->create_bind_grp(
            0, BindLinearSampler, BindTexture{_noiseTexture},
            BindTexture{gpustorage->get_texture(i)});
        _bind0Groups.push_back(bindgrp);
    }

    // Get the render pipeline from the RenderNode:
    _upsamplePipeline =
        _upsampleNode->get_pipeline({TextureFormat::RGBA32Float});

And next, for each worker thread, I was allocating a dedicated render pass and additional resources:

auto ElevationProducer::get_or_create_thread_data()
    -> const RefPtr<ThreadData>& {
    I32 threadId = get_thread_index();
    auto& slot = _threadData[threadId];
    if (slot == nullptr) {
        logNOTE("ElevationProducer: creating specific data for thread {}",
                threadId);
        slot = create_ref_object<ThreadData>();

        I32 tileWidth = (I32)_cache->get_storage()->get_tile_size();

        auto* eng = WGPUEngine::instance();

        // Create the residual texture:
        slot->residualTexture = eng->create_texture_2d(
            tileWidth, tileWidth, TextureFormat::R32Float,
            TextureUsage::TextureBinding | TextureUsage::CopyDst);

        // Add a dedicated render pass to render our data
        slot->demTexture = eng->create_texture_2d(
            tileWidth, tileWidth, TextureFormat::RGBA32Float,
            TextureUsage::RenderAttachment | TextureUsage::CopySrc);

        const RenderPassDesc& desc = {.width = (U32)tileWidth,
                                      .height = (U32)tileWidth,
                                      .with_depth = false,
                                      .swapchain_idx = -1,
                                      .clear_color = true,
                                      .clear_depth = true,
                                      .render_to_texture = true,
                                      .target_texture = slot->demTexture};
        slot->renderPass = create_ref_object<WGPURenderPass>(desc);

        // Create the bind1 group:
        // Note: Accessing the BindGroupBuilder to create a new bind group is
        // currently not thread safe, so we need to protect this section:
        WITH_SPINLOCK(_sp);
        slot->bind1Group =
            _upsampleNode->create_bind_grp(1, slot->unifBuffer.as_ubo(true),
                                           BindTexture{slot->residualTexture});
    }

    return slot;
}

⇒ Some of this code could stay as is, but I think the RenderNode should be replaced with a ComputeNode and the RenderPass should be replaced with a ComputePass. Which all sounds like a great idea and all, except that I have not implemented those classes yet lol. So more work ahead unfortunately 😇.

So here is the initial version of my ComputePass class (pretty small compared to the RenderPass):

class NVGPU_EXPORT WGPUComputePass : public WGPUObject {
    NV_DECLARE_NO_COPY(WGPUComputePass)
    NV_DECLARE_NO_MOVE(WGPUComputePass)

  public:
    // Definition of a computation step:
    struct ComputeStep {
        wgpu::ComputePipeline pipeline;
        Vector<wgpu::BindGroup> bindGroups;
        U32 countX{1};
        U32 countY{1};
        U32 countZ{1};
    };

    WGPUComputePass();
    explicit WGPUComputePass(const ComputePassDesc& desc);
    ~WGPUComputePass() override;

    class NVGPU_EXPORT EncodeHandler : public RefObject {
      public:
        virtual void operator()(wgpu::ComputePassEncoder& pass) = 0;
    };

    template <typename F> class EncodeFunctor : public EncodeHandler {
        F _func;

      public:
        explicit EncodeFunctor(F func) : _func(func){};
        void operator()(wgpu::ComputePassEncoder& pass) override {
            _func(pass);
        }
    };

    /** Specify the encoding function */
    auto set_encode_handler(RefPtr<EncodeHandler> handler) -> WGPUComputePass& {
        _encodeHandler = std::move(handler);
        return *this;
    }

    template <typename F> auto set_encode_func(F func) -> WGPUComputePass& {
        auto handler = nv::create_ref_object<EncodeFunctor<F>>(func);
        return set_encode_handler(handler);
    }

    /** Encode this render pass */
    void encode(const wgpu::CommandEncoder& encoder);

    /** Add a compute preprocessor definition */
    auto add_def(String def) -> WGPUComputePass& {
        _computeDefinitions.insert(std::make_pair(std::move(def), "1"));
        return *this;
    }

    auto add_compute_step(ComputeStep step) -> WGPUComputePass& {
        _computeSteps.emplace_back(std::move(step));
        return *this;
    }

    // auto add_compute_steps(WGPUComputeNode& cnode) -> WGPUComputePass& {
    //     return *this;
    // }

    void execute();

  protected:
    /** Desc for this compute pass */
    ComputePassDesc _desc;

    /** ComputePass descriptor */
    wgpu::ComputePassDescriptor _descriptor;

    /** Encode handler */
    RefPtr<EncodeHandler> _encodeHandler;

    // Definition map:
    Preprocessor::DefMap _computeDefinitions;

    // List of computation steps to be executed:
    Vector<ComputeStep> _computeSteps;
};

The ComputeNode was also relatively simple to implement, and on top of that I also had to create the class ComputePipelineBuilder and add a couple of functions in the WGPUEngine class, but nothing too fancy really:

    // Create a Compute pipeline:
    auto create_compute_pipeline(const wgpu::ComputePipelineDescriptor& desc)
        -> wgpu::ComputePipeline;
        
    // Build a default compute pipeline:
    auto build_compute_pipeline(const String& compute_code,
                                Preprocessor::DefMap* computeDefs = nullptr,
                                bool reset_all = true,
                                ShaderCodeType ctype = CODE_WGSL)
        -> WGPUComputePipelineBuilder&;

Now preparing some minimal unit tests on the compute pipelines. And first thing I realize is that in the unit test, when building a render node or a compute node it might be handy to be able to pass for WGSL code directly instead of providing a filename containing that code. Let's see if there is an easy way to acheive this.

OK: I have now added this function:

auto WGPUEngine::get_shader_code(const String& shader_file) -> String {
    // This "shader_file" might actually contain some code directly:
    if (starts_with(shader_file, "WGSL:")) {
        // Just remove this "WGSL:" prefix:
        return shader_file.substr(5);
    }

    String sFile = shader_file;
    if (get_path_extension(sFile).empty()) {
        sFile += ".wgsl";
    }

    validate_resource_path(WGPUEngine::cat_shader, sFile);

    return read_file(sFile.c_str());
}

And I'm using it in the RenderNode and the ComputeNode just before building our pipelines:

        logDEBUG("Building pipeline...");
        String vs_code = eng->get_shader_code(_shaderDesc.vertFile);
        String fs_code = _shaderDesc.vertFile == _shaderDesc.fragFile
                             ? vs_code
                             : eng->get_shader_code(_shaderDesc.fragFile);

        logDEBUG("Building pipeline...");
        String cs_code = eng->get_shader_code(_shaderDesc.computeFile);

⇒ Now all I need to do is to prefix any wsgl code I want to use directly with the string “WGSL:“ 👍! So I can create a test compute node with this:

    // Prepare a ComputeNode:
    String code = R"(WGSL:
@compute @workgroup_size(32)
fn main() {
    // Compute stuff
}
)";

    WGPUComputeNode cnode(code);

/

Next I need to add some storage bindings.

Note: in an attempt to automate a bit the handling of the “ShaderStage” when preparing binding groups for a compute pipeline, I'm now specifying the “target pipeline” as part of the constructor of the WGPUBindGroupBuilder:

explicit WGPUBindGroupBuilder(TargetPipeline tgt = TGT_PIPELINE_RENDER);

(And obviously, we will set this to TGT_PIPELINE_COMPUTE when creating a ComputeNode.)

And I also updated the define_grp() to accept BindEntry objects now, so the following code works to define the binding group 0:

    WGPUComputeNode cnode(code);

    U32 num = 64;
    DynamicBufferBase input(num, 1, sizeof(F32), BufferUsage::Storage);
    DynamicBufferBase output(num, 1, sizeof(F32), BufferUsage::Storage);

    // Fill the input buffer:
    F32* ptr = input.ptr<F32>();
    for (U32 i = 0; i < num; ++i) {
        (*ptr++) = (F32)i / 10.0F;
    }
    input.update();

    // We need to define some minimal bindings here:
    cnode.define_grp(input.as_sto(), output.as_rw_sto());

Continuing on the implementation I added the possibility to define a “compute_step” in the ComputeNode with the add() method [similar to what we do in the RenderNode]:

    auto add(WorkDim dims, Vector<wgpu::BindGroup> grps, DynOffsets dynOffsets)
        -> WGPUComputeNode&;

    auto add(WorkDim dims, Vector<wgpu::BindGroup> grps) -> WGPUComputeNode&;

and also introduced support to “add” a ComputeNode inside a ComputePass:

auto WGPUComputePass::add(WGPUComputeNode& cnode) -> WGPUComputePass& {
    cnode.collect_steps(_computeSteps, &_computeDefinitions);
    return *this;
}

So here is what my initial compute unit test looks like now:

    WGPUComputeNode cnode(code);

    U32 num = 64;
    DynamicBufferBase input(num, 1, sizeof(F32), BufferUsage::Storage);
    DynamicBufferBase output(num, 1, sizeof(F32), BufferUsage::Storage);

    // Fill the input buffer:
    F32* ptr = input.ptr<F32>();
    for (U32 i = 0; i < num; ++i) {
        (*ptr++) = (F32)i / 10.0F;
    }
    input.update();

    // We need to define some minimal bindings here:

    // Create a bind group layout:
    cnode.define_grp(input.as_sto(), output.as_rw_sto());

    // Create a bind group:
    // cnode.bind(0, input.as_sto(), output.as_rw_sto());
    auto grp0 = cnode.create_bind_grp(0, input.as_sto(), output.as_rw_sto());

    // Add a compute step:
    cnode.add({64}, {grp0});

    // Check that we can get a pipeline from that node:
    BOOST_CHECK(cnode.get_pipeline() != nullptr);

    // Add the compute node to the compute pass:
    cpass->add(cnode);

    // Release the pass:
    cpass.reset();

Next, time to actually execute this compute pass somehow!

I already havea simple encode function I derived from the RenderPass class, but now I realize I don't have support for dynamic offsets in this simplified version. Let's try to add this back. OK! Here is the updated version, which I think will work 😅:

void WGPUComputePass::encode(const wgpu::CommandEncoder& encoder) {
    ComputePassEncoder pass = encoder.BeginComputePass(&_descriptor);

    if (_encodeHandler != nullptr) {
        (*_encodeHandler)(pass);
    }

    // Execute the compute steps
    ComputePipeline currentPipeline = nullptr;
    const Vector<DynamicGroupBind>* currentBinds = nullptr;

    for (const auto& step : _computeSteps) {
        // Set the pipeline:
        if (step.pipeline.Get() != currentPipeline.Get()) {
            currentPipeline = step.pipeline;
            pass.SetPipeline(currentPipeline);
        }

        // Set the bindings:
        U32 idx = 0;
        U32 curSize = currentBinds == nullptr ? 0 : currentBinds->size();

        for (const auto& it : step.bindGroups) {
            if (idx >= curSize ||
                it.group.Get() != (*currentBinds)[idx].group.Get() ||
                !all_equal(it.dynamic_offsets,
                           (*currentBinds)[idx].dynamic_offsets)) {
                pass.SetBindGroup(idx, it.group, it.dynamic_offsets.size(),
                                  it.dynamic_offsets.empty()
                                      ? nullptr
                                      : it.dynamic_offsets.data());
            }
        }
        currentBinds = &step.bindGroups;

        // Dispatch the computation:
        pass.DispatchWorkgroups(step.dims.x, step.dims.y, step.dims.z);
    }

    pass.End();
}

Allright! So now we can execute our ComputePass, just calling the execute method 😎:

    // Create a bind group layout:
    cnode.define_grp(input.as_sto(), output.as_rw_sto());

    // Create a bind group:
    // cnode.bind(0, input.as_sto(), output.as_rw_sto());
    auto grp0 = cnode.create_bind_grp(0, input.as_sto(), output.as_rw_sto());

    // Add a compute step:
    // Our buffer is of size 64, and have work groups of size 32, so we need 2
    // groups on X
    cnode.add({2}, {grp0});

    // Check that we can get a pipeline from that node:
    BOOST_CHECK(cnode.get_pipeline() != nullptr);

    // Add the compute node to the compute pass:
    cpass->add(cnode);

    // Execute the pass:
    cpass->execute();

Which means that in theory we should have the appropriate output now in the output buffer since we have updated the shader code to handle those storage bindings:

@group(0) @binding(0) var<storage,read> inputBuffer: array<f32,64>;
@group(0) @binding(1) var<storage,read_write> outputBuffer: array<f32,64>;

@compute @workgroup_size(32)
fn main(@builtin(global_invocation_id) id: vec3<u32>) {
    // Compute stuff
    outputBuffer[id.x] = inputBuffer[id.x]*2.0;
}

⇒ So how could we read the data back ?

First trying the simplest option:

    // Should have updated the output buffer now:
    auto buf = output.buffer();
    buf.MapAsync(
        wgpu::MapMode::Read, 0, buf.GetSize(),
        [](WGPUBufferMapAsyncStatus status, void*) {
            BOOST_CHECK_EQUAL(status, WGPUBufferMapAsyncStatus_Success);
        },
        nullptr);
    eng->wait_idle();

    // callback, void *userdata)
    const F32* data2 = (F32*)buf.GetConstMappedRange(0);
    BOOST_REQUIRE(data2 != nullptr);

    for (U32 i = 0; i < num; ++i) {
        F32 val = (*data2++);
        BOOST_CHECK_EQUAL(val, (F32)i * 2.0F / 10.0F);
    }

But this failed because I don't have the MapRead usage on the output buffer:

2023-09-13 20:40:17.309170 [ERROR] Dawn: Validation error: The buffer usages (BufferUsage::(CopyDst|Storage|BufferUsage::80000000)) do not contain BufferUsage::MapRead.
 - While calling [Buffer].MapAsync(MapMode::Read, 0, 256, ...).

error: in "compute/test_simple_buffer_compute": check status == WGPUBufferMapAsyncStatus_Success has failed [1 != 0]

⇒ Let's try to add this flag (but I don't expect this to work actually: since I read that the MapRead is not compatible with the Storage flag). And as expected this failed:

D:/Projects/NervLand/tests/test_nvGPU/compute_spec.cpp(892023-09-13 20:44:08.647824 [ERROR] Dawn: Validation error: Buffer usages (BufferUsage::(MapRead|CopyDst|Storage)) is invalid. If a buffer usage contains BufferUsage::MapRead the only other allowed usage is BufferUsage::CopyDst.
 - While calling [Device].CreateBuffer([BufferDescriptor]).

So now, let's create a third buffer that we will use to copy the data back to the CPU:

    U32 num = 64;
    DynamicBufferBase input(num, 1, sizeof(F32), BufferUsage::Storage);
    DynamicBufferBase output(num, 1, sizeof(F32),
                             BufferUsage::Storage | BufferUsage::CopySrc);
    DynamicBufferBase staging(num, 1, sizeof(F32),
                              BufferUsage::CopyDst | BufferUsage::MapRead);

And bingo! Now we have a first working unit test on the Compute pipeline usage:

    cnode.add({2}, {grp0});

    // Check that we can get a pipeline from that node:
    BOOST_CHECK(cnode.get_pipeline() != nullptr);

    // Add the compute node to the compute pass:
    cpass->add(cnode);

    // Execute the pass:
    // cpass->execute();

    auto& bld = eng->build_commands();
    bld.execute_compute_pass(*cpass);
    bld.copy_buffer_to_buffer(output.buffer(), staging.buffer(), 0, 0,
                              output.get_buffer_size());
    bld.submit(false);

    // Should have updated the output buffer now:
    const auto& buf = staging.buffer();
    buf.MapAsync(
        wgpu::MapMode::Read, 0, buf.GetSize(),
        [](WGPUBufferMapAsyncStatus status, void*) {
            BOOST_CHECK_EQUAL(status, WGPUBufferMapAsyncStatus_Success);
        },
        nullptr);
    eng->wait_idle();

    // callback, void *userdata)
    const F32* data2 = (F32*)buf.GetConstMappedRange(0);
    BOOST_REQUIRE(data2 != nullptr);

    for (U32 i = 0; i < num; ++i) {
        F32 val = (*data2++);
        BOOST_CHECK_EQUAL(val, (F32)i * 2.0F / 10.0F);
    }

⇒ I then introduced the GPUBuffer as base class for the DynamicBufferBase (which I should probably rename in fact 🤔).

Actually, just decided to rename DynamicBufferBase to GPUBufferProxy: I think that's a better choice!

Problem: How to setup usage of capturing lambdas to perform async read on a buffer, when the MapAsync() method only expect a function a function pointer and a userdata pointer ?

The following will not work:

    // asynchronous read of this buffer:
    template <typename Func>
    void read_async(Func func, U64 offset = 0, U64 size = WGPU_WHOLE_MAP_SIZE) {
        NVCHK((_buffer.GetUsage() & wgpu::BufferUsage::MapRead) != 0,
              "Cannot read from non MapReadable buffer");

        if (size != WGPU_WHOLE_MAP_SIZE) {
            NVCHK(offset + size <= _buffer.GetSize(),
                  "Out of range buffer mapping");
        }

        _buffer.MapAsync(
            wgpu::MapMode::Read, offset, size,
            [this, func, offset, size](WGPUBufferMapAsyncStatus status, void*) {
                NVCHK(status == WGPUBufferMapAsyncStatus_Success,
                      "Failed to map-read buffer");
                const void* ptr = _buffer.GetConstMappedRange(offset, size);
                NVCHK(ptr != nullptr, "Invalid const mapped pointer.");
                func(ptr);
            },
            nullptr);
    }
/

Disclaimer: I'm pretty sure you could achieve that almost out of box by using an std::function, but I would like to avoid using this is possible 😁.

⇒ Ok I came up with something on this and now I could add the method read_async() and read_sync() in the GPUBuffer.

And I can now replace some test as code this:

    staging.read_async([](const void* data) { BOOST_CHECK(data != nullptr);
    });

    const auto& buf = staging.buffer();
    buf.MapAsync(
        wgpu::MapMode::Read, 0, buf.GetSize(),
        [](WGPUBufferMapAsyncStatus status, void*) {
            BOOST_CHECK_EQUAL(status, WGPUBufferMapAsyncStatus_Success);
        },
        nullptr);
    eng->wait_idle();

    // callback, void *userdata)
    const F32* data2 = (F32*)staging.buffer().GetConstMappedRange(0);

with simply:

const F32* data2 = (F32*)staging.read_sync();

  • blog/2023/0916_nvl_dev30_elevationproducer_study.txt
  • Last modified: 2023/09/20 13:21
  • by 127.0.0.1