====== NervLand DevLog #30: Initial study and design setup for Proland ElevationProducer ====== {{tag>dev cpp webgpu nervland devlog proland}} Hi! In this article, I'm mostly focusing on the ElevationProducer from ProLand: trying to understand how it works and how I should implement it in my own engine (replacing OpenGL with WebGPU in the process). Overall not an easy task to do, but I'm making some progress at least. The structure of the article is actually a bit fuzzy, with ideas going in multiple directions, and unfortunately, this work is still not completed, since I eventually decided I should try and move to a compute pipeline implementation to perform the actual tile generation. So we end up preparing support for the ComputePass and ComputeNode classes, and will continue on this journey in our next episode ;-). ====== ====== Youtube videos (3 parts) for this article available at: ;#; {{ youtube>YlBZ4DqiCo4?large }} ;#; ;#; {{ youtube>uwPC-0sVVO4?large }} ;#; ;#; {{ youtube>J_G6K6vz-9Y?large }} ;#; ===== Initial steps on ElevationProducer implementation ===== The ElevationProducer will use the noise Texture2D Array, in R16F format: so first thing I think I should do here is to construct a simple unit test to check that I can write data to an R16F texture, and read this data back. => Introducing support to copy a Texture to a Buffer: auto WGPUCommandBuilder::copy_texture_to_buffer(wgpu::Texture src, wgpu::Buffer dst, I32 mip_level) -> WGPUCommandBuilder& { // Textures should match to call this method: U32 width = src.GetWidth(); U32 height = src.GetHeight(); U32 nlayers = src.GetDepthOrArrayLayers(); auto fmt = src.GetFormat(); U32 nmips = src.GetMipLevelCount(); auto bsize = dst.GetSize(); U32 numChans = get_texture_format_num_channels(fmt); U32 chanSize = get_texture_format_channel_size(fmt); ImageCopyTexture srcDesc{.texture = std::move(src)}; ImageCopyBuffer dstDesc{ .layout = {.offset = 0, .bytesPerRow = width * numChans * chanSize, .rowsPerImage = height}, .buffer = std::move(dst), }; Extent3D csize{width, height, nlayers}; if (nmips == 1 && mip_level < 0) { // Just copy the only available mip_level: mip_level = 0; } if (mip_level == -1) { THROW_MSG("No implementation yet to copy all mip levels"); // // Copy all the mip levels: // for (U32 i = 0; i < nmips; ++i) { // srcDesc.mipLevel = i; // dstDesc.mipLevel = i; // csize.width = maximum(width / (1 << i), 1U); // csize.height = maximum(height / (1 << i), 1U); // _encoder.CopyTextureToTexture(&srcDesc, &dstDesc, &csize); // } } else { NVCHK(mip_level < nmips, "Out of range mip level."); // Copy a single mip level: srcDesc.mipLevel = mip_level; csize.width = maximum(width / (1 << mip_level), 1U); csize.height = maximum(height / (1 << mip_level), 1U); dstDesc.layout.bytesPerRow = csize.width * numChans * chanSize; dstDesc.layout.rowsPerImage = csize.height; // Check the total size that we will copy: U32 tot_size = csize.width * csize.height * numChans * chanSize; NVCHK(bsize == tot_size, "Mismatch in dest buffer size for texture copy."); _encoder.CopyTextureToBuffer(&srcDesc, &dstDesc, &csize); } return *this; } And also introducing support to wait the WGPU queue to signal that all the work is done: static void workDoneCallback(WGPUQueueWorkDoneStatus status, void* userdata) { logNOTE("In workDoneCallback with status: {}", status); auto* bptr = (std::atomic*)userdata; *bptr = true; } void WGPUEngine::wait_idle() { std::atomic done(false); _queue.OnSubmittedWorkDone(0U, workDoneCallback, &done); logNOTE("Waiting for GPU..."); while (!done.load()) { process_events(); sleep_us(100); } logNOTE("GPU is idle."); } To be honest, in the code above the first argument for ''OnSubmittedWorkDone'' is the **signalValue**, and I currently have no idea what this is exactly, but we will discover that soon enough. **=> Update**: Looking into the Dawn unit tests code it seems that the signalValue should be set at 0 for now and it is an error to use another value (?). Then also added support to generate the **DEM Noise texture 2D Array**: no idea yet how this will be used, but we'll figure this out eventually ;-). **Note**: Oh crap... just realized that the blockingconcurrentqueue I'm using now by default in the LogManager will not build on EMSCRIPTEN: I get an error when trying to link to **sem_timedwait**: cmd.exe /C "cd . && D:\Projects\NervProj\tools\windows\emsdk-git\upstream\emscripten\em++.bat -std=c++20 -fpch-instantiate-templates -DNDEBUG -O3 -s WASM=1 -O3 -s ALLOW_MEMORY_GROWTH=1 -s FETCH=1 "@D:/Projects/NervProj/build/nervland_emcc_release/NervSeed_emcc_preload_file.rsp" sources/apps/NervSeed/CMakeFiles/NervSeed.dir/src/main.cpp.o -o sources\apps\NervSeed\NervSeed.html sources/nvCore/lite/libnvCore_lite.a D:/Projects/NervProj/libraries/windows_emcc/fmt-9.1.1/lib/libfmt.a D:/Projects/NervProj/libraries/windows_emcc/yamlcpp-0.7.0/lib/libyaml-cpp.a -lpthread D:/Projects/NervProj/libraries/windows_emcc/SDL2-2.26.5/lib/libSDL2.a D:/Projects/NervProj/libraries/windows_emcc/yamlcpp-0.7.0/lib/libyaml-cpp.a D:/Projects/NervProj/libraries/windows_emcc/freetype-git-2.13/lib/libfreetype.a D:/Projects/NervProj/libraries/windows_emcc/harfbuzz-git-7.3/lib/libharfbuzz.a D:/Projects/NervProj/libraries/windows_emcc/zlib-1.2.13/lib/libz.a D:/Projects/NervProj/libraries/windows_emcc/brotli-git-1.0.9/lib/libbrotlidec.a D:/Projects/NervProj/libraries/windows_emcc/brotli-git-1.0.9/lib/libbrotlicommon.a D:/Projects/NervProj/libraries/windows_emcc/libpng-1.6.39/lib/libpng16.a -lpthread && cd ." wasm-ld: error: sources/nvCore/lite/libnvCore_lite.a(LogManager.cpp.o): undefined symbol: sem_timedwait => I will have to investigate this some day, but for now let's just disable the logger thread when building for EMSCRIPTEN. ===== Investigations on tile creation process ===== To me it seems the tile creation process would start with a call to ''ElevationProducer::startCreateTile()''. The ''startCreateTile()'' method is a protected virtual method, inherited from the base TileProducer class itself. And this is called in the base **CreateTile** task implementation. => So we need some implementation for this **CreateTile** class. **Note**: the **CreateTile** make extensive checks on its "owner" member to see if it's a null pointer: I'm going to take a different road here and assume instead that the CreateTask should all be deleted before their parent TileProducer, so this owner pointer should always be valid. => I have now prepared a base/simplified implementation of the **CreateTile** class: one thing to note here is that I'm currently bypassing the begin()/end() methods from this class: as far as I understand this is only needed to setup a "framebuffer" before doing some rendering, and I don't think this will apply directly anymore with a WebGPU implementation. Next, let's continue with the reimplementation of ''start_create_tile'' in the ElevationProducer class. But actually the first thing this will require is a call to **createTaskGraph** which itself will need the implementation of the **CreateTileTaskGraph** class, so let's add this one now. Arrggghh... except that CreateTileTaskGraph class is mainly concerned with deleting/restoring a task graph, thus relying on the doRelease() method from ORK: not very good for us. So yeah... for the moment I don't think it would quite make sense to introduce this implementation as is. But let's try to monitor this point. **OK**, now that we have the support for ''start_create_tile'', the next with we need is the implementation for ''ElevationProducer::do_create_tile()'' which will be called in **CreateTile::run()** (and in fact, this is where most of the magic would happen if I'm not mistaken 🤔) Let's see... Okay, hmmm, this is tricky, but working on the do_create_tile() method, I think I now need to replace the concept of a **framebuffer** with a **render pass**. **Side note**: for now, I'm going to use some explicit synchronization but eventually it would be good to have some kind of per thread storage using a fixed thread id like this: #include #include // Thread-local storage for the identifier thread_local int threadId = -1; // Function to get and assign a unique identifier per thread int GetThreadId() { // Atomic counter to generate unique IDs static std::atomic counter(0); if (threadId == -1) { threadId = counter.fetch_add(1); } return threadId; } Still in the **do_create_tile** method, the next thing we will need is the implementation for the **CPUTileStorage** and also the capability to copy CPU data into a given GPU slot: if (_residualTiles != nullptr && _residualTiles->has_tile(level, tx / mod, tz / mod)) { residualOSH.set(0.25F / tileWidth, 0.25F / tileWidth, 2.0F / tileWidth, 1.0F); I32 rx = (tx % mod) * tileSize; I32 ry = (tz % mod) * tileSize; TileCache::Tile* t = _residualTiles->find_tile(level, tx / mod, tz / mod); NVCHK(t != nullptr, "Invalid residual tile"); CPUTileStorage::CPUSlot* cpuTile = dynamic_cast::CPUSlot*>(t->getData()); assert(cpuTile != NULL); for (I32 y = 0; y < tileWidth; ++y) { for (I32 x = 0; x < tileWidth; ++x) { float r = cpuTile->data[(x + rx) + (y + ry) * residualTileWidth]; assert(isFinite(r)); residualTile[x + y * tileWidth] = r; } } residualTexture->setSubImage(0, 0, 0, tileWidth, tileWidth, RED, FLOAT, Buffer::Parameters(), CPUBuffer(residualTile)); } => Yet, for the moment we don't need support for the "residualTiles" yet, so we will come back to those extensions afterwards. ===== Preparing rendering process for ElevationProducer ===== **Note**: Drawing a fullscreen quad using only a single triangle could prove slightly more efficient: (cf. https://stackoverflow.com/questions/2588875/whats-the-best-way-to-draw-a-fullscreen-quad-in-opengl-3-2) I'm now reaching the end of the ''do_create_tile()'' method, and with this we now have to actually do some rendering. And I think I now have a better understanding of how this should be done: As mentioned above, we will have to replace the **framebuffer** concept with a dedicated **RenderPass**. In this render pass we will be able to render to the **demTexture** which is an RGBA32F texture. The rendering will be done with the **upsample** program, so we will need to create a dedicated pipeline for this program. This pipeline should accept multiple bind groups: * group0: [static textures] providing noiseTexture and residualTexture * group1: [dynamic data] providing the correct parent coarse texture (2D array) and the uniform buffer. => For the binding group 1 we should prepare as many bind group as the number of texture 2D arrays available in the **GPUTileStorage** corresponding to this producer. In fact we should probably separate the uniform buffer in its own binding group (because that buffer will be allocated per thread), so we will have: * group0: [static textures] providing noiseTexture and residualTexture * group1: [parent coarse texture] providing the correct parent coarse texture (2D array) * group2: Uniform buffer Another point to keep in mind here is that we are limited to only 4 binding groups. So here we are OK, but in other cases we might have to merge group0 and group1 together to provide all the textures in a single binding group. And thinking about it: I'm now wondering how the rendering will work if we have multiple threads writing data to the same texture, then rendering to the same texture, then copying that texture to different layers in a texture 2D array. => **maybe it's a better idea to use different texture targets upfront** 🤔? Even if we use multiple threads we can certainly share a single WGPU pipeline to execute the upsample program. So we will use a single **RenderNode** to build that: WGPURenderNode rnode(0, "proland/upsample"); Next on this RenderNode must firtst define the layout we want for the binding groups. And on this point I want to introduce some update already: I want to have a unified ''DefEntry'' struct which I could use to define any kind of binding entry (just like we already have the ''BindEntry'' struct already to provide anykind of actual binding.) => Let's provide that! And here is the new struct: struct NVGPU_EXPORT DefEntry { DefEntryType type{DEF_ENTRY_UNDEFINED}; // Common settings: wgpu::ShaderStage stage{wgpu::ShaderStage::Vertex}; I32 location{-1}; I32 group{-1}; // Sampler settings: wgpu::SamplerBindingType sampler_type{wgpu::SamplerBindingType::Filtering}; // Texture settings: wgpu::TextureSampleType texture_stype = wgpu::TextureSampleType::Float; wgpu::TextureViewDimension texture_dim = wgpu::TextureViewDimension::e2D; bool multisampled{false}; // constructors: DefEntry() = default; DefEntry(const DefEntry& rhs) = default; DefEntry(DefEntry&& rhs) = default; void operator=(const DefEntry&) = delete; **Note**: Eventually I should also add the support to define buffers in this DefEntry struct, but for the moment this doesn't seem that critical. ===== Updated DynamicBuffer structure ===== While working on the ''DefEntry'' implementation I also realized I had to update a bit my **DyamicBuffer** implementation => I have now extracted a non-templated base class for this, and we can use a reference on such a base in a generic way in the **BufferProvider** class: NVGPU_EXPORT class DynamicBufferBase : public RefObject { public: /** Default constructor */ DynamicBufferBase() = default; /** Custom buffer constructor */ DynamicBufferBase(U32 elemCount, U32 alignment, U32 elemSize, wgpu::BufferUsage usage) { init(elemCount, alignment, elemSize); create_buffer(usage); } // Get the buffer: [[nodiscard]] auto buffer() const -> const wgpu::Buffer& { return _buffer; } /** get Aligned size */ [[nodiscard]] auto get_aligned_size() const -> U32 { return _alignedSize; } /** Get the buffer size */ [[nodiscard]] auto get_buffer_size() const -> U64 { return _data.size(); } /** Get the buffer offset */ [[nodiscard]] auto get_buffer_offset() const -> U64 { return 0; } /** Update the buffer */ void update() { wgpu_write_buffer(_buffer, 0, _data.data(), _data.size()); } /** Get the data element offset from the start of the buffer */ auto get_dyn_offset(U32 idx) const -> U32 { return _alignedSize * idx; } auto as_ubo(Bool frag = false) const -> BindEntry { BindEntry entry; entry.type = BIND_ENTRY_BUFFER; entry.buffer_usage = wgpu::BufferUsage::CopyDst | wgpu::BufferUsage::Uniform; entry.stage = frag ? wgpu::ShaderStage::Fragment : wgpu::ShaderStage::Vertex; entry.buffer = _buffer; entry.dynamic = _isDynamic; entry.buffer_size = _isDynamic ? _elemSize : _elemSize * _elemCount; entry.buffer_offset = 0; return entry; }; auto as_frag_ubo() const -> BindEntry { return as_ubo(true); } auto as_sto(Bool readonly = true, Bool frag = false) const -> BindEntry { BindEntry entry; entry.type = BIND_ENTRY_BUFFER; entry.buffer_usage = wgpu::BufferUsage::CopyDst | wgpu::BufferUsage::Storage; entry.stage = frag ? wgpu::ShaderStage::Fragment : wgpu::ShaderStage::Vertex; entry.buffer = _buffer; entry.dynamic = _isDynamic; entry.buffer_size = _isDynamic ? _elemSize : _elemSize * _elemCount; entry.buffer_offset = 0; entry.read_only = readonly; return entry; }; /** get the data pointer */ template auto data(U32 idx = 0) -> T& { NVCHK(idx < _elemCount, "Out of range access in DynamicBuffer"); NVCHK(sizeof(T) == 1 || sizeof(T) == _elemSize, "data(): Invalid type size {}!={}", sizeof(T), _elemSize); T* obj = (T*)((U8*)_data.data() + (size_t)(_alignedSize * idx)); return *obj; } /** Fill the data memory with content */ template void set_data(const T& rhs) { NVCHK(sizeof(T) == _elemSize, "set_data(T): Invalid type size {}!={}", sizeof(T), _elemSize); U8* ptr = _data.data(); for (U32 i = 0; i < _elemCount; ++i) { // Cast to our structure type: T* obj = (T*)ptr; // Write the data: *obj = rhs; // Move to next location: ptr += _alignedSize; } } /** Set the data from an array */ template void set_data(const Vector& vec) { NVCHK(sizeof(T) == _elemSize, "set_data(Vector): Invalid type size {}!={}", sizeof(T), _elemSize); U8* ptr = _data.data(); NVCHK(vec.size() == _elemCount, "Unexpected vector length."); for (U32 i = 0; i < _elemCount; ++i) { // Cast to our structure type: T* obj = (T*)ptr; // Write the data: *obj = vec[i]; // Move to next location: ptr += _alignedSize; } } /** Set data from a vector */ template void set_data(const Vector& vec, U32 offset) { NVCHK(sizeof(T) == _elemSize, "set_data(Vector, offset): Invalid type size {}!={}", sizeof(T), _elemSize); U8* ptr = _data.data(); NVCHK(vec.size() >= (_elemCount + offset), "unexpected number of elements"); for (U32 i = 0; i < _elemCount; ++i) { // Cast to our structure type: T* obj = (T*)ptr; // Write the data: *obj = vec[offset + i]; // Move to next location: ptr += _alignedSize; } } protected: /** Create the containers for this object */ void init(U32 elemCount, U32 alignment, U32 elemSize) { // Create the buffer: _elemCount = elemCount; _elemSize = elemSize; _isDynamic = elemCount > 1 && alignment == 0; // We need to figure out the total size of the buffer: U32 align_offset = alignment; if (align_offset == 0) { // Retrieve the buffer alignment from the engine: align_offset = wgpu_get_buffer_offset_alignment(); } // Compute our structure aligned size: _alignedSize = get_aligned_element_size(_elemSize, align_offset); logDEBUG("Computed alignSize: {}", _alignedSize); // Compute the buffer total size: // The last element doesn't require any special alignment consideration: U32 buf_size = _elemSize + (elemCount - 1) * _alignedSize; logDEBUG("Computed buf_size: {}", buf_size); // Prepare the storage: _data.resize(buf_size, '\0'); NVCHK(_data.size() == buf_size, "Invalid data size."); } /** Create and fill the initial buffer */ void create_buffer(wgpu::BufferUsage usage) { _buffer = wgpu_create_buffer(_data.data(), _data.size(), usage); } /** The data object*/ Vector _data; /** The WGPU buffer */ wgpu::Buffer _buffer{}; /** Aligned element size */ U32 _alignedSize{0}; /** Element count */ U32 _elemCount{0}; /** Element size */ U32 _elemSize{0}; /** Dynamic offset flags*/ Bool _isDynamic{false}; }; ===== Back to ElevationProducer render pass ===== => **Crap...** 😅 I'm not even done with this initial Render pass setup and my brain is already on the next step: considering the replacement of the render pipeline I'm building right now with a compute pipeline instead => Sometimes it's really a pain when you realize your brain is going faster than your hands 🤣. For the moment I was just preparing a **RenderNode** in the ElevationProducer constructor as follow: // Use a shared render node to create a single pipeline: _upsampleNode = create_ref_object(0, "proland/upsample"); _upsampleNode->set_blend_state({false}); _upsampleNode->set_depth_state({false}); // Define the binding group 0 layout: // Containing a Sampler, the noise texture // (Tex2dArray) and the parent coarse texture _upsampleNode->define_grp(DefSampler, DefTex2dArray, DefTex2dArray); // Define the binding group 1 layout: // Containing our uniform buffer data and the residual texture (Tex2d) _upsampleNode->define_grp(sizeof(UniformData), DefTex2d); // Next we should generate as many bind group for the binding 0 as we have // textures in the GPUTileStorage: auto* gpustorage = storage->as_gpu_tile_storage(); NVCHK(gpustorage != nullptr, "Invalid GPUTileStorage"); U32 ntexs = gpustorage->get_num_textures(); for (U32 i = 0; i < ntexs; ++i) { auto bindgrp = _upsampleNode->create_bind_grp( 0, BindLinearSampler, BindTexture{_noiseTexture}, BindTexture{gpustorage->get_texture(i)}); _bind0Groups.push_back(bindgrp); } // Get the render pipeline from the RenderNode: _upsamplePipeline = _upsampleNode->get_pipeline({TextureFormat::RGBA32Float}); And next, for each worker thread, I was allocating a dedicated render pass and additional resources: auto ElevationProducer::get_or_create_thread_data() -> const RefPtr& { I32 threadId = get_thread_index(); auto& slot = _threadData[threadId]; if (slot == nullptr) { logNOTE("ElevationProducer: creating specific data for thread {}", threadId); slot = create_ref_object(); I32 tileWidth = (I32)_cache->get_storage()->get_tile_size(); auto* eng = WGPUEngine::instance(); // Create the residual texture: slot->residualTexture = eng->create_texture_2d( tileWidth, tileWidth, TextureFormat::R32Float, TextureUsage::TextureBinding | TextureUsage::CopyDst); // Add a dedicated render pass to render our data slot->demTexture = eng->create_texture_2d( tileWidth, tileWidth, TextureFormat::RGBA32Float, TextureUsage::RenderAttachment | TextureUsage::CopySrc); const RenderPassDesc& desc = {.width = (U32)tileWidth, .height = (U32)tileWidth, .with_depth = false, .swapchain_idx = -1, .clear_color = true, .clear_depth = true, .render_to_texture = true, .target_texture = slot->demTexture}; slot->renderPass = create_ref_object(desc); // Create the bind1 group: // Note: Accessing the BindGroupBuilder to create a new bind group is // currently not thread safe, so we need to protect this section: WITH_SPINLOCK(_sp); slot->bind1Group = _upsampleNode->create_bind_grp(1, slot->unifBuffer.as_ubo(true), BindTexture{slot->residualTexture}); } return slot; } => Some of this code could stay as is, but I think the **RenderNode** should be replaced with a **ComputeNode** and the **RenderPass** should be replaced with a **ComputePass**. Which all sounds like a great idea and all, except that I have not implemented those classes yet lol. So more work ahead unfortunately 😇. ===== ComputePass & ComputeNode implementation ===== So here is the initial version of my ComputePass class (pretty small compared to the RenderPass): class NVGPU_EXPORT WGPUComputePass : public WGPUObject { NV_DECLARE_NO_COPY(WGPUComputePass) NV_DECLARE_NO_MOVE(WGPUComputePass) public: // Definition of a computation step: struct ComputeStep { wgpu::ComputePipeline pipeline; Vector bindGroups; U32 countX{1}; U32 countY{1}; U32 countZ{1}; }; WGPUComputePass(); explicit WGPUComputePass(const ComputePassDesc& desc); ~WGPUComputePass() override; class NVGPU_EXPORT EncodeHandler : public RefObject { public: virtual void operator()(wgpu::ComputePassEncoder& pass) = 0; }; template class EncodeFunctor : public EncodeHandler { F _func; public: explicit EncodeFunctor(F func) : _func(func){}; void operator()(wgpu::ComputePassEncoder& pass) override { _func(pass); } }; /** Specify the encoding function */ auto set_encode_handler(RefPtr handler) -> WGPUComputePass& { _encodeHandler = std::move(handler); return *this; } template auto set_encode_func(F func) -> WGPUComputePass& { auto handler = nv::create_ref_object>(func); return set_encode_handler(handler); } /** Encode this render pass */ void encode(const wgpu::CommandEncoder& encoder); /** Add a compute preprocessor definition */ auto add_def(String def) -> WGPUComputePass& { _computeDefinitions.insert(std::make_pair(std::move(def), "1")); return *this; } auto add_compute_step(ComputeStep step) -> WGPUComputePass& { _computeSteps.emplace_back(std::move(step)); return *this; } // auto add_compute_steps(WGPUComputeNode& cnode) -> WGPUComputePass& { // return *this; // } void execute(); protected: /** Desc for this compute pass */ ComputePassDesc _desc; /** ComputePass descriptor */ wgpu::ComputePassDescriptor _descriptor; /** Encode handler */ RefPtr _encodeHandler; // Definition map: Preprocessor::DefMap _computeDefinitions; // List of computation steps to be executed: Vector _computeSteps; }; The **ComputeNode** was also relatively simple to implement, and on top of that I also had to create the class **ComputePipelineBuilder** and add a couple of functions in the WGPUEngine class, but nothing too fancy really: // Create a Compute pipeline: auto create_compute_pipeline(const wgpu::ComputePipelineDescriptor& desc) -> wgpu::ComputePipeline; // Build a default compute pipeline: auto build_compute_pipeline(const String& compute_code, Preprocessor::DefMap* computeDefs = nullptr, bool reset_all = true, ShaderCodeType ctype = CODE_WGSL) -> WGPUComputePipelineBuilder&; ===== Simple unit tests for compute pipelines ===== Now preparing some minimal unit tests on the compute pipelines. And first thing I realize is that in the unit test, when building a render node or a compute node it might be handy to be able to pass for WGSL code directly instead of providing a filename containing that code. Let's see if there is an easy way to acheive this. **OK**: I have now added this function: auto WGPUEngine::get_shader_code(const String& shader_file) -> String { // This "shader_file" might actually contain some code directly: if (starts_with(shader_file, "WGSL:")) { // Just remove this "WGSL:" prefix: return shader_file.substr(5); } String sFile = shader_file; if (get_path_extension(sFile).empty()) { sFile += ".wgsl"; } validate_resource_path(WGPUEngine::cat_shader, sFile); return read_file(sFile.c_str()); } And I'm using it in the **RenderNode** and the **ComputeNode** just before building our pipelines: logDEBUG("Building pipeline..."); String vs_code = eng->get_shader_code(_shaderDesc.vertFile); String fs_code = _shaderDesc.vertFile == _shaderDesc.fragFile ? vs_code : eng->get_shader_code(_shaderDesc.fragFile); logDEBUG("Building pipeline..."); String cs_code = eng->get_shader_code(_shaderDesc.computeFile); => Now all I need to do is to prefix any wsgl code I want to use directly with the string **"WGSL:"** 👍! So I can create a test compute node with this: // Prepare a ComputeNode: String code = R"(WGSL: @compute @workgroup_size(32) fn main() { // Compute stuff } )"; WGPUComputeNode cnode(code); /*"*/ Next I need to add some storage bindings. **Note**: in an attempt to automate a bit the handling of the "ShaderStage" when preparing binding groups for a compute pipeline, I'm now specifying the "target pipeline" as part of the constructor of the WGPUBindGroupBuilder: explicit WGPUBindGroupBuilder(TargetPipeline tgt = TGT_PIPELINE_RENDER); (And obviously, we will set this to ''TGT_PIPELINE_COMPUTE'' when creating a ComputeNode.) And I also updated the ''define_grp()'' to accept BindEntry objects now, so the following code works to define the binding group 0: WGPUComputeNode cnode(code); U32 num = 64; DynamicBufferBase input(num, 1, sizeof(F32), BufferUsage::Storage); DynamicBufferBase output(num, 1, sizeof(F32), BufferUsage::Storage); // Fill the input buffer: F32* ptr = input.ptr(); for (U32 i = 0; i < num; ++i) { (*ptr++) = (F32)i / 10.0F; } input.update(); // We need to define some minimal bindings here: cnode.define_grp(input.as_sto(), output.as_rw_sto()); Continuing on the implementation I added the possibility to define a "compute_step" in the ComputeNode with the ''add()'' method [similar to what we do in the RenderNode]: auto add(WorkDim dims, Vector grps, DynOffsets dynOffsets) -> WGPUComputeNode&; auto add(WorkDim dims, Vector grps) -> WGPUComputeNode&; and also introduced support to "add" a ComputeNode inside a ComputePass: auto WGPUComputePass::add(WGPUComputeNode& cnode) -> WGPUComputePass& { cnode.collect_steps(_computeSteps, &_computeDefinitions); return *this; } So here is what my initial compute unit test looks like now: WGPUComputeNode cnode(code); U32 num = 64; DynamicBufferBase input(num, 1, sizeof(F32), BufferUsage::Storage); DynamicBufferBase output(num, 1, sizeof(F32), BufferUsage::Storage); // Fill the input buffer: F32* ptr = input.ptr(); for (U32 i = 0; i < num; ++i) { (*ptr++) = (F32)i / 10.0F; } input.update(); // We need to define some minimal bindings here: // Create a bind group layout: cnode.define_grp(input.as_sto(), output.as_rw_sto()); // Create a bind group: // cnode.bind(0, input.as_sto(), output.as_rw_sto()); auto grp0 = cnode.create_bind_grp(0, input.as_sto(), output.as_rw_sto()); // Add a compute step: cnode.add({64}, {grp0}); // Check that we can get a pipeline from that node: BOOST_CHECK(cnode.get_pipeline() != nullptr); // Add the compute node to the compute pass: cpass->add(cnode); // Release the pass: cpass.reset(); Next, time to actually execute this compute pass somehow! I already havea simple encode function I derived from the RenderPass class, but now I realize I don't have support for dynamic offsets in this simplified version. Let's try to add this back. OK! Here is the updated version, which I think will work 😅: void WGPUComputePass::encode(const wgpu::CommandEncoder& encoder) { ComputePassEncoder pass = encoder.BeginComputePass(&_descriptor); if (_encodeHandler != nullptr) { (*_encodeHandler)(pass); } // Execute the compute steps ComputePipeline currentPipeline = nullptr; const Vector* currentBinds = nullptr; for (const auto& step : _computeSteps) { // Set the pipeline: if (step.pipeline.Get() != currentPipeline.Get()) { currentPipeline = step.pipeline; pass.SetPipeline(currentPipeline); } // Set the bindings: U32 idx = 0; U32 curSize = currentBinds == nullptr ? 0 : currentBinds->size(); for (const auto& it : step.bindGroups) { if (idx >= curSize || it.group.Get() != (*currentBinds)[idx].group.Get() || !all_equal(it.dynamic_offsets, (*currentBinds)[idx].dynamic_offsets)) { pass.SetBindGroup(idx, it.group, it.dynamic_offsets.size(), it.dynamic_offsets.empty() ? nullptr : it.dynamic_offsets.data()); } } currentBinds = &step.bindGroups; // Dispatch the computation: pass.DispatchWorkgroups(step.dims.x, step.dims.y, step.dims.z); } pass.End(); } Allright! So now we can execute our ComputePass, just calling the execute method 😎: // Create a bind group layout: cnode.define_grp(input.as_sto(), output.as_rw_sto()); // Create a bind group: // cnode.bind(0, input.as_sto(), output.as_rw_sto()); auto grp0 = cnode.create_bind_grp(0, input.as_sto(), output.as_rw_sto()); // Add a compute step: // Our buffer is of size 64, and have work groups of size 32, so we need 2 // groups on X cnode.add({2}, {grp0}); // Check that we can get a pipeline from that node: BOOST_CHECK(cnode.get_pipeline() != nullptr); // Add the compute node to the compute pass: cpass->add(cnode); // Execute the pass: cpass->execute(); Which means that in theory we should have the appropriate output now in the output buffer since we have updated the shader code to handle those storage bindings: @group(0) @binding(0) var inputBuffer: array; @group(0) @binding(1) var outputBuffer: array; @compute @workgroup_size(32) fn main(@builtin(global_invocation_id) id: vec3) { // Compute stuff outputBuffer[id.x] = inputBuffer[id.x]*2.0; } => So how could we read the data back ? First trying the simplest option: // Should have updated the output buffer now: auto buf = output.buffer(); buf.MapAsync( wgpu::MapMode::Read, 0, buf.GetSize(), [](WGPUBufferMapAsyncStatus status, void*) { BOOST_CHECK_EQUAL(status, WGPUBufferMapAsyncStatus_Success); }, nullptr); eng->wait_idle(); // callback, void *userdata) const F32* data2 = (F32*)buf.GetConstMappedRange(0); BOOST_REQUIRE(data2 != nullptr); for (U32 i = 0; i < num; ++i) { F32 val = (*data2++); BOOST_CHECK_EQUAL(val, (F32)i * 2.0F / 10.0F); } But this failed because I don't have the MapRead usage on the output buffer: 2023-09-13 20:40:17.309170 [ERROR] Dawn: Validation error: The buffer usages (BufferUsage::(CopyDst|Storage|BufferUsage::80000000)) do not contain BufferUsage::MapRead. - While calling [Buffer].MapAsync(MapMode::Read, 0, 256, ...). error: in "compute/test_simple_buffer_compute": check status == WGPUBufferMapAsyncStatus_Success has failed [1 != 0] => Let's try to add this flag (but I don't expect this to work actually: since I read that the MapRead is not compatible with the Storage flag). And as expected this failed: D:/Projects/NervLand/tests/test_nvGPU/compute_spec.cpp(892023-09-13 20:44:08.647824 [ERROR] Dawn: Validation error: Buffer usages (BufferUsage::(MapRead|CopyDst|Storage)) is invalid. If a buffer usage contains BufferUsage::MapRead the only other allowed usage is BufferUsage::CopyDst. - While calling [Device].CreateBuffer([BufferDescriptor]). So now, let's create a third buffer that we will use to copy the data back to the CPU: U32 num = 64; DynamicBufferBase input(num, 1, sizeof(F32), BufferUsage::Storage); DynamicBufferBase output(num, 1, sizeof(F32), BufferUsage::Storage | BufferUsage::CopySrc); DynamicBufferBase staging(num, 1, sizeof(F32), BufferUsage::CopyDst | BufferUsage::MapRead); And **bingo!** Now we have a first working unit test on the Compute pipeline usage: cnode.add({2}, {grp0}); // Check that we can get a pipeline from that node: BOOST_CHECK(cnode.get_pipeline() != nullptr); // Add the compute node to the compute pass: cpass->add(cnode); // Execute the pass: // cpass->execute(); auto& bld = eng->build_commands(); bld.execute_compute_pass(*cpass); bld.copy_buffer_to_buffer(output.buffer(), staging.buffer(), 0, 0, output.get_buffer_size()); bld.submit(false); // Should have updated the output buffer now: const auto& buf = staging.buffer(); buf.MapAsync( wgpu::MapMode::Read, 0, buf.GetSize(), [](WGPUBufferMapAsyncStatus status, void*) { BOOST_CHECK_EQUAL(status, WGPUBufferMapAsyncStatus_Success); }, nullptr); eng->wait_idle(); // callback, void *userdata) const F32* data2 = (F32*)buf.GetConstMappedRange(0); BOOST_REQUIRE(data2 != nullptr); for (U32 i = 0; i < num; ++i) { F32 val = (*data2++); BOOST_CHECK_EQUAL(val, (F32)i * 2.0F / 10.0F); } ===== Improvements on the buffers management ===== => I then introduced the **GPUBuffer** as base class for the **DynamicBufferBase** (which I should probably rename in fact 🤔). Actually, just decided to rename **DynamicBufferBase** to **GPUBufferProxy**: I think that's a better choice! **Problem**: How to setup usage of capturing lambdas to perform async read on a buffer, when the MapAsync() method only expect a function a function pointer and a userdata pointer ? The following will not work: // asynchronous read of this buffer: template void read_async(Func func, U64 offset = 0, U64 size = WGPU_WHOLE_MAP_SIZE) { NVCHK((_buffer.GetUsage() & wgpu::BufferUsage::MapRead) != 0, "Cannot read from non MapReadable buffer"); if (size != WGPU_WHOLE_MAP_SIZE) { NVCHK(offset + size <= _buffer.GetSize(), "Out of range buffer mapping"); } _buffer.MapAsync( wgpu::MapMode::Read, offset, size, [this, func, offset, size](WGPUBufferMapAsyncStatus status, void*) { NVCHK(status == WGPUBufferMapAsyncStatus_Success, "Failed to map-read buffer"); const void* ptr = _buffer.GetConstMappedRange(offset, size); NVCHK(ptr != nullptr, "Invalid const mapped pointer."); func(ptr); }, nullptr); }/*//*/ **Disclaimer**: I'm pretty sure you could achieve that almost out of box by using an **std::function**, but I would like to avoid using this is possible 😁. => Ok I came up with something on this and now I could add the method ''read_async()'' and ''read_sync()'' in the GPUBuffer. And I can now replace some test as code this: staging.read_async([](const void* data) { BOOST_CHECK(data != nullptr); }); const auto& buf = staging.buffer(); buf.MapAsync( wgpu::MapMode::Read, 0, buf.GetSize(), [](WGPUBufferMapAsyncStatus status, void*) { BOOST_CHECK_EQUAL(status, WGPUBufferMapAsyncStatus_Success); }, nullptr); eng->wait_idle(); // callback, void *userdata) const F32* data2 = (F32*)staging.buffer().GetConstMappedRange(0); with simply: const F32* data2 = (F32*)staging.read_sync();