====== NervLand: Introducing support for push constants ====== {{tag>dev cpp 3D nervland vulkan SDL}} Continuing our journey in vulkan, this time our target will be to introduce support for push constants in our command buffers. That sounds simple at first, but when I starting thinking about it, I realize it might not be //that// simple in fact. Let's check why... ====== ====== ===== Simple support for push constant ===== * First, let's make it ultra simple and just introduce the minimal changes in our current VulkanApp to use push constants: we will simply record them in in our existing command buffers, to provide, say 2 vec2 components for position offset and scale factors for instance. * But first of all let's read our max size for push constants: local pdev = self.vkeng:get_physical_device() local props = pdev:get_properties() logDEBUG("Max push constant size: ", props.limits.maxPushConstantsSize, " bytes") * And the result of that is really unimpressive (even on my RTX 2080): Max push constant size: 256 bytes * => We won't go very far with that I'm afraid. * Anyway, we thus create a single push constant range in the pipeline layout: -- Define a single push constant range: pc:setStageFlags(vk.ShaderStageFlagBits.VERTEX_BIT + vk.ShaderStageFlagBits.FRAGMENT_BIT) pc:setOffset(0) pc:setSize(32) * Then we need to record a **vkCmdPushConstants** command, so let's add that one: void VulkanCommandBuffer::write_push_contants(VkPipelineLayout layout, VkShaderStageFlags stages, U32 offset, U32 size, const void* data) { ASSERT(is_recording()); _device->vkCmdPushConstants(_buffer, layout, stages, offset, size, data); } ===== Introducing support for ByteArray ===== * Next, we need a convinient way to write (and read) arbitrary data into a byte buffer: so let's build a **ByteArray** class to support this (with usage from lua in mind) * => So here is the initial version of my ByteArray class: class NVCORE_EXPORT ByteArray { public: ByteArray(); explicit ByteArray(U64 size, U8 defVal = 0); ByteArray(const ByteArray& rhs); ByteArray(ByteArray&& rhs) noexcept; auto operator=(const ByteArray& rhs) -> ByteArray&; auto operator=(ByteArray&& rhs) noexcept -> ByteArray&; virtual ~ByteArray(); /** Retrieve the size of this array */ [[nodiscard]] auto get_size() const -> U64 { return _data.size(); } /** Retrieve the data pointer */ [[nodiscard]] auto get_data(U64 offset = 0) const -> const U8* { return _data.data() + offset; } [[nodiscard]] auto get_data(U64 offset = 0) -> U8* { return _data.data() + offset; } /** Get the current position */ [[nodiscard]] auto get_position() const -> U64 { return _position; } /** Set the current position */ void set_position(U64 pos) { CHECK(pos < _data.size(), "out of range position."); _position = pos; }; /** Reset the position in this buffer. */ void reset_position() { _position = 0; } /** Resize the array */ void resize(U64 newSize) { _data.resize(newSize); } /** Set a given value at a given byte position */ void write_data(const U8* data, U64 dataSize, U64 idx = U64_MAX); void write_u8(U8 value, U64 idx = U64_MAX); void write_i8(I8 value, U64 idx = U64_MAX); void write_u16(U16 value, U64 idx = U64_MAX); void write_i16(I16 value, U64 idx = U64_MAX); void write_u32(U32 value, U64 idx = U64_MAX); void write_i32(I32 value, U64 idx = U64_MAX); void write_u64(U64 value, U64 idx = U64_MAX); void write_i64(I64 value, U64 idx = U64_MAX); void write_f32(F32 value, U64 idx = U64_MAX); void write_f64(F64 value, U64 idx = U64_MAX); void write_vec2f(const Vec2f& value, U64 idx = U64_MAX); void write_vec3f(const Vec3f& value, U64 idx = U64_MAX); void write_vec4f(const Vec4f& value, U64 idx = U64_MAX); protected: /** Storage for the data of this byte array. */ nv::Vector _data; /** Current position in the buffer. */ U64 _position{0}; }; * Now, I'm wondering: could I just pass the **get_data()** pointer from lua directly to the **write_push_contants()** 🤔 ? That would be amazing... Let's try it 😊. * => Whaoooo! that's actually just working out of the box! I can't believe it 😳! * I simply created a new shader file using those push constants: layout(push_constant) uniform Push { vec4 offset; vec4 scale; } push; #ifdef _VERTEX_ layout(location = 0) in vec2 position; layout(location = 1) in vec4 color; layout(location=0) out vec3 vertColor; void main() { gl_Position = vec4(position*push.scale.zw + push.offset.xy, 0.0, 1.0); vertColor = color.rgb; } #endif #ifdef _FRAGMENT_ layout(location=0) in vec3 fragColor; layout(location=0) out vec4 outColor; void main() { outColor = vec4(fragColor, 1.0); } #endif * Then in lua I create my push constants: -- We can also store our push constants array here: self.pushArr = nv.ByteArray(32) self.pushArr:write_vec4f(nv.Vec4f(0.5, 0.5, 0.0, 0.0)) self.pushArr:write_vec4f(nv.Vec4f(0.0, 0.0, 0.4, 0.4)) * Then also update the pipeline layout accordingly: -- Prepare the push constant wrapper: local pc = nvk.VulkanPushConstantRange() -- Define a single push constant range: pc:setStageFlags(vk.ShaderStageFlagBits.VERTEX_BIT + vk.ShaderStageFlagBits.FRAGMENT_BIT) pc:setOffset(0) pc:setSize(32) -- 32 bytes = 2*4*4 => 2 vec4 lines * And finally I write the push constant data when recording the command buffers: -- add the push constants: cbuf:write_push_contants(playout, pstages, 0, 32, self.pushArr:get_data()) * And here is the result I get (which is what I was expecting since apply a scale of 0.4, and the a clip space offset of (0.5,0.5)): {{ blog:2022:1112:vulkan_01_push_constants.png?800 }} * The only point a bit annoying is that the LLS thinks that get_data() will return an integer actually: so that's fix that (should return a "void" type here instead): **OK** fixed. ===== Recording command buffers every frame ===== * From the beginning of this post, what I had in mind was to try and use push constants to rotate my triangle on screen progressively. This implies that I will change a push constant angle or time value on every frame, and thus, we need to record the main command buffer on each frame. * Question is: how to record a command buffer on each frame **efficiently** ? I'm thinking I should avoid using lua at all in callbacks that are executed on each frame... but maybe this perspective is incorrect and using lua here would not make a very big difference ? (=> To be tested on day) * A long time ago I also started investigating JIT compilation with the LLVM compiler: that was really interesting, but also quite complex and not working as expected in the end if I remember correctly (🤔 ?) * So what I'm thinking about now would be to create some kind of graph or "blueprint" system which could be used to generate a command buffer in C++ when executed: that graph could be assembled only once in lua and the reused on each frame to update the command buffers. * => First thing we really need here is the support to rerecord a command buffer, so we add the rest bit on the commad pool: -- Now we create a Command Pool on that family index: self.cmdpool = self.vkeng:create_command_pool(famIdx, vk.CommandPoolCreateFlagBits.RESET_COMMAND_BUFFER_BIT + vk.CommandPoolCreateFlagBits.TRANSIENT_BIT) logDEBUG("Created command pool"); * Next we should "reset" our current command buffers instead of re-creating them again: **OK** (in fact to reset the command buffer we simply start recording it again) * And now we will add a callback in the renderer to re-record the command buffer for each frame... * **Preliminary step**: I just refactored the Callback/LuaCallback implementation to completely hide the LuaCallback in the bindings. And now we create a "Callback" from lua before assigning it in the renderer: local cb = nv.Callback(function() self:recordCommandBuffers(rpass, vbuf, cfg) end) self.renderer:on_swapchain_updated(cb) * Next I can use the same design to provide the lua implementation for the **CmdBuffersProvider** implementation: I start with the definition of a custom constructor: auto _lunactr_CmdBuffersProvider(luna::LuaFunction& func) -> CmdBuffersProvider*; * And then we add the corresponding definition: struct LuaCmdBuffersProvider : public CmdBuffersProvider { NV_DECLARE_NO_COPY(LuaCmdBuffersProvider) NV_DECLARE_NO_MOVE(LuaCmdBuffersProvider) LuaCmdBuffersProvider(lua_State* L, I32 idx) : func(L, idx, true) {} ~LuaCmdBuffersProvider() override = default; void get_buffers(FrameDesc& fdesc, VkCommandBufferList& buffers) override { // Call the lua function: func(fdesc, buffers); } LuaFunction func; }; auto _lunactr_CmdBuffersProvider(luna::LuaFunction& func) -> CmdBuffersProvider* { auto cb = nv::create_ref_object(func.state, func.index); return cb.release(); } * Now let's try to assign this provider in lua... okay, it works, but now I get about 2800fps instead of 3600fps in the previous version (ie. not rebuilding the buffer on each frame) * Here is the new function I'm using to generate a buffer for each frame: -- Generate the command buffers for a given frame description -- just before the those buffer are submitted to the graphics queue ---@param fdesc nvk.FrameDesc Current frame description ---@param buffers nvk.VkCommandBufferList list of vk buffers. ---@param rpass nvk.VulkanRenderPass Render pass to use ---@param vbuf nvk.VulkanVertexBuffer Vertex buffer object ---@param playout vk.PipelineLayout_T Pipeline layout ---@param cfg nvk.VulkanGraphicsPipelineCreateInfo create pipeline config function Class:generate_cmd_buffer(fdesc, buffers, rpass, vbuf, playout, cfg) -- logDEBUG("Should provide the cmd buffer for frame ", fdesc.frameNumber, " on image ", fdesc.swapchainImageIndex) local idx = fdesc.swapchainImageIndex -- Re-record the command buffer as above: local cbuf = self.cbufs:at(idx) local fbuf = self.renderer:get_swapchain_framebuffer(idx) -- Push constants stages: local pstages = vk.ShaderStageFlagBits.VERTEX_BIT + vk.ShaderStageFlagBits.FRAGMENT_BIT local width = self.renderer:get_swapchain_width() local height = self.renderer:get_swapchain_height() if self.width ~= width or self.height ~= height then -- We rebuild the pipeline with teh correct size: self.width = width self.height = height -- Now we should update the viewport dimensions in our graphics pipeline config: local vp = cfg:getCurrentViewportState() vp:setViewport(width, height) if self.pipeline ~= nil then self.vkeng:remove_pipeline(self.pipeline) end self.pipeline = self.vkeng:create_graphics_pipeline(cfg, self.pipelineCache) end -- Check that we can reset the command buffer: fbuf:set_clear_color(0, 0.2, 0.2, 0.2, 1.0) -- fbuf:set_clear_color(0, 1, 1, 1, 1.0) -- cbuf:begin(vk.CommandBufferUsageFlagBits.ONE_TIME_SUBMIT_BIT) cbuf:begin(0) -- Begin rendering into the swapchain framebuffer: cbuf:begin_inline_pass(rpass, fbuf) -- Bind the graphics pipeline: cbuf:bind_graphics_pipeline(self.pipeline) -- Bind the vertex buffer: cbuf:bind_vertex_buffer(vbuf, 0) -- add the push constants: cbuf:write_push_contants(playout, pstages, 0, 32, self.pushArr:get_data()) -- Draw our triangle: cbuf:draw(3) -- End the render pass: cbuf:end_render_pass() -- Finish the command buffer: cbuf:finish() -- Add the buffer to the list: buffers:push_back(cbuf:getVk()) end * And here is the connection for the Buffer provider: local func = function(fdesc, buffers) self:generate_cmd_buffer(fdesc, buffers, renderpass, vbuf, playout, cfg) end local prov = nvk.CmdBuffersProvider(func) self.renderer:set_cmd_buffer_provider(prov) * When testing directly from Saturn I get about **3800fps** with the previous version and **2900fps** with the continuous command buffer rebuild path, that's roughly a 23.7% performance lost, pretty large 🤣. ===== Implementing provider in C++ directly ===== * Now I'm wondering what kind of difference it would make to implement the provider directly in C++, so let's try that. * Here is the implementation I'm providing in C++ for this "**SimpleCmdBuffersProvider**": #include #include #include #include #include #include #include #include #include #include #include namespace nvk { SimpleCmdBuffersProvider::SimpleCmdBuffersProvider( VulkanRenderer* renderer, VulkanRenderPass* rpass, VulkanVertexBuffer* vbuf, VulkanPipelineLayout* playout, VulkanGraphicsPipelineCreateInfo* cfg, VulkanPipelineCache* pcache, const VulkanCommandBufferList& cbufs, const nv::ByteArray& pushArr) : _renderer(renderer), _pipelineCache(pcache), _rpass(rpass), _vbuf(vbuf), _playout(playout), _cfg(cfg), _cbufs(cbufs), _pushArr(pushArr) {} void SimpleCmdBuffersProvider::get_buffers(FrameDesc& fdesc, VkCommandBufferList& buffers) { // Write the command buffer: U32 idx = fdesc.swapchainImageIndex; // Re-record the command buffer as above: auto* cbuf = _cbufs[idx].get(); auto* fbuf = _renderer->get_swapchain_framebuffer(idx); // Push constants stages : U32 pstages = VK_SHADER_STAGE_VERTEX_BIT | VK_SHADER_STAGE_FRAGMENT_BIT; U32 width = _renderer->get_swapchain_width(); U32 height = _renderer->get_swapchain_height(); // Check if we need to rebuild the pipeline: if (_width != width || _height != height) { _cfg->getCurrentViewportState()->setViewport((float)width, (float)height); _pipeline = _renderer->get_device()->create_graphics_pipeline( _cfg->getVk(), _pipelineCache->getVk()); _width = width; _height = height; } fbuf->set_clear_color(0, 0.2, 0.2, 0.2, 1.0); cbuf->begin(0); // Begin rendering into the swapchain framebuffer: cbuf->begin_inline_pass(_rpass.get(), fbuf); // Bind the graphics pipeline: cbuf->push_bind_graphics_pipeline(_pipeline->getVk()); // Bind the vertex buffer: cbuf->bind_vertex_buffer(_vbuf.get(), 0); // add the push constants cbuf->write_push_contants(_playout->getVk(), pstages, 0, _pushArr.get_size(), _pushArr.get_data()); // Draw our triangle: cbuf->draw(3); // End the render pass cbuf->end_render_pass(); // Finish the command buffer: cbuf->finish(); // Add the buffer to the list: buffers.push_back(cbuf->getVk()); } } // namespace nvk/*//*/ * => With this version I can get about **3100fps**, which correspond to a performance lost of "only" 18.4% compared to the pure lua implementation: that's good, but that's not ultra impressive either, so it seems that the performance in luajit are pretty nice: we only loose about **5.3%** when using luajit implementation instead of pure C++. * => Still, this is a path I think I need to investigate further: large "blocks" of code should be implemented in C++ and the assembled in Lua when possible... I still have this "fuzzy idea" about "blueprints" or other kind of building block elements or even LLVM JIT usage which I need to investigate further (but let's keep that for our next post, shall we ?) ===== Rotating our triangle with push constants ===== * Okay so now the final step I wanted to reach here: let's use the push constants to make our triangle rotate progressively 😉 * Simply send the current time as the z component of the first vector in the push contants as follow: // We update our push constants here to contain a time value: F32 time = (F32)fdesc.frameTime; // logDEBUG("Writing time value: {}", time); // We write the time as the z element of the first vec4: _pushArr.write_f32(time, 8); * Then I updated my shader to use that time value and build a rotation matrix with it: #define M_PI 3.1415926535897932384626433832795 void main() { // get the time value: float time = push.offset.z; // prepare rotation matrix: float theta = 2.0*M_PI*time/5.0; float ct = cos(theta); float st = sin(theta); mat2 rot = mat2(ct,st,-st,ct); // Rotate position: vec2 pos = position*rot; gl_Position = vec4(pos*push.scale.zw, 0.0, 1.0); // gl_Position = vec4(position*push.scale.zw + push.offset.xy, 0.0, 1.0); vertColor = color.rgb; } * And that did the job 😁: {{ blog:2022:1116:vulkan_rotating_triangle.gif?800 }} * => So now we are good on push constants usage 👍!