blog:2022:1116_nervland_push_constants

NervLand: Introducing support for push constants

Continuing our journey in vulkan, this time our target will be to introduce support for push constants in our command buffers. That sounds simple at first, but when I starting thinking about it, I realize it might not be that simple in fact. Let's check why…

  • First, let's make it ultra simple and just introduce the minimal changes in our current VulkanApp to use push constants: we will simply record them in in our existing command buffers, to provide, say 2 vec2 components for position offset and scale factors for instance.
  • But first of all let's read our max size for push constants:
        local pdev = self.vkeng:get_physical_device()
        local props = pdev:get_properties()
        logDEBUG("Max push constant size: ", props.limits.maxPushConstantsSize, " bytes")
  • And the result of that is really unimpressive (even on my RTX 2080):
    Max push constant size: 256 bytes
  • ⇒ We won't go very far with that I'm afraid.
  • Anyway, we thus create a single push constant range in the pipeline layout:
        -- Define a single push constant range:
        pc:setStageFlags(vk.ShaderStageFlagBits.VERTEX_BIT + vk.ShaderStageFlagBits.FRAGMENT_BIT)
        pc:setOffset(0)
        pc:setSize(32)
  • Then we need to record a vkCmdPushConstants command, so let's add that one:
    void VulkanCommandBuffer::write_push_contants(VkPipelineLayout layout,
                                                  VkShaderStageFlags stages,
                                                  U32 offset, U32 size,
                                                  const void* data) {
        ASSERT(is_recording());
        _device->vkCmdPushConstants(_buffer, layout, stages, offset, size, data);
    }
  • Next, we need a convinient way to write (and read) arbitrary data into a byte buffer: so let's build a ByteArray class to support this (with usage from lua in mind)
  • ⇒ So here is the initial version of my ByteArray class:
    class NVCORE_EXPORT ByteArray {
      public:
        ByteArray();
        explicit ByteArray(U64 size, U8 defVal = 0);
    
        ByteArray(const ByteArray& rhs);
        ByteArray(ByteArray&& rhs) noexcept;
        auto operator=(const ByteArray& rhs) -> ByteArray&;
        auto operator=(ByteArray&& rhs) noexcept -> ByteArray&;
    
        virtual ~ByteArray();
    
        /** Retrieve the size of this array */
        [[nodiscard]] auto get_size() const -> U64 { return _data.size(); }
    
        /** Retrieve the data pointer */
        [[nodiscard]] auto get_data(U64 offset = 0) const -> const U8* {
            return _data.data() + offset;
        }
        [[nodiscard]] auto get_data(U64 offset = 0) -> U8* {
            return _data.data() + offset;
        }
    
        /** Get the current position */
        [[nodiscard]] auto get_position() const -> U64 { return _position; }
    
        /** Set the current position */
        void set_position(U64 pos) {
            CHECK(pos < _data.size(), "out of range position.");
            _position = pos;
        };
    
        /** Reset the position in this buffer. */
        void reset_position() { _position = 0; }
    
        /** Resize the array */
        void resize(U64 newSize) { _data.resize(newSize); }
    
        /** Set a given value at a given byte position */
        void write_data(const U8* data, U64 dataSize, U64 idx = U64_MAX);
    
        void write_u8(U8 value, U64 idx = U64_MAX);
        void write_i8(I8 value, U64 idx = U64_MAX);
        void write_u16(U16 value, U64 idx = U64_MAX);
        void write_i16(I16 value, U64 idx = U64_MAX);
        void write_u32(U32 value, U64 idx = U64_MAX);
        void write_i32(I32 value, U64 idx = U64_MAX);
        void write_u64(U64 value, U64 idx = U64_MAX);
        void write_i64(I64 value, U64 idx = U64_MAX);
        void write_f32(F32 value, U64 idx = U64_MAX);
        void write_f64(F64 value, U64 idx = U64_MAX);
        void write_vec2f(const Vec2f& value, U64 idx = U64_MAX);
        void write_vec3f(const Vec3f& value, U64 idx = U64_MAX);
        void write_vec4f(const Vec4f& value, U64 idx = U64_MAX);
    
      protected:
        /** Storage for the data of this byte array. */
        nv::Vector<U8> _data;
    
        /** Current position in the buffer. */
        U64 _position{0};
    };
  • Now, I'm wondering: could I just pass the get_data() pointer from lua directly to the write_push_contants() 🤔 ? That would be amazing… Let's try it 😊.
  • ⇒ Whaoooo! that's actually just working out of the box! I can't believe it 😳!
  • I simply created a new shader file using those push constants:
    layout(push_constant) uniform Push {
        vec4 offset;
        vec4 scale;
    } push;
    
    #ifdef _VERTEX_
    
    layout(location = 0) in vec2 position;
    layout(location = 1) in vec4 color;
    
    layout(location=0) out vec3 vertColor;
    
    void main() {
        gl_Position = vec4(position*push.scale.zw + push.offset.xy, 0.0, 1.0);
        vertColor = color.rgb;
    }
    
    #endif
    
    #ifdef _FRAGMENT_
    
    layout(location=0) in vec3 fragColor;
    layout(location=0) out vec4 outColor;
    
    void main() {
        outColor = vec4(fragColor, 1.0);
    }
    
    #endif
  • Then in lua I create my push constants:
        -- We can also store our push constants array here:
        self.pushArr = nv.ByteArray(32)
        self.pushArr:write_vec4f(nv.Vec4f(0.5, 0.5, 0.0, 0.0))
        self.pushArr:write_vec4f(nv.Vec4f(0.0, 0.0, 0.4, 0.4))
  • Then also update the pipeline layout accordingly:
        -- Prepare the push constant wrapper:
        local pc = nvk.VulkanPushConstantRange()
    
        -- Define a single push constant range:
        pc:setStageFlags(vk.ShaderStageFlagBits.VERTEX_BIT + vk.ShaderStageFlagBits.FRAGMENT_BIT)
        pc:setOffset(0)
        pc:setSize(32) -- 32 bytes = 2*4*4 => 2 vec4 lines
  • And finally I write the push constant data when recording the command buffers:
            -- add the push constants:
            cbuf:write_push_contants(playout, pstages, 0, 32, self.pushArr:get_data())
    
  • And here is the result I get (which is what I was expecting since apply a scale of 0.4, and the a clip space offset of (0.5,0.5)):

  • The only point a bit annoying is that the LLS thinks that get_data() will return an integer actually: so that's fix that (should return a “void” type here instead): OK fixed.
  • From the beginning of this post, what I had in mind was to try and use push constants to rotate my triangle on screen progressively. This implies that I will change a push constant angle or time value on every frame, and thus, we need to record the main command buffer on each frame.
  • Question is: how to record a command buffer on each frame efficiently ? I'm thinking I should avoid using lua at all in callbacks that are executed on each frame… but maybe this perspective is incorrect and using lua here would not make a very big difference ? (⇒ To be tested on day)
  • A long time ago I also started investigating JIT compilation with the LLVM compiler: that was really interesting, but also quite complex and not working as expected in the end if I remember correctly (🤔 ?)
  • So what I'm thinking about now would be to create some kind of graph or “blueprint” system which could be used to generate a command buffer in C++ when executed: that graph could be assembled only once in lua and the reused on each frame to update the command buffers.
  • ⇒ First thing we really need here is the support to rerecord a command buffer, so we add the rest bit on the commad pool:
        -- Now we create a Command Pool on that family index:
        self.cmdpool = self.vkeng:create_command_pool(famIdx,
            vk.CommandPoolCreateFlagBits.RESET_COMMAND_BUFFER_BIT + vk.CommandPoolCreateFlagBits.TRANSIENT_BIT)
        logDEBUG("Created command pool");
  • Next we should “reset” our current command buffers instead of re-creating them again: OK (in fact to reset the command buffer we simply start recording it again)
  • And now we will add a callback in the renderer to re-record the command buffer for each frame…
  • Preliminary step: I just refactored the Callback/LuaCallback implementation to completely hide the LuaCallback in the bindings. And now we create a “Callback” from lua before assigning it in the renderer:
        local cb = nv.Callback(function() self:recordCommandBuffers(rpass, vbuf, cfg) end)
        self.renderer:on_swapchain_updated(cb)
    
  • Next I can use the same design to provide the lua implementation for the CmdBuffersProvider implementation: I start with the definition of a custom constructor:
    auto _lunactr_CmdBuffersProvider(luna::LuaFunction& func)
        -> CmdBuffersProvider*;
  • And then we add the corresponding definition:
    struct LuaCmdBuffersProvider : public CmdBuffersProvider {
        NV_DECLARE_NO_COPY(LuaCmdBuffersProvider)
        NV_DECLARE_NO_MOVE(LuaCmdBuffersProvider)
    
        LuaCmdBuffersProvider(lua_State* L, I32 idx) : func(L, idx, true) {}
    
        ~LuaCmdBuffersProvider() override = default;
    
        void get_buffers(FrameDesc& fdesc, VkCommandBufferList& buffers) override {
            // Call the lua function:
            func(fdesc, buffers);
        }
    
        LuaFunction func;
    };
    
    auto _lunactr_CmdBuffersProvider(luna::LuaFunction& func)
        -> CmdBuffersProvider* {
        auto cb =
            nv::create_ref_object<LuaCmdBuffersProvider>(func.state, func.index);
        return cb.release();
    }
  • Now let's try to assign this provider in lua… okay, it works, but now I get about 2800fps instead of 3600fps in the previous version (ie. not rebuilding the buffer on each frame)
  • Here is the new function I'm using to generate a buffer for each frame:
    -- Generate the command buffers for a given frame description
    -- just before the those buffer are submitted to the graphics queue
    ---@param fdesc nvk.FrameDesc Current frame description
    ---@param buffers nvk.VkCommandBufferList list of vk buffers.
    ---@param rpass nvk.VulkanRenderPass Render pass to use
    ---@param vbuf nvk.VulkanVertexBuffer Vertex buffer object
    ---@param playout vk.PipelineLayout_T Pipeline layout
    ---@param cfg nvk.VulkanGraphicsPipelineCreateInfo create pipeline config
    function Class:generate_cmd_buffer(fdesc, buffers, rpass, vbuf, playout, cfg)
        -- logDEBUG("Should provide the cmd buffer for frame ", fdesc.frameNumber, " on image ", fdesc.swapchainImageIndex)
        local idx = fdesc.swapchainImageIndex
    
        -- Re-record the command buffer as above:
        local cbuf = self.cbufs:at(idx)
    
        local fbuf = self.renderer:get_swapchain_framebuffer(idx)
    
        -- Push constants stages:
        local pstages = vk.ShaderStageFlagBits.VERTEX_BIT + vk.ShaderStageFlagBits.FRAGMENT_BIT
        local width = self.renderer:get_swapchain_width()
        local height = self.renderer:get_swapchain_height()
    
        if self.width ~= width or self.height ~= height then
            -- We rebuild the pipeline with teh correct size:
            self.width = width
            self.height = height
    
            -- Now we should update the viewport dimensions in our graphics pipeline config:
            local vp = cfg:getCurrentViewportState()
            vp:setViewport(width, height)
    
            if self.pipeline ~= nil then
                self.vkeng:remove_pipeline(self.pipeline)
            end
    
            self.pipeline = self.vkeng:create_graphics_pipeline(cfg, self.pipelineCache)
        end
    
        -- Check that we can reset the command buffer:
        fbuf:set_clear_color(0, 0.2, 0.2, 0.2, 1.0)
        -- fbuf:set_clear_color(0, 1, 1, 1, 1.0)
    
        -- cbuf:begin(vk.CommandBufferUsageFlagBits.ONE_TIME_SUBMIT_BIT)
        cbuf:begin(0)
    
        -- Begin rendering into the swapchain framebuffer:
        cbuf:begin_inline_pass(rpass, fbuf)
    
        -- Bind the graphics pipeline:
        cbuf:bind_graphics_pipeline(self.pipeline)
    
        -- Bind the vertex buffer:
        cbuf:bind_vertex_buffer(vbuf, 0)
    
        -- add the push constants:
        cbuf:write_push_contants(playout, pstages, 0, 32, self.pushArr:get_data())
    
        -- Draw our triangle:
        cbuf:draw(3)
    
        -- End the render pass:
        cbuf:end_render_pass()
    
        -- Finish the command buffer:
        cbuf:finish()
    
        -- Add the buffer to the list:
        buffers:push_back(cbuf:getVk())
    end
  • And here is the connection for the Buffer provider:
        local func = function(fdesc, buffers)
            self:generate_cmd_buffer(fdesc, buffers, renderpass, vbuf, playout, cfg)
        end
    
        local prov = nvk.CmdBuffersProvider(func)
        self.renderer:set_cmd_buffer_provider(prov)
  • When testing directly from Saturn I get about 3800fps with the previous version and 2900fps with the continuous command buffer rebuild path, that's roughly a 23.7% performance lost, pretty large 🤣.
  • Now I'm wondering what kind of difference it would make to implement the provider directly in C++, so let's try that.
  • Here is the implementation I'm providing in C++ for this “SimpleCmdBuffersProvider”:
    #include <vulkan_precomp.h>
    
    #include <base/VulkanCommandBuffer.h>
    #include <base/VulkanFramebuffer.h>
    #include <base/VulkanPipeline.h>
    #include <base/VulkanPipelineCache.h>
    #include <base/VulkanPipelineLayout.h>
    #include <base/VulkanRenderPass.h>
    #include <engine/VulkanRenderer.h>
    #include <engine/VulkanVertexBuffer.h>
    #include <providers/SimpleCmdBuffersProvider.h>
    #include <vulkan_wrappers.h>
    
    namespace nvk {
    
    SimpleCmdBuffersProvider::SimpleCmdBuffersProvider(
        VulkanRenderer* renderer, VulkanRenderPass* rpass, VulkanVertexBuffer* vbuf,
        VulkanPipelineLayout* playout, VulkanGraphicsPipelineCreateInfo* cfg,
        VulkanPipelineCache* pcache, const VulkanCommandBufferList& cbufs,
        const nv::ByteArray& pushArr)
        : _renderer(renderer), _pipelineCache(pcache), _rpass(rpass), _vbuf(vbuf),
          _playout(playout), _cfg(cfg), _cbufs(cbufs), _pushArr(pushArr) {}
    
    void SimpleCmdBuffersProvider::get_buffers(FrameDesc& fdesc,
                                               VkCommandBufferList& buffers) {
        // Write the command buffer:
        U32 idx = fdesc.swapchainImageIndex;
    
        // Re-record the command buffer as above:
        auto* cbuf = _cbufs[idx].get();
    
        auto* fbuf = _renderer->get_swapchain_framebuffer(idx);
    
        // Push constants stages :
        U32 pstages = VK_SHADER_STAGE_VERTEX_BIT | VK_SHADER_STAGE_FRAGMENT_BIT;
    
        U32 width = _renderer->get_swapchain_width();
        U32 height = _renderer->get_swapchain_height();
    
        // Check if we need to rebuild the pipeline:
        if (_width != width || _height != height) {
            _cfg->getCurrentViewportState()->setViewport((float)width,
                                                         (float)height);
            _pipeline = _renderer->get_device()->create_graphics_pipeline(
                _cfg->getVk(), _pipelineCache->getVk());
            _width = width;
            _height = height;
        }
    
        fbuf->set_clear_color(0, 0.2, 0.2, 0.2, 1.0);
    
        cbuf->begin(0);
    
        // Begin rendering into the swapchain framebuffer:
        cbuf->begin_inline_pass(_rpass.get(), fbuf);
    
        // Bind the graphics pipeline:
        cbuf->push_bind_graphics_pipeline(_pipeline->getVk());
    
        // Bind the vertex buffer:
        cbuf->bind_vertex_buffer(_vbuf.get(), 0);
    
        // add the push constants
        cbuf->write_push_contants(_playout->getVk(), pstages, 0,
                                  _pushArr.get_size(), _pushArr.get_data());
    
        // Draw our triangle:
        cbuf->draw(3);
    
        // End the render pass
        cbuf->end_render_pass();
    
        // Finish the command buffer:
        cbuf->finish();
    
        // Add the buffer to the list:
        buffers.push_back(cbuf->getVk());
    }
    
    } // namespace nvk
    /
  • ⇒ With this version I can get about 3100fps, which correspond to a performance lost of “only” 18.4% compared to the pure lua implementation: that's good, but that's not ultra impressive either, so it seems that the performance in luajit are pretty nice: we only loose about 5.3% when using luajit implementation instead of pure C++.
  • ⇒ Still, this is a path I think I need to investigate further: large “blocks” of code should be implemented in C++ and the assembled in Lua when possible… I still have this “fuzzy idea” about “blueprints” or other kind of building block elements or even LLVM JIT usage which I need to investigate further (but let's keep that for our next post, shall we ?)
  • Okay so now the final step I wanted to reach here: let's use the push constants to make our triangle rotate progressively 😉
  • Simply send the current time as the z component of the first vector in the push contants as follow:
        // We update our push constants here to contain a time value:
        F32 time = (F32)fdesc.frameTime;
    
        // logDEBUG("Writing time value: {}", time);
    
        // We write the time as the z element of the first vec4:
        _pushArr.write_f32(time, 8);
  • Then I updated my shader to use that time value and build a rotation matrix with it:
    #define M_PI 3.1415926535897932384626433832795
    
    void main() {
        // get the time value:
        float time = push.offset.z;
    
        // prepare rotation matrix:
        float theta = 2.0*M_PI*time/5.0;
        float ct = cos(theta);
        float st = sin(theta);
    
        mat2 rot = mat2(ct,st,-st,ct);
        
        // Rotate position:
        vec2 pos = position*rot;
    
        gl_Position = vec4(pos*push.scale.zw, 0.0, 1.0);
        // gl_Position = vec4(position*push.scale.zw + push.offset.xy, 0.0, 1.0);
        vertColor = color.rgb;
    }
  • And that did the job 😁:

  • ⇒ So now we are good on push constants usage 👍!
  • blog/2022/1116_nervland_push_constants.txt
  • Last modified: 2022/11/18 11:43
  • by 127.0.0.1