====== NervLand: Introducing support for push constants ======
{{tag>dev cpp 3D nervland vulkan SDL}}
Continuing our journey in vulkan, this time our target will be to introduce support for push constants in our command buffers. That sounds simple at first, but when I starting thinking about it, I realize it might not be //that// simple in fact. Let's check why...
====== ======
===== Simple support for push constant =====
* First, let's make it ultra simple and just introduce the minimal changes in our current VulkanApp to use push constants: we will simply record them in in our existing command buffers, to provide, say 2 vec2 components for position offset and scale factors for instance.
* But first of all let's read our max size for push constants: local pdev = self.vkeng:get_physical_device()
local props = pdev:get_properties()
logDEBUG("Max push constant size: ", props.limits.maxPushConstantsSize, " bytes")
* And the result of that is really unimpressive (even on my RTX 2080): Max push constant size: 256 bytes
* => We won't go very far with that I'm afraid.
* Anyway, we thus create a single push constant range in the pipeline layout: -- Define a single push constant range:
pc:setStageFlags(vk.ShaderStageFlagBits.VERTEX_BIT + vk.ShaderStageFlagBits.FRAGMENT_BIT)
pc:setOffset(0)
pc:setSize(32)
* Then we need to record a **vkCmdPushConstants** command, so let's add that one: void VulkanCommandBuffer::write_push_contants(VkPipelineLayout layout,
VkShaderStageFlags stages,
U32 offset, U32 size,
const void* data) {
ASSERT(is_recording());
_device->vkCmdPushConstants(_buffer, layout, stages, offset, size, data);
}
===== Introducing support for ByteArray =====
* Next, we need a convinient way to write (and read) arbitrary data into a byte buffer: so let's build a **ByteArray** class to support this (with usage from lua in mind)
* => So here is the initial version of my ByteArray class: class NVCORE_EXPORT ByteArray {
public:
ByteArray();
explicit ByteArray(U64 size, U8 defVal = 0);
ByteArray(const ByteArray& rhs);
ByteArray(ByteArray&& rhs) noexcept;
auto operator=(const ByteArray& rhs) -> ByteArray&;
auto operator=(ByteArray&& rhs) noexcept -> ByteArray&;
virtual ~ByteArray();
/** Retrieve the size of this array */
[[nodiscard]] auto get_size() const -> U64 { return _data.size(); }
/** Retrieve the data pointer */
[[nodiscard]] auto get_data(U64 offset = 0) const -> const U8* {
return _data.data() + offset;
}
[[nodiscard]] auto get_data(U64 offset = 0) -> U8* {
return _data.data() + offset;
}
/** Get the current position */
[[nodiscard]] auto get_position() const -> U64 { return _position; }
/** Set the current position */
void set_position(U64 pos) {
CHECK(pos < _data.size(), "out of range position.");
_position = pos;
};
/** Reset the position in this buffer. */
void reset_position() { _position = 0; }
/** Resize the array */
void resize(U64 newSize) { _data.resize(newSize); }
/** Set a given value at a given byte position */
void write_data(const U8* data, U64 dataSize, U64 idx = U64_MAX);
void write_u8(U8 value, U64 idx = U64_MAX);
void write_i8(I8 value, U64 idx = U64_MAX);
void write_u16(U16 value, U64 idx = U64_MAX);
void write_i16(I16 value, U64 idx = U64_MAX);
void write_u32(U32 value, U64 idx = U64_MAX);
void write_i32(I32 value, U64 idx = U64_MAX);
void write_u64(U64 value, U64 idx = U64_MAX);
void write_i64(I64 value, U64 idx = U64_MAX);
void write_f32(F32 value, U64 idx = U64_MAX);
void write_f64(F64 value, U64 idx = U64_MAX);
void write_vec2f(const Vec2f& value, U64 idx = U64_MAX);
void write_vec3f(const Vec3f& value, U64 idx = U64_MAX);
void write_vec4f(const Vec4f& value, U64 idx = U64_MAX);
protected:
/** Storage for the data of this byte array. */
nv::Vector _data;
/** Current position in the buffer. */
U64 _position{0};
};
* Now, I'm wondering: could I just pass the **get_data()** pointer from lua directly to the **write_push_contants()** 🤔 ? That would be amazing... Let's try it 😊.
* => Whaoooo! that's actually just working out of the box! I can't believe it 😳!
* I simply created a new shader file using those push constants: layout(push_constant) uniform Push {
vec4 offset;
vec4 scale;
} push;
#ifdef _VERTEX_
layout(location = 0) in vec2 position;
layout(location = 1) in vec4 color;
layout(location=0) out vec3 vertColor;
void main() {
gl_Position = vec4(position*push.scale.zw + push.offset.xy, 0.0, 1.0);
vertColor = color.rgb;
}
#endif
#ifdef _FRAGMENT_
layout(location=0) in vec3 fragColor;
layout(location=0) out vec4 outColor;
void main() {
outColor = vec4(fragColor, 1.0);
}
#endif
* Then in lua I create my push constants: -- We can also store our push constants array here:
self.pushArr = nv.ByteArray(32)
self.pushArr:write_vec4f(nv.Vec4f(0.5, 0.5, 0.0, 0.0))
self.pushArr:write_vec4f(nv.Vec4f(0.0, 0.0, 0.4, 0.4))
* Then also update the pipeline layout accordingly: -- Prepare the push constant wrapper:
local pc = nvk.VulkanPushConstantRange()
-- Define a single push constant range:
pc:setStageFlags(vk.ShaderStageFlagBits.VERTEX_BIT + vk.ShaderStageFlagBits.FRAGMENT_BIT)
pc:setOffset(0)
pc:setSize(32) -- 32 bytes = 2*4*4 => 2 vec4 lines
* And finally I write the push constant data when recording the command buffers: -- add the push constants:
cbuf:write_push_contants(playout, pstages, 0, 32, self.pushArr:get_data())
* And here is the result I get (which is what I was expecting since apply a scale of 0.4, and the a clip space offset of (0.5,0.5)):
{{ blog:2022:1112:vulkan_01_push_constants.png?800 }}
* The only point a bit annoying is that the LLS thinks that get_data() will return an integer actually: so that's fix that (should return a "void" type here instead): **OK** fixed.
===== Recording command buffers every frame =====
* From the beginning of this post, what I had in mind was to try and use push constants to rotate my triangle on screen progressively. This implies that I will change a push constant angle or time value on every frame, and thus, we need to record the main command buffer on each frame.
* Question is: how to record a command buffer on each frame **efficiently** ? I'm thinking I should avoid using lua at all in callbacks that are executed on each frame... but maybe this perspective is incorrect and using lua here would not make a very big difference ? (=> To be tested on day)
* A long time ago I also started investigating JIT compilation with the LLVM compiler: that was really interesting, but also quite complex and not working as expected in the end if I remember correctly (🤔 ?)
* So what I'm thinking about now would be to create some kind of graph or "blueprint" system which could be used to generate a command buffer in C++ when executed: that graph could be assembled only once in lua and the reused on each frame to update the command buffers.
* => First thing we really need here is the support to rerecord a command buffer, so we add the rest bit on the commad pool: -- Now we create a Command Pool on that family index:
self.cmdpool = self.vkeng:create_command_pool(famIdx,
vk.CommandPoolCreateFlagBits.RESET_COMMAND_BUFFER_BIT + vk.CommandPoolCreateFlagBits.TRANSIENT_BIT)
logDEBUG("Created command pool");
* Next we should "reset" our current command buffers instead of re-creating them again: **OK** (in fact to reset the command buffer we simply start recording it again)
* And now we will add a callback in the renderer to re-record the command buffer for each frame...
* **Preliminary step**: I just refactored the Callback/LuaCallback implementation to completely hide the LuaCallback in the bindings. And now we create a "Callback" from lua before assigning it in the renderer: local cb = nv.Callback(function() self:recordCommandBuffers(rpass, vbuf, cfg) end)
self.renderer:on_swapchain_updated(cb)
* Next I can use the same design to provide the lua implementation for the **CmdBuffersProvider** implementation: I start with the definition of a custom constructor: auto _lunactr_CmdBuffersProvider(luna::LuaFunction& func)
-> CmdBuffersProvider*;
* And then we add the corresponding definition: struct LuaCmdBuffersProvider : public CmdBuffersProvider {
NV_DECLARE_NO_COPY(LuaCmdBuffersProvider)
NV_DECLARE_NO_MOVE(LuaCmdBuffersProvider)
LuaCmdBuffersProvider(lua_State* L, I32 idx) : func(L, idx, true) {}
~LuaCmdBuffersProvider() override = default;
void get_buffers(FrameDesc& fdesc, VkCommandBufferList& buffers) override {
// Call the lua function:
func(fdesc, buffers);
}
LuaFunction func;
};
auto _lunactr_CmdBuffersProvider(luna::LuaFunction& func)
-> CmdBuffersProvider* {
auto cb =
nv::create_ref_object(func.state, func.index);
return cb.release();
}
* Now let's try to assign this provider in lua... okay, it works, but now I get about 2800fps instead of 3600fps in the previous version (ie. not rebuilding the buffer on each frame)
* Here is the new function I'm using to generate a buffer for each frame: -- Generate the command buffers for a given frame description
-- just before the those buffer are submitted to the graphics queue
---@param fdesc nvk.FrameDesc Current frame description
---@param buffers nvk.VkCommandBufferList list of vk buffers.
---@param rpass nvk.VulkanRenderPass Render pass to use
---@param vbuf nvk.VulkanVertexBuffer Vertex buffer object
---@param playout vk.PipelineLayout_T Pipeline layout
---@param cfg nvk.VulkanGraphicsPipelineCreateInfo create pipeline config
function Class:generate_cmd_buffer(fdesc, buffers, rpass, vbuf, playout, cfg)
-- logDEBUG("Should provide the cmd buffer for frame ", fdesc.frameNumber, " on image ", fdesc.swapchainImageIndex)
local idx = fdesc.swapchainImageIndex
-- Re-record the command buffer as above:
local cbuf = self.cbufs:at(idx)
local fbuf = self.renderer:get_swapchain_framebuffer(idx)
-- Push constants stages:
local pstages = vk.ShaderStageFlagBits.VERTEX_BIT + vk.ShaderStageFlagBits.FRAGMENT_BIT
local width = self.renderer:get_swapchain_width()
local height = self.renderer:get_swapchain_height()
if self.width ~= width or self.height ~= height then
-- We rebuild the pipeline with teh correct size:
self.width = width
self.height = height
-- Now we should update the viewport dimensions in our graphics pipeline config:
local vp = cfg:getCurrentViewportState()
vp:setViewport(width, height)
if self.pipeline ~= nil then
self.vkeng:remove_pipeline(self.pipeline)
end
self.pipeline = self.vkeng:create_graphics_pipeline(cfg, self.pipelineCache)
end
-- Check that we can reset the command buffer:
fbuf:set_clear_color(0, 0.2, 0.2, 0.2, 1.0)
-- fbuf:set_clear_color(0, 1, 1, 1, 1.0)
-- cbuf:begin(vk.CommandBufferUsageFlagBits.ONE_TIME_SUBMIT_BIT)
cbuf:begin(0)
-- Begin rendering into the swapchain framebuffer:
cbuf:begin_inline_pass(rpass, fbuf)
-- Bind the graphics pipeline:
cbuf:bind_graphics_pipeline(self.pipeline)
-- Bind the vertex buffer:
cbuf:bind_vertex_buffer(vbuf, 0)
-- add the push constants:
cbuf:write_push_contants(playout, pstages, 0, 32, self.pushArr:get_data())
-- Draw our triangle:
cbuf:draw(3)
-- End the render pass:
cbuf:end_render_pass()
-- Finish the command buffer:
cbuf:finish()
-- Add the buffer to the list:
buffers:push_back(cbuf:getVk())
end
* And here is the connection for the Buffer provider: local func = function(fdesc, buffers)
self:generate_cmd_buffer(fdesc, buffers, renderpass, vbuf, playout, cfg)
end
local prov = nvk.CmdBuffersProvider(func)
self.renderer:set_cmd_buffer_provider(prov)
* When testing directly from Saturn I get about **3800fps** with the previous version and **2900fps** with the continuous command buffer rebuild path, that's roughly a 23.7% performance lost, pretty large 🤣.
===== Implementing provider in C++ directly =====
* Now I'm wondering what kind of difference it would make to implement the provider directly in C++, so let's try that.
* Here is the implementation I'm providing in C++ for this "**SimpleCmdBuffersProvider**": #include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
namespace nvk {
SimpleCmdBuffersProvider::SimpleCmdBuffersProvider(
VulkanRenderer* renderer, VulkanRenderPass* rpass, VulkanVertexBuffer* vbuf,
VulkanPipelineLayout* playout, VulkanGraphicsPipelineCreateInfo* cfg,
VulkanPipelineCache* pcache, const VulkanCommandBufferList& cbufs,
const nv::ByteArray& pushArr)
: _renderer(renderer), _pipelineCache(pcache), _rpass(rpass), _vbuf(vbuf),
_playout(playout), _cfg(cfg), _cbufs(cbufs), _pushArr(pushArr) {}
void SimpleCmdBuffersProvider::get_buffers(FrameDesc& fdesc,
VkCommandBufferList& buffers) {
// Write the command buffer:
U32 idx = fdesc.swapchainImageIndex;
// Re-record the command buffer as above:
auto* cbuf = _cbufs[idx].get();
auto* fbuf = _renderer->get_swapchain_framebuffer(idx);
// Push constants stages :
U32 pstages = VK_SHADER_STAGE_VERTEX_BIT | VK_SHADER_STAGE_FRAGMENT_BIT;
U32 width = _renderer->get_swapchain_width();
U32 height = _renderer->get_swapchain_height();
// Check if we need to rebuild the pipeline:
if (_width != width || _height != height) {
_cfg->getCurrentViewportState()->setViewport((float)width,
(float)height);
_pipeline = _renderer->get_device()->create_graphics_pipeline(
_cfg->getVk(), _pipelineCache->getVk());
_width = width;
_height = height;
}
fbuf->set_clear_color(0, 0.2, 0.2, 0.2, 1.0);
cbuf->begin(0);
// Begin rendering into the swapchain framebuffer:
cbuf->begin_inline_pass(_rpass.get(), fbuf);
// Bind the graphics pipeline:
cbuf->push_bind_graphics_pipeline(_pipeline->getVk());
// Bind the vertex buffer:
cbuf->bind_vertex_buffer(_vbuf.get(), 0);
// add the push constants
cbuf->write_push_contants(_playout->getVk(), pstages, 0,
_pushArr.get_size(), _pushArr.get_data());
// Draw our triangle:
cbuf->draw(3);
// End the render pass
cbuf->end_render_pass();
// Finish the command buffer:
cbuf->finish();
// Add the buffer to the list:
buffers.push_back(cbuf->getVk());
}
} // namespace nvk/*//*/
* => With this version I can get about **3100fps**, which correspond to a performance lost of "only" 18.4% compared to the pure lua implementation: that's good, but that's not ultra impressive either, so it seems that the performance in luajit are pretty nice: we only loose about **5.3%** when using luajit implementation instead of pure C++.
* => Still, this is a path I think I need to investigate further: large "blocks" of code should be implemented in C++ and the assembled in Lua when possible... I still have this "fuzzy idea" about "blueprints" or other kind of building block elements or even LLVM JIT usage which I need to investigate further (but let's keep that for our next post, shall we ?)
===== Rotating our triangle with push constants =====
* Okay so now the final step I wanted to reach here: let's use the push constants to make our triangle rotate progressively 😉
* Simply send the current time as the z component of the first vector in the push contants as follow: // We update our push constants here to contain a time value:
F32 time = (F32)fdesc.frameTime;
// logDEBUG("Writing time value: {}", time);
// We write the time as the z element of the first vec4:
_pushArr.write_f32(time, 8);
* Then I updated my shader to use that time value and build a rotation matrix with it: #define M_PI 3.1415926535897932384626433832795
void main() {
// get the time value:
float time = push.offset.z;
// prepare rotation matrix:
float theta = 2.0*M_PI*time/5.0;
float ct = cos(theta);
float st = sin(theta);
mat2 rot = mat2(ct,st,-st,ct);
// Rotate position:
vec2 pos = position*rot;
gl_Position = vec4(pos*push.scale.zw, 0.0, 1.0);
// gl_Position = vec4(position*push.scale.zw + push.offset.xy, 0.0, 1.0);
vertColor = color.rgb;
}
* And that did the job 😁:
{{ blog:2022:1116:vulkan_rotating_triangle.gif?800 }}
* => So now we are good on push constants usage 👍!