====== NervLand DevLog #29: Introducing Proland core elements ====== {{tag>dev cpp webgpu nervland devlog proland}} In this new devlog session I'm mainly focusing on introduction some of the base elements required to integrate/refactor the Inria ProLand project into NervLand: a pretty tricky task so this will definitely take some good time, but we have to start soemwhere, right ? In this process we will also cover some additional steps, like some improvements on the Scene Management system, introduction of the TaskGraph and Multithread scheduler subsystem, extensions on the LogManager to support multithreading with appropriate performances, and we will also review how to extract some frustum planes from an (almost) arbitrary transformation matrix! Let's get started 😎. ====== ====== Youtube videos (2 parts) for this article available at: ;#; {{ youtube>fvs-Aag4DKc?large }} ;#; ;#; {{ youtube>NB-egOvTkj8?large }} ;#; **References:** * [[https://gitlab.inria.fr/neyret/proland|Source gitlab repository for Proland]] * [[https://gitlab.inria.fr/neyret/ork|Source gitlab repository for Ork]] * [[http://proland.imag.fr/doc/proland-4.0/core/html/index.html|Proland Core API documentation]] * [[http://proland.imag.fr/doc/proland-4.0/core/html/page-examples.html|Proland Core examples]] ===== ✅ Preparing initial classes for Core module ===== Adding minimal (skeleton) implementation for the following classes: - **TiteStorage** - **TileProducer** - **TileLayer** - **TileCache** - **GPUTileStorage** - **TerrainNode** - **TerrainQuad** Next, I'm thinking I could start with updating my implementation of the **SceneManager** and **Scene** classes: these could now be extended to support our WebGPU engine in fact 🤔? Since I'm updating the Scene/SceneManager classes, I also need to rebuild the lua bindings, and strangely enough this lead me to some bindings conflicts between the "int" and "long" types. So I had to update the default type registration for "int" in **app/NervBind.lua** to fix this: luna.tm:getOrCreateType("std::int32_t", { primitive = Type.BasicType.INT32 }):addNames("__int32_t", "int32_t", "int", "long") Added initial implementation pf **SceneNode** class. Implemented a first type of SceneNode to render a given WGPU RenderPass: the **SkyBox**: class NVGPU_EXPORT SkyBox : public SceneNode { public: struct CreateInfo { wgpu::Texture envmap{nullptr}; U32 width{0}; U32 height{0}; I32 swapchain_idx{-1}; // Window* window{nullptr}; }; explicit SkyBox(const CreateInfo& infos); static auto create(const CreateInfo& infos) -> RefPtr { return nv::create_ref_object(infos); } /** Process this skybox */ void process() override; protected: /** Render pass for this Skybox */ WGPURenderPass _rpass; }; Fully encapsulated the frame rendering process in the ''WGPUEngine::render_frame()'' method: void WGPUEngine::render_frame() { auto* sman = SceneManager::instance(); pre_render(); auto& bld = build_commands(); sman->process_all(); bld.submit(true); present(); post_render(); } Next, I encapsulated the WGPUCamera instance into the **Scene** object, so now we can simplify the **setup_app()** function even further: void ExampleApp::setup_app() { // Enable IMgui: useImgui = true; // Get the env cubemap: // const char* tname = "tests/cmaps/yokohama_rgba"; const char* tname = "tests/cmaps/bridge2"; _ibl = generate_pbr_ibl_context( {.envmapFile = tname, .usage = TextureUsage::CopyDst | TextureUsage::TextureBinding}); auto* sman = SceneManager::instance(); auto& scene = sman->get_or_create_scene(); auto& spass = scene.add_pass({.with_depth = true, .swapchain_idx = 0}); spass.add_skybox(_ibl.filteredEnv); spass.add_pbr({ .ibl = _ibl, .transform = Mat4f::scale(2.0, 2.0, 2.0) * Mat4f::rotate(toRad(90.0), VEC3F_YAXIS) * Mat4f::rotate(toRad(-90.0), VEC3F_XAXIS), .modelName = "tests/cerberus", .albedoFile = "tests/cerberus/albedo.png", .roughnessFile = "tests/cerberus/roughness.png", .metallicFile = "tests/cerberus/metallic.png", .normalFile = "tests/cerberus/normal.png", .aoFile = "tests/cerberus/ao.png", }); } ===== ✅ Analysing how to extract frustum planes from a matrix ===== In the original Proland code we do something like that (when updating the TerrainNode): deformedCameraPos = owner->getLocalToCamera().inverse() * vec3d::ZERO; SceneManager::getFrustumPlanes(owner->getLocalToScreen(), deformedFrustumPlanes); localCameraPos = deform->deformedToLocal(deformedCameraPos); The first stop for me here, is on the computation of the frustum planes from the localToScreen matrix, which goes as follow: void SceneManager::getFrustumPlanes(const mat4d &toScreen, vec4d *frustumPlanes) { const double *m = toScreen.coefficients(); // Extract the LEFT plane frustumPlanes[0].x = m[12] + m[0]; frustumPlanes[0].y = m[13] + m[1]; frustumPlanes[0].z = m[14] + m[2]; frustumPlanes[0].w = m[15] + m[3]; // Extract the RIGHT plane frustumPlanes[1].x = m[12] - m[0]; frustumPlanes[1].y = m[13] - m[1]; frustumPlanes[1].z = m[14] - m[2]; frustumPlanes[1].w = m[15] - m[3]; // Extract the BOTTOM plane frustumPlanes[2].x = m[12] + m[4]; frustumPlanes[2].y = m[13] + m[5]; frustumPlanes[2].z = m[14] + m[6]; frustumPlanes[2].w = m[15] + m[7]; // Extract the TOP plane frustumPlanes[3].x = m[12] - m[4]; frustumPlanes[3].y = m[13] - m[5]; frustumPlanes[3].z = m[14] - m[6]; frustumPlanes[3].w = m[15] - m[7]; // Extract the NEAR plane frustumPlanes[4].x = m[12] + m[8]; frustumPlanes[4].y = m[13] + m[9]; frustumPlanes[4].z = m[14] + m[10]; frustumPlanes[4].w = m[15] + m[11]; // Extract the FAR plane frustumPlanes[5].x = m[12] - m[8]; frustumPlanes[5].y = m[13] - m[9]; frustumPlanes[5].z = m[14] - m[10]; frustumPlanes[5].w = m[15] - m[11]; } Nice and all, but I have no idea what is happening here, so I'm not going to just drop this in my engine, and I need to understand what this code is doing 😁. **References**: * https://fgiesen.wordpress.com/2012/08/31/frustum-planes-from-the-projection-matrix/ * https://www8.cs.umu.se/kurser/5DV051/HT12/lab/plane_extraction.pdf => So from what I read we need to use this "Gribb-Hartmann" algorithm here, but in WebGPU the clip space on the z axis should be in the range: \(0 \lt w^\prime\). So in fact we will use the Direct3D clipping space rules here: \[ -w^\prime \lt x^\prime \lt w^\prime \\ -w^\prime \lt y^\prime \lt w^\prime \\ 0 \lt z^\prime \lt w^\prime \] Where \( \mathbf{v^\prime} = \left(x^\prime,y^\prime,z^\prime,w^\prime\right) \) are the transformed coordinates of our point (x,y,z,w) in the clip space: \[ \mathbf{v^\prime} = M \cdot \mathbf{v} \Leftrightarrow \begin{bmatrix}x^\prime\\y^\prime\\z^\prime\\w^\prime\end{bmatrix} = \begin{bmatrix}m_{11} & m_{12} & m_{13} & m_{14}\\m_{21} & m_{22} & m_{23} & m_{24}\\m_{31} & m_{32} & m_{33} & m_{34}\\m_{41} & m_{42} & m_{43} & m_{44}\end{bmatrix} \cdot \begin{bmatrix}x\\y\\z\\w\end{bmatrix} \] So now if we just consider the left clipping plane, we have the inequality \( -w^\prime < x^\prime \), which give us (ignoring any transposition symbol here as this just makes everything more confusing from my perspective, and the meaning below is obvious): \[ - \mathbf{v} \cdot \mathbf{\text{row}_4} \lt \mathbf{v} \cdot \mathbf{\text{row}_1} \Leftrightarrow 0 < \mathbf{v} \cdot \left( \mathbf{\text{row}_1} + \mathbf{\text{row}_4}\right) \] So this give us the final plane equation (corresponding to the formula \(ax+by+cz+d=0\)): \[ 0 = x \cdot (m_{11} + m_{41}) + y \cdot (m_{12} + m_{42}) + z \cdot (m_{13}+m_{43}) + (m_{14} + m_{44}) \] if we then normalize the plane normal (a,b,c), the formula \( ax_0 +by_0 + cz_0 + d \) will then give use the distance of a point \(\left(x_0,y_0,z_0\right)\) to the plane. And we will need to ensure this distance is positive to be on the correct side of the plane here. => Okay, so now I think I understand where I am, and I can build my own function for this. Let's do it 👍! So this gives me the following final implementation: void SceneManager::get_frustum_planes(const Mat4d& toScreen, Vec4d* frustumPlanes) { const Mat4d& m = toScreen; Vec3d norm; // Extract the LEFT plane frustumPlanes[NV_FRUSTUM_LEFT].set(m(3, 0) + m(0, 0), m(3, 1) + m(0, 1), m(3, 2) + m(0, 2), m(3, 3) + m(0, 3)); // Extract the RIGHT plane frustumPlanes[NV_FRUSTUM_RIGHT].set(m(3, 0) - m(0, 0), m(3, 1) - m(0, 1), m(3, 2) - m(0, 2), m(3, 3) - m(0, 3)); // Extract the BOTTOM plane frustumPlanes[NV_FRUSTUM_BOTTOM].set(m(3, 0) + m(1, 0), m(3, 1) + m(1, 1), m(3, 2) + m(1, 2), m(3, 3) + m(1, 3)); // Extract the TOP plane frustumPlanes[NV_FRUSTUM_TOP].set(m(3, 0) - m(1, 0), m(3, 1) - m(1, 1), m(3, 2) - m(1, 2), m(3, 3) - m(1, 3)); // Extract the NEAR plane frustumPlanes[NV_FRUSTUM_NEAR].set(m(2, 0), m(2, 1), m(2, 2), m(2, 3)); // Extract the FAR plane frustumPlanes[NV_FRUSTUM_FAR].set(m(3, 0) - m(2, 0), m(3, 1) - m(2, 1), m(3, 2) - m(2, 2), m(3, 3) - m(2, 3)); } Now also using double precision in **Camera** implementation. ===== ✅ Updating terrain orientation ===== **Important note**: Next we have the function ''Deformation::localToDeformedDifferential()'': auto Deformation::localToDeformedDifferential(const Vec3d& localPt, bool /*clamp*/) const -> Mat4d { return Mat4d::translate(Vec3d(localPt.x(), localPt.y(), 0.0)); } => I'm not quite sure if I really need to differentiate on the X/Y axis... **To be clarified**. In fact, thinking more about this point: in our engine we use the canonical world orientation as follow: - **X axis** = Right - **Y axis** = Up - **Z axis** = Forward This doesn't match the default terrain orientation in proland which is using the z axis as up, we could probably work around this but to avoid dealing with too many conventions I think it would be better if I just use the X/Z plan for the terrain and Y axis as up too. Next: * Implemented support for ''TerrainNode::get_visibility(...)'' and **Box3d** class. * Implemented initial processing of **TerrainQuad::update()** (without the occluder system). ===== ✅ Initial terrain QuadTree display results ===== Here is the initial display result I got, with a simple shader displaying eahc quad UV coords as color: {{ blog:2023:055_nvl_proland_simple_terrain.png }} Shortly after this initial result I updated the shader and terrain uniforms to display the sine wave output as in the proland official core example: {{ blog:2023:056_nvl_proland_sine_wave_terrain.png }} ===== ✅ Implementing support for ElevationProducer ===== => Working on implementation of Terrain sample1 from proland: http://proland.imag.fr/doc/proland-4.0/terrain/html/page-examples.html Added implementation for **Task** class. Added implementation for **TaskGraph** class. **Side note**: to build NervLand on a linux system I now had to install **libstdc++-10-dev** (ie. ''sudo apt install libstdc++-10-dev'') otherwise I get an error on ''#include ''. ===== ✅ Implementing TaskGraph and Multithread Scheduler ===== Now that I have introduced support for the **Task** and **TaskGraph** classes, it's time to move to the **Scheduler** implementation. Except that for this one, I think there are a few points I could improve on: First, the reference ORK implementation makes a very extensive use of Task sets, and relies on a mutex and condition variables to synchronize the tasks execution on multiple threads. Instead, I think I could use the [[https://github.com/cameron314/concurrentqueue|Moodycamel concurrent queue]] for a lock-free implementation. **Side note**: I had to rebuild LuaJIT on linux during this process as compatibility with lua 5.2 was not enabled yet and thus I got errors in the nvCore test app with the lua function ''unpack()''. ===== ✅ LogManager dedicated thread ===== While doing some initial tests on the MultithreadScheduler I realized that using the LogManager in its current implementation can really reduce the performances significantly due to the time spent in the FileLogger/StdLogger actually writting stuff on the streams: If for instance I bypass writting the "NOTE" message I'm outputing in a temporary test message like this: void FileLogger::output(int level, const char* prefix, const char* msg, size_t size) { if (level == LogManager::NOTE) { return; } _stream << prefix; _stream.write(msg, (std::streamsize)size) << std::endl; _stream.flush(); } then I get 100 dummy tasks processed in **~0.000250 secs** which seems fair, but if I keep writting this simple message, I get the same 100 tasks executed in **~0.002737 secs** instead 😲, so that's probably something that would at least deserve a little bit of optimization I think. **Note**: I get even larger durations (**~6600us**) when I use the actual message I want to output instead of a simple one (but again, just bypassing the writing to the streams I only get ~250us!): // logNOTE("Hello world!"); logNOTE("Scheduling {} immediate tasks, {} pending, {} + {} dependencies.", _bulkStagedTasks.size(), _pendingTasks.size(), _dependencies.size(), _inverseDependencies.size()); /*//*/ So, what I'm thinking here now is that maybe I could use a dedicated thread to write the log messages, and maybe I could use my new friend the **Moodycamel::concurrentqueue** here to avoid any lock 😉! Let's see... As a first step, I try to use simple std::string objects to pass them to the logger thread and then get them back for recycling: auto& str = get_output_string(); #if NV_USE_LOG_THREAD // If using the log thread we first try to recycle the memory from a // previously used string: _recycleQueue.try_dequeue_from_producer(_recycleProducerToken, str); // Note: whether the dequeue operation was successfull or not should not be // relevant here: if there is nothing to dequeue, we just use the string as // is (eg. allocating some additional memory for it) str.reserve(strlen(buf.data()) + size); str.clear(); str += buf.data(); str.append(data, size); // send that message to the logger thread: _msgQueue.enqueue(std::move(str)); #else str.reserve(strlen(buf.data()) + size); str.clear(); str += buf.data(); str.append(data, size); // lock the logging spinlock: WITH_SPINLOCK(_logSP); output_message(lvl, str); #endif But to be honest, I'm not absolutely sure this will have the minimal possible dynamic memory allocation cost 🤔. To clarify that, I added the following code: U32 tsize = strlen(buf.data()) + size; if (str.capacity() < tsize) { std::cout << "==========> Increasing string capacity to " << tsize << std::endl; str.reserve(tsize); } And unfortunately, I seem to get quite a lot of allocation messages with that: ==========> Increasing string capacity to 94 ==========> Increasing string capacity to 94 ==========> Increasing string capacity to 94 ==========> Increasing string capacity to 94 ==========> Increasing string capacity to 94 ==========> Increasing string capacity to 94 ==========> Increasing string capacity to 94 ==========> Increasing string capacity to 94 2023-08-29 11:21:08.051274 [NOTE] Scheduling 1 immediate tasks, 0 pending, 0 + 0 dependencies. ==========> Increasing string capacity to 94 2023-08-29 11:21:08.051277 [NOTE] Scheduling 1 immediate tasks, 0 pending, 0 + 0 dependencies. ==========> Increasing string capacity to 94 2023-08-29 11:21:08.051279 [NOTE] Scheduling 1 immediate tasks, 0 pending, 0 + 0 dependencies. ==========> Increasing string capacity to 94 2023-08-29 11:21:08.051283 [NOTE] Scheduling 1 immediate tasks, 0 pending, 0 + 0 dependencies. ==========> Increasing string capacity to 94 2023-08-29 11:21:08.051286 [NOTE] Scheduling 1 immediate tasks, 0 pending, 0 + 0 dependencies. ==========> Increasing string capacity to 942023-08-29 11:21:08.051289 [NOTE] Scheduling 1 immediate tasks, 0 pending, 0 + 0 dependencies. 2023-08-29 11:21:08.051292 [NOTE] Scheduling 1 immediate tasks, 0 pending, 0 + 0 dependencies. ==========> Increasing string capacity to ==========> Increasing string capacity to 57 Now, is this really incorrect ? I mean, we could be in a situation where we are posting multiple new messages from the same thread before we get a chance to recycle any allocated memory... Let's try with a large number of tasks, and check the reusage as follow: U32 tsize = strlen(buf.data()) + size; if (str.capacity() < tsize) { std::cout << "==========> Increasing string capacity to " << tsize << std::endl; str.reserve(tsize); } else { std::cout << "==========> Reusing string with capacity " << str.capacity() << std::endl; } And then added even more debug outputs to check if we have actual dequeueing of strings: if (!_recycleQueue.try_dequeue_from_producer(_recycleProducerToken, str)) { // Increase the count of "in flight strings" _numQueuedStrings.fetch_add(1, std::memory_order_release); std::cout << "==========> Num queued strings: " << _numQueuedStrings.load(std::memory_order_acquire) << std::endl; std::cout.flush(); } else { std::cout << "==========> Dequeuing string of capacity: " << str.capacity() << std::endl; std::cout.flush(); }; And indeed it seems we will queue some strings but then we will start re-using them: in my test I will queue up to **217 strings** before stabilization, **but** this is only because I slowdown the task producer thread here with my access to std::cout : in a production scenario, I'm afraid this number could grow much more (which is not so good 😬) Okay, then I also started th use bulk dequeuing of the strings on the logger thread, and this seems to help a lot: void LogManager::logger_thread() { std::string buffer; U32 maxNumStrings = NV_LOG_MSG_MAX_BULK_COUNT; std::vector msgs(maxNumStrings); U32 lastNumQueuedStrings = 0; U32 lastMaxCount = 0; while (true) { U32 num = _numQueuedStrings.load(std::memory_order_acquire); if (num != lastNumQueuedStrings) { lastNumQueuedStrings = num; std::cout << "====> num queued strings: " << num << std::endl; } // if (!_msgQueue.wait_dequeue_timed(_msgConsumerToken, msg, 200)) { // if (_stop) { // break; // } // continue; // }; U32 count = _msgQueue.wait_dequeue_bulk_timed( _msgConsumerToken, msgs.data(), maxNumStrings, 200); if (count == 0) { if (_stop) { break; } continue; }; // Check if we have increased the count: if (count > lastMaxCount) { lastMaxCount = count; std::cout << "====> Max dequeue bulk count: " << count << std::endl; } // Concatenate all the strings in a single large buffer: U32 tsize = 0; for (U32 i = 0; i < count; ++i) { tsize += msgs[i].size(); } // Add space for the newline character: tsize += count - 1; if (buffer.capacity() < tsize) { std::cout << "====> Increasing buffer capacity to: " << tsize << std::endl; buffer.reserve(tsize); } // Copy the data: // char* ptr = buffer.data(); buffer.clear(); for (U32 i = 0; i < count; ++i) { auto& str = msgs[i]; buffer += str; // memcpy(ptr, str.data(), str.size()); // ptr += str.size(); if (i < (count - 1)) { // Add the newline: // (*ptr++) = '\n'; buffer += '\n'; } } // recycle the strings: _recycleQueue.enqueue_bulk(_recycleProducerToken, msgs.data(), count); // We have a string to output: output_message(0, buffer); // // Recycle the message string: // _recycleQueue.enqueue(_recycleProducerToken, msg); } std::cout << "====> Exiting logger thread." << std::endl; } => Trying to run 10000 tasks, will take around **1.434630 secs** without the logger thread, while the "pure execution time" (ie. basically postponing the writing of the log messages to the sinks in the logger thread) is only of **0.029723 secs** on my system otherwise, so that's definitely worth it. ===== ✅ Multi-layers TaskGraph unit test ===== I then continued with the unit testing of the Schduler/Task graphs and prepared the following unit test: BOOST_AUTO_TEST_CASE(test_simple_taskgraph_deps) { // Run a taskgraph with multiple layers of dependencies: auto sch = create_ref_object(0.0F); std::atomic global_counter{0}; // Number of tasks per layer: U32 tnum = 20; // Number of layers: U32 lnum = 10; U32 tcount = 0; Timer timer; auto tg = create_ref_object(); RefTaskSet prevLayer; RefTaskSet curLayer; // We prepare a vector of atomic int values: Vector> counters(lnum); // Those counters should start at 0: for (const auto& it : counters) { BOOST_CHECK_EQUAL(it.load(std::memory_order_acquire), 0); } SpinLock sp; for (int l = 0; l < lnum; ++l) { for (int i = 0; i < tnum; ++i) { // Add the task to the task graph: auto t = tg->add_task( [&global_counter, &counters, &sp, tnum, lnum, i, l]() { logNOTE("Executing task {}-{}", l, i); // All the tasks from the previous layers should be // completed at this point: for (I32 l2 = 0; l2 < l; ++l2) { U32 num = counters[l2].load(std::memory_order_acquire); // NVCHK(num == tnum, "Invalid value: {}", num); // using BOOST checks with multiple threads { WITH_SPINLOCK(sp); BOOST_CHECK_EQUAL(num, tnum); } } // All the tasks from successor layers should be untouched: for (I32 l2 = l + 1; l2 < lnum; ++l2) { U32 num = counters[l2].load(std::memory_order_acquire); { WITH_SPINLOCK(sp); BOOST_CHECK_EQUAL(num, 0); } } // Increment the counter for the current layer: counters[l].fetch_add(1, std::memory_order_release); global_counter.fetch_add(1, std::memory_order_release); }); // Add this task to the current layer: curLayer.insert(t); // Add all the tasks from the previous layer as predecessors for // this task: for (const auto& pred : prevLayer) { tg->add_dependency(t, pred); } tcount++; } // Move the current layer as the pred layer: prevLayer.clear(); prevLayer.insert(curLayer.begin(), curLayer.end()); curLayer.clear(); } // Task should not be done: BOOST_CHECK_EQUAL(tg->is_done(), false); timer.start(); // Run all those tasks: sch->run(tg); logWARN("Taskgraph with {} tasks / {} layers done in {:.6f} secs", tnum, lnum, timer.stop() / 1e6); // All layer counters should be completed: for (const auto& it : counters) { BOOST_CHECK_EQUAL(it.load(std::memory_order_acquire), tnum); } // Taskgraph should be marked as done: BOOST_CHECK_EQUAL(tg->is_done(), true); BOOST_CHECK_EQUAL(global_counter.load(std::memory_order_acquire), tcount); sch.reset(); } A few notes on the code above: * First, using **BOOST_CHECK_EQUAL** from multiple threads at the same time doesn't seem to be supported our of the box: this will trigger the display of strange/partial debug output on std::cout if done in a naive way. So I had to introduce a SpinLock to synchronize this part of the task functions. * In the test above I build a task graph with 10 consecutive layers, and 20 tasks per layer, where each tasks on a given layer depends directly on all the tasks from the previous layer, this seems to work just fine actually, but what I noticed is that the debug outputs themself can sometimes get mixed anyway, like for instance: 2023-08-29 21:14:31.661667 [NOTE] Executing task 1-7 2023-08-29 21:14:31.661123 [NOTE] Executing task 0-18 2023-08-29 21:14:31.661134 [NOTE] Executing task 0-5 2023-08-29 21:14:31.661666 [NOTE] Executing task 1-0 ====> num queued strings: 60 ====> Max dequeue bulk count: 53 ====> Increasing buffer capacity to: 2835 2023-08-29 21:14:31.672053 [NOTE] Executing task 2-9 2023-08-29 21:14:31.672062 [NOTE] Executing task 2-17 2023-08-29 21:14:31.672741 [NOTE] Executing task 3-17 2023-08-29 21:14:31.672850 [NOTE] Executing task 4-10 2023-08-29 21:14:31.672053 [NOTE] Executing task 2-18 2023-08-29 21:14:31.672741 [NOTE] Executing task 3-19 2023-08-29 21:14:31.672745 [NOTE] Executing task 3-4 2023-08-29 21:14:31.672849 [NOTE] Executing task 4-6 2023-08-29 21:14:31.672743 [NOTE] Executing task 3-13 2023-08-29 21:14:31.672054 [NOTE] Executing task 2-12 2023-08-29 21:14:31.672741 [NOTE] Executing task 3-14 2023-08-29 21:14:31.672850 [NOTE] Executing task 4-14 2023-08-29 21:14:31.672054 [NOTE] Executing task 2-10 => We can see from the timetag that the lower level layers still seems to be executed first (as expected), but I think what is going wrong is the dequeuing of the messages in bulk: according to the ConcurrentQueue documentation I think there is no guarantee that the messages will be in a specific order. So I think I should introduce some kind of mechanism to also pass a timetag along the strings, so that I could then reorder them before concatenating to produce a single buffer. Let's see... This lead me to introducing a dedicated **Message** structure in my LogManager: struct Message { U64 timetag{0}; std::string str; Message() = default; // Disable copies: Message(const Message&) = delete; auto operator=(const Message&) = delete; // Enable Move: Message(Message&&) noexcept = default; auto operator=(Message&&) noexcept -> Message& = default; ~Message() = default; }; And this is what I was then trying to pass around with my queues: /** Message queue */ using MessageQueue = moodycamel::BlockingConcurrentQueue; MessageQueue _msgQueue; /** Additional queue to recycle the memory allocated for the output strings */ using RecycleQueue = moodycamel::ConcurrentQueue; RecycleQueue _recycleQueue; Only to realize immediately after, that the Message **copy constructor** seems to be required by **ConcurrentQueue** 😭!, since I get this error message on compile: D:\Projects\NervLand\sources\nvCore\src\external\concurrentqueue.h:2238:111: error: call to deleted constructor of 'nv::LogManager::Message' MOODYCAMEL_CONSTEXPR_IF (!MOODYCAMEL_NOEXCEPT_CTOR(T, decltype(*itemFirst), new (static_cast(nullptr)) T(details::deref_noexcept(itemFirst)))) { ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ D:\Projects\NervLand\sources\nvCore\src\external\concurrentqueue.h:208:66: note: expanded from macro 'MOODYCAMEL_NOEXCEPT_CTOR' #define MOODYCAMEL_NOEXCEPT_CTOR(type, valueType, expr) noexcept(expr) ^~~~ D:/Projects/NervLand/sources/nvCore/src\log/LogManager.h:129:9: note: 'Message' has been explicitly marked deleted here Message(const Message&) = delete; ^ ... and this is a pain, because it could actually indicate that I'm copying the data where I was expecting to get my data moved instead too bad 🙃. But let's just try to confirm that to be safe. HHmmmm 🤔... Actually it seems this might only ever be required when bulk enqueuing messages... which is not something I need here. So maybe I could provide a dummy throwing implementation here ? arrrggh... I tried with this: LogManager::Message::Message(const LogManager::Message& rhs) { // THROW_MSG("Message(const Message&) should never be called."); std::cout << "!!!!!!!!!!!!! Calling message copy constructor with size: " << rhs.str.size() << std::endl; }; And I got this kind of output: ====> Increasing buffer capacity to: 9798 !!!!!!!!!!!!! Calling message copy constructor with size: 53 !!!!!!!!!!!!! Calling message copy constructor with size: 53 !!!!!!!!!!!!! Calling message copy constructor with size: 53 !!!!!!!!!!!!! Calling message copy constructor with size: 52 !!!!!!!!!!!!! Calling message copy constructor with size: 52 !!!!!!!!!!!!! Calling message copy constructor with size: 53 !!!!!!!!!!!!! Calling message copy constructor with size: 53 !!!!!!!!!!!!! Calling message copy constructor with size: 53 !!!!!!!!!!!!! Calling message copy constructor with size: 52 !!!!!!!!!!!!! Calling message copy constructor with size: 53 !!!!!!!!!!!!! Calling message copy constructor with size: 52 !!!!!!!!!!!!! Calling message copy constructor with size: 52 !!!!!!!!!!!!! Calling message copy constructor with size: 52 !!!!!!!!!!!!! Calling message copy constructor with size: 52 !!!!!!!!!!!!! Calling message copy constructor with size: 53 !!!!!!!!!!!!! Calling message copy constructor with size: 52 !!!!!!!!!!!!! Calling message copy constructor with size: 53 !!!!!!!!!!!!! Calling message copy constructor with size: 52 !!!!!!!!!!!!! Calling message copy constructor with size: 52 !!!!!!!!!!!!! Calling message copy constructor with size: 53 !!!!!!!!!!!!! Calling message copy constructor with size: 52 !!!!!!!!!!!!! Calling message copy constructor with size: 53 !!!!!!!!!!!!! Calling message copy constructor with size: 53 !!!!!!!!!!!!! Calling message copy constructor with size: 52 !!!!!!!!!!!!! Calling message copy constructor with size: 53 !!!!!!!!!!!!! Calling message copy constructor with size: 53 !!!!!!!!!!!!! Calling message copy constructor with size: 53 Really not too good... There might be ways to work around this (by moving the string member in the copy constructor maybe ?) but actually I'm now thinking about a noticeable different path: why not simply keep a "static" vector of Message objects and only pass the relevant message slot indices in the queues ?! Since we will not continuously reallocate this vector it should be safe to write to "free slots" from multiple threads. Let's try to implement that. **Update**: actually I found one location where I was using enqueue_bulk: // recycle the strings: _recycleQueue.enqueue_bulk(_recycleProducerToken, msgs.data(), count); /*//*/ ===== ✅ Static LogManager message array ===== Working on this I eventually extracted the std::string member from my **Message** class above, and thus created a simpler **MsgTag** struct: struct MsgTag { U32 index{0}; U64 timetag{0}; }; While keeping the message string in a fixe array: /** Storage for the messages **/ std::array _msgArray; Then I could simply pass the **MsgTag** corresponding to "filled slots" on the queue to the logger thread, and retrieve those MsgTag once there are not needed anymore to recycle the corresponding slots: // If using the log thread we first try to recycle the memory from a // previously used string: MsgTag mtag{}; if (!_recycleQueue.try_dequeue_from_producer(_recycleProducerToken, mtag)) { // There is no MsgTag ready for dequeueing yet, so we check if we are // already at full capacity, or if we can start using a remaining unused // slot: if (_numQueuedStrings.load(std::memory_order_acquire) == NV_LOG_MSG_QUEUE_CAPACITY) { // All slots are used so we must really wait until one becomes free: std::cout << "LogManager: =========> All string slots in use, waiting." << std::endl; while (!_recycleQueue.try_dequeue_from_producer( _recycleProducerToken, mtag)) ; } else { // We can start using an unused slot increasing the count of "in // flight strings" mtag.index = _numQueuedStrings.fetch_add(1, std::memory_order_release); // std::cout << "==========> Num queued strings: " // << _numQueuedStrings.load(std::memory_order_acquire) // << std::endl; // std::cout.flush(); } } auto& str = _msgArray[mtag.index]; // else { // std::cout << "==========> Dequeuing string of capacity: " // << str.capacity() << std::endl; // std::cout.flush(); // }; // Note: whether the dequeue operation was successfull or not should not be // relevant here: if there is nothing to dequeue, we just use the string as // is (eg. allocating some additional memory for it) U32 tsize = strlen(buf.data()) + size; if (str.capacity() < tsize) { // std::cout << "==========> Increasing string capacity to " << tsize // << std::endl; str.reserve(tsize); } // else { // std::cout << "==========> Reusing string with capacity " // << str.capacity() << std::endl; // } str.clear(); str += buf.data(); str.append(data, size); // Update the timetag: mtag.timetag = _timeTag.fetch_add(1, std::memory_order_release); // Try to send that message to the logger thread: while (!_msgQueue.enqueue(mtag)) ; => This works just fine, and with this implementation I'm sure that there is no expensive continuous dynamic allocation occuring for the strings, so all good for me 👍! ===== ✅ LogManager thread local storage buffer ===== Another interesting update I made on the log manager was with the **get_mem_buffer()** method: auto LogManager::get_mem_buffer() -> fmt::memory_buffer& { #if 1 return threadData.buffer; #else auto tid = std::this_thread::get_id(); auto it = _memBuffers.find(tid); if (it != _memBuffers.end()) { return it->second; } // We must add the mem buffer: fmt::memory_buffer* bufPtr = nullptr; { WITH_SPINLOCK(_logSP); bufPtr = &_memBuffers[tid]; } // logNOTE("Adding logmanager mem buffer for thread {}", // get_current_thread_id()); return *bufPtr; #endif } The previous implementation (disabled above) was relying on an unordered_map where I was storing a memory_buffer for each thread, retrieving that mem buffer without any lock... but in fact I think even this is incorrect, since we could have a case where we read and write to the map at the same time from 2 threads. So now I'm using a **Thread Local Storage** to store the per thread data and leave it to the compiler to sort any synchronization issue ;-). The good thing on top of that it that is also seems to improve the performances a bit: struct ThreadData { fmt::memory_buffer buffer; std::string str; }; thread_local ThreadData threadData; ===== ✅ Additional producer classes ===== Adding implementation for: - **TileStorage** - **GPUTileStorage** - **TileLayer** - **TileCache** - **TileProducer** Interesting I just noticed that the max number of layers for a Texture2D array in WebGPU is still relatively low at a **value of 256** accordingly to the limits I can read from my engine