blog:2023:0906_nvl_dev29_proland_introduction

NervLand DevLog #29: Introducing Proland core elements

In this new devlog session I'm mainly focusing on introduction some of the base elements required to integrate/refactor the Inria ProLand project into NervLand: a pretty tricky task so this will definitely take some good time, but we have to start soemwhere, right ?

In this process we will also cover some additional steps, like some improvements on the Scene Management system, introduction of the TaskGraph and Multithread scheduler subsystem, extensions on the LogManager to support multithreading with appropriate performances, and we will also review how to extract some frustum planes from an (almost) arbitrary transformation matrix! Let's get started 😎.

Youtube videos (2 parts) for this article available at:

References:

Adding minimal (skeleton) implementation for the following classes:

  1. TiteStorage
  2. TileProducer
  3. TileLayer
  4. TileCache
  5. GPUTileStorage
  6. TerrainNode
  7. TerrainQuad

Next, I'm thinking I could start with updating my implementation of the SceneManager and Scene classes: these could now be extended to support our WebGPU engine in fact 🤔?

Since I'm updating the Scene/SceneManager classes, I also need to rebuild the lua bindings, and strangely enough this lead me to some bindings conflicts between the “int” and “long” types. So I had to update the default type registration for “int” in app/NervBind.lua to fix this:

    luna.tm:getOrCreateType("std::int32_t", { primitive = Type.BasicType.INT32 }):addNames("__int32_t", "int32_t", "int",
        "long")

Added initial implementation pf SceneNode class.

Implemented a first type of SceneNode to render a given WGPU RenderPass: the SkyBox:

class NVGPU_EXPORT SkyBox : public SceneNode {

  public:
    struct CreateInfo {
        wgpu::Texture envmap{nullptr};
        U32 width{0};
        U32 height{0};
        I32 swapchain_idx{-1};
        // Window* window{nullptr};
    };

    explicit SkyBox(const CreateInfo& infos);

    static auto create(const CreateInfo& infos) -> RefPtr<SkyBox> {
        return nv::create_ref_object<SkyBox>(infos);
    }

    /** Process this skybox */
    void process() override;

  protected:
    /** Render pass for this Skybox */
    WGPURenderPass _rpass;
};

Fully encapsulated the frame rendering process in the WGPUEngine::render_frame() method:

void WGPUEngine::render_frame() {
    auto* sman = SceneManager::instance();

    pre_render();

    auto& bld = build_commands();
    sman->process_all();
    bld.submit(true);

    present();

    post_render();
}

Next, I encapsulated the WGPUCamera instance into the Scene object, so now we can simplify the setup_app() function even further:

void ExampleApp::setup_app() {
    // Enable IMgui:
    useImgui = true;

    // Get the env cubemap:
    // const char* tname = "tests/cmaps/yokohama_rgba";
    const char* tname = "tests/cmaps/bridge2";
    _ibl = generate_pbr_ibl_context(
        {.envmapFile = tname,
         .usage = TextureUsage::CopyDst | TextureUsage::TextureBinding});

    auto* sman = SceneManager::instance();
    auto& scene = sman->get_or_create_scene();

    auto& spass = scene.add_pass({.with_depth = true, .swapchain_idx = 0});
    spass.add_skybox(_ibl.filteredEnv);
    spass.add_pbr({
        .ibl = _ibl,
        .transform = Mat4f::scale(2.0, 2.0, 2.0) *
                     Mat4f::rotate(toRad(90.0), VEC3F_YAXIS) *
                     Mat4f::rotate(toRad(-90.0), VEC3F_XAXIS),
        .modelName = "tests/cerberus",
        .albedoFile = "tests/cerberus/albedo.png",
        .roughnessFile = "tests/cerberus/roughness.png",
        .metallicFile = "tests/cerberus/metallic.png",
        .normalFile = "tests/cerberus/normal.png",
        .aoFile = "tests/cerberus/ao.png",
    });
}

In the original Proland code we do something like that (when updating the TerrainNode):

    deformedCameraPos = owner->getLocalToCamera().inverse() * vec3d::ZERO;
    SceneManager::getFrustumPlanes(owner->getLocalToScreen(), deformedFrustumPlanes);
    localCameraPos = deform->deformedToLocal(deformedCameraPos);

The first stop for me here, is on the computation of the frustum planes from the localToScreen matrix, which goes as follow:

void SceneManager::getFrustumPlanes(const mat4d &toScreen, vec4d *frustumPlanes)
{
    const double *m = toScreen.coefficients();
    // Extract the LEFT plane
    frustumPlanes[0].x = m[12] + m[0];
    frustumPlanes[0].y = m[13] + m[1];
    frustumPlanes[0].z = m[14] + m[2];
    frustumPlanes[0].w = m[15] + m[3];
    // Extract the RIGHT plane
    frustumPlanes[1].x = m[12] - m[0];
    frustumPlanes[1].y = m[13] - m[1];
    frustumPlanes[1].z = m[14] - m[2];
    frustumPlanes[1].w = m[15] - m[3];
    // Extract the BOTTOM plane
    frustumPlanes[2].x = m[12] + m[4];
    frustumPlanes[2].y = m[13] + m[5];
    frustumPlanes[2].z = m[14] + m[6];
    frustumPlanes[2].w = m[15] + m[7];
    // Extract the TOP plane
    frustumPlanes[3].x = m[12] - m[4];
    frustumPlanes[3].y = m[13] - m[5];
    frustumPlanes[3].z = m[14] - m[6];
    frustumPlanes[3].w = m[15] - m[7];
    // Extract the NEAR plane
    frustumPlanes[4].x = m[12] + m[8];
    frustumPlanes[4].y = m[13] + m[9];
    frustumPlanes[4].z = m[14] + m[10];
    frustumPlanes[4].w = m[15] + m[11];
    // Extract the FAR plane
    frustumPlanes[5].x = m[12] - m[8];
    frustumPlanes[5].y = m[13] - m[9];
    frustumPlanes[5].z = m[14] - m[10];
    frustumPlanes[5].w = m[15] - m[11];
}

Nice and all, but I have no idea what is happening here, so I'm not going to just drop this in my engine, and I need to understand what this code is doing 😁.

References:

⇒ So from what I read we need to use this “Gribb-Hartmann” algorithm here, but in WebGPU the clip space on the z axis should be in the range: \(0 \lt w^\prime\). So in fact we will use the Direct3D clipping space rules here:

\[ -w^\prime \lt x^\prime \lt w^\prime \\ -w^\prime \lt y^\prime \lt w^\prime \\ 0 \lt z^\prime \lt w^\prime \]

Where \( \mathbf{v^\prime} = \left(x^\prime,y^\prime,z^\prime,w^\prime\right) \) are the transformed coordinates of our point (x,y,z,w) in the clip space:

\[ \mathbf{v^\prime} = M \cdot \mathbf{v} \Leftrightarrow \begin{bmatrix}x^\prime\\y^\prime\\z^\prime\\w^\prime\end{bmatrix} = \begin{bmatrix}m_{11} & m_{12} & m_{13} & m_{14}\\m_{21} & m_{22} & m_{23} & m_{24}\\m_{31} & m_{32} & m_{33} & m_{34}\\m_{41} & m_{42} & m_{43} & m_{44}\end{bmatrix} \cdot \begin{bmatrix}x\\y\\z\\w\end{bmatrix} \]

So now if we just consider the left clipping plane, we have the inequality \( -w^\prime < x^\prime \), which give us (ignoring any transposition symbol here as this just makes everything more confusing from my perspective, and the meaning below is obvious):

\[ - \mathbf{v} \cdot \mathbf{\text{row}_4} \lt \mathbf{v} \cdot \mathbf{\text{row}_1} \Leftrightarrow 0 < \mathbf{v} \cdot \left( \mathbf{\text{row}_1} + \mathbf{\text{row}_4}\right) \]

So this give us the final plane equation (corresponding to the formula \(ax+by+cz+d=0\)):

\[ 0 = x \cdot (m_{11} + m_{41}) + y \cdot (m_{12} + m_{42}) + z \cdot (m_{13}+m_{43}) + (m_{14} + m_{44}) \]

if we then normalize the plane normal (a,b,c), the formula \( ax_0 +by_0 + cz_0 + d \) will then give use the distance of a point \(\left(x_0,y_0,z_0\right)\) to the plane. And we will need to ensure this distance is positive to be on the correct side of the plane here.

⇒ Okay, so now I think I understand where I am, and I can build my own function for this. Let's do it 👍!

So this gives me the following final implementation:

void SceneManager::get_frustum_planes(const Mat4d& toScreen,
                                      Vec4d* frustumPlanes) {
    const Mat4d& m = toScreen;

    Vec3d norm;

    // Extract the LEFT plane
    frustumPlanes[NV_FRUSTUM_LEFT].set(m(3, 0) + m(0, 0), m(3, 1) + m(0, 1),
                                       m(3, 2) + m(0, 2), m(3, 3) + m(0, 3));

    // Extract the RIGHT plane
    frustumPlanes[NV_FRUSTUM_RIGHT].set(m(3, 0) - m(0, 0), m(3, 1) - m(0, 1),
                                        m(3, 2) - m(0, 2), m(3, 3) - m(0, 3));

    // Extract the BOTTOM plane

    frustumPlanes[NV_FRUSTUM_BOTTOM].set(m(3, 0) + m(1, 0), m(3, 1) + m(1, 1),
                                         m(3, 2) + m(1, 2), m(3, 3) + m(1, 3));

    // Extract the TOP plane
    frustumPlanes[NV_FRUSTUM_TOP].set(m(3, 0) - m(1, 0), m(3, 1) - m(1, 1),
                                      m(3, 2) - m(1, 2), m(3, 3) - m(1, 3));

    // Extract the NEAR plane
    frustumPlanes[NV_FRUSTUM_NEAR].set(m(2, 0), m(2, 1), m(2, 2), m(2, 3));

    // Extract the FAR plane
    frustumPlanes[NV_FRUSTUM_FAR].set(m(3, 0) - m(2, 0), m(3, 1) - m(2, 1),
                                      m(3, 2) - m(2, 2), m(3, 3) - m(2, 3));
}

Now also using double precision in Camera implementation.

Important note: Next we have the function Deformation::localToDeformedDifferential():

auto Deformation::localToDeformedDifferential(const Vec3d& localPt,
                                              bool /*clamp*/) const -> Mat4d {
    return Mat4d::translate(Vec3d(localPt.x(), localPt.y(), 0.0));
}

⇒ I'm not quite sure if I really need to differentiate on the X/Y axis… To be clarified.

In fact, thinking more about this point: in our engine we use the canonical world orientation as follow:

  1. X axis = Right
  2. Y axis = Up
  3. Z axis = Forward

This doesn't match the default terrain orientation in proland which is using the z axis as up, we could probably work around this but to avoid dealing with too many conventions I think it would be better if I just use the X/Z plan for the terrain and Y axis as up too.

Next:

  • Implemented support for TerrainNode::get_visibility(…) and Box3d class.
  • Implemented initial processing of TerrainQuad::update() (without the occluder system).

Here is the initial display result I got, with a simple shader displaying eahc quad UV coords as color:

Shortly after this initial result I updated the shader and terrain uniforms to display the sine wave output as in the proland official core example:

⇒ Working on implementation of Terrain sample1 from proland: http://proland.imag.fr/doc/proland-4.0/terrain/html/page-examples.html

Added implementation for Task class.

Added implementation for TaskGraph class.

Side note: to build NervLand on a linux system I now had to install libstdc++-10-dev (ie. sudo apt install libstdc++-10-dev) otherwise I get an error on #include <compare>.

Now that I have introduced support for the Task and TaskGraph classes, it's time to move to the Scheduler implementation. Except that for this one, I think there are a few points I could improve on:

First, the reference ORK implementation makes a very extensive use of Task sets, and relies on a mutex and condition variables to synchronize the tasks execution on multiple threads. Instead, I think I could use the Moodycamel concurrent queue for a lock-free implementation.

Side note: I had to rebuild LuaJIT on linux during this process as compatibility with lua 5.2 was not enabled yet and thus I got errors in the nvCore test app with the lua function unpack().

While doing some initial tests on the MultithreadScheduler I realized that using the LogManager in its current implementation can really reduce the performances significantly due to the time spent in the FileLogger/StdLogger actually writting stuff on the streams:

If for instance I bypass writting the “NOTE” message I'm outputing in a temporary test message like this:

void FileLogger::output(int level, const char* prefix, const char* msg,
                        size_t size) {
    if (level == LogManager::NOTE) {
        return;
    }

    _stream << prefix;
    _stream.write(msg, (std::streamsize)size) << std::endl;
    _stream.flush();
}

then I get 100 dummy tasks processed in ~0.000250 secs which seems fair, but if I keep writting this simple message, I get the same 100 tasks executed in ~0.002737 secs instead 😲, so that's probably something that would at least deserve a little bit of optimization I think.

Note: I get even larger durations (~6600us) when I use the actual message I want to output instead of a simple one (but again, just bypassing the writing to the streams I only get ~250us!):

    // logNOTE("Hello world!");
    logNOTE("Scheduling {} immediate tasks, {} pending, {} + {} dependencies.",
            _bulkStagedTasks.size(), _pendingTasks.size(), _dependencies.size(),
            _inverseDependencies.size());

So, what I'm thinking here now is that maybe I could use a dedicated thread to write the log messages, and maybe I could use my new friend the Moodycamel::concurrentqueue here to avoid any lock 😉! Let's see…

As a first step, I try to use simple std::string objects to pass them to the logger thread and then get them back for recycling:

    auto& str = get_output_string();

#if NV_USE_LOG_THREAD
    // If using the log thread we first try to recycle the memory from a
    // previously used string:
    _recycleQueue.try_dequeue_from_producer(_recycleProducerToken, str);

    // Note: whether the dequeue operation was successfull or not should not be
    // relevant here: if there is nothing to dequeue, we just use the string as
    // is (eg. allocating some additional memory for it)
    str.reserve(strlen(buf.data()) + size);
    str.clear();
    str += buf.data();
    str.append(data, size);

    // send that message to the logger thread:
    _msgQueue.enqueue(std::move(str));
#else
    str.reserve(strlen(buf.data()) + size);
    str.clear();
    str += buf.data();
    str.append(data, size);

    // lock the logging spinlock:
    WITH_SPINLOCK(_logSP);
    output_message(lvl, str);
#endif

But to be honest, I'm not absolutely sure this will have the minimal possible dynamic memory allocation cost 🤔.

To clarify that, I added the following code:

    U32 tsize = strlen(buf.data()) + size;
    if (str.capacity() < tsize) {
        std::cout << "==========> Increasing string capacity to " << tsize
                  << std::endl;
        str.reserve(tsize);
    }

And unfortunately, I seem to get quite a lot of allocation messages with that:

==========> Increasing string capacity to 94
==========> Increasing string capacity to 94
==========> Increasing string capacity to 94
==========> Increasing string capacity to 94
==========> Increasing string capacity to 94
==========> Increasing string capacity to 94
==========> Increasing string capacity to 94
==========> Increasing string capacity to 94
2023-08-29 11:21:08.051274 [NOTE] Scheduling 1 immediate tasks, 0 pending, 0 + 0 dependencies.
==========> Increasing string capacity to 94
2023-08-29 11:21:08.051277 [NOTE] Scheduling 1 immediate tasks, 0 pending, 0 + 0 dependencies.
==========> Increasing string capacity to 94
2023-08-29 11:21:08.051279 [NOTE] Scheduling 1 immediate tasks, 0 pending, 0 + 0 dependencies.
==========> Increasing string capacity to 94
2023-08-29 11:21:08.051283 [NOTE] Scheduling 1 immediate tasks, 0 pending, 0 + 0 dependencies.
==========> Increasing string capacity to 94
2023-08-29 11:21:08.051286 [NOTE] Scheduling 1 immediate tasks, 0 pending, 0 + 0 dependencies.
==========> Increasing string capacity to 942023-08-29 11:21:08.051289 [NOTE] Scheduling 1 immediate tasks, 0 pending, 0 + 0 dependencies.

2023-08-29 11:21:08.051292 [NOTE] Scheduling 1 immediate tasks, 0 pending, 0 + 0 dependencies.
==========> Increasing string capacity to ==========> Increasing string capacity to 57

Now, is this really incorrect ? I mean, we could be in a situation where we are posting multiple new messages from the same thread before we get a chance to recycle any allocated memory…

Let's try with a large number of tasks, and check the reusage as follow:

    U32 tsize = strlen(buf.data()) + size;
    if (str.capacity() < tsize) {
        std::cout << "==========> Increasing string capacity to " << tsize
                  << std::endl;
        str.reserve(tsize);
    } else {
        std::cout << "==========> Reusing string with capacity "
                  << str.capacity() << std::endl;
    }

And then added even more debug outputs to check if we have actual dequeueing of strings:

    if (!_recycleQueue.try_dequeue_from_producer(_recycleProducerToken, str)) {
        // Increase the count of "in flight strings"
        _numQueuedStrings.fetch_add(1, std::memory_order_release);
        std::cout << "==========> Num queued strings: "
                  << _numQueuedStrings.load(std::memory_order_acquire)
                  << std::endl;
        std::cout.flush();
    } else {
        std::cout << "==========> Dequeuing string of capacity: "
                  << str.capacity() << std::endl;
        std::cout.flush();
    };

And indeed it seems we will queue some strings but then we will start re-using them: in my test I will queue up to 217 strings before stabilization, but this is only because I slowdown the task producer thread here with my access to std::cout : in a production scenario, I'm afraid this number could grow much more (which is not so good 😬)

Okay, then I also started th use bulk dequeuing of the strings on the logger thread, and this seems to help a lot:

void LogManager::logger_thread() {

    std::string buffer;

    U32 maxNumStrings = NV_LOG_MSG_MAX_BULK_COUNT;
    std::vector<std::string> msgs(maxNumStrings);

    U32 lastNumQueuedStrings = 0;
    U32 lastMaxCount = 0;

    while (true) {
        U32 num = _numQueuedStrings.load(std::memory_order_acquire);
        if (num != lastNumQueuedStrings) {
            lastNumQueuedStrings = num;
            std::cout << "====> num queued strings: " << num << std::endl;
        }

        // if (!_msgQueue.wait_dequeue_timed(_msgConsumerToken, msg, 200)) {
        //     if (_stop) {
        //         break;
        //     }

        //     continue;
        // };

        U32 count = _msgQueue.wait_dequeue_bulk_timed(
            _msgConsumerToken, msgs.data(), maxNumStrings, 200);

        if (count == 0) {
            if (_stop) {
                break;
            }

            continue;
        };

        // Check if we have increased the count:
        if (count > lastMaxCount) {
            lastMaxCount = count;
            std::cout << "====> Max dequeue bulk count: " << count << std::endl;
        }

        // Concatenate all the strings in a single large buffer:
        U32 tsize = 0;
        for (U32 i = 0; i < count; ++i) {
            tsize += msgs[i].size();
        }

        // Add space for the newline character:
        tsize += count - 1;

        if (buffer.capacity() < tsize) {
            std::cout << "====> Increasing buffer capacity to: " << tsize
                      << std::endl;
            buffer.reserve(tsize);
        }

        // Copy the data:
        // char* ptr = buffer.data();
        buffer.clear();
        for (U32 i = 0; i < count; ++i) {
            auto& str = msgs[i];
            buffer += str;
            // memcpy(ptr, str.data(), str.size());
            // ptr += str.size();
            if (i < (count - 1)) {
                // Add the newline:
                // (*ptr++) = '\n';
                buffer += '\n';
            }
        }

        // recycle the strings:
        _recycleQueue.enqueue_bulk(_recycleProducerToken, msgs.data(), count);

        // We have a string to output:
        output_message(0, buffer);

        // // Recycle the message string:
        // _recycleQueue.enqueue(_recycleProducerToken, msg);
    }

    std::cout << "====> Exiting logger thread." << std::endl;
}

⇒ Trying to run 10000 tasks, will take around 1.434630 secs without the logger thread, while the “pure execution time” (ie. basically postponing the writing of the log messages to the sinks in the logger thread) is only of 0.029723 secs on my system otherwise, so that's definitely worth it.

I then continued with the unit testing of the Schduler/Task graphs and prepared the following unit test:

BOOST_AUTO_TEST_CASE(test_simple_taskgraph_deps) {
    // Run a taskgraph with multiple layers of dependencies:

    auto sch = create_ref_object<MultithreadScheduler>(0.0F);

    std::atomic<I32> global_counter{0};

    // Number of tasks per layer:
    U32 tnum = 20;

    // Number of layers:
    U32 lnum = 10;

    U32 tcount = 0;

    Timer timer;

    auto tg = create_ref_object<TaskGraph>();

    RefTaskSet prevLayer;
    RefTaskSet curLayer;

    // We prepare a vector of atomic int values:
    Vector<std::atomic<U32>> counters(lnum);

    // Those counters should start at 0:
    for (const auto& it : counters) {
        BOOST_CHECK_EQUAL(it.load(std::memory_order_acquire), 0);
    }

    SpinLock sp;

    for (int l = 0; l < lnum; ++l) {
        for (int i = 0; i < tnum; ++i) {

            // Add the task to the task graph:
            auto t = tg->add_task(
                [&global_counter, &counters, &sp, tnum, lnum, i, l]() {
                    logNOTE("Executing task {}-{}", l, i);

                    // All the tasks from the previous layers should be
                    // completed at this point:
                    for (I32 l2 = 0; l2 < l; ++l2) {
                        U32 num = counters[l2].load(std::memory_order_acquire);
                        // NVCHK(num == tnum, "Invalid value: {}", num);
                        // using BOOST checks with multiple threads
                        {
                            WITH_SPINLOCK(sp);
                            BOOST_CHECK_EQUAL(num, tnum);
                        }
                    }

                    // All the tasks from successor layers should be untouched:
                    for (I32 l2 = l + 1; l2 < lnum; ++l2) {
                        U32 num = counters[l2].load(std::memory_order_acquire);
                        {
                            WITH_SPINLOCK(sp);
                            BOOST_CHECK_EQUAL(num, 0);
                        }
                    }

                    // Increment the counter for the current layer:
                    counters[l].fetch_add(1, std::memory_order_release);

                    global_counter.fetch_add(1, std::memory_order_release);
                });

            // Add this task to the current layer:
            curLayer.insert(t);

            // Add all the tasks from the previous layer as predecessors for
            // this task:
            for (const auto& pred : prevLayer) {
                tg->add_dependency(t, pred);
            }

            tcount++;
        }

        // Move the current layer as the pred layer:
        prevLayer.clear();
        prevLayer.insert(curLayer.begin(), curLayer.end());
        curLayer.clear();
    }

    // Task should not be done:
    BOOST_CHECK_EQUAL(tg->is_done(), false);

    timer.start();

    // Run all those tasks:
    sch->run(tg);

    logWARN("Taskgraph with {} tasks / {} layers done in {:.6f} secs", tnum,
            lnum, timer.stop() / 1e6);

    // All layer counters should be completed:
    for (const auto& it : counters) {
        BOOST_CHECK_EQUAL(it.load(std::memory_order_acquire), tnum);
    }

    // Taskgraph should be marked as done:
    BOOST_CHECK_EQUAL(tg->is_done(), true);

    BOOST_CHECK_EQUAL(global_counter.load(std::memory_order_acquire), tcount);

    sch.reset();
}

A few notes on the code above:

* First, using BOOST_CHECK_EQUAL from multiple threads at the same time doesn't seem to be supported our of the box: this will trigger the display of strange/partial debug output on std::cout if done in a naive way. So I had to introduce a SpinLock to synchronize this part of the task functions.

* In the test above I build a task graph with 10 consecutive layers, and 20 tasks per layer, where each tasks on a given layer depends directly on all the tasks from the previous layer, this seems to work just fine actually, but what I noticed is that the debug outputs themself can sometimes get mixed anyway, like for instance:

2023-08-29 21:14:31.661667 [NOTE] Executing task 1-7
2023-08-29 21:14:31.661123 [NOTE] Executing task 0-18
2023-08-29 21:14:31.661134 [NOTE] Executing task 0-5
2023-08-29 21:14:31.661666 [NOTE] Executing task 1-0
====> num queued strings: 60
====> Max dequeue bulk count: 53
====> Increasing buffer capacity to: 2835
2023-08-29 21:14:31.672053 [NOTE] Executing task 2-9
2023-08-29 21:14:31.672062 [NOTE] Executing task 2-17
2023-08-29 21:14:31.672741 [NOTE] Executing task 3-17
2023-08-29 21:14:31.672850 [NOTE] Executing task 4-10
2023-08-29 21:14:31.672053 [NOTE] Executing task 2-18
2023-08-29 21:14:31.672741 [NOTE] Executing task 3-19
2023-08-29 21:14:31.672745 [NOTE] Executing task 3-4
2023-08-29 21:14:31.672849 [NOTE] Executing task 4-6
2023-08-29 21:14:31.672743 [NOTE] Executing task 3-13
2023-08-29 21:14:31.672054 [NOTE] Executing task 2-12
2023-08-29 21:14:31.672741 [NOTE] Executing task 3-14
2023-08-29 21:14:31.672850 [NOTE] Executing task 4-14
2023-08-29 21:14:31.672054 [NOTE] Executing task 2-10

⇒ We can see from the timetag that the lower level layers still seems to be executed first (as expected), but I think what is going wrong is the dequeuing of the messages in bulk: according to the ConcurrentQueue documentation I think there is no guarantee that the messages will be in a specific order.

So I think I should introduce some kind of mechanism to also pass a timetag along the strings, so that I could then reorder them before concatenating to produce a single buffer. Let's see…

This lead me to introducing a dedicated Message structure in my LogManager:

    struct Message {
        U64 timetag{0};
        std::string str;

        Message() = default;
        // Disable copies:
        Message(const Message&) = delete;
        auto operator=(const Message&) = delete;

        // Enable Move:
        Message(Message&&) noexcept = default;
        auto operator=(Message&&) noexcept -> Message& = default;

        ~Message() = default;
    };

And this is what I was then trying to pass around with my queues:

    /** Message queue */
    using MessageQueue = moodycamel::BlockingConcurrentQueue<Message>;
    MessageQueue _msgQueue;

    /** Additional queue to recycle the memory allocated for the output strings
     */
    using RecycleQueue = moodycamel::ConcurrentQueue<Message>;
    RecycleQueue _recycleQueue;

Only to realize immediately after, that the Message copy constructor seems to be required by ConcurrentQueue 😭!, since I get this error message on compile:

D:\Projects\NervLand\sources\nvCore\src\external\concurrentqueue.h:2238:111: error: call to deleted constructor of 'nv::LogManager::Message'
                        MOODYCAMEL_CONSTEXPR_IF (!MOODYCAMEL_NOEXCEPT_CTOR(T, decltype(*itemFirst), new (static_cast<T*>(nullptr)) T(details::deref_noexcept(itemFirst)))) {
                                                                                                                                   ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
D:\Projects\NervLand\sources\nvCore\src\external\concurrentqueue.h:208:66: note: expanded from macro 'MOODYCAMEL_NOEXCEPT_CTOR'
#define MOODYCAMEL_NOEXCEPT_CTOR(type, valueType, expr) noexcept(expr)
                                                                 ^~~~
D:/Projects/NervLand/sources/nvCore/src\log/LogManager.h:129:9: note: 'Message' has been explicitly marked deleted here
        Message(const Message&) = delete;
        ^

… and this is a pain, because it could actually indicate that I'm copying the data where I was expecting to get my data moved instead too bad 🙃. But let's just try to confirm that to be safe. HHmmmm 🤔… Actually it seems this might only ever be required when bulk enqueuing messages… which is not something I need here. So maybe I could provide a dummy throwing implementation here ?

arrrggh… I tried with this:

LogManager::Message::Message(const LogManager::Message& rhs) {
    // THROW_MSG("Message(const Message&) should never be called.");
    std::cout << "!!!!!!!!!!!!! Calling message copy constructor with size: "
              << rhs.str.size() << std::endl;
};

And I got this kind of output:

====> Increasing buffer capacity to: 9798
!!!!!!!!!!!!! Calling message copy constructor with size: 53
!!!!!!!!!!!!! Calling message copy constructor with size: 53
!!!!!!!!!!!!! Calling message copy constructor with size: 53
!!!!!!!!!!!!! Calling message copy constructor with size: 52
!!!!!!!!!!!!! Calling message copy constructor with size: 52
!!!!!!!!!!!!! Calling message copy constructor with size: 53
!!!!!!!!!!!!! Calling message copy constructor with size: 53
!!!!!!!!!!!!! Calling message copy constructor with size: 53
!!!!!!!!!!!!! Calling message copy constructor with size: 52
!!!!!!!!!!!!! Calling message copy constructor with size: 53
!!!!!!!!!!!!! Calling message copy constructor with size: 52
!!!!!!!!!!!!! Calling message copy constructor with size: 52
!!!!!!!!!!!!! Calling message copy constructor with size: 52
!!!!!!!!!!!!! Calling message copy constructor with size: 52
!!!!!!!!!!!!! Calling message copy constructor with size: 53
!!!!!!!!!!!!! Calling message copy constructor with size: 52
!!!!!!!!!!!!! Calling message copy constructor with size: 53
!!!!!!!!!!!!! Calling message copy constructor with size: 52
!!!!!!!!!!!!! Calling message copy constructor with size: 52
!!!!!!!!!!!!! Calling message copy constructor with size: 53
!!!!!!!!!!!!! Calling message copy constructor with size: 52
!!!!!!!!!!!!! Calling message copy constructor with size: 53
!!!!!!!!!!!!! Calling message copy constructor with size: 53
!!!!!!!!!!!!! Calling message copy constructor with size: 52
!!!!!!!!!!!!! Calling message copy constructor with size: 53
!!!!!!!!!!!!! Calling message copy constructor with size: 53
!!!!!!!!!!!!! Calling message copy constructor with size: 53

Really not too good… There might be ways to work around this (by moving the string member in the copy constructor maybe ?) but actually I'm now thinking about a noticeable different path: why not simply keep a “static” vector of Message objects and only pass the relevant message slot indices in the queues ?! Since we will not continuously reallocate this vector it should be safe to write to “free slots” from multiple threads. Let's try to implement that.

Update: actually I found one location where I was using enqueue_bulk:

        // recycle the strings:
        _recycleQueue.enqueue_bulk(_recycleProducerToken, msgs.data(), count);
/

Working on this I eventually extracted the std::string member from my Message class above, and thus created a simpler MsgTag struct:

    struct MsgTag {
        U32 index{0};
        U64 timetag{0};
    };

While keeping the message string in a fixe array:

    /** Storage for the messages **/
    std::array<std::string, NV_LOG_MSG_QUEUE_CAPACITY> _msgArray;

Then I could simply pass the MsgTag corresponding to “filled slots” on the queue to the logger thread, and retrieve those MsgTag once there are not needed anymore to recycle the corresponding slots:

    // If using the log thread we first try to recycle the memory from a
    // previously used string:
    MsgTag mtag{};

    if (!_recycleQueue.try_dequeue_from_producer(_recycleProducerToken, mtag)) {
        // There is no MsgTag ready for dequeueing yet, so we check if we are
        // already at full capacity, or if we can start using a remaining unused
        // slot:
        if (_numQueuedStrings.load(std::memory_order_acquire) ==
            NV_LOG_MSG_QUEUE_CAPACITY) {
            // All slots are used so we must really wait until one becomes free:
            std::cout
                << "LogManager: =========> All string slots in use, waiting."
                << std::endl;
            while (!_recycleQueue.try_dequeue_from_producer(
                _recycleProducerToken, mtag))
                ;

        } else {
            // We can start using an unused slot increasing the count of "in
            // flight strings"
            mtag.index =
                _numQueuedStrings.fetch_add(1, std::memory_order_release);
            // std::cout << "==========> Num queued strings: "
            //           << _numQueuedStrings.load(std::memory_order_acquire)
            //           << std::endl;
            // std::cout.flush();
        }
    }

    auto& str = _msgArray[mtag.index];

    // else {
    //     std::cout << "==========> Dequeuing string of capacity: "
    //               << str.capacity() << std::endl;
    //     std::cout.flush();
    // };

    // Note: whether the dequeue operation was successfull or not should not be
    // relevant here: if there is nothing to dequeue, we just use the string as
    // is (eg. allocating some additional memory for it)
    U32 tsize = strlen(buf.data()) + size;
    if (str.capacity() < tsize) {
        // std::cout << "==========> Increasing string capacity to " << tsize
        //           << std::endl;
        str.reserve(tsize);
    }
    // else {
    //     std::cout << "==========> Reusing string with capacity "
    //               << str.capacity() << std::endl;
    // }

    str.clear();
    str += buf.data();
    str.append(data, size);

    // Update the timetag:
    mtag.timetag = _timeTag.fetch_add(1, std::memory_order_release);

    // Try to send that message to the logger thread:
    while (!_msgQueue.enqueue(mtag))
        ;

⇒ This works just fine, and with this implementation I'm sure that there is no expensive continuous dynamic allocation occuring for the strings, so all good for me 👍!

Another interesting update I made on the log manager was with the get_mem_buffer() method:

auto LogManager::get_mem_buffer() -> fmt::memory_buffer& {
#if 1
    return threadData.buffer;
#else
    auto tid = std::this_thread::get_id();
    auto it = _memBuffers.find(tid);
    if (it != _memBuffers.end()) {
        return it->second;
    }

    // We must add the mem buffer:
    fmt::memory_buffer* bufPtr = nullptr;
    {
        WITH_SPINLOCK(_logSP);
        bufPtr = &_memBuffers[tid];
    }

    // logNOTE("Adding logmanager mem buffer for thread {}",
    //         get_current_thread_id());

    return *bufPtr;
#endif
}

The previous implementation (disabled above) was relying on an unordered_map where I was storing a memory_buffer for each thread, retrieving that mem buffer without any lock… but in fact I think even this is incorrect, since we could have a case where we read and write to the map at the same time from 2 threads. So now I'm using a Thread Local Storage to store the per thread data and leave it to the compiler to sort any synchronization issue ;-). The good thing on top of that it that is also seems to improve the performances a bit:

struct ThreadData {
    fmt::memory_buffer buffer;
    std::string str;
};

thread_local ThreadData threadData;

Adding implementation for:

  1. TileStorage
  2. GPUTileStorage
  3. TileLayer
  4. TileCache
  5. TileProducer
Interesting I just noticed that the max number of layers for a Texture2D array in WebGPU is still relatively low at a value of 256 accordingly to the limits I can read from my engine
  • blog/2023/0906_nvl_dev29_proland_introduction.txt
  • Last modified: 2023/09/09 21:59
  • by 127.0.0.1