Multi-threading and SIMD optimization for Unreal Editor

Introduction Async Basics in Unreal Async Basics FRunnable AsyncTask ParallelFor FNonAbandonableTask FFunctionGraphTask and TGraphTask AsyncPool and AsyncThread TFuture and TPromise Example — Thread pool and thread object Result Showcase SIMD Instructions Basics—Vectorization Example — Thresholding Example — Gaussian Blur Summary Reference

Introduction

While developing plugins or tools, performance and optimization considerations are often overlooked, especially when targeting the editor. Meanwhile, artists demand a seamless real-time experience without causing the editor to throttle when adjusting properties back and forth. In this article, I am going to showcase techniques and tools that address these performance challenges. The focus will include multi-threading within the game thread and harnessing SIMD instructions tailored for supported CPU types. All code examples are based on Unreal Engine version 5.2. If you encounter any issues or wish to get in touch, please don't hesitate to reach out. Your feedback is greatly appreciated!

Async Basics in Unreal

I'll provide an overview of fundamental concepts along with some widely-used built-in multi-threading tools. For a more in-depth understanding of certain concepts or classes, please refer to the following link.

Async Basics

There are two major approaches that threads can run on:

FRunnable — Just like the runnable interface in other languages. It has Init, Run, Exit, Stop.

TaskGraph — Suitable for small async operations that do not need pausing or callbacks.

`FRunnable`

Just like the runnable interface in other languages. It has Init, Run, Exit, Stop.

`AsyncTask`

Run a small async operation without creating a dedicated class or starting a new thread and do not need the control logic for pausing or callbacks, you can put it inside an AsyncTask running on the TaskGraph .

`ParallelFor`

A fancier version of AsyncTask that splits a for loop into multiple Tasks running in the TaskGraph.

`FNonAbandonableTask`

A way to declare your own AsyncTasks, in a format that sits in-between FRunnable and lambda-like AsyncTasks. You can implement your own code in a standalone class to be more reusable, which will run on the TaskGraph instead of inside a dedicated thread, but missing some of FRunnable’s initializing and stopping logic.

`FFunctionGraphTask` and `TGraphTask`

FFunctionGraphTask is another way to queue Tasks using lambdas, with a single prerequisite and completion action as optionals.

TGraphTask takes an entire use-defined class and adds it to the queue, with an optional list of prerequisites. You need to implement your own DoTask() function within your class for it to work.

`AsyncPool` and `AsyncThread`

AsyncPool will execute a given function asynchronously on the specified thread pool.

AsyncThread executes a given function asynchronously using a separate thread.

`TFuture` and `TPromise`

I will use an analogy to demonstrate the relationship. You can skip it if you want to read more technical terms.

The analogy

Assume you are at a restaurant and going to order. After you complete your order with the waiter, the chef will make you a TPromise. This promise is that eventually, you will get your food. And for each promise object, there is a SetValue function that can only be called ONCE, in our analogy, by the chef to “fulfill this promise”.

Meanwhile, the waiter will give you a token, which is a TFuture that can be used to query the result of the associated Promise. So I can do something else (async calls) or I can simply do nothing and query the result every tick.

To summarize:

Ordering (Promise): You make a promise to get a result (place an order).

Fulfillment (Promise.SetValue()): The entity producing the result fulfills the promise by setting the result (chef preparing the food).

Token (Future): You receive a token (Future) representing your order.

Retrieving (Future.Get()): You use the token (call get()) to retrieve the result, blocking if needed (waiting at the chef’s counter).

Promise:

Represents the commitment to produce a result in the future.

Created by the entity initiating the asynchronous task (task producer).

Allows the task producer to fulfill the promise with a result or reject it with an error using set_value() or set_exception().

Future:

Represents the result of an asynchronous operation.

Created by the entity using or waiting for the result of the asynchronous task (task consumer).

Allows the task consumer to retrieve the result using get().

Remember a promise can not be abandoned. It needs to be completed. So creating too many promises at one time will cause a performance drop. Also, neither the TFuture nor the TPromise holds the operation itself.

You can check the example of using them in the PCG plugin at PCGAsync.cpp line 46 in UE 5.2.


// Setup [current, last, nb points] data per dispatch
TArray<TFuture<int32>> AsyncTasks;
AsyncTasks.Reserve(NumFutures);

Example — Thread pool and thread object

Now, let's venture into creating our own thread pool and objects. Given that this tutorial primarily focuses on illustrating the workings of the async system, I'll skip some of the basic setups. This system is used for updating textures in the editor using CPU image filter such as Gaussian blur. I built my system on a UTickableWorldSubsystem , basically it is a tickable UObject that has the same lifetime as the UWorld.

We need an Executor to issue our tasks. For the sake of simplicity, we won't be overly concerned about unfinished tasks; we'll assume they all involve small numerical operations that can be completed swiftly.

Let’s look at the tasks class first.

For the FAsyncTextureSrcUpdateTask to access shared objects within the executor, it holds a pointer to the executor. Additionally, a parameter class is introduced in this context, passing essential information such as the execution ID, execution rounds, and a task GUID. This design facilitates easy extension for future modifications.

The Abandon method is invoked when this task is canceled, providing a cleanup mechanism. The core functionality of the task is executed within the DoWork method. It's worth noting that DoThreadedWork() is a virtual function inherited from the parent class, serving as an alternative location to place the actual work.

Here is the FAsyncTaskExecutorParam mentioned before.

Declaration of the thread pool objects is straightforward.

The initialization is trivial as well, you only need to allocate space and thread counts. The second parameter in Create is the stack size for each thread, assuming its unit is in bytes. Please correct me here if I am wrong.

For the executor, the only crucial part we need to care about is the Execute function. However, in a real-world scenario, you might want to have a scheduler thread for initializing data and asynchronously issuing tasks.

We need an array to track tasks are running.

To prevent our thread pool from being overwhelmed initially, we keep track of the current execution ID. Initially, we only add 512 tasks to the pool and incrementally add more as some of the existing tasks are completed.

Following that, we define our iterate body using a lambda expression. Subsequently, we pass the function pointer to our task object.

Finally we add this task to the pool, and it will automatically be put in a queue and executed.

In the subsystem we check for any completed task after scheduling new tasks in Tick().

Once again, I've set a maximum count to avoid a bulk update in the render thread. The goal is to distribute the workload evenly across multiple frames for optimal performance. Consequently, you should aim to update the texture across multiple frames whenever feasible.

Then I check the result for the tasks, if they are completed, I remove them from our ready tasks array and flush them to a texture update array.

With all the steps above finished, I just simply update the texture regions that are dirty.

Result Showcase

With some extra works, leveraging async threads for texture processing becomes highly valuable. This approach proves especially beneficial for editor tools relying on painting masks, such as biome scattering. When combined with GPU generation, it creates a synergy that offers artists a real-time editing experience.

Video preview — Thresholding using multi-threading and async update render resources. 2k texture stable 60+ fps in editor. Can dynamically switch mipmap based on performance.

SIMD Instructions

While multi-threading tools are beneficial, they might fall short for CPU-bound tasks like CPU image processing. Dividing minor numerical tasks among multiple threads can lead to excessive context switches, potentially throttling the editor. To mitigate this, it's essential to leverage hardware structures for simultaneous processing of multiple data, known as SIMD (single-instruction-multiple-data) instructions. Although Unreal Engine often encapsulates these instructions in FMath, understanding them is valuable.

In the following section, I'll provide an overview of CPU structures. Feel free to skip it if you prefer to dive into the tool development details. We'll primarily focus on Intel and AMD CPUs, with code supporting the AVX set. For other CPU types, the code can be easily adapted. Ensure your CPU series supports SIMD instruction sets before proceeding with the following sections.

Different Instruction Sets

Advanced Vector Extensions (AVX) and Streaming SIMD Extensions (SSE) are both instruction set architectures designed to perform parallel processing tasks, particularly in the context of floating-point operations. Here's a brief overview of each:

Streaming SIMD Extensions (SSE):

SSE: Introduced by Intel in 1999.

Versions: SSE1, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2.

Data Width: 128 bits.

Registers: XMM0 through XMM7 (128-bit registers).

Operations: Primarily focused on single-precision (32-bit) floating-point operations, but SSE2 introduced support for double-precision (64-bit) as well.

Applications: Widely used for multimedia and graphics processing.

Advanced Vector Extensions (AVX):

AVX: Introduced by Intel in 2011.

Versions: AVX, AVX2.

Data Width: 256 bits (AVX), 512 bits (AVX2).

Registers: YMM0 through YMM15 (256-bit registers), YMM16 through YMM31 (AVX2, 512-bit registers).

Operations: Extended support for floating-point operations, integer operations, and additional features like FMA (Fused Multiply-Add).

Applications: Targets a broader range of applications, including scientific simulations, financial modeling, and more extensive parallelism.

Other Related Extensions:

AVX-512: An extension of AVX introduced by Intel, featuring 512-bit registers and additional instructions for further parallelism. It's designed for high-performance computing (HPC) and demanding workloads.

NEON: ARM's SIMD (Single Instruction, Multiple Data) architecture, used in ARM processors for tasks similar to SSE and AVX.

MMX: An older SIMD instruction set by Intel, introduced in the late 1990s. It was superseded by SSE.

Differences:

SSE primarily operates on 128-bit data, while AVX introduced wider registers (256 and 512 bits).

AVX extends support to a broader range of operations and introduces more registers, allowing for greater parallelism.

AVX-512 further extends the register width and provides even more advanced instructions for highly parallel workloads.

The choice between SSE, AVX, or AVX-512 depends on factors such as the target architecture, performance requirements, and the specific operations performed in the application. Developers often need to consider backward compatibility and choose the appropriate instruction set based on the capabilities of the target CPUs.

Basics—Vectorization

When processing arrays using for-loops, we often apply the same operations to an entire set of values. While modern C++ compilers can optimize code in some straightforward situations, in complex logic, relying solely on the compiler might not be sufficient. This is where our intervention as programmers becomes crucial.

For instance, consider the task of adding two arrays together. The naive scalar approach would be:


for (i = 0; i < n; i++)
    c[i] = a[i] + b[i];

As you can see the code will execute the same operation four times. We can present it from a top-down perspective, considering there are four elements:

In hardware, values are stored in binary form, allowing us to leverage this characteristic to perform the same operation simultaneously across an array. Assuming the elements are all 32-bit floats, in a modern CPU, the smallest register we can use is a 128-bit register. This implies that we can store four 32-bit float values in a single register. For simplicity, let's consider an array with values 1, 2, 3, and 4.

We can then apply add on these two registers, which is a swift and fundamental operation when working directly with registers. Finally, we store the result in another register.

Bit-wise demonstration of vector addition

After that we can then read out the values from the result register into a float array, enabling us to extract the results for further processing.

In fact this is how CPUs process our code, we just explicitly instruct the compiler how to execute our logic.

This process is called Vectorization. The key to deploy SIMD optimization is to vectorize the algorithm. However, it's worth noting that some algorithms, particularly those with complex control flow like deeply nested loops, multiple branches, or intricate switch statements, can pose challenges for optimization.

Example — Thresholding

To illustrate writing SIMD code, let's consider an example of applying thresholding to an image. First, let's explore the available data types. There are three general types of registers: _m128, __m256, and __m512. _m128 represents a 128-bit register, _m256 represents a 256-bit register, and _m512 is a 512-bit register. Each of them has two specific versions designed for integer and double-precision floating-point operations, distinguished by the suffix "i" for integers and "d" for doubles. This naming convention applies consistently to all three types: _m128i and _m128d for 128-bit, _m256i and _m256d for 256-bit, and similarly for the 512-bit type.

Here's a summary of the characters commonly found in Intel-style SIMD intrinsics.

Prefix (_mm, _mm256, etc.):

Indicates that the function is part of the Intel SIMD (Single Instruction, Multiple Data) intrinsics.

Operation (add, sub, mul, set, loadu, etc.):

Describes the type of operation the intrinsic performs (e.g., addition, subtraction, multiplication).

Shift/Mask (srli, slli, sra, srl, etc.):

Describes bitwise operations or shifts (e.g., shift right logical immediate, shift left logical immediate).

Data Type (si128, ps, epi16, pd, etc.):

Specifies the type of data the intrinsic operates on (e.g., 128-bit packed integers, single-precision floating-point, packed 16-bit integers, double-precision floating-point).

Immediate (_mm_set1_epi8, _mm_set_epi32, etc.):

Indicates that the intrinsic takes an immediate value as an argument.

Size (128, 256, 512, etc.):

Specifies the size of the vector in bits (e.g., 128 bits, 256 bits).

Suffix (pd, ps, epi8, epi16, epi32, epi64, etc.):

Indicates the specific type of packed data (e.g., packed double-precision, packed single-precision, packed 8-bit integers, packed 16-bit integers, packed 32-bit integers, packed 64-bit integers).

Given that image data is primarily represented as uint8, we'll utilize _m128i in our tutorial. Now, let's briefly review the algorithm. For thresholding, we can straightforwardly employ Intel Intrinsics. The approach involves obtaining the minimum value between the threshold value and the pixel value. Then we compare this minimum with the pixel value again. If they are equal, signifying that the pixel value is greater than or equal to the threshold value, we generate an all-one mask at that location.

Here is the implementation of the logic we mentioned above. As you can see it is very straight forward.

Demo code, please write proper code in real applications.

By integrating this approach with the previously implemented async system, we can achieve further optimization of the execution time.

Enabling SIMD instructions significantly reduces the time required for thresholding, and this is just a preliminary implementation.

Example — Gaussian Blur

Convolution algorithms are trickier to vectorize. There are two key points—convolution and fixed-point optimization.

In thresholding, we can simultaneously process 16 8-bit pixels. But with Gaussian Blur where many kernel values are fractional numbers, the use of floating-point numbers will significantly reduce the number of pixels we can process per round. To address this limitation, fixed-point optimization becomes necessary. While it introduces some approximation and error, it proves tolerable for many painting tasks.

To implement fixed-point optimization, we left-shift our 8-bit pixel to make it a 16-bit value. The lower 8 bits represent the fractional part, while the higher 8 bits represent the decimal part. We then multiply the fractional number by 255 (max uint8), resulting in values like 0.5 becoming 127 and 0.25 becoming 64.

Then, we broadcast this value across the entire register, left-shift the kernel, and multiply the shifted pixel and kernel.

Afterward, we right-shift the result by 8 bits to extract the actual result. With a 256-bit register, this approach allows us to simultaneously process 16 pixels per run during convolution operations.

To vectorize the convolution operation, we trade space for time by mirroring pixels on the edge to handle border conditions. I preprocess the texture to expand in four directions based on the kernel size. For instance, if the texture is initially 2048 * 2048 and the kernel is three elements, the texture expands to 2050 * 2050.

X0 is the edge for original picture, k1, k2, k3 are kernel elements.

In the vectorized convolution process, we sequentially load three groups of 16 pixels into three 256-bit registers (8 pixels for 8-bit registers), offsetting the window by one pixel each time. This is the essence of vectorized convolution.

In the profiler, we can see that the longest execution time for this is 3.5ms. We can optimize that as well by limiting how many chunks to update per tick. For each chunk, it takes around 130 microseconds on average to perform a 3-item kernel blur, allowing the editor to maintain 60 fps during this operation.

Summary

In this tutorial we learned how to use SIMD intrinsics and multi-threading to optimize the code. The optimization process involves trial-and-error, necessitating performance profiling and iterative adjustments. However, this is also a fun process since you can always discover clever tricks to make your program running faster.

Reference

https://forums.unrealengine.com/t/multithreading-and-performance-in-unreal/1216417