Currently I’m implementing a HDR pipeline for environment map lighting in raPT, and I have found HDR image processing to be quite demanding computationally. In particular the generation of full resolution CDF tables for importance sampling can take a long time with a straightforward single-threaded implementation. For example, only building the CDF table for a 8k×4k HDR image (floats) already requires about 600ms on my Intel Xeon E5 v3. Since I want to use a slider to adjust exposure and rotation of the environment maps in my scenes, this is real annoyance instead of real time.

Obviously, implementing multi-threading support will already give a considerable boost on my dual socket 24 core machine, especially since the problem is embarrassingly parallel along one of its two dimensions. However as usual curiosity took over and I wanted to see how far I can push performance, so I optimized the memory access patterns and added AVX vectorization in addition to dynamically load-balanced multi-threading. The results are decent and, while not providing any spectacular new insights, highlight some important performance characteristics of modern CPU architectures.

First, let me state the problem more clearly. A u-v parametrized environment map with width w and height h in a 2:1 aspect ratio is mapped to a sphere by aligning u with the azimuthal angle φ and v with the polar angle θ. For importance sampling each image pixel is assigned a PDF value derived from its scalar intensity scaled by sin(θ), and the resulting two-dimensional discrete PDF is separated into one-dimensional marginal and conditional densities. The density values are accumulated along the image columns into monotone step functions corresponding to the conditional CDFs, and from the integrals of the conditional CDFs a single marginal CDF is calculated. Sampling is accomplished by drawing a pair of random numbers, which are used to index the marginal CDF and the resulting conditional CDF respectively by finding the function’s step interval that includes the value of the random number.

The performance critical part is the calculation of the conditional CDF for each of the pixel columns. Accumulation is performed along the v axis while the image is stored in u direction in memory. This guarantees 1-2 cache misses for every accessed pixel composed of 3 floats. The text book approach is to use some kind of blocking technique, which is implicit in the AVX implementation and load balancing described below.

The rgb AOS format of the pixel data is a less-than-ideal match for 8-wide SIMD instructions, unfortunately. The best approach here is to load 8 pixels at once (96 bytes) using 3 registers, and to perform a shuffle dance to get a single vector for each component. This is accomplished by 3 vperm2f128, 6 vblendps and 3 vpermilps instructions, which was the shortest sequence I could come up with, but I always welcome clever suggestions. Once everything is in place the PDF values for the 8 pixels are calculated in parallel and … another inconvenience is encountered.

Unpacking 3 registers of AOS pixel data into SOA format. Unpacking 3 registers of AOS pixel data into SOA format. Urgh!

Each of the calculated PDF values belong to a different conditional CDF so accumulation requires gathering the previous results from separate memory locations and scattering the new results back to separate memory locations again. Since AVX gather/scatter support is either notoriously slow or non-existing, a small accumulation buffer is used to store the results of 8 iterations along the v axis in SOA format. Once the buffer is full, the data is read back into the AVX registers and a full 8x8 transpose swaps rows and columns so that 8 results can be stored for each of the 8 conditional CDFs being processed. The last entry of the accumulation buffer is used to seed the calculations of the next 8 iterations. This way all the gather/scatter operations are avoided at the cost of several shuffle instructions.

The final optimization is to process 16 instead of 8 conditional CDFs in parallel, essentially duplicating the code described above. The motivation is twofold, first to fully utilize the cache lines being fetched (exactly 3 for 16 pixels), and second to provide two independent instruction streams to increase IPC.

Load balancing for multi-threading is then implemented with a simple atomic counter that represents the pixel columns of the HDR image, which is incremented by 16 every time a thread requires more work until the counter value is equal to the image width.

original AVX multi-threaded
8k×4k 619ms 166ms 14ms (44x)
4k×2k 116ms 41ms 3ms (39x)
1k×512 2.42ms 0.74ms 0.07ms (35x)

Results are listed in the table above for 3 different image sizes and 3 implementations (original, AVX, multi-threaded). Let’s compare the original and AVX implementations first. The speed-up is around 3-4x which is reasonable considering the ratio of shuffle to actual compute instructions. Next, let’s focus on multi-threading scalability. Activating all 24 cores/48 threads leads to a speed-up of 11-13x which is rather low. The reason is memory bandwidth saturation since the ratio of data movement to compute is quite high. Intel reports 118Gb/s for a synthetic benchmark on a dual socket system almost identical to mine, and since my memory buffers live all on the same NUMA node I can potentially reach half of this. The medium-sized benchmark achieves an effective bandwidth of 4k∗2k∗(12b+4b)/3ms≈45Gb/s which is about 70% of the maximum. Most of the remaining bandwidth is consumed by hardware prefetching cache lines along the image rows that are evicted before they are actually used and thus need to be reloaded.

This brings me to a point that probably stood out to you from the start: Swapping the axes for marginal and conditional CDFs would linearize memory access and align with the hardware prefetcher, and in addition improve load balancing because rows could be processed individually instead of in sets of 8 or 16. The simple justification for not doing this is that the format is decided by the sampling code, and for performance reasons I use the long axis for the marginal CDF. Just out of interest I tested swapping the axes for the CDF computation, and achieved 25-30% shorter run times and an effective bandwidth of 95% due to the more efficient memory access.

Concluding this rather technical (tedious?) post, my initial performance issue has been resolved and I can happily adjust exposure and rotation of my full resolution HDR environment maps in real time now. Considering the severe bandwidth bottleneck and the large number of threads in my test machine, the AVX implementation was not strictly necessary but a nice exercise nevertheless.

The next post will introduce the first version of the raPT path tracer featuring my fast ray tracing kernels and this HDR map processing implementation of course, so expect enjoyable pretty pictures!