The best known BVH construction algorithm in terms of ray tracing performance is the BVH with spatial splits (SBVH). In contrast to a standard BVH where sibling nodes overlap in space if the corresponding primitives do, the SBVH allows to split overlapping primitives resulting in spatially disjunct sibling nodes. Primitive splitting is a costly operation and considerably complicates multi-threaded BVH construction due to recursively growing memory buffers.
The raPT renderer contains a highly efficient multi-threaded and vectorized SBVH construction framework, which is described in the paper Parallel Spatial Splits in Bounding Volume Hierarchies published in this year’s Eurographics Symposium on Parallel Graphics and Visualization (EGPGV16).
Bounding volume hierarchies (BVH) are essential for efficient ray tracing. In time-constrained situations such as real-time or large model visualization, fast construction of BVHs usually compromises hierarchy quality, resulting in reduced rendering speed. We propose a parallel framework for the state-of-the-art BVH construction algorithm with spatial splits (SBVH) that provides highest quality hierarchies within a time frame competitive with low quality builders optimized for construction speed. We leverage both data and task parallelism to employ threading and single instruction, multiple data (SIMD) capabilities of modern CPUs. Our key contribution is a lightweight memory management and load balancing scheme that maximizes parallel efficiency.
The paper comes with supplementary code fragments that demonstrate how data parallel AVX instructions can be used to accelerate various kernel operations of the SBVH algorithm, in particular triangle splitting.
Here, I provide extended results obtained from a more common Intel Core i7-6700 processor. Both Embree (v2.9.0) and raPT have received further updates, so the data for the Xeon differs slightly from the values presented in the paper. In addition, to show the impact of spatial splits on BVH construction times, the table below includes timings with spatial splits disabled for both raPT and Embree.
As you can see, reducing the number of threads from 48 (Xeon) to 8 (Core i7) reduces the gap between Embree and raPT somewhat which demonstrates the performance gained from the highly efficient parallel framework in raPT if many threads are active.
Without spatial splits raPT and Embree are closer, with raPT still clearly in the lead. This puts into perspective the speed-up obtained from vectorized split operations and the corresponding implementation of recursively growing memory buffers.