Front view of a path-traced Boeing 777 Bottom view of a path-traced Boeing 777 Side view of a path-traced Boeing 777

This is my first raPT blog post so welcome and thanks for reading. A couple of days ago JCGT published my paper Efficient Ray Tracing Kernels for Modern CPU Architectures which introduces refined algorithms for BVH4 traversal of coherent and incoherent rays, named CLPT and ORST respectively.

The results show that CLPT can outperform Embree up to 4x for primary ray traversal! For incoherent rays ORST can achieve up to 60% higher traversal speed compared to the previously fastest method DRST, which makes ORST about twice as fast as Embree for secondary rays.

Unsurprisingly, the motivation of the paper has been to push traversal speed even further compared to previous approaches, but also to have a unified pair of algorithms with best-in-class performance for a single acceleration structure. Previously, a BVH2 was best for primary rays and a BVH4 best for secondary rays in my experience. With CLPT, a BVH4 is always the right choice. While primary rays are not such a big deal in a path tracer, CLPT still provides a noticeable speed-up to overall frame time. More importantly, my ray tracing kernels are also used in scientific rendering applications where local shading is sufficient and performance is dictated by primary ray traversal.

The images above demonstrate both CLPT and ORST in action with diffuse path tracing of a Boeing 777 model with 300 Million triangles. The high-quality BVH4 of the Boeing has been constructed within 15 seconds with an optimized, parallel SBVH implementation, which will be the topic of a future blog post. For now, you can find the paper Abstract below and the link to the full paper (and other publications) in the About section.

The recent push for interactive global illumination (GI) has established the 4-ary bounding volume hierarchy (BVH4) as a highly efficient acceleration structure for incoherent ray queries with single rays. Ray stream techniques augment the fast single-ray traversal with increased utilization of CPU vector units and leverage memory bandwidth for batches of rays. Despite their success, the proposed implementations suffer from high bookkeeping cost and batch fragmentation, especially for small batch sizes. Furthermore, due to the focus on incoherent rays, optimization for highly coherent BVH4 ray queries, such as primary visibility, has received little attention. Our contribution is twofold: For coherent ray sets, we introduce a large packet traversal tailored to the BVH4 that is faster than the original BVH2 variant, and for incoherent ray batches we propose a novel implementation of ray streams which reduces the bookkeeping cost while strictly maintaining the preferred traversal order of individual rays. Both algorithms are designed around a fast traversal order look-up mechanism. We evaluate our work for primary visibility and diffuse GI and demonstrate significant performance gains over current state-of-the-art implementations.