This is the first part of a series about the inner workings of the raPT renderer, uncovering techniques and implementation details. The topic for today is the acceleration structure implemented in raPT to speed up the calculation of ray-geometry intersections.
An acceleration structure reduces the algorithmic complexity of finding the closest triangle intersection for a given ray from linear to logarithmic in the number of triangles, decoupling the render performance from the scene size to some extent. Among the common acceleration structures for ray tracing are kd-tree, grid, and bounding volume hierarchy (BVH). The design of the acceleration structure is influenced by several factors, such as emphasis on dynamic or static geometry, flexibility, memory consumption, and by the capabilities of the target platform and the underlying microarchitecture.
The choice for raPT is a BVH with a branching factor of 4 (BVH4), meaning that a parent node references (up to) 4 child nodes. Increasing the branching factor increases the efficiency of data parallel processing and memory coherence, and reduces the depth of the hierarchy, which in turn reduces ray divergence and memory consumption. It also reduces culling efficiency leading to an increased number of ray-bounding box intersection tests and memory bandwidth, so there is a sweet spot which depends mainly on the traversal algorithm and the targeted micro architecture. Since the raPT traversal algorithm ORST processes 2 rays simultaneously, a branching factor of 4 is sufficient to saturate an AVX register.
A differentiating feature of my BVH4 structure is a fast masking mechanism, which allows compression of nodes and visibility editing of scene geometry. Each node contains two 4-bit masks, a static mask for compression and a dynamic mask for editing. Each mask bit maps to one of the child nodes, where a set bit enables the child node for traversal and a unset bit disables it. The dynamic mask is a subset of the static mask, so that the dynamic mask can only have a set bit where the static mask has a set bit as well. Initially both masks are equal, and during traversal only the dynamic mask is relevant.
Before continuing with the mask mechanism, let’s first take a look at the node data structure(s):
Compared to the original layout described in my ray traversal paper there have been some changes / improvements.
The number of
bvhNode4 elements after a
bvhNode4_box is no longer variable, but always 4.
This leads to a
bvhNode4_cluster which is always 128 bytes large and naturally aligned in memory.
child field contains the index for the child cluster or for the primitive cluster, depending on whether the node is an inner node or a leaf respectively.
This information is encoded in the
perm is the index into the traversal order look-up table and
mask contains the static mask in the upper bits and the dynamic mask in the lower bits.
Also the field for the primitive count has vanished from
bvhNode4 compared to the original layout.
In the case of a leaf node
child now references exactly one primitive cluster containing 4 triangles, which means that BVH construction needs to partition the primitives until only 4 or less remain in a single node, regardless of SAH cost.
This may seem like a drawback but in practice the number of SAH efficient leaves with primitive counts larger 4 is negligible to begin with, and the simpler intersection logic results in slightly improved traversal performance.
During construction of a BVH4 often nodes are created with 2 or 3 child nodes instead of 4. In the case of 2 child nodes, both are leaves, and in the case of 3 child nodes, 1 is a leaf and 2 are inner nodes. The distribution of the child count in a common scene is around 40%, 20%, and 40% for 2,3, and 4 children respectively.
Thus child clusters are not always fully occupied, resulting in wasted memory. Static masks allow to compress 2 unrelated pairs of child nodes into a single cluster, reducing the size of the hierarchy by around 20%. For the case of 3 child nodes, compression is not directly possible because there are no nodes with a single leaf. The strategy of ‘pulling up’ nodes can be applied here to fill the single empty slot with one of the two grand children. However, this can degrade traversal speed slightly, so it is a trade of between performance and (rather insignificant) memory savings.
The same mechanism is used to tightly pack primitive clusters for leaves with only 1, 2 or 3 triangles. Here leaves with 1 triangle can be paired with nodes containing 3 triangles. However, in practice 3 triangle leaves always outnumber single triangle leaves significantly, which prohibits perfectly dense packing.
When inspecting a model, you usually want to change the visibility of certain parts or delete unneeded geometry entirely. For this use case, dynamic masks are the ideal technique, see Figure 1. Selecting a material, object, mesh, or primitive, the corresponding triangles are disabled / enabled and a post-order traversal of the BVH propagates the visibility changes up the hierarchy. This way disabled subtrees are implicitly skipped for invisible geometry, accelerating ray traversal. Note that this feature does not add any additional instructions to the traversal routines. The original visibility of triangles / nodes can be reconstructed from the static masks.
Depending on the granularity of the visibility changes (e.g. material) it may be necessary to perform a post order traversal of the entire hierarchy to find all corresponding primitives. Thus the update is implemented with task-based parallelism utilizing the same parallelization framework powering the BVH builder. For common scenes a full update takes a few milliseconds at most, and for a huge data set like the Boeing the timing is around 700ms on a dual socket workstation.
The dynamic masks are also useful to implement a ‘delete with undo’ mechanism. By defining a special ‘delete’ material, assigning this material to a geometry selection will remove all the corresponding primitives from the scene. The deletions can be undone by toggling the visibility of the ‘delete’ material and changing the material for a selection of primitives. This is illustrated in Figure 2 where all the removed triangles appear in red.
For the efficient construction of high-quality split BVHs on many-core CPUs I have devised a fast parallelization framework, which is the topic of an upcoming paper already accepted for publication. I will highlight the work in a future blog post with some additional goodies, time permitting.