# Accelerated Single Ray Tracing for Wide Vector Units

The annual High Performance Graphics conference (HPG’17) is kicking off today in LA with a very promising line-up of papers. A side effect of the continued VR hype is an increased interest in ray tracing methods motivated by adaptive rendering algorithms. In addition, production rendering has seen wide-spread adoption of ray tracing. So these are good times for ray tracing research and HPG offers two papers focused on accelerated ray traversal methods, one targeted at the GPU by Nvidia (view), and one targeted at the CPU by myself. You can read my contribution here, and after that keep on reading this blog post which describes some further optimizations of the algorithm (named WiVe) introduced in the paper.

After some further experimenting with WiVe I’ve discovered a few tweaks that make the algorithm even more efficient. An issue with the original implementation is the dependency chain created by storing nodes to the stack and reading the top node back. On the KNL architecture there is no store forwarding between the vector and integer units, so the data has to go through the L1 cache. Ideally the next node to be traversed could be transferred directly from vector to integer registers in order to move stack operations out of the critical path of execution. As it happens there is a way to achieve this, even without increasing the instruction count.

Take a look at Figure 1.
The idea is to always have the next node in the low element of the vector register after compression (D) because the low element can be easily transferred to an integer register using *movd* (E) (Instructions are described here).
This means that the order of nodes in the vector register has to be inverted.
In addition, if none of the child nodes is intersected during a traversal step the low vector element must be occupied by the current stack top (T) for the algorithm to work.

The *movd* instruction loads the stack top into the low element of a vector register and sets all other elements to zero (C).
Using the masked variant of *vpcompressq* with the stack top as the destination register overwrites the stack top node only if one or more child nodes are intersected.
Thus, in all cases we find the next valid node in correct order in the low element of the stack vector and the traversal can continue without further delay.
The remaining stack push is removed from the immediate dependency chain.
Since the nodes are now in inverted order we use an inverted stack as well that grows towards smaller memory addresses.
To avoid overwriting of previous stack data the store (F) requires a new mask that reflects the new register positions of the active nodes after compression.
This mask is easily obtained with a *vptestmq* instruction which sets a mask bit for all non-zero elements.

Now the algorithm is almost working, the only remaining issue is its termination since we no longer check the stack size during traversal. Instead, a sentinel element is added to the bottom of the stack (S) which is zero everywhere except for a set leaf node flag. If the traversal reaches the bottom of the stack it finds the fake leaf node and jumps to leaf intersection where a simple test detects the sentinel and terminates the algorithm.

Mazda | San Miguel | Art Deco | Powerplant | Villa | |
---|---|---|---|---|---|

Updated |
131.9 | 77.3 | 171.4 | 88.9 | 92.6 |

Original |
126.7 | 73.1 | 165.0 | 85.4 | 87.4 |

Improvement [+%] |
4 | 6 | 4 | 4 | 6 |

So how does the updated implementation perform? Was it worth the effort? Table 1 lists the results, the experimental setup is the same as described in the paper (AVX-512, Intel Xeon Phi 7250). Performance improved between 4-6% over the original implementation, making the currently fastest CPU-based BVH traversal for single rays even faster!