To be sure, the above example is contrived for our purposes here and is unlikely to occur in practice, since few real 3D objects partition either perfectly or barely at all. But the fact remains: real disparities among objects in partitioning performance do exist, resulting in real disparities—often vast ones—in the amount of intersect testing work between rays. Of course, this has serious consequences for parallel systems. Indeed, in the absence of carefully designed schemes that dynamically balance workloads among the various parallel compute nodes at runtime, successful parallelization appears impossible.
The MPI/RT Solution
MPI/RT smoothes work disparities between compute nodes via a dynamic load-balancing scheme. The scheme allows nodes that have finished working to probe their still-working neighbors, asking for work. When probed, a still-working node responds by splitting its remaining work, keeping some for itself and bundling the rest for transmission to the probing node over the cluster interconnect fabric. Through successive exchanges involving different nodes, the unfinished work of a single, lagging node gradually diffuses over the entire cluster.
While compute nodes communicate directly with each other, in a peer-to-peer fashion, to exchange ray bundles, MPI/RT implements a manager-worker scheme for system startup and shutdown. This allows a distinguished manager node to generate the initial population of rays and divide them among the worker nodes at startup. Modularizing initial ray division and broadcast logic in the code for the manager node allowed us to experiment with different techniques for assigning rays to worker nodes on startup. Ultimately, we found the optimal strategy to be randomization. But this isn’t surprising: by randomizing rays into bundles to be transmitted to worker nodes for processing, we mitigate the spatial coherence inherent in the ray tracing problem. Spatial coherence means that if one ray strikes a poorly-partitioned object, then the ray next to it is likely to strike the same object. So a structured initial work assignment scheme that assigns all the rays in the same region to the same compute node simply has the effect of concentrating a lot of difficult-to-process rays on the same node, which is precisely what we’re trying to avoid. Because an unstructured, random assignment scheme works against spatial coherence, it results in a more equitable initial division of work among nodes.
In the end, the hybrid load-balancing strategy
employed by MPI/RT effected a successful parallelization, even
in the presence of bounding volume hierarchies, and our project
passed with flying colors.
< PREV | PAGE 4