LOCAL: Locality-Optimizing Caching Algorithms and Layouts

Peter Lindstrom (Principal Investigator), Jonathan Cohen, Mark Duchaineau, Martin Isenburg, Valerio Pascucci
Center for Applied Scientific Computing
Lawrence Livermore National Laboratory
Sung-Eui Yoon
Korea Advanced Institute of Science and Technology
Dinesh Manocha, Jack Snoeyink
University of North Carolina at Chapel Hill
Jarek Rossignac
Georgia Institute of Technology

Overview

Advances in simulation and remote sensing have led to a recent explosion in the size of geometric data. For example, a single time step from a scientific simulation may involve up to billions of elements, exceeding the amount of memory available even on small-scale parallel machines, not to mention end-user desktops. Visualization and analysis techniques are commonly used to explore and make sense of such data sets. However, the sheer size of data poses significant problems in all stages of the visualization pipeline, from offline pre-processing of simulation data (e.g. filtering, remeshing, simplification, compression, hierarchy construction), to interactive queries (e.g. slicing, isocontouring, adaptive view-dependent meshing), to real-time rendering (e.g. organization, shipping, and caching of graphics primitives). Moreover, visualization data is often unstructured in nature, which further complicates its management and representation.

Figure 1: Example stream processing pipeline. Each stream module uses a small, fixed-size in-core buffer that continuously slides over the mesh in increments of single triangles, thus providing a narrow window over the much larger on-disk data set. Mesh edges between buffered and not yet processed triangles form a stream input boundary, from which triangles are read. Similarly, triangles are written along an output boundary, which coincides with the input boundary for the downstream module in the pipeline.

Whereas CPU and GPU performance as well as storage capacity have by and large kept pace with the exponential growth in data, data access latency (e.g. disk seek time) and bandwidth (e.g. disk-to-memory transfer) have not. The large differences in performance between cache levels and relatively slow inter-cache data transfer cause major bottlenecks in data processing, especially for large data sets that span all levels of cache, but also for in-core processing and packaging of low-level graphics primitives. Furthermore, as the rapid growth in CPU and GPU performance is likely to be sustained over the foreseeable future, addressing the latency and bandwidth bottlenecks becomes increasingly important. Therefore redesigning visualization and analysis algorithms to make better use of limited bandwidth but plentiful compute resources is critical to achieving efficiency and scalability.

The goal of this project is to develop techniques for reducing bandwidth requirements for large unstructured data, both explicitly, by making use of data compression, and implicitly, by optimizing the layout of the data for better locality and cache reuse. Many offline mesh processing tasks can be cast in terms of iterators over the data set, e.g. "for each mesh element, do some processing." Such tasks, including compression, filtering, analysis, and local operators such as derivatives, do not require visiting the data in any particular order, however the data is often magnitudes larger than main memory and cannot be processed by conventional in-core algorithms. For such tasks, we propose restructuring the computation to perform windowed stream processing. Streaming supports fast sequential I/O in a single pass over the data by exposing only a subset of it through a sliding window (Figure 1), on which localized random access is possible. Furthermore, stream modules can work in parallel via pipelined processing, which obviates costly intermediate disk storage and amortizes latency over the entire stream. Because random access to disk is not needed, compressed I/O, which further reduces bandwidth requirements, can easily be realized and made transparent to the application. One of the main challenges of stream processing is achieving locality in the unstructured data stream in order to minimize the memory footprint and the time mesh elements need to be buffered (Figure 2). For this, we are designing optimal data layouts (Figure 3).

Figure 2: Snapshots of triangle sequences and diagrams for two different layouts of the same unstructured NIF triangle mesh. The diagrams show references between triangles and vertices for the order in which they appear in the mesh file, with colored bars connecting the first and last vertex referenced by a triangle, and vice versa. Optimizing the sequential layout of vertices and triangles reduces the footprint (minimal memory needed) and latency (maximum time a vertex is buffered) for streaming the mesh through memory (percentages are of total mesh size).

(a) original layout: footprint = 9.3%, latency = 69% (b) streamable layout: footprint = 0.78%, latency = 0.90%

Figure 3: Comparison of various layout strategies for an unstructured mesh. The stream footprint, measured in number of active vertices, is shown in green in the diagrams. (d) By adapting the layout to the irregular connectivity of the mesh, maximally connected and well-shaped pieces of the mesh are obtained, which facilitates stream processing.

(a) depth-first sort (b) breadth-first sort (c) spatial sort (d) spectral sequencing

The streaming approach works well for applications that can adapt their traversals of the data to the given order in which the data enters memory. Many tasks, however, intrinsically require data-dependent or user-specified data traversals that are not determined until run time, and for which visiting only a small subset of the data may be necessary. Examples include slicing, isocontouring, range queries, view-dependent refinement, and topological analysis. To accommodate and accelerate such queries, we are designing data layouts that are robust to a variety of access patterns, and that have good average and worst case performance. We are, for example, investigating extensions of structured data layouts based on space-filling curves to unstructured meshes (Figure 4). We are particularly focusing on cache-oblivious data layouts that have good performance over all levels of cache. Finally, we are exploiting compression, cache-oblivious layouts, and streaming for interactive visualization, e.g. for reducing bandwidth requirements between the CPU and GPU and maximizing cache reuse in on-chip vertex caches.

Figure 4: (a-f) Our cache-oblivious locality measure, g, of a layout correlates well with the edge cut, c, over a wide range of cache block sizes (a block size of 27 vertices is illustrated here). The edge cut for a given block size or domain decomposition is an important performance measure. It captures, for example, the amount of communication needed in parallel applications, as well as the likelihood of a cache miss when traversing an edge in the grid. (g) Our measure is defined for any undirected graph, and thus applies equally to unstructured surface and volume meshes.

(a) row major
g = 4.00, c = 150
(b) MinLA
g = 3.45, c = 114
(c) Lebesgue
g = 3.35, c = 105
(d) H index
g = 2.87, c = 102
(e) Hilbert
g = 2.73, c = 100
(f) βΩ
g = 2.70, c = 95
(g) unstructured mesh layout optimized for our measure

Related Publications

Contact

For more information about this project, please contact Peter Lindstrom.

UCRL-WEB-210653. Last Updated: January 7, 2010.