Raster Parallelization with OpenMP
Raster Modules with OpenMP support
The following modules have received OpenMP support for parallel processing since GSoC 2021 project. For modules that received OpenMP support prior to this, see OpenMP support in GRASS 7.
- r.univar
- r.neighbors
- r.mfilter
- r.resamp.filter
- r.resamp.interp
- r.slope.aspect
- r.series
- r.patch
Implementation
We can classify raster modules into 2 main categories: one which computes and outputs some statistics about a raster map and one which outputs a new raster map after some computation of a given raster map.
For the first type of modules which does not output any raster map, the proportion of the module that can be parallelized is larger as they do not involve any sequential disk writes of compressed raster maps. As such, the implementation is straightforward: each thread read the input raster maps concurrently, maintain their own private set of statistics and finally reduce them to output the final statistics. Currently, the OpenMP loop scheduling is set to static, and the map is split into equal-sized chunks and allocated to each thread for computation. This is because we want the behavior to be deterministic to ensure that the result we get for each run stays the same for the same input parameters. However, since the order of computation now differs among the single-threaded and multi-threaded runs, there may be floating point discrepancies using different number of threads. For example, on larger raster maps, the low-order bits are often lost for finite precision floating point numbers arithmetic, and thus the final statistics results depend on the order of arithmetic. This issue can potentially be rectified by keeping a running compensation to adjust for the floating point discrepancies (See Kahan Summation Algorithm). While that will reduce the error, there can be no guarantee of consistency between runs with different number of threads. Besides that, there are no major computational overhead introduced to support parallelization, and the memory usage is simply scaled according to the number of threads used for computation.
In the modules listed above, the module that falls into this category is r.univar.
The second type of modules are trickier to implement since we have to do sequential disk writes to produce the compressed raster maps. There are 3 ideas to tackle this issue, (1) allocate enough memory for the entire output map for threads to work on before writing to the output raster map sequentially, (2) use a secondary storage (HDD/SSD) to store the results using temporary files before transferring to the output raster map and (3) split the processing of the raster map into chunks and store the results in memory. The first idea is straightforward and some modules such as r.sun has been implemented this way. However, we can easily face out-of-memory issues when running on very large raster files. Furthermore, we do not want to modify the original behavior of the modules, so this idea is not suitable for modules that have very low memory footprint. The second idea is to use temporary files instead of storing the results in-memory. This approach will incur additional disk reads and writes cost which can potentially be expensive on slower disk hardware. The costs are however amortized over the initial required disk I/O of the raster map. During a benchmark of r.neighbors parallelization using this approach, there are no significant extra costs as compared to the in-memory approach on a machine with NVMe SSD. While being a viable option, the third approach may provide more predictable performance which is less dependent on disk hardware. We introduce an additional user parameter, memory, to control the size of the output buffer. This will decide the chunk size that we will split the raster map into, then we will do computation on each chunk followed by sequential writes to raster file alternately until we finish processing the input map. This staggered computation and writing does not affect the performance of the single-threaded runs. With that, we can easily parallelize over the computation parts of each chunk. Assume reasonably large enough chunk size, the threads can efficiently split the work and the overhead costs are not significant as compared to the total cost. The minimum chunk size is set to be at least k number of rows, where k is the number of threads specified. Also, we can consider the original behavior as a subset of the new implementation with chunk size set to a single row. This means that users can always opt for the original behavior by setting memory to 0 and nprocs to 1.
The rest of modules listed above falls into the second category.
Usage
To enable OpenMP support, users need to compile GRASS with OpenMP enabled.
./configure --with-openmp
Then after the compilation and installation, you can now run the modules above with parameters nprocs and memory. For example,
r.univar map=INPUT zones=ZONE nprocs=12 r.neighbors input=INPUT filter=9 method=average nprocs=12 memory=500
There are also benchmark scripts available in each of the modules listed above for users to readily benchmark on their own machine. For modules-specific behavior, see the Performance sections under manual pages of each module.