OpenMP: Difference between revisions

From GRASS-Wiki
Jump to navigation Jump to search
(→‎See also: +another tutorial)
(→‎DIY scripting: r3.in.xyz example)
Line 265: Line 265:
</source>
</source>


* This approach has been used in the {{AddonCmd|r3.in.xyz}} addon script.
* Another example using r.sun Mode 2 can be found on the [[r.sun]] wiki page.
* Another example using r.sun Mode 2 can be found on the [[r.sun]] wiki page.



Revision as of 07:08, 7 December 2011

Multithreaded jobs in GRASS

OpenMP is an implementation of multithreading, a method of parallelization whereby the master "thread" (a series of instructions executed consecutively) "forks" a specified number of slave "threads" and a task is divided among them (from wikipedia). The job is distributed over the available processor cores (2-core, 4-core, ...).

The (yet) only parallelized library in GRASS >=6.3 is GRASS Partial Differential Equations Library (GPDE). The library design is thread safe and supports threaded parallelism with OpenMP. The code is not yet widely used in GRASS. See the gpde programmer's manual for details.

How to activate it with GCC >= 4.2 (compiler flag '-fopenmp' as well as library '-lgomp' are needed):

# GPDE with openMP support:

cd lib/gpde/
vim Makefile
# uncomment the EXTRA_CFLAGS row and switch the two existing EXTRA_LIBS rows

Integrated system-wide support for OpenMP via a --with-openmp ./configure flag has been requested in trac #657.

OpenMP support in GRASS 7

In GRASS version 7 the gpde and the gmath libraries are providing functions which are partly parallelized with OpenMP. All blas level 2 and 3 functions as well as many linear equation solver in the gmath library, are parallelized using OpenMP pragmas. Several numerical modules, which are using those functions, can now benefit from multi core systems (i.e.: r.gwflow, r3.gwflow, r.solute.transport).

OpenMP flags are compiler dependent, thus OpenMP support should be set using C- and linker-flags before calling configure. I.e: for gcc > 4.2:

CFLAGS="-O3 -Wall -Werror-implicit-function-declaration -fno-common -fopenmp"
LDFLAGS="-lgomp"


This should enable OpenMP support in the libraries and ALL depending modules.

You can test the OpenMP support when compiling the gpde and gmath tests by hand (switch into the test directories in the lib dirs and type make). The test library modules "test.gmath.lib" and "test.gpde.lib" should be available in the path after starting grass.

The gmath lib test module "test.gmath.lib" provides additionally benchmarks for blas level 2 and 3 functions and for many solver. gmath/test> test.gmath.lib help

Description:
 Performs benchmarks, unit and integration tests for the gmath library

Usage:
 test.gmath.lib [-uia] [unit=string] [integration=string] [rows=value]
   [solverbench=string] [blasbench=string] [--verbose] [--quiet]

Flags:
  -u   Run all unit tests
  -i   Run all integration tests
  -a   Run all unit and integration tests
 --v   Verbose module output
 --q   Quiet module output

Parameters:
         unit   Choose the unit tests to run
                options: blas1,blas2,blas3,solver,ccmath,matconv
  integration   Choose the integration tests to run
                options:
         rows   The size of the matrices and vectors for benchmarking
                default: 1000
  solverbench   Choose solver benchmark
                options: krylov,direct
    blasbench   Choose blas benchmark
                options: blas2,blas3


I.e testing the speedup of the blas level 2 and 3 functions of the latest svn trunk of grass7, compiled with the flags mentioned above on a 8 core intel xeon system:

gmath/test> setenv OMP_NUM_THREADS 1
gmath/test> test.gmath.lib blasbench=blas2 rows=5000

++ Running blas level 2 benchmark ++
Computation time G_math_Ax_sparse: 0.244123
Computation time G_math_Ax_sband: 0.280636
Computation time G_math_d_Ax: 0.134494
Computation time G_math_d_Ax_by: 0.18556
Computation time G_math_d_x_dyad: 0.268684

-- gmath lib tests finished successfully --

gmath/test> setenv OMP_NUM_THREADS 4
gmath/test> test.gmath.lib blasbench=blas2 rows=5000 

++ Running blas level 2 benchmark ++
Computation time G_math_Ax_sparse: 0.072549
Computation time G_math_Ax_sband: 0.192712
Computation time G_math_d_Ax: 0.036652
Computation time G_math_d_Ax_by: 0.047904
Computation time G_math_d_x_dyad: 0.080534

-- gmath lib tests finished successfully --

gmath/test> setenv OMP_NUM_THREADS 1
gmath/test> test.gmath.lib blasbench=blas3 rows=1000

++ Running blas level 3 benchmark ++
Computation time G_math_d_aA_B: 0.013263
Computation time G_math_d_AB: 18.729

-- gmath lib tests finished successfully --

gmath/test> setenv OMP_NUM_THREADS 4
gmath/test> test.gmath.lib blasbench=blas3 rows=1000

++ Running blas level 3 benchmark ++
Computation time G_math_d_aA_B: 0.006946
Computation time G_math_d_AB: 4.80446

-- gmath lib tests finished successfully --

General code structure

Example cited from "openMP tutorial" (see below):

    #include <omp.h>
   
    int main ()  {
        int var1, var2, var3;
    
        Some serial code 
        ...
   
        /* Beginning of parallel section. Fork a team of threads. */
        /* Specify variable scoping */
   
       #pragma omp parallel private(var1, var2) shared(var3)
       {
   
        /* Parallel section executed by all threads */
        ...
   
        /* All threads join master thread and disband */
       }  /* end pragma */
   
       /* Resume serial code */
       ...
   
    }

And in the Makefile, add something like this:

  #openMP support
  EXTRA_CFLAGS=-fopenmp
  EXTRA_LIBS=$(GISLIB) -lgomp $(MATHLIB)
  • Examples:
https://computing.llnl.gov/tutorials/openMP/exercise.html

Run time

The default is to create as many threads as the system has processors. If you don't want that, you can control the number with the OMP_NUM_THREADS environment variable. For example to request 3 threads from a Bourne shell:

OMP_NUM_THREADS=3
export OMP_NUM_THREADS

g.module ...

Candidates

It is important to understand which modules are processor bound, and concentrate on them. i.e. do not needlessly complicate the code of non-long running processor bound or I/O-bound modules. Almost all of the GIS libraries are not thread-safe. Regardless, these are typically I/O bound not processor bound, so not critical to parallelize. It is expected that most of the CPU-bound loops which will benefit from parallelization will be found in the modules.

A good place to start is by running a profiling tool to find the worst offending functions and deal with them first. Blindly parallelizing every loop you can find has the potential to slow things down due to the overheads needed to create and destroy threads.

This would speed up the CPU-bound v.surf.bspline and v.lidar.edgedetection considerably.
Please contact and coordinate with Helena Mitasova before starting work on this.
Please contact and coordinate with Markus Metz before starting work on this.
Please contact and coordinate with Laura Toma before starting work on this.
Should fix bug described in trac #390 first and once that is done move module into the main repo.
Please contact and coordinate with Markus Neteler / Jaro Hofierka before starting work on this.

Complete

  • The GPDE library (lib/gpde/) has OpenMP support (disabled by default)
  • The gmath library (lib/gmath/) has OpenMP support for grass blas level 1, 2 and 3 algorithms as well as several iterative and direct linear equation solver (disabled by default)
  • GRASS 7 has a ./configure switch for `--with-pthread`
  • Yann has added OpenMP support to i.atcorr. (not in SVN)

Alternatives

WARNING
not all GRASS modules and scripts are safe to have other things happening in the same mapset while they are running. Try at your own risk after performing a suitable safety audit. e.g. Make sure g.region is not run, externally (temporarily) changing the region settings.

GNU Parallel

  • GNU Parallel is an advanced version of xargs which makes it easy to write parallel shell scripts.
 ### r.sun mode 1 loop ###
 SUNRISE=7.67
 SUNSET=16.33
 STEP=0.01
 # | wc -l   867
 
 DAY=355

 seq $SUNRISE $STEP $SUNSET | parallel -j+0 r.sun -s elevin=gauss day=$DAY \
       time={} beam_rad=rad1_test.${DAY}_{}_beam --quiet

GNU Parallel can also distribute work to other computers, see the video on how http://www.youtube.com/watch?v=LlXDtd_pRaY

xargs

  • xargs can be told to limit itself to a certain number of processes at once. The r.sun example is almost exactly as with GNU Parallel, except for `-P $CORES -n 1` instead of `-j+0`.

For example, convert a large number of Raster3D maps into 2D rasters:

  NUM_CORES=6
  g.mlist rast3d | xargs -P $NUM_CORES -n 1 -I{} \
     r3.to.rast -r in={} out={} --quiet

For another example, here we spit apart a PDF and convert each page to a PNG image:

  pdftk pdfmovie.pdf burst
  NUM_CORES=6
  ls -1 pg_*.pdf | xargs -P $NUM_CORES -n 1 -I{} \
     sh -c "pdftoppm {} | pnmcut -width 1280 -height 1024 -left 0 -top 0 | \
               pnmtopng > \`basename {} .pdf\`.png"

DIY scripting

  • Poor-man's multithreading using Bourne shell script & backgrounding. WARNING: not all GRASS modules and scripts are safe to have other things happening in the same mapset while they are running. Try at your own risk after performing a suitable safety audit. e.g. Make sure g.region is not run, externally changing the region settings.

Example:

 ### r.sun mode 1 loop ###
 SUNRISE=7.67
 SUNSET=16.33
 STEP=0.01
 # | wc -l   867
 CORES=4
 
 DAY=355
 for TIME in `seq $SUNRISE $STEP $SUNSET` ; do
    echo "time=$TIME"
    CMD="r.sun -s elevin=gauss day=$DAY time=$TIME \
          beam_rad=rad1_test.${DAY}_${TIME}_beam --quiet"
 
    # poor man's multi-threading for a multi-core CPU
    MODULUS=`echo "$TIME $STEP $CORES" | awk '{print $1 % ($2 * $3)}'`
    if [ "$MODULUS" = "$STEP" ] ; then
       # stall to let the background jobs finish
       $CMD
       sleep 2
       #while [ `pgrep -c r.sun` -ne 0 ] ; do
       #   sleep 5
       #done
    else
      $CMD &
    fi
 done
  • This approach has been used in the r3.in.xyz addon script.
  • Another example using r.sun Mode 2 can be found on the r.sun wiki page.

See also

  • idea: You might be able to run the mpd daemon and then launch jobs via mpirun -np 4 <command> in order to make your quad-core into a mini self-contained Beowulf cluster.