Parallel GRASS jobs

From GRASS-Wiki
Jump to navigation Jump to search

Parallel GRASS jobs

NOTE: GRASS 6 libraries are NOT thread safe (except for GPDE, see below).

GRASS doesn't perform any locking on the files within a GRASS database, so the user may end up with one process reading a file while another process is in the middle of writing it. The most problematic case is the WIND file, which contains the current region, although there are others.

If a user wants to run multiple commands concurrently, steps need to be taken to ensure that this type of conflict doesn't happen. For the current region, the user can use the WIND_OVERRIDE environment variable to specify a named region which should be used instead of the WIND file.

Or the user can use the GRASS_REGION environment variable to specify the region parameters (the syntax is the same as the WIND file, but with newlines replaced with semicolons). With this approach, the region can only be read, not modified.

Problems can also arise if the user reads files from another mapset while another session is modifying those files. The WIND file isn't an issue here, nor are the files containing raster data (which are updated atomically), but the various support files may be.

See below for ways around these limitations.

Background

This you should know about GRASS' behaviour concerning multiple jobs:

  • You can run multiple processes in multiple locations (what's that?). Peaceful coexistence.
  • You can run multiple processes in the same mapset, but only if the region is untouched (but it's really not recommended). Better launch each job in its own mapset within the location.
  • You can run multiple processes in the same location, but in different mapsets. Peaceful coexistence.

Approach

Essentially there are at least two approaches of "poor man" parallelization without modifying GRASS source code:

  • split map into spatial chunks (possibly with overlap to gain smooth results)
  • time series: run each map elaboration on a different node.

See the Parallelizing Scripts wiki page

Working with tiles

Huge map reprojection example:

Q: I'd like to try splitting a large raster into small chunks and then projecting each one separately, sending the project command to the background. The problem is that, if the GRASS command changes the region settings, things might not work.

A: r.proj doesn't change the region.

Processing the map in chunks requires setting a different region for each command. That can be done by creating named regions and using the WIND_OVERRIDE environment variable, e.g.:

       g.region ... save=region1
       g.region ... save=region2
       ...
       WIND_OVERRIDE=region1 r.proj ... &
       WIND_OVERRIDE=region2 r.proj ... &
       ...

(for python see the grass.use_temp_region() function)

The main factor which is likely to affect parallelism is the fact that the processes won't share their caches, so there'll be some degree of inefficiency if there's substantial overlap between the source areas for the processes.

If you have more than one such map to project, processing entire maps in parallel might be a better choice (so that you get N maps projected in 10 hours rather than 1 map in 10/N hours).

Parallelized code

OpenMP

GPDE using OpenMP

The only parallelized library in GRASS >=6.3 is GRASS Partial Differential Equations Library (GPDE) and the gmath library in GRASS 7. Read more in OpenMP.

pthreads

The r.mapcalc in GRASS 7 has been parallelized using GNU pthreads.

Bourne and Python Scripts

Often very easy & can be done without modification to the main source code.

OpenMPI

The GIPE i.vi.mpi addon module has been created as a MPI (Message_Passing_Interface) implementation of the GIPE i.vi addon module.

GPU Programming

There is a version of the r.sun module which has been modified to use OpenCL. (works; still experimental)

MPI Programming

There is a sample implementation at module level in GIPE: i.vi.mpi

Cluster and Grid computing

Grid Engine

General steps (for multiple serial jobs on many CPUs):

  • Job definition
    • Grid Engine setup (in the header): define calculation time, number of nodes, number of processors, amount of RAM for individual job;
    • data are stored in centralized directory which is seen by all nodes, often mounted via NFS;
  • Job execution (launch of jobs)
    • user launches all jobs ("qsub"), they are submitted to the queue.
    • the scheduler optimizes among all user the execution of the jobs according to available resources and requested resources;
    • for the user this means that 0..max jobs are executed in parallel (unless the administrators didn't define either priority or limits). The user can then observe the job queue ("qstat") to see other jobs ahead and scheduling of own jobs. Once a job is running, the cluster possibly sends a notification email to the user, the same again when a job is terminating or is aborted.
    • At the end of the worker script call a second batch job which only contains g.copy to copy the result into a common mapset.
  • Job planning
    • The challenging part for the user is to estimate the request for number of nodes and CPUs per node as well as the amount of needed RAM. Usually tests are needed to see the performance in order to impose the values correctly in the Grid Engine script (see below).

The Grid Engine script

Job launcher: 'launch_grassjob_on_GE.sh' to launch GRASS jobs with Grid Engine:

#!/bin/sh
 
# Serial job, MODIS LST elaboration
# Markus Neteler, 2008, 2011
 
## GE settings
# request Bourne shell as shell for job
#$ -S /bin/sh
# run in current working directory
#$ -cwd
 
# We want Grid Engine to send mail when the job begins (b), aborts (a) and when it ends (e)
#$ -M neteler@somewhere.it
#$ -m bae
 
# uncomment next line for bash debug output
#set -x
 
############
# SUBMIT from home dir to node:
#  cd $HOME
#  qsub -cwd -l mem_free=3000M -v MYMODIS=aqua_lst1km20020706.LST_Night_1km.filt \
#       -v MYTARGET=modis_lst_reconstructed launch_grassjob_on_GE.sh
#
# WATCH
#  watch 'qstat | grep "$USER\|job-ID"'
#    Under the state column you can see the status of your job. Some of the codes are
#    * r: the job is running
#    * t: the job is being transferred to a cluster node
#    * qw: the job is queued (and not running yet)
#    * Eqw: an error occurred with the job
#
###########

# better say where to find libs and bins:
export PATH=$PATH:$HOME/binaries/bin:/usr/local/bin:$HOME/sge_batch_jobs
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib/:/usr/local/lib64/:$HOME/binaries/lib/

GRASSDBASE=/grassdata

# generate machine (blade) unique TMP string
UNIQUE=`mktemp`
MYTMP=`basename $UNIQUE`
rm -f $UNIQUE

# path to GRASS binaries and libraries:
export GISBASE=/usr/local/grass-6.4.2
export PATH=$PATH:$GISBASE/bin:$GISBASE/scripts
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$GISBASE/lib

# use process ID (PID) as GRASS lock file number:
export GIS_LOCK=$MYTMP
export GRASS_MESSAGE_FORMAT=plain
export TERM=linux

# use Grid Engine jobid + unique string as MAPSET to avoid GRASS lock
MYMAPSET=sge.$JOB_ID.$MYTMP
MYLOC=patUTM32
MYUSER=$MYMAPSET
TARGETMAPSET=$MYTARGET

# Note: remember to to use env vars for variable transfer
GRASS_BATCH_JOB=$HOME/binaries/bin/modis_interpolation_GRASS_RST.sh

# Note: remember to to use env vars for variable transfer (-v VAR=something)
export MODISMAP="$MYMODIS"

# print nice percentages:
export GRASS_MESSAGE_FORMAT=plain

################ nothing to change below ############
#execute_your_command

echo "************ Starting job at `date` on blade `hostname` *************"

# DEBUG turn on bash debugging, i.e. print all lines to 
# standard error before they are executed
# set -x

echo "SGE_ROOT $SGE_ROOT"
echo "The cell in which the job runs $SGE_CELL. Got $NSLOTS slots on $NHOSTS hosts"

# temporarily store *locally* the stuff to avoid NFS overflow/locking problem
## Remove XXX to enable
grep XXX/storage/local /etc/mtab >/dev/null && (
	if test "$MYMAPSET" = ""; then
	  echo "You are crazy to not define \$MYMAPSET. Emergency exit."
	  exit 1
	else
	  rm -rf /storage/local/$MYMAPSET
	fi
        # (create new mapset on the fly)
	mkdir /storage/local/$MYMAPSET
	ln -s /storage/local/$MYMAPSET $GRASSDBASE/$MYLOC/$MYMAPSET
	if [ $? -ne 0 ] ; then
	  echo "ERROR: Something went wrong creating the local storage link..."
	  exit 1
	fi
) || ( # in case that /storage/local is unmounted:
        # (create new mapset on the fly)
        mkdir $GRASSDBASE/$MYLOC/$MYMAPSET
)

# Set the global grassrc file to individual file name
MYGISRC="$HOME/.grassrc6.$MYUSER.`uname -n`.$MYTMP"

#generate GISRCRC
echo "GISDBASE: /grassdata" > "$MYGISRC"
echo "LOCATION_NAME: $MYLOC" >> "$MYGISRC"
echo "MAPSET: $MYMAPSET" >> "$MYGISRC"
echo "GRASS_GUI: text" >> "$MYGISRC"

# path to GRASS settings file
export GISRC=$MYGISRC

# fix WIND in the newly created mapset
cp "/grassdata/$MYLOC/PERMANENT/DEFAULT_WIND" "/grassdata/$MYLOC/$MYMAPSET/WIND"
db.connect -c --quiet

# run the GRASS job:
echo "Processing $MYMAP (target mapset: $TARGETMAPSET) ..."
. $GRASS_BATCH_JOB

# cleaning up temporary files
$GISBASE/etc/clean_temp > /dev/null

MYRASTERRESULTS=`g.mlist rast mapset=. sep=" "`

#regenerate GISRCRC for TARGETMAPSET and load it
echo "GISDBASE: $GRASSDBASE" > "$MYGISRC"
echo "LOCATION_NAME: $MYLOC" >> "$MYGISRC"
echo "MAPSET: $TARGETMAPSET" >> "$MYGISRC"
echo "GRASS_GUI: text" >> "$MYGISRC"
g.gisenv set=MAPSET=$TARGETMAPSET

# copy results to target mapset
for result in $MYRASTERRESULTS ; do
     g.copy rast=${result}@${MYMAPSET},${result}
done

# cleanup
rm -rf /tmp/grass6-$USER-$GIS_LOCK
rm -rf /storage/local/$MYMAPSET
rm -f ${MYGISRC}
if test -f $GRASSDBASE/$MYLOC/$MYMAPSET/WIND ; then
   rm -rf $GRASSDBASE/$MYLOC/$MYMAPSET
fi

echo "Hopefully successfully finished at `date` *************"
exit 0

The GRASS worker script

The real GRASS job is just the bare script (no #!/bin/sh nor exit 0 must be included) containing the commands to be executed. As defined in 'launch_grassjob_on_GE.sh', it must be stored as '$HOME/binaries/bin/modis_interpolation_GRASS_RST.sh':

#### To be submitted as Grid Engine job
# uncomment next line for debugging:
# set -x
# GRASS job for usage in Grid Engine
###################################
# MN 2008, 2011

# $MYMODIS is passed on via qsub job submission (-v VAR=something)

# add path to source data mapset if needed
g.mapsets add=modis_lst_reconstructed
g.region region=lst_europe_250m -p

export GRASS_OVERWRITE=1

r.info $MYMODIS
# do something more

# don't leave here with the 'exit' function!

Launching a single job

To submit a single job from home directory to cluster node, enter:

cd $HOME
qsub -cwd -l mem_free=3000M -v MYMODIS=aqua_lst1km20020706.LST_Night_1km.filt \
     -v MYTARGET=modis_lst_reconstructed  launch_grassjob_on_GE.sh

The result will be stored in the target location 'modis_lst_reconstructed'.

Launching many jobs

To submit a series of jobs, use a "for" loop as described in the PBS section above. In the HOME directory stdout and stderr logs will be stored for each job.

We do this by simply looping over all map names to elaborate:

      cd $HOME
      # loop and launch (we just pick the names from the GRASS DB itself; here: do all maps)
      # instead of launching immediately, we create a launch script:
      for i in `find /grassdata/myloc/modis_originals/cell/ -name '*'` ; do 
          NAME=`basename $i`
          echo qsub -v MYMODIS=$NAME -v MYTARGET=modis_lst_reconstructed  launch_grassjob_on_GE.sh
      done | sort > launch1.sh 

      # now really launch the jobs:
      sh launch1.sh

That's it! Emails will arrive to notify upon begin, abort (hopefully not!) and end of job execution.

After all jobs are completed, find the results in the target location 'modis_lst_reconstructed'.

Torque (PBS) Resource Manager

It works similar to Grid Engine, see above.

  • Often used with the Maui Cluster Scheduler

OpenMosix

NOTE: The openMosix Project has officially closed as of March 1, 2008.

If you want to launch several GRASS jobs in parallel, you have to launch each job in its own mapset. Be sure to indicate the mapset correctly in the GISRC file (see above). You can use the process ID (PID, get with $$ or use PBS jobname) to generate a almost unique number which you can add to the mapset name.


Now you could launch the jobs on an openMosix cluster (just install openMosix on your colleague's computers...).

Cloud computing

GRASS GIS 7 is running in the cloud as web processing service backend. Have a look at:


Provided ID could not be validated.

This Open Cloud GIS has been set up in a private Amazon compatible cloud environment using:

  • Ubuntu 10.04 LTS and 10.10 cloud server edition
  • Eucalyptus Cloud
  • GRASS GIS 7 latest svn
  • PyWPS latest svn
  • wps-grass-bridge latest svn
  • QGIS 1.7 with a modified QWPS plugin

Hints for NFS users

  • AVOID script forking on the cluster (but inclusion via ". script.sh" works ok). This means that the GRASS_BATCH_JOB approach is prone to fail. It is highly recommended to simply set a series of environmental variables to define the GRASS session, see here how to do that.
  • be careful with concurrent file writing (use "lockfile" locking, the lockfile command is provided by procmail);
  • store as much temporary data as possible (even the final maps) on the local blade disks if you have.
  • finally collect all results from local blade disks *after* the parallel job execution in a sequential collector job (I am writing that now) to not kill NFS. For example, Grid Engine offers a "hold" function to only execute the collector job after having done the rest.
  • If all else fails, and the I/O load is not too great, consider using sshfs with ssh passkeys instead of NFS.
  • In some situations it is necessary to preserve the same directory structure on all nodes, and symlinks are a nice way to do that, but some (closed source 3rd party which will remain nameless) software insists on expanding symlinks. In this situation the bindfs FUSE extension can help. It is safer to use than "mount" binds, and you don't have to be root to set them up. As with sshfs there is a performance penalty so it may not be appropriate in high I/O situations.

Misc Tips & Tricks

  • When working with long time series and r.series starts to complain that files are missing/not readable, check if you opened more than 1024 files in this process. this is the typical limit (check with ulimit -n). To overcome this problem, the admin has to add in /etc/security/limits.conf something like this:
 # Limit user nofile - max number of open files
 * soft  nofile 1500
 * hard  nofile 1800

For less invasive solution which still requires root access, see here.

See also