Parallel GRASS jobs: Difference between revisions

From GRASS-Wiki
Jump to navigation Jump to search
(complete overhaul with NFS lessions learned)
(cleanup fixed)
Line 178: Line 178:
=== Grid Engine ===
=== Grid Engine ===


* URL: ~[http://waybackmachine.org/*/gridengine.sunsource.net http://gridengine.sunsource.net/]~ (new site at http://sourceforge.net/projects/gridscheduler/)
* URL: <strike>[http://waybackmachine.org/*/gridengine.sunsource.net http://gridengine.sunsource.net/]</strike> (new site at http://sourceforge.net/projects/gridscheduler/)
* Lauching jobs: qsub
* Lauching jobs: qsub
* Navigating the Grid Engine System with GUI: qmon
* Navigating the Grid Engine System with GUI: qmon
Line 184: Line 184:
* User statstics: qacct -o
* User statstics: qacct -o


Example script to lauch Grid Engine jobs:
Example script to lauch GRASS jobs with Grid Engine:
<source lang="bash">
<source lang="bash">
       #!/bin/sh
       #!/bin/sh
Line 304: Line 304:
       . $GRASS_BATCH_JOB
       . $GRASS_BATCH_JOB
        
        
       # cleaning up temporary files ...
       # cleaning up temporary files
       $GISBASE/etc/clean_temp > /dev/null
       $GISBASE/etc/clean_temp > /dev/null
        
       rm -f ${MYGISRC}
       # move back from local disk to NFS
      rm -rf /tmp/grass6-$USER-$GIS_LOCK
       
       # move data back from local disk to NFS
       ## Remove XXX to enable
       ## Remove XXX to enable
       grep XXX/storage/local /etc/mtab >/dev/null && (
       grep XXX/storage/local /etc/mtab >/dev/null && (

Revision as of 18:21, 26 March 2011

Parallel GRASS jobs

NOTE: GRASS 6 libraries are NOT thread safe (except for GPDE, see below).

Background

This you should know about GRASS' behaviour concerning multiple jobs:

  • You can run multiple processes in multiple locations (what's that?). Peaceful coexistence.
  • You can run multiple processes in the same mapset, but only if the region is untouched (but it's really not recommended). Better launch each job in its own mapset within the location.
  • You can run multiple processes in the same location, but in different mapsets. Peaceful coexistence.

Approach

Essentially there are at least two approaches of "poor man" parallelization without modifying GRASS source code:

  • split map into spatial chunks (possibly with overlap to gain smooth results)
  • time series: run each map elaboration on a different node.

Parallelized code

GPDE using OpenMP

The only parallelized library in GRASS >=6.3 is GRASS Partial Differential Equations Library (GPDE). Read more in OpenMP.

GPU Programming

Cluster and Grid computing

PBS Scheduler

Note: For PBS details, read on here.

You need essentially two scripts:

  • GRASS job script (which takes the name(s) of map(s) to elaborate from environmental variables
  • script to launch this GRASS-script as job for each map to elaborate

General steps (for multiple serial jobs on many CPUs):

  • Job definition
    • PBS setup (in the header): define calculation time, number of nodes, number of processors, amount of RAM for individual job;
    • data are stored in centralized directory which is seen by all nodes;
  • Job execution (launch of jobs)
    • user launches all jobs ("qsub"), they are submitted to the queue. Use the GRASS_BATCH_JOB variable to define the name of the elaboration script.
    • the scheduler optimizes among all user the execution of the jobs according to available resources and requested resources;
    • for the user this means that 0..max jobs are executed in parallel (unless the administrators didn't define either priority or limits). The user can then observe the job queue ("showq") to see other jobs ahead and scheduling of own jobs. Once a job is running, the cluster possibly sends a notification email to the user, the same again when a job is terminating.
    • At the end of the elaboration call a second batch job which only contains g.copy to copy the result into a common mapset. There is a low risk of race condition here in case that two nodes finish at the same time but this could be even trapped in a loop which checks if the target mapset is locked and, if needed, launches g.copy again 'till success.
  • Job planning
    • The challenging part for the user is to estimate the execution time since PBS kills jobs which exceed the requested time. The same applies to the request for number of nodes and CPUs per node as well as the amount of needed RAM. Usually tests are needed to see the performance in order to impose the values correctly in the PBS script (see below).

How to write the scripts: To avoid race conditions, you can automatically generate multiple mapsets in a given location. When you start GRASS (in your script) with path to grassdata/location/mapset/ and the requested mapset does not yet exist, it will be automatically created. So, as first step in your job script, be sure to run

     g.mapsets add=mapset1_with_data[,mapset2_with_data]

to make the data which you want to elaborate accessible. You would then loop over many map names (e.g. "aqua_lst1km20020706.LST_Night_1km.filt") and launch the script with map name as first parameter:

      ------- snip (you need to update this for PBS stuff and save as 'launch_grassjob_55min.sh' -----------
      #!/bin/sh
      ### Project number, enter if applicable (needed to manage your CPU hours)
      #PBS -A HPC2N-2008-001
      #
      ### Job name - defaults to name of submit script
      #PBS -N modis_interpolation_GRASS_RST.sh
      #
      ### Output files - defaults to jobname.[eo]jobnumber
      #PBS -o modis_rst.$MYMODIS.out
      #PBS -e modis_rst.$MYMODIS.err
      #
      ### Mail on - a=abort, b=beginning, e=end - defaults to a
      #PBS -m abe
      ### Number of nodes - defaults to 1:1
      ### Requesting 1 nodes with 1 processor per node:
      #PBS -l nodes=1:ppn=1
      ### Requesting time - defaults to 30 minutes
      #PBS -l walltime=00:55:00
      ### amount of physical memory (in MB) each processor will use with a line:
      #PBS -l pvmem=3000m
 
      # we'll call this script below in a loop, giving the name of the map to elaborate as parameter
      MYMAPSET=$1
      MYLOC=nc_spm08
      TARGETMAPSET=results
      
      # define as env. variable the batch job which does the real GRASS elaboration (so, which contains the GRASS commands)
      # (job scripts need executable flag be set)
      GRASS_BATCH_JOB=/shareddisk/modis_job.sh
      grass64 -c -text /shareddisk/grassdata/$MYLOC/$MYMAPSET
      
      # since we write results to a temporary mapset, copy over result to target mapset
      # this we launch again as small GRASS job
      export INMAP=${CURRMAP}_rst
      export INMAPSET=$MYMAPSET
      export OUTMAP=$INMAP
      export GRASS_BATCH_JOB=/shareddisk/gcopyjob.sh
      grass64 -c -text  /shareddisk/grassdata/$MYLOC/$TARGETMAPSET
      if [ $? -ne 0 ] ; then
         # race condition with other copy job...
         echo "ERROR while running $GRASS_BATCH_JOB, retrying..."
         sleep 2
         $HOME/binaries/bin/grass64 -text grassdata/$MYLOC/$TARGETMAPSET
         if [ $? -ne 0 ] ; then
            echo "ERROR while running $GRASS_BATCH_JOB, retrying..."
            sleep 7
            $HOME/binaries/bin/grass64 -text grassdata/$MYLOC/$TARGETMAPSET
            if [ $? -ne 0 ] ; then
               echo "FINAL ERROR while running $GRASS_BATCH_JOB, giving up!"
               exit 1
            else
               echo "Done $GRASS_BATCH_JOB of <$INMAP>"
            fi
         else
            echo "Done $GRASS_BATCH_JOB of <$INMAP>"
         fi
      else
         echo "Done $GRASS_BATCH_JOB of <$INMAP>"
      fi


      exit 0
      ------- snap ----------

You see, that GRASS is run twice. Note that you need GRASS Version >=6.3 to make use of GRASS_BATCH_JOB (if variable is present, GRASS automatically executes that job instead of launching the normal interactive user interface).

The script 'gcopyjob.sh' simply contains:

      ------- snip -----------
      #!/bin/sh
      
      # copy files from one mapset to another avoiding race conditions on target mapset
      LIST=`g.mlist type=rast mapset=$INMAPSET`
      
      for map in $LIST ; do
        g.copy rast=$map@$INMAPSET,$map --o
        if [ $? -ne 0 ] ; then
                # maybe race condition with other copy job...
                g.message -e 'ERROR while <g.copy ...>, retrying...'
                sleep 2
                g.copy rast=$map@$INMAPSET,$map --o
                if [ $? -ne 0 ] ; then
                        g.message -e 'ERROR while <g.copy ...>, retrying...'
                        sleep 3
                        g.copy rast=$map@$INMAPSET,$map --o
                        if [ $? -ne 0 ] ; then
                                g.message -e 'FINAL ERROR while <g.copy ...>, giving up!'
                                exit 1
                        else
                                echo "Done g.copy of <$map>"
                        fi
                else
                        echo "Done g.copy of <$map>"
                fi
        else
                echo "Done g.copy of <$map>"
        fi
      done
      
      ------- snap ----------

Launching many jobs:

We do this by simply looping over all map names to elaborate:

      cd /shareddisk/
      # loop and launch (we just pick the names from the GRASS DB itself; here: do all maps)
      # instead of launching immediately, we create a launch script:
      for i in `find grassdata/myloc/modis_originals/cell/ -name '*'` ; do 
          NAME=`basename $i`
          echo qsub -v MYMODIS=$NAME ./launch_grassjob_55min.sh
      done | sort > launch1.sh 

      # now really launch the jobs:
      sh launch1.sh

That's it! Emails will arrive to notify upon begin, abort (hopefully not!) and end of job execution.

Grid Engine

Example script to lauch GRASS jobs with Grid Engine:

      #!/bin/sh
 
      # Serial job, MODIS LST elaboration
      # Markus Neteler, 2008, 2011
 
      ## GE settings
      # request Bourne shell as shell for job
      #$ -S /bin/sh
      # run in current working directory
      #$ -cwd
 
      # We want Grid Engine to send mail when the job begins and when it ends.
      #$ -M neteler@somewhere.it
      #$ -m abe
 
      # uncomment next line for bash debug output
      #set -x
 
      ############
      # SUBMIT from home dir to node:
      #  cd $HOME
      #  qsub -cwd -l mem_free=3000M -v MYMODIS=aqua_lst1km20020706.LST_Night_1km.filt launch_grassjob_on_GE.sh
      #
      # WATCH
      #  watch 'qstat | grep "$USER\|job-ID"'
      #    Under the state column you can see the status of your job. Some of the codes are
      #    * r: the job is running
      #    * t: the job is being transferred to a cluster node
      #    * qw: the job is queued (and not running yet)
      #    * Eqw: an error occurred with the job
      #
      ###########
      
      # better say where to find libs and bins:
      export PATH=$PATH:$HOME/binaries/bin:/usr/local/bin:$HOME/sge_batch_jobs
      export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/binaries/lib/
      
      # generate machine (blade) unique TMP string
      UNIQUE=`mktemp`
      MYTMP=`basename $UNIQUE`
      
      # path to GRASS binaries and libraries:
      export GISBASE=/usr/local/grass-6.4.1svn
      export PATH=$PATH:$GISBASE/bin:$GISBASE/scripts
      export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$GISBASE/lib

      # use process ID (PID) as GRASS lock file number:
      export GIS_LOCK=$$
      export GRASS_MESSAGE_FORMAT=plain
      
      # use Grid Engine jobid + unique string as MAPSET to avoid GRASS lock
      MYMAPSET=sge.$JOB_ID.$MYTMP
      MYLOC=patUTM32
      MYUSER=$MYMAPSET
      TARGETMAPSET=modisLSTinterpolation
      
      # Note: remember to to use env vars for variable transfer
      GRASS_BATCH_JOB=$HOME/binaries/bin/modis_interpolation_GRASS_RST.sh
      
      # Note: remember to to use env vars for variable transfer
      export MODISMAP="$MYMODIS"
      
      # print nice percentages:
      export GRASS_MESSAGE_FORMAT=plain

      ################ nothing to change below ############
      #execute_your_command
      
      echo "************ Starting job at `date` on blade `hostname` *************"
      
      # DEBUG turn on bash debugging, i.e. print all lines to 
      # standard error before they are executed
      # set -x
      
      echo "SGE_ROOT $SGE_ROOT"
      echo "The cell in which the job runs $SGE_CELL. Got $NSLOTS slots on $NHOSTS hosts"
      
      # temporarily store *locally* the stuff to avoid NFS overflow/locking problem
      ## Remove XXX to enable
      grep XXX/storage/local /etc/mtab >/dev/null && (
      	if test "$MYMAPSET" = ""; then
      	  echo "You are crazy to not define \$MYMAPSET. Emergency exit."
      	  exit 1
      	else
      	  rm -rf /storage/local/$MYMAPSET
      	fi
              # (create new mapset on the fly)
      	mkdir /storage/local/$MYMAPSET
      	ln -s /storage/local/$MYMAPSET /grassdata/$MYLOC/$MYMAPSET
      	if [ $? -ne 0 ] ; then
      	  echo "ERROR: Something went wrong creating the local storage link..."
      	  exit 1
      	fi
      ) || ( # in case that /storage/local is unmounted:
              # (create new mapset on the fly)
              mkdir /grassdata/$MYLOC/$MYMAPSET
      )
      
      
      # Set the global grassrc file to individual file name
      MYGISRC="$HOME/.grassrc6.$MYUSER.`uname -n`.$MYTMP"
      
      #generate GISRCRC
      echo "GISDBASE: /grassdata" > "$MYGISRC"
      echo "LOCATION_NAME: $MYLOC" >> "$MYGISRC"
      echo "MAPSET: $MYMAPSET" >> "$MYGISRC"
      echo "GRASS_GUI: text" >> "$MYGISRC"
      
      # path to GRASS settings file
      export GISRC=$MYGISRC
      
      # fix WIND in the newly created mapset
      cp "/grassdata/$MYLOC/PERMANENT/DEFAULT_WIND" "/grassdata/$MYLOC/$MYMAPSET/WIND"
      db.connect -c --quiet
      
      # run the GRASS job:
      . $GRASS_BATCH_JOB
      
      # cleaning up temporary files
      $GISBASE/etc/clean_temp > /dev/null
      rm -f ${MYGISRC}
      rm -rf /tmp/grass6-$USER-$GIS_LOCK
        
      # move data back from local disk to NFS
      ## Remove XXX to enable
      grep XXX/storage/local /etc/mtab >/dev/null && (
      	if test "$MYMAPSET" = ""; then
      	  echo "You are crazy to not define \$MYMAPSET. Emergency exit."
      	  exit 1
      	else
      	  # rm temporary link:
      	  rm -fr /grassdata/patUTM32/$MYMAPSET/.tmp
      	  rm -f /grassdata/$MYLOC/$MYMAPSET
      	  # copy stuff
      	  cp -rp /storage/local/$MYMAPSET /grassdata/$MYLOC/$MYMAPSET
      	  if [ $? -ne 0 ] ; then
      	     echo "ERROR: Something went wrong copying back from local storage to NFS directory..."
      	     exit 1
      	  fi
      	  rm -rf /storage/local/$MYMAPSET
      	fi
      )
      
      # save result for collector job
      lockfile -1 -l 30 /grassdata/$MYLOC/$COLLECTORLIST.lock
      echo "$MYMAPSET" >> /grassdata/$MYLOC/$COLLECTORLIST.csv
      rm -f /grassdata/$MYLOC/$COLLECTORLIST.lock
      # Alternative: touch file in mapset and run via "find" and cp/mv later
      ####
      
      rm -f ${MYGISRC}
      echo "************ Finished at `date` *************"
      exit 0

Launch many jobs through "for" loop as described in the PBS section above.

OpenMosix

NOTE: The openMosix Project has officially closed as of March 1, 2008.

If you want to launch several GRASS jobs in parallel, you have to launch each job in its own mapset. Be sure to indicate the mapset correctly in the GISRC file (see above). You can use the process ID (PID, get with $$ or use PBS jobname) to generate a almost unique number which you can add to the mapset name.


Now you could launch the jobs on an openMosix cluster (just install openMosix on your colleague's computers...).

Hints for NFS users

  • AVOID script forking on the cluster (but inclusion via ". script.sh" works ok). This means that the GRASS_BATCH_JOB approach is prone to fail. It is highly recommended to simply set a series of environmental variables to define the GRASS session, see here how to do that.
  • be careful with concurrent file writing (use "lockfile" locking, the lockfile command is provided by procmail);
  • store as much temporary data as possible (even the final maps) on the local blade disks if you have.
  • finally collect all results from local blade disks *after* the parallel job execution in a sequential collector jog (I am writing that now) to not kill NFS. For example, Grid Engine offers a "hold" function to only execute the collector job after having done the rest.

Misc Tips & Tricks

  • When working with long time series and r.series starts to complain that files are missing/not readable, check if you opened more than 1024 files in this process. this is the typical limit (check with ulimit -n). To overcome this problem, the admin has to add in /etc/security/limits.conf something like this:
 # Limit user nofile - max number of open files
 * soft  nofile 1500
 * hard  nofile 1800

For less invasive solution which still requires root access, see here.

  • See the poor-man's multi-processing script on the OpenMP wiki page.

See also