Large raster data processing: Difference between revisions

From GRASS-Wiki
Jump to navigation Jump to search
(+Large File Support (LFS) explanations)
m (minor cleanup)
(16 intermediate revisions by 3 users not shown)
Line 1: Line 1:
__TOC__
__TOC__


== Large File Support (LFS) ==
'''Introduction'''
 
Affects 32bit systems which are normally limited to 2GB (2^31) per file. As workaround, LFS can be enabled in GRASS GIS at compile time.


* GRASS GIS 6: much of the raster library and modules can be enabled for LFS during compilation
While GRASS GIS can handle [[GRASS GIS Performance|pretty large raster maps]], sometimes the user systems are hardware limited.
* GRASS GIS 7: raster and vector libraries are LFS enabled.


See also: [[Large File Support]]
This page collects a number of workarounds for such cases.


== Workaround: Select a smaller area within a very large dataset before importing in GRASS ==
== Workaround: Select a smaller area within a very large dataset before importing in GRASS ==
Line 32: Line 29:


At 100m this corresponds to 27.8 billion cells.
At 100m this corresponds to 27.8 billion cells.
== Raster map precision types ==
* '''CELL DATA TYPE''': a raster map from INTEGER type (4 bytes, whole numbers only)
** in GRASS GIS, CELL is a 32 bit integer with a range from -2,147,483,647 to +2,147,483,647. The value -2,147,483,648 is reserved for NODATA.
* '''FCELL DATA TYPE''': a raster map from FLOAT type (4 bytes, 7-9 digits precision)
* '''DCELL DATA TYPE''': a raster map from DOUBLE type (8 bytes, 15-17 digits precision)
* '''NULL''': represents "no data" in raster maps, to be distinguished from 0 (zero) data value
Aliases:
* '''INTEGER MAP''': see CELL DATA TYPE
* '''FLOAT MAP''': see FCELL DATA TYPE
* '''DOUBLE MAP''': see DCELL DATA TYPE
(reference in the [https://github.com/OSGeo/grass/blob/master/include/gis.h#L588 GRASS GIS source code])
See also [[GRASS raster semantics]]


== Create a mosaic ==
== Create a mosaic ==
Line 38: Line 52:
[http://www.gdal.org/gdal_vrttut.html virtual] mosaic. This will take only a few seconds to run.
[http://www.gdal.org/gdal_vrttut.html virtual] mosaic. This will take only a few seconds to run.


  gdalbuildvrt -input_file_list list.txt mosaic.vrt
  gdalbuildvrt -input_file_list list.csv mosaic.vrt


Also note that gdal can read Tifs within a zip/gz-file as well.  
Also note that gdal can read Tifs within a zip/gz-file as well.  
Line 44: Line 58:
To import the mosaic into GRASS, we first need to create a WGS84 location, then:
To import the mosaic into GRASS, we first need to create a WGS84 location, then:


  r.external -r input=/path/to/mosaic.vrt output=mosaic
  r.external input=/path/to/mosaic.vrt output=mosaic
 
GRASS GIS 7.8+ can internally generate VRT-style mosaics with {{cmd|r.buildvrt}}
 
== Split large maps into tiles ==
 
Sometimes it is possible to work with tiles, i.e. splitting a big map into chunks. These tiles can be easily generated with {{cmd|r.tile}}. It optionally allows for tiling with overlap.
 
== Raster map precision: reducing file size ==
 
Not always double precision is needed for raster map computations. Strategies:
* consider to run with FCELL rather than DCELL precision (see {{cmd|r.mapcalc}}'s float() function).
* you may also multiply with e.g. 100 and then round() to integer (used e.g. for BIOCLIM and WORLDCLIM).
 
Functions in {{cmd|r.mapcalc}} related to precision:
<pre>
double(x)              convert x to double-precision floating point (DCELL)
float(x)                convert x to single-precision floating point (FCELL)
round(x)                round x to nearest integer (INT)
round(x,y)              round x to nearest multiple of y (INT)
round(x,y,z)            round x to nearest y*i+z for some integer i (INT)
</pre>
 
See also [[GRASS raster semantics]]


== Troubleshooting ==
== Troubleshooting ==
Line 50: Line 87:
=== Number of open files limitation ===
=== Number of open files limitation ===


Linux (and other OSes, likewise) has a limit of opening 1024 files in parallel. So that you may experience the following error message:
When working with time series or multispectral data, you may experience the following error message (r.series, r.patch etc.):


  Too many open files
  Too many open files


However, this can be enlarged, see {{cmd|r.series}} manual page --> "Number of raster maps to be processed is given by the limit of the operating system."
or
 
WARNING: Unable to open raster map <somemap@mapset>
ERROR: One or more input raster maps not found
 
Many operating systems have the limitation of opening 1024 files at the same time. The limits are just to prevent a process from (accidentally or deliberately) consuming too much memory. The descriptor table is in kernel memory, and can't be swapped to disk. However, this can be enlarged. Note that, in GRASS GIS 7.0, each raster map requires two open files: one for the ''cell''/''fcell'' file, one for the ''null'' bitmap.


Resource limits are per-process, and inherited. The ''ulimit'' command (which is a shell built-in) changes the limits for the current shell process; the new limit will be inherited by any child processes.
Resource limits are per-process, and inherited. The ''ulimit'' command (which is a shell built-in) changes the limits for the current shell process; the new limit will be inherited by any child processes.


On most Linux systems, resource limits are set on login by the pam_limits module according to the settings contained in /etc/security/limits.conf /etc/security/limits.d/*.conf.
==== Solution for Linux: System-wise change ====


You should be able to edit those files if you have root privilege (also via sudo), but you will need to log in again before any changes take effect.
On most Linux systems, resource limits are set on login by the ''pam_limits'' module according to the settings contained in /etc/security/limits.conf /etc/security/limits.d/*.conf. You should be able to edit those files if you have root privilege (also via sudo), but you will need to log in again as a user before any changes take effect. A reboot shouldn't be necessary since changes to limits.conf should take effect for any subsequent logins. A hard limit can't be increased for any existing non-root processes, or those descended from them.


Increasing a hard limit requires root privilege (or at least the CAP_SYS_RESOURCE capability), so the limits have to be set by the
program which manages the login (login, xdm, sshd, etc) after the user has identified themself (so it knows which limits to set) but before the process changes its ownership from root to the logged-in user.
  sudo su
  vim /etc/security/limits.conf
Add in /etc/security/limits.conf something like this:
  # Limit user nofile - max number of open files
  * soft  nofile 1500
  * hard  nofile 1800
On Ubuntu machines, it may be necessary to edit your /etc/pam.d/common-session file and add the following line:
session required pam_limits.so
Restart the user session (i.e., logout, login).
==== Solution for Windows: System-wise change ====
From the Microsoft documentation it is not very clear how many files can be opened simulaneously (FIX IF POSSIBLE), here some hints:
* https://support.microsoft.com/en-us/kb/87690
* ...
==== Solution for Linux: Session only change ====
To change the limits for an existing session, you may be able to use something like:
To change the limits for an existing session, you may be able to use something like:


Line 69: Line 136:


However, sudo will probably reset certain environment variables, particularly LD_LIBRARY_PATH, so you may need to reset those in the new shell.
However, sudo will probably reset certain environment variables, particularly LD_LIBRARY_PATH, so you may need to reset those in the new shell.
=== ERROR: G_malloc: unable to allocate xxxx bytes of memory ===
Be sure to use GRASS GIS 7 which has way better support for large files.
If you happen to use the "memory" option requesting more than 2GB (e.g. <tt>memory=3000</tt>) check if you are using 64bit executables since 32bit applications can use only up to 2GB memory. In this case it does not matter if the operating system is 64bit or if there is more than 2GB of memory available. You need to use 64bit binaries of GRASS GIS 7.
== Large File Support (LFS) ==
Affects 32bit systems which are normally limited to 2GB (2^31) per file. As workaround, LFS can be enabled in GRASS GIS at compile time.
* GRASS GIS 6: much of the raster library and modules can be enabled for LFS during compilation
* GRASS GIS 7: raster and vector libraries are LFS enabled.
See also: [[Large File Support]]


== See also ==
== See also ==


* [[GRASS GIS Performance]]
* [[GRASS GIS Performance]]
* [[Large vector data processing]]


[[Category:Tutorial]]
[[Category:Tutorial]]
[[Category: massive data analysis]]
[[Category: massive data analysis]]
[[Category: raster]]

Revision as of 17:34, 28 December 2020

Introduction

While GRASS GIS can handle pretty large raster maps, sometimes the user systems are hardware limited.

This page collects a number of workarounds for such cases.

Workaround: Select a smaller area within a very large dataset before importing in GRASS

Suppose that we have all ASTER GDEM world coverage (>22000 files) and we aim to build a mosaic of Europe and import it in GRASS. A nice reference on how to deal with ASTER GDEM is here. The very first step is to select among the >22000 files those covering our area of interest (Europe). To this aim, we can use use gdaltindex to create an index of all the files:

gdaltindex an_index.shp *.tif

This command will create a polygon shapefile with the footprint of each raster as the polygon shape and the name of the image files represented in the attribute table. After that, we just need to do a spatial select on the shapefile and extract the filenames from the attributes from the selected polygons:

ogr2ogr -f CSV list.csv an_index.shp -spat xmin ymin xmax ymax

will produce a CSV file, from which we can easily copy the list of files in a list.txt.

Here, our area of interest (subset) has the following boundaries:

north: N71: ymax = 72.0001389 
south: N34: ymin = 33.9998611
west: W011: xmin = -11.0001389
east: E042: xmax = 43.0001389

At 100m this corresponds to 27.8 billion cells.

Raster map precision types

  • CELL DATA TYPE: a raster map from INTEGER type (4 bytes, whole numbers only)
    • in GRASS GIS, CELL is a 32 bit integer with a range from -2,147,483,647 to +2,147,483,647. The value -2,147,483,648 is reserved for NODATA.
  • FCELL DATA TYPE: a raster map from FLOAT type (4 bytes, 7-9 digits precision)
  • DCELL DATA TYPE: a raster map from DOUBLE type (8 bytes, 15-17 digits precision)
  • NULL: represents "no data" in raster maps, to be distinguished from 0 (zero) data value

Aliases:

  • INTEGER MAP: see CELL DATA TYPE
  • FLOAT MAP: see FCELL DATA TYPE
  • DOUBLE MAP: see DCELL DATA TYPE

(reference in the GRASS GIS source code)

See also GRASS raster semantics

Create a mosaic

Now that we have a list of the files interesting our area, and since our files are all in the same datum (WGS84), we can use gdalbuildvrt to create a virtual mosaic. This will take only a few seconds to run.

gdalbuildvrt -input_file_list list.csv mosaic.vrt

Also note that gdal can read Tifs within a zip/gz-file as well.

To import the mosaic into GRASS, we first need to create a WGS84 location, then:

r.external input=/path/to/mosaic.vrt output=mosaic

GRASS GIS 7.8+ can internally generate VRT-style mosaics with r.buildvrt

Split large maps into tiles

Sometimes it is possible to work with tiles, i.e. splitting a big map into chunks. These tiles can be easily generated with r.tile. It optionally allows for tiling with overlap.

Raster map precision: reducing file size

Not always double precision is needed for raster map computations. Strategies:

  • consider to run with FCELL rather than DCELL precision (see r.mapcalc's float() function).
  • you may also multiply with e.g. 100 and then round() to integer (used e.g. for BIOCLIM and WORLDCLIM).

Functions in r.mapcalc related to precision:

double(x)               convert x to double-precision floating point (DCELL)
float(x)                convert x to single-precision floating point (FCELL)
round(x)                round x to nearest integer (INT)
round(x,y)              round x to nearest multiple of y (INT)
round(x,y,z)            round x to nearest y*i+z for some integer i (INT)

See also GRASS raster semantics

Troubleshooting

Number of open files limitation

When working with time series or multispectral data, you may experience the following error message (r.series, r.patch etc.):

Too many open files

or

WARNING: Unable to open raster map <somemap@mapset>
ERROR: One or more input raster maps not found

Many operating systems have the limitation of opening 1024 files at the same time. The limits are just to prevent a process from (accidentally or deliberately) consuming too much memory. The descriptor table is in kernel memory, and can't be swapped to disk. However, this can be enlarged. Note that, in GRASS GIS 7.0, each raster map requires two open files: one for the cell/fcell file, one for the null bitmap.

Resource limits are per-process, and inherited. The ulimit command (which is a shell built-in) changes the limits for the current shell process; the new limit will be inherited by any child processes.

Solution for Linux: System-wise change

On most Linux systems, resource limits are set on login by the pam_limits module according to the settings contained in /etc/security/limits.conf /etc/security/limits.d/*.conf. You should be able to edit those files if you have root privilege (also via sudo), but you will need to log in again as a user before any changes take effect. A reboot shouldn't be necessary since changes to limits.conf should take effect for any subsequent logins. A hard limit can't be increased for any existing non-root processes, or those descended from them.

Increasing a hard limit requires root privilege (or at least the CAP_SYS_RESOURCE capability), so the limits have to be set by the program which manages the login (login, xdm, sshd, etc) after the user has identified themself (so it knows which limits to set) but before the process changes its ownership from root to the logged-in user.

 sudo su
 vim /etc/security/limits.conf

Add in /etc/security/limits.conf something like this:

 # Limit user nofile - max number of open files
 * soft  nofile 1500
 * hard  nofile 1800

On Ubuntu machines, it may be necessary to edit your /etc/pam.d/common-session file and add the following line:

session required pam_limits.so

Restart the user session (i.e., logout, login).

Solution for Windows: System-wise change

From the Microsoft documentation it is not very clear how many files can be opened simulaneously (FIX IF POSSIBLE), here some hints:

Solution for Linux: Session only change

To change the limits for an existing session, you may be able to use something like:

sudo bash
ulimit ...
su -c bash <username>

However, sudo will probably reset certain environment variables, particularly LD_LIBRARY_PATH, so you may need to reset those in the new shell.

ERROR: G_malloc: unable to allocate xxxx bytes of memory

Be sure to use GRASS GIS 7 which has way better support for large files.

If you happen to use the "memory" option requesting more than 2GB (e.g. memory=3000) check if you are using 64bit executables since 32bit applications can use only up to 2GB memory. In this case it does not matter if the operating system is 64bit or if there is more than 2GB of memory available. You need to use 64bit binaries of GRASS GIS 7.

Large File Support (LFS)

Affects 32bit systems which are normally limited to 2GB (2^31) per file. As workaround, LFS can be enabled in GRASS GIS at compile time.

  • GRASS GIS 6: much of the raster library and modules can be enabled for LFS during compilation
  • GRASS GIS 7: raster and vector libraries are LFS enabled.

See also: Large File Support

See also