GRASS GIS Performance: Difference between revisions
mNo edit summary |
⚠️El selvaje (talk | contribs) m (add Float32 range-bound) |
||
(31 intermediate revisions by 6 users not shown) | |||
Line 6: | Line 6: | ||
GRASS GIS is fully 32bit and 64bit compliant. See also the [[Software requirements specification]]. | GRASS GIS is fully 32bit and 64bit compliant. See also the [[Software requirements specification]]. | ||
=== Search strategies used in processing geodata === | |||
GRASS GIS makes heavy use of search trees in order to speed up computation: | |||
* segment lib: btree2 | |||
* 2D splines (RST): quadtree | |||
* 3D splines (RST): octree | |||
* vector lib topology: R*-tree | |||
See the [http://grass.osgeo.org/programming7/ Programmer's manual] for details. | |||
=== Number of opened input files === | === Number of opened input files === | ||
Line 13: | Line 23: | ||
See also | See also | ||
* {{cmd|r.series}} | * {{cmd|r.series}} | ||
* {{cmd|r.series.accumulate}} | |||
=== Memory management === | === Memory management === | ||
Due to the modular architecture of GRASS GIS the overhead is minimal. | Due to the modular architecture of GRASS GIS the overhead of the software itself is minimal. | ||
'''Raster data operations''': where appropriate, modules offer a parameter to optimize caching ("memory" parameter). | |||
# Pixel based operations: they have very low impact on memory usage. | |||
# Moving window based operations: they have medium impact on memory usage. | |||
# Full map operations (watersheds, cost surfaces, etc.): they have high impact on memory usage. | |||
# Statistical operations: while univariate statistics have low impact on memory usage, quartiles and other aggregated statistics have medium impact on memory usage. | |||
'''Vector data operations''': | |||
# Vector point operations: memory consumption depends on the amount of points. LiDAR data processing is commonly demanding. For some operations the creation of topology can be skipped to reduce the memory footprint. | |||
# Vector line operations: they have low impact on memory usage (depends on the amount of data). | |||
# Vector area/faces operations: they have high impact on memory usage. | |||
# Topological versus non-topological operations: a subset of vector modules is able to operate on point vector maps without topology which saves notably RAM usage. | |||
'''Database operations''': | |||
* Most operations are simply SQL transactions with low impact on memory usage. | |||
See also | '''See also''' | ||
* [[Memory issues]] | * Solving [[Memory issues]] when dealing with large amounts of data | ||
=== Vector management === | === Vector management === | ||
A "vector map" is a data layer consisting of a number of sparse features in geographic space. These might be data points (drill sites), lines (roads), polygons (park boundary), volumes (3D CAD structure), or some combination of all these. Typically each feature in the map will be tied to a set of attribute layers stored in a database (road names, site ID, geologic type, etc.). As a general rule these can exist in 2D or 3D space and are independent of the GIS's computation region. | |||
In all GRASS versions, the limit | See also {{cmd|vectorintro}} in the manual. | ||
==== Vector geometry ==== | |||
In all GRASS GIS versions, | |||
* with topology the feature limit is at time 2^31 - 1 (about 2 billion) features per vector map. | |||
* ''TODO: add limit if topology creation is disabled at import for points (e.g., LiDAR points).'' | |||
==== Vector attribute management ==== | |||
Attributes are managed through a SQL interface (see also {{cmd|databaseintro}}) | |||
The default database backend is | |||
* DBF files (tend to be slow) in GRASS GIS 6 ({{cmd|grass-dbf|version=64}}) | |||
* SQLite file (very fast compared to DBF in GRASS GIS 7 ({{cmd|grass-sqlite}}) | |||
Other SQL backends are offered as well including PostgreSQL, MySQL, etc.: see {{cmd|sql}} support in GRASS GIS. | |||
'''Speed of DBF versus SQLite drivers''': attribute operations which take hours using the DBF backend just take seconds using the SQLite backend. | |||
==== Maximum Number of Attribute Columns ==== | ==== Maximum Number of Attribute Columns ==== | ||
Line 31: | Line 76: | ||
The maximum number of attribute columns of a table connected to a vector map is defined by the capabilities of the the selected database backend (set with {{cmd|db.connect}}). | The maximum number of attribute columns of a table connected to a vector map is defined by the capabilities of the the selected database backend (set with {{cmd|db.connect}}). | ||
'''DBF-Backend''': GRASS 4.x - 6.x use by default the DBF backend. While there is no explicitly stated maximum number of allowed attribute columns, Web sources report a maximum '''between 128 and 1023/24'''. Trials with GRASS 6.4.2 in 2012 result in write failure if > 2000 attribute columns are used. Export to DBF-based ESRI Shapefile provides a warning if more that '''255''' attributes are used: Other software tools may ignore all further attributes, hence a maximum of '''128''' columns may be prudent. | * '''DBF-Backend''': GRASS 4.x - 6.x use by default the DBF backend. While there is no explicitly stated maximum number of allowed attribute columns, Web sources report a maximum '''between 128 and 1023/24'''. Trials with GRASS 6.4.2 in 2012 result in write failure if > 2000 attribute columns are used. Export to DBF-based ESRI Shapefile provides a warning if more that '''255''' attributes are used: Other software tools may ignore all further attributes, hence a maximum of '''128''' columns may be prudent. | ||
* '''SQLite-Backend''': GRASS 7.x uses by default the SQLite backend. The default maximum number of attribute columns is '''2000''' according to the [http://www.sqlite.org/limits.html#max_column specifications]. This number can be increased by compiling SQlite with changed settings. | |||
* '''MySQL-Backend''': The default maximum number of attribute columns is '''4096''' according to the [http://wiki.postgresql.org/wiki/FAQ#What_is_the_maximum_size_for_a_row.2C_a_table.2C_and_a_database.3F specifications]. | |||
* '''PostgreSQL-Backend''': The default maximum number of attribute columns is '''250-1600''' according to the [http://dev.mysql.com/doc/refman/5.1/en/column-count-limit.html specifications] depending on column types. | |||
* '''Oracle-Backend''': The default maximum number of attribute columns is '''1000''' according to the [http://docs.oracle.com/cd/B19306_01/server.102/b14237/limits003.htm specifications]. | |||
==== Maximum file size of the attributes file ==== | |||
* '''DBF-Backend''' (in GRASS 6 the default DB backend): ''to be added'' (2Gb? in case of LFS enabled?) | |||
* '''SQLite-Backend''' (in GRASS 7 the default DB backend): The maximum file size of a SQLite db is '''140 TB''', independent of the architecture, i.e. Large File Support (LFS) is always there. Usually SQLite will hit the maximum file size limit of the underlying filesystem or disk hardware size limit long before it hits its own [http://www.sqlite.org/fileformat2.html internal size limit]. | |||
=== Raster management === | |||
A "raster map" is a data layer consisting of a gridded array of cells. It has a certain number of rows and columns, with a data point (or null value indicator) in each cell. These may exist as a 2D grid or as a 3D cube made up of many smaller cubes, i.e. a stack of 2D grids. | |||
See also {{cmd|rasterintro}} in the manual. | |||
==== Raster map precision types ==== | |||
''' | * '''CELL DATA TYPE''': a raster map from INTEGER type (4 bytes, whole numbers only). | ||
** In GRASS GIS, CELL is a 32 bit integer with a range from -2,147,483,647 to +2,147,483,647. The value -2,147,483,648 is reserved for NODATA. | |||
* '''FCELL DATA TYPE''': a raster map from FLOAT type (4 bytes, 7-9 digits precision). | |||
** In GRASS GIS, FCELL is a 32 bit float (Float32) with a range from -3.4E38 to 3.4E38. However, the integer precision can be only ensured between -16777216 and 16777216. If your raster overpass this range we strongly suggest to use DCELL, as Float64 data type. | |||
* '''DCELL DATA TYPE''': a raster map from DOUBLE type (8 bytes, 15-17 digits precision). | |||
** In GRASS GIS, DCELL is a 64 bit float (Float64) with a range from -1.79E308 to 1.79E308. | |||
* '''NULL''': represents "no data" in raster maps, to be distinguished from 0 (zero) data value. | |||
''' | Aliases: | ||
* '''INTEGER MAP''': see CELL DATA TYPE | |||
* '''FLOAT MAP''': see FCELL DATA TYPE | |||
* '''DOUBLE MAP''': see DCELL DATA TYPE | |||
(reference in the [https://github.com/OSGeo/grass/blob/master/include/gis.h#L588 GRASS GIS source code]) | |||
See also [[GRASS raster semantics]] | |||
== Large file support == | == Large file support == | ||
Line 46: | Line 117: | ||
GRASS GIS 7 supports the off_t type, hence it can address an enormous amount of raster data. | GRASS GIS 7 supports the off_t type, hence it can address an enormous amount of raster data. | ||
See also | See also: | ||
* [[Large raster data processing]] | * [[Large raster data processing]] | ||
Some '''benchmarks''' | ==== Some '''benchmarks''' ==== | ||
* Import of [http://www.ecad.eu/ ECAD 6.0] Tmean dataset: 22650 layers in single netCDF file: import takes 300 Seconds while reading file via NFS (i.e. 75 maps per second) | * Import of [http://www.ecad.eu/ ECAD 6.0] Tmean dataset: 22650 layers in single netCDF file: import takes 300 Seconds while reading file via NFS (i.e. 75 maps per second) | ||
* Calculation of watersheds, half basins, flow accumulation, drainage directions, and stream with {{cmd|r.watershed}} for an area of 90,000 rows x 100,000 cols (9,000,000,000 cells, metric) successfully done in 77.2 hours (Intel Xeon X5670, 2.93GHz) | |||
* European DEM at 25m ([http://www.eea.europa.eu/data-and-maps/data/eu-dem#tab-original-data eudem_dem_3035_europe.tif], 24.1 GB GeoTIFF, 48 billion cells) processing: | |||
** Import of this GeoTIFF file with r.in.gdal on a blade via NFS: a) '''77h''' without memory option (hence 40MB = GDAL's default cache), b) ''''1.5h'''' with memory=300 (hence using 300MB GDAL cache), c) ''''1.5h'''' with memory=2000 (hence using 2GB GDAL cache) | |||
* r.neighbors with [https://trac.osgeo.org/grass/ticket/2676#comment:2 3.694261e+12 pixels] (rows: 440046 cols: 830958 cells) | |||
* Import of Global Forest Loss map with rows=560000 * cols=1440000 = 8.064e+11 pixels (see {{trac|3365}}); map can be easily shown in GRASS GIS monitor | |||
* r.stream.extract: the upper limit matrix cell number that can handle is about 1.15e+18 raster cells (1.15 "[https://en.wikipedia.org/wiki/Unit_prefix exa]"-cells. The number of detected stream segments must not be larger than 2,147,483,647 streams. | |||
* ... add more | |||
=== Large vector data processing === | === Large vector data processing === | ||
Line 57: | Line 135: | ||
GRASS GIS 7 supports the off_t type, hence it can address an enormous amount of vector data. | GRASS GIS 7 supports the off_t type, hence it can address an enormous amount of vector data. | ||
Currently multi-billion vector points have been managed ([http://lists.osgeo.org/pipermail/grass-dev/2011-January/052996.html citation]) without topology (since not needed). In all GRASS versions, the limit with topology is at time 2^31 - 1 (about 2 billion) features per vector map. | Currently multi-billion vector points have been managed ([http://lists.osgeo.org/pipermail/grass-dev/2011-January/052996.html citation]) without topology (since not needed). In all GRASS versions, the limit with topology is at time 2^31 - 1 (about 2 billion) features per vector map. | ||
See also: | |||
* [[Large vector data processing]] | |||
==== Some '''benchmarks''' ==== | |||
* ... | |||
* ... add more | |||
== Parallelization == | == Parallelization == | ||
In GRASS 7, a few modules have been parallelized with | In GRASS 7, a few modules have been experimentally parallelized with OpenMP. However, if data can be processed in chunks, GRASS GIS can be used on clusters. | ||
See also | * [[OpenMP/Benchmarks]] | ||
Parallelized modules: | |||
* v.surf.rst, r.sim.water, r.sun, ... | |||
Note: As of 2020, there are still issues with openMP (it may lead to weird results or perform slower). | |||
== Benchmarks == | |||
=== r.neighbors === | |||
$ g.region -pa n=228500 s=215000 w=630000 e=645000 res=0.5 | |||
$ r.random.surface output=random | |||
$ time r.neighbors input=random output=avg,min,max method=average,minimum,maximum size=5 | |||
real 6m58.801s | |||
user 6m45.132s | |||
sys 0m6.864s | |||
810,000,000 cells (27,000x30,000), 3 outputs (average, min, max), window size 5, one core, negligible use of RAM. | |||
== User reports == | |||
* [[Case studies]] | |||
== See also == | |||
* [[Parallel GRASS jobs]] | * [[Parallel GRASS jobs]] | ||
* [[Software requirements specification]] | |||
* [[Performance comparison GRASS vs. ArcGIS]] | |||
* [[OpenMP/Benchmarks|Benchmarks of OpenMP parallelized code]] | |||
* [http://wiki.osgeo.org/wiki/GIS_workstation_setup_tips GIS workstation setup tips] (OSGeo Wiki page) | |||
[[Category: FAQ]] | [[Category: FAQ]] | ||
[[Category: Hardware]] | |||
[[Category: massive data analysis]] | [[Category: massive data analysis]] | ||
[[Category: Raster]] | |||
[[Category: Vector]] |
Revision as of 14:00, 27 February 2021
GRASS GIS Performance
GRASS GIS is noted for being ready for massive data analysis. This page contains an yet incomplete collection of performance indicators.
Architecture
GRASS GIS is fully 32bit and 64bit compliant. See also the Software requirements specification.
Search strategies used in processing geodata
GRASS GIS makes heavy use of search trees in order to speed up computation:
- segment lib: btree2
- 2D splines (RST): quadtree
- 3D splines (RST): octree
- vector lib topology: R*-tree
See the Programmer's manual for details.
Number of opened input files
There are only operating system constraints of the number of input files which can be opened simultaneously. Commonly the limit is 1024 files. In operating systems like Linux this limit can be overcome with the "ulimit" settings.
See also
Memory management
Due to the modular architecture of GRASS GIS the overhead of the software itself is minimal.
Raster data operations: where appropriate, modules offer a parameter to optimize caching ("memory" parameter).
- Pixel based operations: they have very low impact on memory usage.
- Moving window based operations: they have medium impact on memory usage.
- Full map operations (watersheds, cost surfaces, etc.): they have high impact on memory usage.
- Statistical operations: while univariate statistics have low impact on memory usage, quartiles and other aggregated statistics have medium impact on memory usage.
Vector data operations:
- Vector point operations: memory consumption depends on the amount of points. LiDAR data processing is commonly demanding. For some operations the creation of topology can be skipped to reduce the memory footprint.
- Vector line operations: they have low impact on memory usage (depends on the amount of data).
- Vector area/faces operations: they have high impact on memory usage.
- Topological versus non-topological operations: a subset of vector modules is able to operate on point vector maps without topology which saves notably RAM usage.
Database operations:
- Most operations are simply SQL transactions with low impact on memory usage.
See also
- Solving Memory issues when dealing with large amounts of data
Vector management
A "vector map" is a data layer consisting of a number of sparse features in geographic space. These might be data points (drill sites), lines (roads), polygons (park boundary), volumes (3D CAD structure), or some combination of all these. Typically each feature in the map will be tied to a set of attribute layers stored in a database (road names, site ID, geologic type, etc.). As a general rule these can exist in 2D or 3D space and are independent of the GIS's computation region.
See also vectorintro in the manual.
Vector geometry
In all GRASS GIS versions,
- with topology the feature limit is at time 2^31 - 1 (about 2 billion) features per vector map.
- TODO: add limit if topology creation is disabled at import for points (e.g., LiDAR points).
Vector attribute management
Attributes are managed through a SQL interface (see also databaseintro)
The default database backend is
- DBF files (tend to be slow) in GRASS GIS 6 (grass-dbf)
- SQLite file (very fast compared to DBF in GRASS GIS 7 (grass-sqlite)
Other SQL backends are offered as well including PostgreSQL, MySQL, etc.: see sql support in GRASS GIS.
Speed of DBF versus SQLite drivers: attribute operations which take hours using the DBF backend just take seconds using the SQLite backend.
Maximum Number of Attribute Columns
The maximum number of attribute columns of a table connected to a vector map is defined by the capabilities of the the selected database backend (set with db.connect).
- DBF-Backend: GRASS 4.x - 6.x use by default the DBF backend. While there is no explicitly stated maximum number of allowed attribute columns, Web sources report a maximum between 128 and 1023/24. Trials with GRASS 6.4.2 in 2012 result in write failure if > 2000 attribute columns are used. Export to DBF-based ESRI Shapefile provides a warning if more that 255 attributes are used: Other software tools may ignore all further attributes, hence a maximum of 128 columns may be prudent.
- SQLite-Backend: GRASS 7.x uses by default the SQLite backend. The default maximum number of attribute columns is 2000 according to the specifications. This number can be increased by compiling SQlite with changed settings.
- MySQL-Backend: The default maximum number of attribute columns is 4096 according to the specifications.
- PostgreSQL-Backend: The default maximum number of attribute columns is 250-1600 according to the specifications depending on column types.
- Oracle-Backend: The default maximum number of attribute columns is 1000 according to the specifications.
Maximum file size of the attributes file
- DBF-Backend (in GRASS 6 the default DB backend): to be added (2Gb? in case of LFS enabled?)
- SQLite-Backend (in GRASS 7 the default DB backend): The maximum file size of a SQLite db is 140 TB, independent of the architecture, i.e. Large File Support (LFS) is always there. Usually SQLite will hit the maximum file size limit of the underlying filesystem or disk hardware size limit long before it hits its own internal size limit.
Raster management
A "raster map" is a data layer consisting of a gridded array of cells. It has a certain number of rows and columns, with a data point (or null value indicator) in each cell. These may exist as a 2D grid or as a 3D cube made up of many smaller cubes, i.e. a stack of 2D grids.
See also rasterintro in the manual.
Raster map precision types
- CELL DATA TYPE: a raster map from INTEGER type (4 bytes, whole numbers only).
- In GRASS GIS, CELL is a 32 bit integer with a range from -2,147,483,647 to +2,147,483,647. The value -2,147,483,648 is reserved for NODATA.
- FCELL DATA TYPE: a raster map from FLOAT type (4 bytes, 7-9 digits precision).
- In GRASS GIS, FCELL is a 32 bit float (Float32) with a range from -3.4E38 to 3.4E38. However, the integer precision can be only ensured between -16777216 and 16777216. If your raster overpass this range we strongly suggest to use DCELL, as Float64 data type.
- DCELL DATA TYPE: a raster map from DOUBLE type (8 bytes, 15-17 digits precision).
- In GRASS GIS, DCELL is a 64 bit float (Float64) with a range from -1.79E308 to 1.79E308.
- NULL: represents "no data" in raster maps, to be distinguished from 0 (zero) data value.
Aliases:
- INTEGER MAP: see CELL DATA TYPE
- FLOAT MAP: see FCELL DATA TYPE
- DOUBLE MAP: see DCELL DATA TYPE
(reference in the GRASS GIS source code)
See also GRASS raster semantics
Large file support
Large raster data processing
GRASS GIS 7 supports the off_t type, hence it can address an enormous amount of raster data.
See also:
Some benchmarks
- Import of ECAD 6.0 Tmean dataset: 22650 layers in single netCDF file: import takes 300 Seconds while reading file via NFS (i.e. 75 maps per second)
- Calculation of watersheds, half basins, flow accumulation, drainage directions, and stream with r.watershed for an area of 90,000 rows x 100,000 cols (9,000,000,000 cells, metric) successfully done in 77.2 hours (Intel Xeon X5670, 2.93GHz)
- European DEM at 25m (eudem_dem_3035_europe.tif, 24.1 GB GeoTIFF, 48 billion cells) processing:
- Import of this GeoTIFF file with r.in.gdal on a blade via NFS: a) 77h without memory option (hence 40MB = GDAL's default cache), b) '1.5h' with memory=300 (hence using 300MB GDAL cache), c) '1.5h' with memory=2000 (hence using 2GB GDAL cache)
- r.neighbors with 3.694261e+12 pixels (rows: 440046 cols: 830958 cells)
- Import of Global Forest Loss map with rows=560000 * cols=1440000 = 8.064e+11 pixels (see trac #3365); map can be easily shown in GRASS GIS monitor
- r.stream.extract: the upper limit matrix cell number that can handle is about 1.15e+18 raster cells (1.15 "exa"-cells. The number of detected stream segments must not be larger than 2,147,483,647 streams.
- ... add more
Large vector data processing
GRASS GIS 7 supports the off_t type, hence it can address an enormous amount of vector data. Currently multi-billion vector points have been managed (citation) without topology (since not needed). In all GRASS versions, the limit with topology is at time 2^31 - 1 (about 2 billion) features per vector map.
See also:
Some benchmarks
- ...
- ... add more
Parallelization
In GRASS 7, a few modules have been experimentally parallelized with OpenMP. However, if data can be processed in chunks, GRASS GIS can be used on clusters.
Parallelized modules:
- v.surf.rst, r.sim.water, r.sun, ...
Note: As of 2020, there are still issues with openMP (it may lead to weird results or perform slower).
Benchmarks
r.neighbors
$ g.region -pa n=228500 s=215000 w=630000 e=645000 res=0.5 $ r.random.surface output=random $ time r.neighbors input=random output=avg,min,max method=average,minimum,maximum size=5 real 6m58.801s user 6m45.132s sys 0m6.864s
810,000,000 cells (27,000x30,000), 3 outputs (average, min, max), window size 5, one core, negligible use of RAM.