Vector aggregate values

From GRASS-Wiki
Jump to navigation Jump to search

Introduction

GIS users often wish to aggregate the values of one polygon layer into another. Although this is not integrated as a standard tool in GRASS, it is definitely possible. This article introduces two methods to do just that: A simple, quick operation will be described that can take place entirely in GRASS GIS and produces raster maps or, if desired, polygons with aggregated values from two maps and a second, more complex method that uses external editors and creates text files as outputs.

Method 1

written by Daniel Lee

In this example we assume that we have a vector map of catchment areas and a raster map of precipitation. We want to find out what the sum total of precipitation per catchment area is. This method would also work if the precipitation data were also stored as vector data, but the vector precipitation data would have to be turned into a raster map, just as the catchment areas are converted to rasters in the following steps. Remember to set the region, etc. in GRASS properly.

Step 1: Convert catchment area polygons to rasters

The polygons are converted to rasters using v.to.rast. The primary key for the polygons should be used as the values for the new rasters. If the polygons were imported from a shapefile, for example, the field FID could be used.

v.to.rast input=CatchmentAreas output=CatchmentAreas_Raster column=FID

Step 2: Convert precipitation raster to integer values

The tool that we will be using later, r.statistics, requires integer values. Because the precipitation raster is composed of floating point values, we have to convert it to an integer map first. In order to avoid rounding areas in the final results, we also multiply the raster by 100. This way we can divide it by 100 later in order to get the original float values back. If the raster that you want to aggregate into your vectors is already made of integer values or if rounding errors aren't a problem, this step can be changed or left out.

r.mapcalc 'Precipitation_Integer = round (Precipitation * 100)

Step 3: Aggregate the precipitation values by summing them in the catchment areas they fall in

Now we have two rasters that can be combined. In order to find the sum of the pixels of the precipitation map we use r.statistics.

r.statistics base=CatchmentAreas_Raster cover=Precipitation_Integer method=sum output=Precipitation_by_CatchmentArea

Step 4: Extract sums from the new map

Now the precipitation sums are aggregated by catchment area, but only as attributes and not as categories in the resulting raster. We now extract the aggregated sums, overwriting the previous raster, by using the "@" function in r.mapcalc. Additionally, the map is divided by 100 to scale the values back, reversing the change we made when we turned the original precipitation map into an integer map. Don't forget to not divide by 100 if you didn't multiply by 100 in step 2.

r.mapcalc 'Precipitation_by_CatchmentArea = @Precipitation_by_CatchmentArea / 100'

Step 5: Convert the aggregated precipitation raster to a vector map.

This procedure should already be familiar for those who work with vector maps. If vectors are not needed, it can be skipped.

r.to.vect input=Precipitation_by_CatchmentArea output=Precipitation_by_CatchmentArea_Vectors feature=area

In an optional last step one can combine the new vectors, which will probably not be exactly the same as the old vectors, with the old vector attribute table, either with v.db.join or v.distance using e.g. the centroids as spatial join criterium.

Method 2

written by Michael O'Donovan

GIS users accustomed to proprietary vector-based products (MapInfo, MapMaker, Atlas and some other ESRI) are often disappointed when the try migrating to OpenSource software. Many of the key functions that make vector-based systems so useful are unavailable in OpenSource software like QGIS, Thuban and GRASS. One such function is the ability to aggregate values from one polygon layer to another.

A typical example would be one of aggregating the population of census tracts (or municipalities) to derive the population of a suburb (or province). The aggregation is simply one of adding the tracts together only when the outer edge of some combination of polygons coincide. This happens when tracts are defined in such a way that they coincide with suburb perimeters. However when the boundaries do not coincide the situation usually calls for a vector-based system that will allocate populations according to the location of the tract centroid or the proportion of a tract falling into a particular suburb. The absence of this aggregation feature discourages potential users from adopting OpenSource software like GRASS. The section that follows shows how, with a little bit of preparation, aggregations can be easily performed by GRASS raster functions. It follows a hypothetical example inspired by the proposed redrawing of South Africa's nine provinces which currently dissect many of the 280+ municipalities. It shows how municipal populations can be re-combined to derive the provincial population by converting the vector features to rasters.

For simplicity it is assumed that the user has two shape files "Municipalities" and "Province". The "Municipalities" file is linked to a file "Municipalities.dbf" which contains numerous variables including "pop" (its population) and "SHP_FID" (a unique number which identifies each polygon). Both files can be easily imported into GRASS (using v.in.ogr) and displayed along with their variables. Before the vector files can be converted to rasters an unusual preparatory step is required. This entails converting the variables of interest from a value per vector to a value per raster cell. To make this conversion it is necessary to place into the appropriate .dbf file a count of the number of raster cells that will represent that vector. In this example this will enable the user to represent the population not as a number per municipality but as a number per cell. All subsequent aggregations need to be based on these cell values and thus on the cell counts.

The first step in preparation is to calculate the number of cells in each municipality. To do this convert the municipality vector into a raster file. Remember that, unlike vectors which can be linked to a large number of other variables, raster cells take on a single value only. For this step this value must be a number that uniquely identifies each municipality. If the municipality vector was obtained from an ESRI shapefile the field "SHP_FID" doesthe job. From the command line type:

v.to.rast input=Municipalities output=municipalities use=attr col=SHP_FID

This creates a raster file called "municipalities". As vectors and raster are of different types there is little prospect of their names being confused. (Nevertheless vectors are indicated by a starting capital while rasters are kept as lowercase words.) the new raster can be seen by typing...

d.rast municipalities

If is looks funny change the color scheme using something like...

r.colors map=municipalities color=random

and d.redraw it.

The next step requires extracting a count of the number of raster cells in each municipality. Once again this is a simple procedure. To save do this and save the output to a file called "munic_count" type...

r.stats input=municipalities output=munic_count -c

This will produce a text file with the SHP_FID and a cell count seperated by a space. The final preparatory step is to place the count corresponding to each municipality into the Municipalities.dbf file and to calculate the variables of interest as an amount per cell.


This is how you would link the files in OpenOffice if you have not yet set up your data sources and do not know how to use VLOOKUP etc.

  1. Open the file "munic_count" - it will default to a text page containing the SHP_FID, an open space and the cell count.
  2. Convert the open spaces to tabs using "Edit", "Find and Replace". Place a space in the "search for" box and \t in the "replace with" boxes. Activate "Regular expressions" and click on "Replace all".
  3. Select all of the data (Ctrl A) and copy it to a clipboard (Ctrl C)

You now have all the data ordered by SHP_FID ready to be pasted into the Municpalities.dbf file. So open the Municipalities.dbf file - it will default to a spreadsheet.

  1. Make sure that the spreadsheet data is in ascending ordered of SHP_FID. (Select all data other than the headers then go to "Data", "Sort" and select the appropriate column).
  2. Click on to cell at the top of the first unused column and click on "Edit", "Paste special" and "unformatted text". The cells values and SHP_FID data should now be placed in the corresponding rows. Check that this has been done.

Variables of interest

If you are interested in, say, aggregating populations then calculate the population per cell for every municipality. (If your data starts in row 1, population is in column C and the cell count in column K type "=C1/K1". Copy and then paste the formula into all other rows.) Give the new column an appropraite name, say, "cell_value" and save the file under its original name (i.e. as a .dbf). You may well return at a latter stage and calculate new variables of interest using the cell counts.


You can now proceed with the aggregations.

First convert the Municipalities vector into a raster (again). This time use the cell_value as the attribute of interest. At the command prompt type...

v.to.rast input=Municipalities output=municipalities use=attr col=cell_value

Note that I gave the output raster the same name as earlier. This will overwrite the original raster thereby minimizing clutter.

Now also convert the Provinces vector to a raster file

v.to.rast input=Provinces output=provinces use=attr col=SHP_FID

The relationship between the two raster files can now be explored using a number of statistics commands most of which are found under r.statistics. Perhaps the most important statistics are r.sum and r.average.

The command

r.sum municipalities

will give the total of all the cell values in the municipalities raster. This equals the total population of the region. However what is actually required in the sum of the population for each province. To get this we use r.statistics. Unfortunately this script does not work for floating point rasters (which is what you are most likely to have). To overcome this problem create a new raster reflecting the integer value of each cell. To create a file usable for r.statistics use..

r.mapcalc
(then enter line-by-line)
        int_municpal=round(municipalities)
        end

If rounding of the cell values to an integer results in a notable decline in accuracy then create a new raster file in which the values are multiplied by ten or a hundred and only then rounded. In latter analysis recall that the units have been dramatically (but precisely) inflated. To do this try "int_munic=round(100*municipalities)".

To finally aggregate the municipal population to provinces type...

r.statistics base=provinces cover=int_municipal method=sum output=prov_pop

This creates a new raster "prov_pop" which can be explored using

d.what.vect

(click on each region of interest)

You will see that the values for each province now equals the sum of all the cells that fall into that province. If you want to list the values so that you can put them into Provinces.dbf (i.e. the vector) use..

r.stats input=RSA_STATS -l

The output can then be placed into the corresponding data file using a method similar to that above.

r.statistics comes with a range of other functions other than "sum". These substantially improve the functionality of the method. Statistics include distribution, average, mode, median, average deviation, standard deviation, variance, skewness, kurtosis, minimum and maximum. For assistance type "man r.statistics" from the command prompt. Additional functionality is gained by r.mapcalc's ability to perform functions on multiple raster files. This, in itself, makes the move to raster-based mapping most promising.


Sometimes the above procedure results in a notable loss in accuracy. There are two main sources for this:

The rounding of cell values (to use r.statistics) may be problematic for areas with very small values. If this is the case consider inflating cell values by one of several orders of magnitude.

When the region was set up a resolution (E-W and N-S) was specified. If this resolution is not fine enough there may be some loss of accuracy through aggregation. Consider increasing the resolution.