Vector aggregate values

Q: How to aggregate values from one polygon layer to another

A: (written by Michael O'Donovan)

GIS users accustomed to commercial vector-based products (MapInfo, MapMaker, Atlas and some other ESRI) are often disappointed when the try migrating to OpenSource software. Many of the key functions that make vector-based systems so useful are unavailable in OpenSource software like QGIS, Thuban and GRASS. One such function is the ability to aggregate values from one polygon layer to another. A typical example would be one of aggregating the population of census tracts (or municipalities) to derive the population of a suburb (or province). The aggregation is simply one of adding the tracts together only when the outer edge of some combination of polygons coincide. This happens when tracts are defined in such a way that they coincide with suburb perimeters. However when the boundaries do not coincide the situation usually calls for a vector-based system that will allocate populations according to the location of the tract centroid or the proportion of a tract falling into a particular suburb. The absence of this aggregation feature discourages potential users from adopting OpenSource software like GRASS. The section that follows shows how, with a little bit of preparation, aggregations can be easily performed by GRASS raster functions. It follows a hypothetical example inspired by the proposed redrawing of South Africa's nine provinces which currently dissect many of the 280+ municipalities. It shows how municipal populations can be re-combined to derive the provincial population by converting the vector features to rasters.

For simplicity it is assumed that the user has two shape files "Municipalities" and "Province". The "Municipalities" file is linked to a file "Municipalities.dbf" which contains numerous variables including "pop" (its population) and "SHP_FID" (a unique number which identifies each polygon). Both files can be easily imported into GRASS (using v.in.ogr) and displayed along with their variables. Before the vector files can be converted to rasters an unusual preparatory step is required. This entails converting the variables of interest from a value per vector to a value per raster cell. To make this conversion it is necessary to place into the appropriate .dbf file a count of the number of raster cells that will represent that vector. In this example this will enable the user to represent the population not as a number per municipality but as a number per cell. All subsequent aggregations need to be based on these cell values and thus on the cell counts.

The first step in preparation is to calculate the number of cells in each municipality. To do this convert the municipality vector into a raster file. Remember that, unlike vectors which can be linked to a large number of other variables, raster cells take on a single value only. For this step this value must be a number that uniquely identifies each municipality. If the municipality vector was obtained from an ESRI shapefile the field "SHP_FID" doesthe job. From the command line type:

       v.to.rast input=Municipalities output=municipalities use=attr col=SHP_FID

This creates a raster file called "municipalities". As vectors and raster are of different types there is little prospect of their names being confused. (Nevertheless vectors are indicated by a starting capital while rasters are kept as lowercase words.) the new raster can be seen by typing...

       d.rast municipalities.

If is looks funny change the color scheme using something like...

       r.colors map=municipalities color=random

and d.rast it again.

The next step requires extracting a count of the number of raster cells in each municipality. Once again this is a simple procedure. To save do this and save the output to a file called "munic_count" type...

       r.stats input=municipalities output=munic_count -c

This will produce a text file with the SHP_FID and a cell count seperated by a space. The final preparatory step is to place the count corresponding to each municipality into the Municipalities.dbf file and to calculate the variables of interest as an amount per cell.

This is how you would link the files in OpenOffice if you have not yet set up your data sources and do not know how to use VLOOKUP etc.

Open the file "munic_count" - it will default to a text page containing the SHP_FID, an open space and the cell count.
Convert the open spaces to tabs using "Edit", "Find and Replace". Place a space in the "search for" box and \t in the "replace with" boxes. Activate "Regular expressions" and click on "Replace all".
Select all of the data (Ctrl A) and copy it to a clipboard (Ctrl C)

You now have all the data ordered by SHP_FID ready to be pasted into the Municpalities.dbf file. So open the Municipalities.dbf file - it will default to a spreadsheet.

Make sure that the spreadsheet data is in ascending ordered of SHP_FID. (Select all data other than the headers then go to "Data", "Sort" and select the appropriate column).
Click on to cell at the top of the first unused column and click on "Edit", "Paste special" and "unformatted text". The cells values and SHP_FID data should now be placed in the corresponding rows. Check that this has been done.

Variables of interest.

If you are interested in, say, aggregating populations then calculate the population per cell for every municipality. (If your data starts in row 1, population is in column C and the cell count in column K type "=C1/K1". Copy and then paste the formula into all other rows.) Give the new column an appropraite name, say, "cell_value" and save the file under its original name (i.e. as a .dbf). You may well return at a latter stage and calculate new variables of interest using the cell counts.

You can now proceed with the aggregations.

First convert the Municipalities vector into a raster (again). This time use the cell_value as the attribute of interest. At the command prompt type...

       v.to.rast input=Municipalities output=municipalities use=attr col=cell_value

Note that I gave the output raster the same name as earlier. This will overwrite the original raster thereby minimizing clutter.

Now also convert the Provinces vector to a raster file

       v.to.rast input=Provinces output=provinces use=attr col=SHP_FID

The relationship between the two raster files can now be explored using a number of statistics commands most of which are found under r.statistics. Perhapsthe most important statistics are r.sum and r.average.

The command

       r.sum municpalities

will give the total of all the cell values in the municipalities raster. This equals the total population of the region. However what is actually required in the sum of the population for each province. To get this we use r.statistics. Unfortunately this script does not work for floating point rasters (which is what you are most likely to have). To overcome this problem create a new raster reflecting the integer value of each cell. To create a file useable for r.statistics use..

       r.mapcalc

(then enter line-by-line)

       int_municpal=round(municipalities)
       end

If rounding of the cell values to an integer results in a notable decline in accuracy then create a new raster file in which the values are multiplied by ten or a hundred and only then rounded. In latter analysis recall that the units have been dramatically (but precisely) inflated. To do this try "int_munic=round(100*municipalities)".

To finally aggregate the municipal population to provinces type...

       r.statistics base=provinces cover=int_municipal method=sum output=prov_pop

This creates a new raster "prov_pop" which can be explored using

d.what.vect (click on each region of interest)

You will see that the values for each province now equals the sum of all the cells that fall into that province. If you want to list the values so that you can put them into Provinces.dbf (i.e. the vector) use..

       r.stats input=RSA_STATS -l

The output can then be placed into the corresponding data file using a method similar to that above.

r.statistics comes with a range of other functions other than "sum". These substantially improve the functionality of the method. Statistics include distribution, average, mode, median, average deviation, standard deviation, variance, skewness, kurtosis, minimum and maximum. For assistance type "man r.statistics" from the command prompt. Additional functionality is gained by r.mapcalc's ability to perform functions on multiple raster files. This, in itself, makes the move to raster-based mapping most promising.

Sometimes the above procedure results in a notable loss in accuracy. There are two main sources for this:

The rounding of cell values (to use r.statistics) may be problematic for areas with very small values. If this is the case consider inflating cell values by one of several orders of magnitude.

When the region was set up a resolution (E-W and N-S) was specified. If this resolution is not fine enough there may be some loss of accuracy through aggregation. Consider increasing the resolution.

Vector aggregate values

Navigation menu

Search