|
|
Line 1: |
Line 1: |
| {{MoveToTrac}} | | {{MovedToTrac|LargeFileSupport}} |
| (largely based on comments by Glynn Clements on the GRASS-dev mailing list)
| |
| | |
| == The need ==
| |
| | |
| Standard C <stdio.h> file functions return file sizes as long integer. On 32-bit systems this overflows at 2 gigabytes. For support of files bigger than this, you need LFS. In GRASS GIS 6, this is only implemented in libgis. (i.e. there is support for reading+writing raster maps, but not many import/export modules or vector functions have it). The situation is much better in GRASS GIS 7.
| |
| | |
| In GRASS, configure.in/configure have been updated to support --enable-largefile (based on code from "cdr-tools"). See [http://trac.osgeo.org/grass/browser/grass/trunk/configure.in configure.in] (around line 1610).
| |
| | |
| == The issues ==
| |
| | |
| The problem is that ftell() returns the result as a (signed) long. If
| |
| the result won't fit into a long, it returns -1 (and sets errno to
| |
| EOVERFLOW).
| |
| | |
| This can only happen if you also set _FILE_OFFSET_BITS to 64 so that
| |
| fopen() is redirected to fopen64(), otherwise fopen() will simply
| |
| refuse to open files larger than 2GiB (apparently, this isn't true on
| |
| some versions of MacOSX, which open the file anyhow then fail on
| |
| fseek/ftell once you've passed the 2GiB mark).
| |
| | |
| If you want to obtain the current offset for a file whose size exceeds
| |
| the range of a signed long, you instead have to use the (non-ANSI)
| |
| ftello() function, which returns the offset as an off_t.
| |
| | |
| '''''TODO:''''' But before we do that, we would need to add configure checks so that we don't try to use ftello() on systems which don't provide it.
| |
| | |
| | |
| There isn't a truly portable solution. Some platforms might not even
| |
| have an integral type larger than 32 bits.
| |
| | |
| The most practicaly solution is to use ftello() if it's available.
| |
| | |
| This will require some configure checks. These are simple enough to
| |
| implement; it's the design which is problematic (as usual).
| |
| | |
| Unlike most HAVE_FOO checks, fseeko() isn't a simple have/don't-have
| |
| check. Rather, it's usually a case that the function is available only
| |
| when certain macros are defined (e.g. _LARGEFILE_SOURCE).
| |
| | |
| That gives rise to the question of what we check for, how we check for
| |
| it, how we pass that information to the code, and how we use it.
| |
| | |
| | |
| The trick is deciding what to test, and what to indicate. Do we want
| |
| to know:
| |
| | |
| 1. Whether ftello() exists with the default CFLAGS/CPPFLAGS?
| |
| | |
| 2. Whether ftello() exists with the default CFLAGS/CPPFLAGS plus a
| |
| fixed selection of additional switches (e.g. -D_LARGEFILE_SOURCE)?
| |
| | |
| 3. Whether ftello() exists with the default CFLAGS/CPPFLAGS plus a
| |
| variable selection of additional switches (e.g. -D_LARGEFILE_SOURCE)
| |
| with those switches being communicated via an additional variable?
| |
| | |
| The test would need to be:
| |
| #if defined(HAVE_FTELLO) && defined(HAVE_FSEEKO)
| |
| as you are using both of those.
| |
| | |
| Also, if the existence of FSEEKO/FTELLO is conditional upon certain
| |
| macros (e.g. _LARGEFILE_SOURCE), you have to ensure that those macros
| |
| are defined before the first inclusion of <stdio.h>, including any
| |
| "hidden" inclusion from other headers, which essentially means that
| |
| you have to ensure that the macros are defined before you include any
| |
| other headers (except for <grass/config.h>).
| |
| | |
| | |
| Rather than try to come up with some infrastructure which allows us to
| |
| use LFS in a piecemeal fashion, it would be preferable to clean up the
| |
| GRASS code so that we can enable LFS globally. Then, we can just add:
| |
| | |
| -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64
| |
| | |
| to CPPFLAGS, and not have to worry about adding the necessary macros
| |
| to individual files. Any HAVE_* checks then become simple
| |
| have/don't-have checks.
| |
| | |
| | |
| My inclination would be to add:
| |
| | |
| extern off_t G_ftell(FILE *fp);
| |
| extern int G_fseek(FILE *stream, off_t offset, int whence);
| |
| | |
| These would be implemented using fseeko/ftello where available, and
| |
| fseek/ftell otherwise. That eliminates the need to perform checks in
| |
| individual source files. However, we would need to take care not to
| |
| use them in code which can't handle large offsets (i.e. code which
| |
| will truncate off_t values to int/long).
| |
| | |
| | |
| If you use -D_FILE_OFFSET_BITS=64, fopen() should be redirected to
| |
| fopen64(). There are reports that the MacOSX fopen() is equivalent to
| |
| fopen64() even without additional switches.
| |
| | |
| == Coding LFS in GRASS GIS 6 ==
| |
| | |
| Currently the <tt>--enable-largefile</tt> switch only enables LFS in libgis, not anywhere else.
| |
| | |
| [Although config.h includes definitions to enable LFS automatically,
| |
| those definitions are currently inactive. This is probably a good
| |
| thing; a lot of GRASS' code isn't LFS-aware, and explicit failure is
| |
| preferable to silently corrupting data.]
| |
| | |
| To enable LFS elsewhere, you need to manually add
| |
| -D_FILE_OFFSET_BITS=64 to the compilation flags. The simplest approach
| |
| is to add to the module's Makefile:
| |
| | |
| ifneq ($(USE_LARGEFILES),)
| |
| EXTRA_CFLAGS = -D_FILE_OFFSET_BITS=64
| |
| endif
| |
| | |
| and add include <tt>config.h</tt> before '''all''' other header files in the code.
| |
| #include <grass/config.h>
| |
| #include <stdio.h>
| |
| #include <string.h>
| |
| #include <grass/gis.h>
| |
| ...
| |
| | |
| === int versus off_t ===
| |
| You may as well just use "<tt>off_t filesize</tt>" unconditionally. An
| |
| "off_t" will always be large enough to hold a "long".
| |
| | |
| If using "<tt>off_t</tt>", be sure to add:
| |
| #include <sys/types.h>
| |
| | |
| == Issues related to import/export of maps ==
| |
| | |
| Some exporting/inporting formats have their own intrinsic limitations, see for instance
| |
| http://www.gdal.org/formats_list.html
| |
| | |
| == GRASS GIS 6: LFS-safe libs and module list ==
| |
| * libgis
| |
| | |
| * r.in.arc
| |
| * r.in.ascii
| |
| * r.out.arc
| |
| * r.out.ascii
| |
| * r.proj.seg
| |
| * r.terraflow
| |
| | |
| == GRASS GIS 6: LFS works in progress ==
| |
| | |
| * r.in.xyz
| |
| * r.terraflow (intregrate current LFS support into GRASS's --enable-largefile ./configure switch)<BR>(r.terraflow creates huge temporary files which can easily go over 2GB)
| |
| | |
| == GRASS GIS 6: LFS wish list ==
| |
| '''High priority modules to get LFS'''
| |
| * r.in.*
| |
| * r.out.*
| |
| * GRASS GDAL plugin (??)
| |
| * v.surf.rst
| |
| * v.surf.idw(2)
| |
| * vector libs (limited by number of features)
| |
| * v.in.ascii -bt (without topology)
| |
| * DB libs
| |
| | |
| == Coding LFS in GRASS GIS 7 ==
| |
| | |
| * Already enabled for raster and vector libraries and modules, see http://trac.osgeo.org/grass/wiki/Grass7/NewFeatures
| |
| | |
| '''Q:''' Is xyz module (e.g. r.texture) supporting LFS?
| |
| | |
| '''A:''' Modules typically don't need to do anything regarding LFS; the support is in the libraries. The main issue which affects modules is that they shouldn't assume that cell counts will fit into an "int" or even a "long". But even failing to do so won't have any effect upon I/O.
| |
| | |
| '''Details:'''
| |
| | |
| The most common way for the issue to arise is multiplying the number
| |
| of rows by the number of columns to obtain the total number of cells.
| |
| Most modules have no need to do this, but it is occasionally done e.g.
| |
| when calculating statistics or storing the data in a temporary file.
| |
| | |
| The number of rows and columns can reasonably be assumed to fit into a
| |
| signed 32-bit integer, but their product cannot.
| |
| | |
| Even on a system with a 64-bit "long"[1], multiplying 2 "int"s will
| |
| produce an "int" result, and assigning the result to a "long" variable
| |
| doesn't change that. E.g.
| |
| | |
| int nrows = Rast_window_rows();
| |
| int ncols = Rast_window_cols();
| |
| long ncells = nrows * ncols;
| |
| | |
| will truncate the result of the multiplication to an "int" (which is
| |
| 32 bits on all mainstream platforms) then expand the truncated value
| |
| to 64 bits in the assignment. To perform the multplication using
| |
| "long", one of the arguments must be converted, e.g.:
| |
| | |
| long ncells = (long) nrows * ncols;
| |
| | |
| [1] This doesn't include 64-bit versions of Windows, where "long" is only 32 bits for compatibility reasons.
| |
| | |
| The issue isn't strictly related to LFS; due to compression, it's
| |
| possible for a raster with more than 2^31 cells to take up less than
| |
| 2 GiB on disk. LFS just makes the issue more likely to arise in practice.
| |
| | |
| == References ==
| |
| | |
| * [http://www.suse.de/~aj/linux_lfs.html LFS support in Linux]
| |
| * [http://opengroup.org/platform/lfs.html Adding Large File Support to the Single UNIX® Specification]
| |
| * [http://www.sun.com/software/whitepapers/wp-largefiles/largefiles.pdf Large Files in Solaris: A White Paper]
| |
|
| |
|
| [[Category: Development]] | | [[Category: Development]] |
| [[Category: massive data analysis]] | | [[Category: massive data analysis]] |