Difference between revisions of "Large File Support"

From GRASS-Wiki
Jump to navigation Jump to search
(added Glynn's comments from ML: http://lists.osgeo.org/pipermail/grass-dev/2014-June/069805.html)
m (Moved to Trac)
Line 1: Line 1:
(largely based on comments by Glynn Clements on the GRASS-dev mailing list)
== The need ==
Standard C <stdio.h> file functions return file sizes as long integer. On 32-bit systems this overflows at 2 gigabytes. For support of files bigger than this, you need LFS. In GRASS GIS 6, this is only implemented in libgis. (i.e. there is support for reading+writing raster maps, but not many import/export modules or vector functions have it). The situation is much better in GRASS GIS 7.
In GRASS, configure.in/configure have been updated to support --enable-largefile (based on code from "cdr-tools"). See [http://trac.osgeo.org/grass/browser/grass/trunk/configure.in configure.in] (around line 1610).
== The issues ==
The problem is that ftell() returns the result as a (signed) long. If
the result won't fit into a long, it returns -1 (and sets errno to
This can only happen if you also set _FILE_OFFSET_BITS to 64 so that
fopen() is redirected to fopen64(), otherwise fopen() will simply
refuse to open files larger than 2GiB (apparently, this isn't true on
some versions of MacOSX, which open the file anyhow then fail on
fseek/ftell once you've passed the 2GiB mark).
If you want to obtain the current offset for a file whose size exceeds
the range of a signed long, you instead have to use the (non-ANSI)
ftello() function, which returns the offset as an off_t.
'''''TODO:''''' But before we do that, we would need to add configure checks so that we don't try to use ftello() on systems which don't provide it.
There isn't a truly portable solution. Some platforms might not even
have an integral type larger than 32 bits.
The most practicaly solution is to use ftello() if it's available.
This will require some configure checks. These are simple enough to
implement; it's the design which is problematic (as usual).
Unlike most HAVE_FOO checks, fseeko() isn't a simple have/don't-have
check. Rather, it's usually a case that the function is available only
when certain macros are defined (e.g. _LARGEFILE_SOURCE).
That gives rise to the question of what we check for, how we check for
it, how we pass that information to the code, and how we use it.
The trick is deciding what to test, and what to indicate. Do we want
to know:
1. Whether ftello() exists with the default CFLAGS/CPPFLAGS?
2. Whether ftello() exists with the default CFLAGS/CPPFLAGS plus a
fixed selection of additional switches (e.g. -D_LARGEFILE_SOURCE)?
3. Whether ftello() exists with the default CFLAGS/CPPFLAGS plus a
variable selection of additional switches (e.g. -D_LARGEFILE_SOURCE)
with those switches being communicated via an additional variable?
The test would need to be:
#if defined(HAVE_FTELLO) && defined(HAVE_FSEEKO)
as you are using both of those.
Also, if the existence of FSEEKO/FTELLO is conditional upon certain
macros (e.g. _LARGEFILE_SOURCE), you have to ensure that those macros
are defined before the first inclusion of <stdio.h>, including any
"hidden" inclusion from other headers, which essentially means that
you have to ensure that the macros are defined before you include any
other headers (except for <grass/config.h>).
Rather than try to come up with some infrastructure which allows us to
use LFS in a piecemeal fashion, it would be preferable to clean up the
GRASS code so that we can enable LFS globally. Then, we can just add:
to CPPFLAGS, and not have to worry about adding the necessary macros
to individual files. Any HAVE_* checks then become simple
have/don't-have checks.
My inclination would be to add:
extern off_t G_ftell(FILE *fp);
extern int G_fseek(FILE *stream, off_t offset, int whence);
These would be implemented using fseeko/ftello where available, and
fseek/ftell otherwise. That eliminates the need to perform checks in
individual source files. However, we would need to take care not to
use them in code which can't handle large offsets (i.e. code which
will truncate off_t values to int/long).
If you use -D_FILE_OFFSET_BITS=64, fopen() should be redirected to
fopen64(). There are reports that the MacOSX fopen() is equivalent to
fopen64() even without additional switches.
== Coding LFS in GRASS GIS 6 ==
Currently the <tt>--enable-largefile</tt> switch only enables LFS in libgis, not anywhere else.
[Although config.h includes definitions to enable LFS automatically,
those definitions are currently inactive. This is probably a good
thing; a lot of GRASS' code isn't LFS-aware, and explicit failure is
preferable to silently corrupting data.]
To enable LFS elsewhere, you need to manually add
-D_FILE_OFFSET_BITS=64 to the compilation flags. The simplest approach
is to add to the module's Makefile:
ifneq ($(USE_LARGEFILES),)
and add include <tt>config.h</tt> before '''all''' other header files in the code.
#include <grass/config.h>
#include <stdio.h>
#include <string.h>
#include <grass/gis.h>
=== int versus off_t ===
You may as well just use "<tt>off_t filesize</tt>" unconditionally. An
"off_t" will always be large enough to hold a "long".
If using "<tt>off_t</tt>", be sure to add:
#include <sys/types.h>
== Issues related to import/export of maps ==
Some exporting/inporting formats have their own intrinsic limitations, see for instance
== GRASS GIS 6: LFS-safe libs and module list ==
* libgis
* r.in.arc
* r.in.ascii
* r.out.arc
* r.out.ascii
* r.proj.seg
* r.terraflow
== GRASS GIS 6: LFS works in progress ==
* r.in.xyz
* r.terraflow (intregrate current LFS support into GRASS's --enable-largefile ./configure switch)<BR>(r.terraflow creates huge temporary files which can easily go over 2GB)
== GRASS GIS 6: LFS wish list ==
'''High priority modules to get LFS'''
* r.in.*
* r.out.*
* GRASS GDAL plugin (??)
* v.surf.rst
* v.surf.idw(2)
* vector libs (limited by number of features)
* v.in.ascii -bt  (without topology)
* DB libs
== Coding LFS in GRASS GIS 7 ==
* Already enabled for raster and vector libraries and modules, see http://trac.osgeo.org/grass/wiki/Grass7/NewFeatures
'''Q:''' Is xyz module (e.g. r.texture) supporting LFS?
'''A:''' Modules typically don't need to do anything regarding LFS; the support is in the libraries. The main issue which affects modules is that they shouldn't assume that cell counts will fit into an "int" or even a "long". But even failing to do so won't have any effect upon I/O.
The most common way for the issue to arise is multiplying the number
of rows by the number of columns to obtain the total number of cells.
Most modules have no need to do this, but it is occasionally done e.g.
when calculating statistics or storing the data in a temporary file.
The number of rows and columns can reasonably be assumed to fit into a
signed 32-bit integer, but their product cannot.
Even on a system with a 64-bit "long"[1], multiplying 2 "int"s will
produce an "int" result, and assigning the result to a "long" variable
doesn't change that. E.g.
        int nrows = Rast_window_rows();
        int ncols = Rast_window_cols();
        long ncells = nrows * ncols;
will truncate the result of the multiplication to an "int" (which is
32 bits on all mainstream platforms) then expand the truncated value
to 64 bits in the assignment. To perform the multplication using
"long", one of the arguments must be converted, e.g.:
        long ncells = (long) nrows * ncols;
[1] This doesn't include 64-bit versions of Windows, where "long" is only 32 bits for compatibility reasons.
The issue isn't strictly related to LFS; due to compression, it's
possible for a raster with more than 2^31 cells to take up less than
2 GiB on disk. LFS just makes the issue more likely to arise in practice.
== References ==
* [http://www.suse.de/~aj/linux_lfs.html LFS support in Linux]
* [http://opengroup.org/platform/lfs.html Adding Large File Support to the Single UNIX® Specification]
* [http://www.sun.com/software/whitepapers/wp-largefiles/largefiles.pdf Large Files in Solaris: A White Paper]

[[Category: Development]]
[[Category: Development]]
[[Category: massive data analysis]]
[[Category: massive data analysis]]

Latest revision as of 05:22, 11 September 2016