Validating rasters

GPXZ stores over 10 million unique GeoTIFF rasters between our dataset and our sources. The number of actual files is much larger as those rasters are duplicated many times across our infrastructure.

When shepherding so many files, it’s inevitable to come across some invalid ones. Sometimes the file comes with errors already from the source. Other times files develop corruption, due to software bugs or hardware crashes.

While the final GPXZ dataset goes through a robust QA process to ensure invalid data never reaches our customers, it’s still much easier to resolve data corruption that’s detected as soon as it occurs.

At GPXZ we run a validate-rasters script which we run on any files that have been downloaded, written to, or copied. If you’re interested in building a similar tool, some approaches are outlined below.

gdalinfo

The gdalinfo command gives information about a raster file.

gdalinfo S48E167.tif

# Driver: GTiff/GeoTIFF
# Files: S48E167.tif
# [there's lots more lines which I've truncated here]

If you pass a file that isn’t a geospatial raster then you get an error message

gdalinfo gpxz_logo.tif

# ERROR 4: `gpxz_logo.png' not recognized as being in a supported file format.
# gdalinfo failed - unable to open 'gpxz_logo.png'.

and more importantly you get a non-zero status code which will be interpreted as an error (e.g., if calling from python this will throw a python exception, or if chaining commands with &&)

echo $?  # Display the most recent status code.

# 1

Because we don’t really care about the actual output for validation purposes, we can add a few flags that reduce verbosity

gdalinfo -nomd -norat -noct -nomask S48E167.tif

Used like this, gdalinfo will catch major problems like zero-length files. But because many formats (most notable GeoTIFF) encode all this metadata at the start of the file, simply running gdalinfo won’t catch files that have been corrupted by a crash during writing, which in our experience is one of the most common causes of corruption.

If we truncate a valid GeoTIFF and check it

head -c 100000 S48E167.tif > S48E167.truncated.tif
gdalinfo S48E167.truncated.tif

# Driver: GTiff/GeoTIFF
# Files: S48E167.truncated.tif
# ...

gdal will happily rattle of the metadata from the header and exit with a successful return code!

gdalinfo -stats

To be assured that your data is good, there’s no option but to read it all. Luckily we can again [mis]use gdalinfo to do this by asking it to compute the summary statistics of our data with the -stats flag.

gdalinfo -stats S48E167.tif

# Driver: GTiff/GeoTIFF
# Files: S48E167.tif
# ...
#   Metadata:
#     STATISTICS_MINIMUM=-15
#     STATISTICS_MAXIMUM=746
#     STATISTICS_MEAN=17.086878887742
#     STATISTICS_STDDEV=67.97198232226
#     STATISTICS_VALID_PERCENT=100

At the end of our output we now have some summary statistics which require reading every pixel of the raster to compute. Repeating the command on our corrupted raster gives the error we’re expecting

gdalinfo -stats S48E167.truncated.tif

# ERROR 1: TIFFFillStrip:Read error at scanline 99; got 0 bytes, expected 678
# ERROR 1: TIFFReadEncodedStrip() failed.
# ERROR 1: S48E167.truncated.tif, band 1: IReadBlock failed at X offset 0, Y offset 100: TIFFReadEncodedStrip() failed.
# Driver: GTiff/GeoTIFF
# Files: S48E167.truncated.tif
# ...

echo $?

# 130

rasterio

The gdalinfo -stats command requires reading (and decompressing) the entire raster. If you’re willing you go through all that effort anyway, there’s even more categories of invalid rasters that can be detected! We’ll need to switch over to a more expressive programming language though.

Start by reading the data into memory. This step is enough to catch all the cases covered by gdalinfo -stats, though it will crash on larger-than-memory files.

with rasterio.open(raster_path) as f:
    a = f.read(masked=True)
    meta = f.meta

Now we can add extra test cases.

Like checking if our GeoTIFF file is actually a GeoTIFF, not something else with a .tif filename extension

assert meta["driver"] == "GTiff"

and checking for common NODATA values that don’t represent actual data.

a_filled = np.ma.filled(a, np.nan)
assert not np.any(a_filled == -32768)
assert not np.any(a_filled == -9999)

Domain-specific checks can be added too that go beyond file integrity into data integrity.

For (earth-based) DEMs (in metres) we know the range of expected values

assert np.nanmax(a_filled) < ELEVATION_MT_EVEREST
assert ELEVATION_MARIANA_TRENCH < np.nanmin(a_filled)

and are expecting only a single channel

assert meta['count'] == 1

Elevation

GPXZ is an API for elevation data. Access the highest-quality DEMs globally with a free account.

We also do consulting work around DEMs and geospatial data management. If that sounds like something you could use, reach out at [email protected]!