Skip to content
John Payne edited this page Aug 11, 2017 · 13 revisions

Finding rain, snow and soil moisture data

John Payne, 18 July, 2017

These pages summarize what I learned recently while searching for publicly accessible long-term daily precipitation records. A great variety of weather data is freely available on the web, but it is not easy to figure out which product to use, or how to get the data in a useful format. With the caveat that I am not an expert on any of these sources, I hope that this may serve as a rough map of the landscape for other researchers who are just starting similar searches.

Precipitation data are available from two main sources: 1) local rain gauge records and 2) global satellite data products. Weather stations are scattered haphazardly across the globe and satellites swoop around the earth in obliquely-angled orbits that crisscross and encounter day/night transitions, clouds, oceans, and so on. Data providers like to tame this disorder by converting measurements into regular geographic grids. Such conversions always involve summarizing and interpolating the data (pixels in a satellite image won't line up perfectly with latitude/longitude lines, for example), but the gains in efficiency make up for the loss in accuracy. Sophisticated models underlie all satellite data sets. These models translate values read from pixels in satellite instruments into meaningful indices, correct for known biases, rectify spatial warping, transform coordinates, deal with errors, and so on. NASA and other providers go to great lengths to do quality control on their data. Unless you are in the business of providing data, you want a data product, not raw data.

Agencies like NASA often transform a single raw data set into a range of data products that have undergone varying degrees of quality control. Stringent quality control may add months or years to the lag time ("latency") before a product is publicly available. Some quality-control methods adjust the raw data significantly; for example, daily satellite-estimated rain totals may be adjusted so that the sum of daily values agrees with measured monthly ground-based rain gauge totals. Many new data products combine data from multiple satellites, and some gridded precipitation or soil moisture products employ both satellite and ground station data in models. All products have date gaps due to equipment failure or other problems, and the formats in which data are offered range from relatively straightforward (rasters), to intermediate difficulty (NetCDF and HDF files--see below), to truly awful (GHCN .dly files--see below). NASA data documentation is world-class, thankfully.

In general, I suggest the following order of operation:

  1. Explore product descriptions (see the non-exhaustive list below) and pick a few products that may match your needs, including temporal and spatial resolution and the product latency (latency can be a sticking point if you need recent data);
  2. Skim the product documentation to understand what data sources the product is based on, and what algorithms were used to create it. Don't worry too much yet about the nitty-gritty of satellite instruments and algorithms unless you really need that extra detail (you can assume that intelligent choices have been made), but do pay attention to things like how cloud cover is dealt with, to avoid making major mistakes in your analysis;
  3. Skim a few research papers to get an idea of what the product's drawbacks might be (it's tempting to skip this step, but it can save you a lot of time in the long run);
  4. Use browsers provided on the data websites to download a few samples and start exploring the data; and finally
  5. If you need a substantial amount of data, find software to automate the download and processing of the range of dates/places you need, and download your data. Most of the websites offer "data browsers," but they are usually clunky and it's faster to use other software to download and process data directly. I offer one solution in this repository.

A cautionary note on analysis of gridded data, from the Climate Data Guide: "Rain rates, and especially the distribution of rain rates at a particular location, depend strongly on spatial resolution of the dataset. So, for example, a measure of extreme rain like its 99th percentile is expected to be very different in a time series observed from a surface rain gauge and in a co-located pixel of a gridded precipitation dataset, even if the data have the same temporal resolution. The spatial gridding results in a smoother dataset. In order to compare two precipitation datasets on grids with different resolutions, one should typically regrid one or both of the datasets to a common grid using a regridding method that conserves the total amount of rain falling in an area. Often, the default interpolation in an analysis software package is bilinear interpolation (e.g., Matlab), which is not conservative." (See the Climate Data Guide for further recommendations).

Data from weather stations on the ground

GHCN-D (https://www.ncdc.noaa.gov/ghcn-daily-description)

Grid Resolution: Not gridded
Temporal resolution: Daily
Domain: Global
Period of Record: Up to 175 years ago (varies by station) to the present
Latency: Updated daily and reprocessed monthly.

A global set of raw (i.e., not modeled) weather station data is available from the Global Historical Climatology Network Daily database, GHCN-D, which contains meteorological measurements from over 100,000 weather stations across the globe. In some senses, these data are the gold standard, and the data from some stations include a wide range of variables relating to rain, snow/ice, temperatures, wind, light levels, soil conditions, and so on. Unfortunately, rain gauges are likely to underestimate precipitation, and local data sets often suffer from inconsistent temporal data collection and patchy geographic coverage, as well as changes in practices and equipment that may introduce biases. On the positive side, the geographic coverage of this network is impressive and many stations have a very long history; for example, I found data from 41 weather stations in Mongolia, some dating back to the 1950s. GHCN "daily" files are in an atrociously difficult format that probably stems from the era of card readers and VAX mainframes; among its many quirks is that missing values may actually convey information. I couldn't get other people's functions to work for me (or didn't own the required software), so I wrote my own functions to parse the GHCN files in R; they are available at https://github.com/jcpayne/parse_GHCN_data.

GPCC (https://www.dwd.de/EN/ourservices/gpcc/gpcc.html) The Global Precipitation Climatology Centre (from the German Weather Service, Deutsche Wetterdienst) has one of the largest global databases of monthly rain gauge totals, from about 70,000 different stations in >170 countries. I do not know how much overlap there is between the GPCC stations and the GHCN stations. GPCC does not make the data from individual stations available. For daily data, they instead publish two gridded products, one near-real-time and the other lagging far behind but better quality-controlled:

  • GPCC First Guess Daily (ftp://ftp.dwd.de/pub/data/gpcc/html/gpcc_firstguess_daily_doi_download.html) Grid Resolution: 1 degree
    Temporal resolution: Daily
    Domain: Global
    Period of Record: 2009 to present
    Latency: 3-5 days
    The First Guess Daily data have minimal quality control (hence the name), and they represent spatially-averaged data from automatically-reporting weather stations. We found fairly good agreement between GPCC First Guess Daily data and local rain records in our study area.

  • GPCC (Global Precipitation Climate Centre) Full Data Daily Version 1 (ftp://ftp.dwd.de/pub/data/gpcc/html/fulldata-daily_v1_doi_download.html) Grid Resolution: 1 degree
    Temporal resolution: Daily
    Domain: Global
    Period of Record: January 1988 to Dec 2013
    Latency: 3-4 years.
    Quality-controlled (includes kriging and comparison with other data).

Satellite-based precipitation estimates

Satellite instruments for estimating precipitation include infrared radiation (IR) sensors, which measure the temperatures of cloud tops, and microwave radiation sensors. Microwave instruments include active transceivers, which are similar to land-based weather radar, and passive microwave sensors. As NASA explains it, "Active radars transmit and receive signals reflected back to the radar. The signal returned to the radar receiver provides a measure of the size and number of rain/snow drops at multiple vertical layers in the cloud. Passive precipitation radiometers measure natural thermal radiation (called brightness temperatures) from the complete observational scene including snow, rain, clouds, and the Earth's surface." Microwave sensors tend to produce more accurate estimates of rainfall than IR sensors, but the microwave satellites are usually in lower orbits and therefore have much spottier coverage than IR satellites. A number of global, gridded data sets have been developed which integrate IR and microwave satellite methods, or integrate satellite methods with ground-based rain-gauge methods. They include:

PERSIANN ( http://chrsdata.eng.uci.edu).

Grid Resolution: 0.25 degrees lat/lon
Temporal resolution: daily
Domain: Global for latitudes 60N to 60S
Period of Record: 1983 to Dec. 2016
Latency: 3 to 6 months

The Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks- Climate Data Record (PERSIANN-CDR) relies primarily on infrared data, but includes some microwave data. PERSIANN's most interesting feature is the use of neural networks to translate IR data to precipitation estimates. However, recent studies have found that PERSIANN estimates are not well correlated with land-based measurements (Hirpa et al. 2010); (Gao and Liu 2013). In arid Inner Mongolia, PERSIANN significantly overestimates rainfall (Yang and Luo 2014).

CMORPH ( http://www.cpc.ncep.noaa.gov/products/janowiak/cmorph_description.html)

Grid Resolution: 0.07277 degrees lat/lon (8 km at the equator)
Temporal resolution: 30 minutes (plus coarser products)
Domain: Global for latitudes 60N to 60S
Period of Record: December 3, 2002 to present
Latency: 18 hours

CMORPH models precipitation with estimates that are primarily derived from low-orbit satellite microwave observations. It fills in missing data using geostationary IR data on cloud position, and its most unusual feature is that it includes a model of the temporal cycle of cloud development, because a cloud's stage of development influences the amount of precipitation it produces. Like PERSIANN, CMORPH appears to be relatively inaccurate (Hirpa et al. 2010).

TRMM ( https://pmm.nasa.gov/trmm))

Grid Resolution: 0.25 degrees
Temporal resolution: 3-hour (TMPA 3B42) and monthly (TMPA 3B43)
Domain: Global for latitudes 50N to 50S
Period of Record: 1997 to present (but see description below)
Latency:

The TRMM (Tropical Rainfall Measuring Mission) satellite collected both active and passive microwave data. TRMM data were included in several products, including the Multi-satellite Precipitation Analysis (TMPA), which uses IR data to fill gaps in microwave satellite data. TMPA 3B42 proved more accurate than PERSIANN or CMORPH for the Inner Mongolia area, but errors were still unacceptably high, with correlation between satellite and rain gauge measurements well below 0.5 (Yang and Luo 2014). Unfortunately, the TRMM satellite ended its mission on April 8, 2015 and shortly thereafter burned up in the atmosphere. The multi-satellite TMPA 3B42 data stream will continue until GPM comes online (see next item), but the loss of the TRMM satellite created calibration problems (see https://pmm.nasa.gov/sites/default/files/document_files/TMPA-to-IMERG_transition.pdf).

GPM (https://pmm.nasa.gov/data-access/downloads/gpm)

Grid Resolution: 0.1 degree
Temporal resolution: 30 minutes to 1 month, depending on product
Domain: Global for latitudes 60N to 60S
Period of Record: March 2014 to present (but see description below)
Latency: 6 hours to 4 months, depending on product

A new set of products called Global Precipitation Monitoring (GPM) will use an algorithm called IMERG to merge the strong points of PERSIANN, CMPORPH and TRIMM analyses, and then calibrate the results against monthly rain gauge totals. Unfortunately, the data are only available from March 2014 to the present, although backwards integration with older data sets is planned to continue into 2018.

GPCP (https://climatedataguide.ucar.edu/climate-data/gpcp-daily-global-precipitation-climatology-project) Grid Resolution: 1 degree (daily data) and 2.5 degrees (monthly data)
Temporal resolution: Daily and monthly
Domain: Global
Period of Record: Oct 1 1996 to Oct 31 2015 for daily data, and Jan 1979 to present for monthly data (see description)
Latency: 3 months for monthly data. Daily?

Like GPM, the GPCP product integrates IR and microwave satellite data with ground rain gauge-based measurements to produce precipitation estimates. GPCP uses several different satellites including some in polar orbits, and for ground data it relies on GPCC. Its products have a long record of high reliability. However, "due to a hardware failure, GPCP SG V2.2 (monthly data) and 1DD V1.2 (daily, 1-degree data) ceased production in October 2015. The follow-on GPCP V2.3 and 1DD V1.3 will soon become available, reprocessed back to January 1979 and October 1996, respectively"( https://climatedataguide.ucar.edu/climate-data/gpcp-daily-global-precipitation-climatology-project)). According to https://precip.gsfc.nasa.gov, 1DD V1.3 is still in beta testing (as of April 27, 2017).

Satellite-based soil moisture estimates

Estimates of soil moisture might prove more useful than precipitation estimates for some research questions involving plant growth, grazing, etc. Although the spatial resolution of satellite-based soil moisture data tends to be coarse, the products are targeted directly at information on root zone moisture, rather than at precipitation, which may be less tightly correlated with plant growth. Data sources include:

SMAP (https://nsidc.org/data/smap/smap-data.html)

Grid Resolution: From 1km to 49km, depending on product
Temporal resolution: From 49 minutes to 1 day, depending on product
Domain: Generally, global for latitudes 85N to 85S
Period of Record: March 2015 to the present
Latency: 7 days

This new soil moisture measuring mission uses a combination of active and passive microwave sensors. Unfortunately, the satellite suffered some problems with one of its instruments and data are only available from 2015.

ESA CCI Soil Moisture (ftp://anon-ftp.ceda.ac.uk/neodc/esacci/soil_moisture/)

Grid Resolution: 0.25 degrees
Temporal resolution: Monthly
Domain: Global
Period of Record: 1978 or 1991 to Dec 2014 (see below)
Latency: 7 days
CCI soil moisture combines active and passive microwave data from a number of satellites. Date ranges:

  • Active Product: 1991-08-05 to 2014-12-31 (8550 days = 8550 files = 3.93 GB)
  • Passive Product: 1978-11-01 to 2014-12-31 (13210 days = 13210 files = 5.41 GB)
  • Combined Product: 1978-11-01 to 2014-12-31 (13210 days = 13210 files = 6.19 GB)

GRACE ( https://grace.jpl.nasa.gov/data/get-data/))

Grid Resolution: 0.5 to 1 degree
Temporal resolution: Monthly (although I think raw data may be collected every 3 hours?)
Domain: Global
Period of Record: From 2004 to present
Latency: ?

NASA's GRACE mission measures changes in the earth's gravity field. As a by-product, large-scale changes in soil water content can be deduced. GRACE data are primarily useful for large-scale, long-term trends (to see if a landscape has been slowly becoming more arid, for example, and the analysis seems to be fairly complex; see https://grace.jpl.nasa.gov/data/get-data/monthly-mass-grids-land/ for an example).

For my uses, rain gauge data from GHCN, spatially-averaged rain gauge data from GPCC, and MODIS Daily snow cover data were currently most useful, although in a year or so there should be better products as CPCP and GPM data become available.

Satellite-based snow cover estimates

Precipitation data from weather stations may include snow, but not all stations have proper equipment to record the amount of water in snow accurately and rain gauges are likely to misreport snowfall (for example, snow or ice builds up on the gauge and then melts when the weather warms, first underestimating and then overestimating precipitation). In addition, snow can be seen clearly on satellite images; therefore, snow cover is also a valuable satellite product.

MODIS/Terra Snow Cover Daily L3 Global 500m Grid, Version 6 (MOD10A1) (http://nsidc.org/data/mod10a1)

Grid Resolution: 500m
Temporal resolution: Daily (an 8-day product also available)
Domain: Global, but only tiles over land are produced. Tiles are 1200 km x 1200 km.
Period of Record: 24 February 2000 to present
Latency: 2 days

This product reports snow cover per pixel using an index called "NDSI" that ranges from 0 (no snow) to 100 (100% snow). Note that the previous version (Version 5) reported a binary snow/no-snow value for each pixel, but essentially used the same underlying algorithm. It has special flags for cloud cover, night, lake ice, and other conditions.

MODIS/Terra Snow Cover Eight-Day L3 Global 500m Grid, Version 6 (MOD10A2) (http://nsidc.org/data/mod10a2)

This product reports a binary value (snow/no-snow) for each pixel averaged over an 8-day period, which is "the period of near repeat ground track of the MODIS satellites." The eight-day product is designed to minimize the effects of cloudiness, and to give a more spatially consistent snow extent map than the daily product. It does not count NDSI values from 1-10 in the estimate of maximum snow cover, and only marks a pixel as "cloud" if all 8 days were cloud-covered; otherwise it chooses whatever observation was most common. "For example, if there were five snow-free land, and three cloud observations, the cell will be reported as snow-free land." (MODIS Snow Products Collection 6 User Guide). The product also includes an 8-bit binary integer which gives the snow/no-snow value for each of the 8 days (I once wrote a grisly script to extract and sum it along a track in a stack of images). The user's guide notes: "Typically the accuracy is similar to the MOD10A1 product, but may be lower because compositing of the daily snow commission errors over eight days can increase the percentage of error, spatially and temporally, despite the filter applied to reduce errors."

VIIRS/NPP Snow Cover 6-Min L2 Swath 375m, Version 1 (https://nsidc.org/data/VNP10/)

Grid Resolution: 375m
Temporal resolution: Daily, or more frequently where swaths overlap.
Domain: Global
Period of Record: Currently available for limited dates in 2016 and 2017, but the archive will soon have a period of record dating back to January 1, 2012 and will continue forward in near-real time.
Latency: Near-real time

This is a new data source. Snow cover is estimated from radiance data acquired by the Visible Infrared Imager Radiometer Suite (VIIRS) on board the Suomi National Polar-orbiting Partnership (NPP) satellite, which is in a sun-synchronous, near-circular polar orbit. The product is designed to be compatible with MODIS Terra and Aqua Version 6 snow cover data sets, and like MODIS, it reports snow cover per pixel using the "NDSI" index, which ranges from 0 (no snow) to 100 (100% snow).

Notes on file formats

If you are able to download data in a raster format like .tiff, .geotiff, or something similar you're in luck, because most GIS and statistics programs can read those files without further fuss. Some data providers let you submit special requests for raster formats (the National Snow and Ice Data Center does so). Unfortunately, a lot of data are only available in more obscure formats.

Meet the NetCDF and HDF Formats

NetCDF (Network Common Data Format) is a data storage convention that is popular with climate and forecasting scientists. It is part of a much larger effort to make data shareable and understandable (see here for grisly details). The NetCDF file format (.nc) is flexible. A typical, simple NetCDF file might hold layers of gridded data, where each layer represents measurements of a variable such as temperature over a geographic grid, and each layer is a snapshot at a different time of the same grid. A NetCDF file is a bit like a mini-database, but the NetCDF format is designed to be more compact and lightweight than a database, to be portable across computer systems that may represent numbers differently, and to allow very fast random access to subsets of hierarchical data sets.

In addition to gridded data, NetCDF files contain enough metadata to be completely "self-describing," and there is a lot of specialized software that can read NetCDF files. However, in practice, self-description sometimes fails because of the format's extreme flexibility. A NetCDF file can be created with almost any structure that one could imagine. For example, some files I was analyzing had been created with latitude and longitude in reverse of the expected order, and the R package I was using failed to detect that, so all my data were sideways, so to speak (see "When the going gets hairy," below, for how to fix that).

The HDF (Hierarchical Data Format) family of formats includes HDF, HDF4, and HDF5, which are all different and not interchangeable. I'll refer to the family as "HDF*" (see https://www.cise.ufl.edu/~rms/HDF-NetCDF%20Report.pdf for a good overview). The most recent of these formats is HDF5. Like NetCDF, an HDF* file contains both data and descriptive metadata, and HDF* formats are particularly useful for storing complex multidimensional, hierarchical data; in addition, they have advanced compression and storage options. I have read that all this complexity can result in somewhat slower data access than NetCDF.

US government agencies use both HDF5 and HDF4 for satellite data. Like NetCDF, HDF* files require special software to open them. One subtle problem is that some software opens HDF4 files but not HDF5 files, and vice versa. It is sometimes surprisingly difficult to figure out whether your data are in an HDF4 or HDF5 format; for example, some of the MODIS data websites don't mention (at least not prominently) that they use the older HDF4 format.

Reading and Manipulating NetCDF and HDF files

There are a variety of standalone programs for inspecting NetCDF and HDF files. I found them all to be of very limited usefulness--most are small utility programs with poor UIs and few options. The best of the ones I tried were:

I use R (an open-source statistics program) to organize a lot of my work, and there are various R libraries that can manipulate NetCDF and HDF* files. These are somewhat more powerful than the standalone programs. The libraries include:

  • ncdf4 : a reasonably good library for netcdf files (including creating them from scratch)
  • rhdf5 : reportedly, a good library for hdf5 files.
  • gdalUtils : a library with wrappers for GDAL functions. GDAL is the ultimate Swiss-army knife that can handle anything. It tends to be very slow.
  • Other libraries may open netcdf files if you're lucky.

There were some MODIS-specific R libraries that I ended up not using for one reason or another, but they might be worth checking out:

  • MODIS
  • MODISTools
  • pyMODIS (a collection of Python scripts)

I ended up writing my own R script to use NASA's MODIS Reprojection Tool (a command-line program which must be installed separately), to download, re-project, stitch, and crop images for one date at a time, which saved me from having to download terabytes of data before processing it. The NASA tool turned out to be waaaaaay faster than any of the other options. My script is available in this repository.

When the going gets hairy

NetCDF Operators (NCO)
If you need to modify or create NetCDF or HDF* files, or do something complicated like taking longitudinal slices of a file ("hyperslabs"), you can mess around with user-friendly browsers, R packages and so on, but when that fails or slows to a crawl and you start tearing your hair out, there is a collection of very powerful and fast command-line-only functions that the professionals use. It's called NCO (NetCDF Operators), and is maintained by Charlie Zender, one of those wizards who make it possible for the rest of us to muddle through our work. It's available on GitHub ( https://github.com/nco/nco)). The documentation at (http://nco.sourceforge.net/nco.html)) is very dense, but NCO is not difficult to use once you have installed it and figured out which function to use. Once you've banged your head on other non-solutions, you'll be grateful for NCO's power, even if it takes you a few tries to get the result you want.

There are useful examples and hints here:
http://nelson.wisc.edu/ccr/resources/nco/ncea.php

and here:
http://research.jisao.washington.edu/data_sets/nco/

Switching the lat/long in my ncdf file only took a single, short line of code:

ncpdq -a lat,lon original_file.nc new_file.nc

Ain't that sweet!

Climate Data Operators (CDO)
There is another big set of command-line operators used by the pros, called CDO, that is available from the Max Planck Institute here: https://code.mpimet.mpg.de/projects/cdo/files. I installed it with Macports and found it useful as well. For example, I needed to split a pile of NetCDF files that each contained a month of data into individual files containing one day of data, because a program I was using to load the NetCDF files into a PostgreSQL database couldn't handle multiple dates. NCO didn't seem to have any simple way to do that--the hard part is naming the daily files--but CDO had a one-line command that was built for exactly that and solved my problem in an instant:

cdo splitday $inputfile $outfile

References

Gao, Y., and M. Liu. 2013. Evaluation of high-resolution satellite precipitation products using rain gauge observations over the Tibetan Plateau. Hydrology and Earth System Sciences 17:837.

Hirpa, F. A., M. Gebremichael, and T. Hopson. 2010. Evaluation of high-resolution satellite precipitation products over very complex terrain in Ethiopia. Journal of Applied Meteorology and Climatology 49:1044-1051.

Yang, Y., and Y. Luo. 2014. Evaluating the performance of remote sensing precipitation products CMORPH, PERSIANN, and TMPA, in the arid region of northwest China. Theoretical and applied climatology 118:429-445.