This page describes a flexible method for indexing datasets. Central to this theme is the construction of axes which may or may not be described. If an axis is described, the description may or may not be understood by the client software. Default behavior (for not-described or not-understood axes) is to index them with a dimensionless scalar.
There is also an optimized discrete index for gridded data. Transforms between discrete and continuous indicies shall be provided.
A mechanism for slicing (subsetting) the data shall be developed.
We want Geotools to provide meaning to the axes it understands, and index those it does not understand.
Indexing data 101
Accessing data in a gridded dataset begins at the most base level: indicies. Indicies are the most efficient means of traversing a grid if it is not necessary to interpret the indicies. They are also the most efficient way to iterate over the grid, as the extent is well defined and the interval between adjacent values along any axis is 1.
The next level of data access involves the interpretation of values along the axes. If the axes are appropriately described, then meaning can be applied to a coordinate, which has values along each axis. An axis can be continuous (e.g., distance from some reference axis value) or discrete. In the parlance of the recent discussions, a "sample dimension" is an example of a discrete axis. If, for example, the sample axis has the set of ordered quantities ("wind speed", "temperature", "pressure"), the axis index may only take on values of "wind speed", "temperature", or "pressure".
An "index space" may be constructed from an ordered collection of axes regardless of whether or not all the axes are described. Further, an index space may still be constructed if the description of a particular axis (or set of axes) is not understood. For the case where the axis description is not known, it may be indexed by a dimensionless scalar. When the description is known, the axis may have a dimension (length) and units (meters) associated with it.
Topic 2 (ISO 19111), in the Coordinate System package, expresses a special case of this general concept. ISO 19111 specifies completely defined and completely understood coordinate systems. There is little room for axes which are not defined or not understood. But we still have some room.
ISO 19111 defines two kind of unknown coordinate systems:
UserDefinedCS. Maybe they were intended for spatiotemporal axis only, but actually nothing prevents us from using them with other kind of axis. Because conversions are performed from grid coordinates to CRS coordinates by a
MathTransform, and since math transforms can be setup independently of any CRS, grid coverage can already work without full understanding of all axes. But trying to use ISO 19111 for non-spatiotemporal axes may be an abuse.
While ISO 19111 pre-defined interfaces may not be appropriate, I don't see any technical objection to create our own axis to be used in an ISO 19111 framework. For example nothing prevents us from creating a
SpectralCS for referencing a wavelength in a spectrum. The referencing framework is extensible. The only downside is that such a
SpectralCS would be non-standard.
The challenge is to figure out how to describe axes in some general way such that spatio-temporal axes are easily detected and the responsibility for eliciting meaning from the axes is delegated to some GeoAPI campatible interface like Geotools without delegating the responsibility for understanding all of the axes.
Current generic approaches
This section summarizes a current self-describing data format which provide some semblance of a facility for doing this. Note, however, that only a framework is provided in each. The actual data access APIs do not attempt to find meaning in the axes.
The Multiarray2 package is released with the Java NetCDF toolkit. It appears very much to be an implementation of JSR 83. (JSR 83 appears to be a dead spec which fell through the cracks. It apparently finished its review ballot in October 2000, and nothing has happened since.) Multiarray2 provides a lean, efficient method for loading truly n-Dimensional gridded datasets into memory, then accessing the data elements. It has a very simple implementation of an Index into the grid. This index is just comprised of an array of integers.
The main utility of this package is that the data is loaded into a one-dimensional array of the appropriate primitive type. When data elements are accessed, the appropriate offset into the array is calculated given the index.
The multiarray2 package concerns itself with data management. It provides an interface which permits the user to access data with an n-D integer index, encapsulating the 1D internal data representation. The user is responsible for understanding the axes and adding meaning where necessary.
The NetCDF file format is a self-describing means of transporting binary data in the standard primitive types. At it's simplest, a NetCDF file contains a few distinct components:
- Attributes (which can be attached to either the coordinates or variables).
(NOTE: Ucar is in the process of making NetCDF more complex.)
The NetCDF format requires that dimensions be defined before being used. Each dimension definition specifies the number of samples it contains. All variables are defined in terms of an ordered collection of dimensions. In this manner, the "shape" of the array is determined by the dimensions.
The Java NetCDF package (v2.2) attempts to gain information from the content of the NetCDF file. It examines the collection of dimension names, file attributes and variable attributes to determine whether the file conforms to a "convention". If so, the NetCDF package will designate that certain dimensions have certain meanings, like "geographic latitude" and "geographic longitude". If a variable is defined with a combination of dimensions which have known spatial meanings, that variable is designated a "GeoGrid". A GeoGrid can have a mixture of meaningful dimensions and dimensions which have no known meaning. This is very nearly what we want to accomplish with the MultiDimensional Georegistered Grid inside Geotools.
The IO plugins should be capable of building up a set of axes from the contents of the data source even if Geotools is not expected to understand a few of them.
The dumber you are, the faster you are. Applications which merely iterate through the grid to perform some operation will always be the fastest. Applications which take the time to iterate over some abstract space not directly represented by reality will incur some additional overhead. It could very well be that an application is interested in applying meaning to the axes only to identify an interesting subset of data.
It is not the place of a toolkit to impose an unwanted level of abstraction on an application.
A toolkit must provide a service or otherwise add value.
|Correlary to Theorem 1 and Theorem 2|
A toolkit must allow the application to decide what level of abstraction to impose on the problem. If possible, the application writer should be allowed to access the same data using different levels of abstraction.
|Therefore, we must...|
The API we develop should allow users access to the geospatial smarts inside Geotools without forcing the user to access data via geographic location.
Path to expansion
Two obvious alternatives present themselves immediately.
- Define our own axes description framework completely outside of the ISO19111 framework, but using ISO19111 for geospatial axes description...
- See if we can extend ISO19111 in such a way that it can tolerate non spatio-temporal axes. Can a coordinate system be defined to ignore axes it doesn't understand?
- Note from Martin: Yes, ISO 19111 is very easy to extends (for example with a
SpectralCS). However, we have to decide if it is the right thing to do on a conceptual point of view.
- Note from Martin: Yes, ISO 19111 is very easy to extends (for example with a
What issues arise from this? How do "filler" dimensions impact the concept of using a MathTransform to perform a coordinate conversion or transform? Is the solution straightforward?
Gridded data natively supports a very concise notion of rectangular data subsetting. Our subsetting method should permit access to this notion as well as allow the user to subset via our "value added" geospatial meaning.