Skip to end of metadata
Go to start of metadata

Introduction

What follows is the result of my investigation of JSR-73 as a potential Operations API. I needed to implement buffering really quick and wanted to see what JSR-73 was all about since it was mentioned in one of the other attachments. Well, I went ahead and implemented a buffering operation anyway, because implementing an Operations API, then implementing Buffering would have taken too long. Let me know if any of the following makes sense. I am far from an expert on the Datamining API.

Summary

A preliminary survey of the JSR-73 datamining API reveals that the API is largely concerned with the approximation of datasets by models of the datasets. JSR-73 defines a framework for defining algorithms, functions, tasks, data, models of data, and executing algorithms on datasets. The API contains three heirarchical levels of ?operation? groupings, from most abstract to least abstract:

  1. MiningFunctions, which are the names of a grouping of similar algorithms: (e.g., the group named ?classification? contains k-Means and naiveBayes algorithms)
  2. MiningAlgorithms, which are the names attached to standard, well defined methods of accomplishing an operation. These are commonly defined as a collection of interfaces which define the ability to query whether a specific capability is supported.
  3. Actual implementations of algorithms, which are not included in the JSR, and which perform the actual work.

There are three central registries of defined MiningFunctions, MiningAlgorithms, and MiningTasks. These registries are lists of names only. They do not reference interfaces associated with specific functions or algorithms, nor do they group algorithms by similar functions. The registries merely accomplish a ?typing? of the names they contain into the catgories: Function, Algorithm, or Task. The heirarchical relationships between these items are maintained by a DataMiningEngine (DME), explained later.
There are six MiningTasks defined by JSR-73:

  1. applyTask : apply a previously constructed data model to an input dataset. (inverse of buildTask) Has two children tasks: DataSetApplyTask and RecordApplyTask.
  2. BuildTask : construct a new data model which approximates the input data (inverse of applyTask)
  3. computeStatisticsTask: compute some statistics given the input data
  4. importTask: import data into the JSR-73 data description framework.
  5. ExportTask: extract data from the JSR-73 data description framework.
  6. TestTask: computes an error metric on a training data set and it's corresponding model.

A MiningTask, much like MiningAlgorithms and MiningFunctions, is just a name used to identify currently available Tasks. The actual interfaces which describe the Task inherit from javax.datamining.base.Task.
The execution of a task is handled via a Connection, which is an interface which must be implemented. A Connection represents a connection to a DataMiningEngine (DME), and once the connection is constructed, many tasks may be executed on it, either synchronously or asynchronously. The DME holds the key to the relationships between Functions, Algorithms, and Tasks. It reports which functions, algorithms, and tasks it supports, and it returns a factory given a fully qualified class name of the object the factory is supposed to produce.
JSR-73 has an independent data representation and model representation, neither of which is necessarily tied to a specific MiningAlgorithm or MiningFunction. This enables many different algorithms (say a clustering algorithm and a classification algorithm) to produce the same kind of data model (say a tree, or an array of centroids with variances on all axes.) The specification includes a means of specifying both of these quantites. I believe the implementations of algorithms are permitted to operate on whatever they want, so it may not be necessary to be concerned with importing a Geotools data representation into the datamining API or vice-versa.
The downside of JSR-73 is that there is no obviously available reference implementation available yet, it is uncertain whether or not a DME would be included in the reference implementation, and a significant amount of work would be required to implement the core DME, Tasks, Connections, and other objects required for minimal stand alone operation. It is uncertain whether a limited implementation should be part of the Geotools distribution or should be separate. If a limited, single-JVM implementation of the DME is not included with Geotools, and if JSR-73 is adopted as the Geotools Operations API, then one would need to setup and run a DME separately to perform any operation like buffering, much as one needs to setup and run a database server in order to use the current implementation of CoordinateSystemEPSGFactory.
The upside of adopting JSR-73 is that the DME is a logical place to centralize the registration of all operations supported by the library at runtime. A discovery and heirarchical grouping mechanism is already in place. A mechanism for retrieving factories based on produced object is already specified. In fact, implementing a DME and utilizing JSR-73 as an Operations API would position Geotools in a unique niche: geo-datamining. This could be a kind of geospatially aware data modelling framework for geographically approximating data. (e.g., A set of hotspot detections from a set of satellite overpasses could be viewed as a sampling of fire progression over the landscape and approximated by some simplifying model: a growing polygon, a fractal IFS).

Survey of implementation requirements for Geotools/JSR-73 integration

This is a very quick summary of what I think would be required in order to
geospatially enable JSR-73 with Geotools (conversely, to utilize JSR-73 as an Operations API. I then move on to postulate what would be required to add a buffering operation under this framework.

Decide on package names

For the purposes of this discussion, the following package names will be assumed.

  1. Implementation classes tied mainly to the implementation of JSR-73 shall reside in org.geotools.datamining.
  2. Implementation classes tied to buffering shall be located in org.geotools.algorithms.buffering.

Implement JSR-73 Core

A limited functionality core of the JSR-73 specification will be required:

  1. DataMiningEngine : the core class which maintains relationships between Functions, Algorithms, Tasks, Factories, and which can execute operations. This code should handle registration with the javax.datamining.Mining* classes when Geotools implementations register themselves. Ensure that MiningFunctions and MiningAlgorithms have unique names/don't attempt to register the same name twice.
  2. DataMiningEngineSpi : the service provider interface that implementing classes use to register an operation with the DME. This interface must contain enough information to permit the DME to manage all the relationships it must manage.
  3. Implement javax.datamining.resource.Connection with org.geotools.datamining.Connection. Users obtain the singleton instance of this class to begin their interaction with the DME. (there has to be a better way!)
  4. Subclass javax.datamining.base.AlgorithmSettings with org.geotools.datamining.interfaces.GeospatialAlgorithmSettings
  5. Subclass javax.datamining.Factory with org.geotools.datamining.interfaces.GeospatialAlgorithmSettingsFactory.
  6. Subclass javax.datamining.base.Task with (and provide implementations for):
    • org.geotools.datamining.interfaces.GeospatialTask
    • org.geotools.datamining.interfaces.GeometricTask
    • org.geotools.datamining.interfaces.CoverageTask
    • org.geotools.datamining.interfaces.GridCoverageTask
  7. Subclass javax.datamining.base.Model with (and provide implementing classes):
    • org.geotools.datamining.interfaces.GeospatialModel
    • org.geotools.datamining.interfaces.GeometricModel
    • org.geotools.datamining.interfaces.CoverageModel

Deeper thinking may result in more work which must be done in order to construct a working core.

Extend JSR-73 with a buffering operation

  1. Extend org.geotools.datamining.interfaces.GeometricTask with org.geotools.buffering.BufferingTask.
  2. The buffering operation will produce a GeometricModel, which will be a Feature, so no need to define a new model.
  3. Extend org.geotools.datamining.interfaces.GeospatialAlgorithmSettings with org.geotools.buffering.BufferingAlgorithmSettings.
  4. Extend the factory for #3.
  5. Implement the processing class, a factory to create the class, and a DataMiningEngineSpi.

Summary

The Datamining API intends to do for datamining (the process of examining a data set for a model which fits it well) what JDBC did for databases. I do not see, however, a reference to a common language (like SQL) for datamining. All work is accomplished by describing a processing chain to a datamining engine (analgous to a database server), which may be local or remote, and which is accessed via a Connection interface.

If Geotools were to implement a limited data mining engine, the DME would serve as a central registry of all operations and all implementations of operations. This meshes nicely with the current plugin model found throughout the Geotools API. It also would provide a discovery mechanism for clients wishing to run specific operations.

Using the analogy between JDBC and the JDMAPI, however, leads us quickly to the point where one must configure and run the data mining engine as a service prior to having the ability to do any processing at all. This does not lend itself towards the creation of lightweight applications which are easily deployable to nonexpert users. Any DME included with Geotools must at least have the option of executing within the same JVM as the client code.

Lastly, I did not notice anything analgous to a DriverManager in JDBC, which manages connections, so I currently do not know how to start the whole process.

  • No labels