This page documents my Google Summer of code project "Caching Data in uDig". You can read my projet proposal at http://rochecrg3.scg.ulaval.ca/crg/dev/soc07/udig.
Here I will try to keep track of ideas, development notes and work progress.
First I propose some user stories, in which I dumped first ideas. I borrowed ideas from Andrea's : Datastore caching
I will update these stories as work goes on, and I get a better idea of the needs.
Datastore caching should allow to wrap transparently any other
Datastore (ie. the source Datastore) and leverage related subsequent
Datastore cache acts as buffer between the client and the actual
source Datastore. It may be persistent over client sessions.
The maximum capacity can be set by the user.
Can we propose a general framework for arbitrary data caching ?
Raster and vector data models API are not unified in Geotools.
One issue is the way features are retrieved from Datastore : a
Datastore usually receives a Query, and features are not retrieved by
id. Slightly different queries will yield a common set of features.
What we will store are features, rather than query results.
But we need to track queries, and some kind of relationships with
result features. How do we want to remember previous queries ? What
relationships with features do we keep track of ?
First, we suggest we keep track of queries spatial bounds and keep a
spatial index of the cached features.
What type of queries the cache should leverage ?
Other queries may require to keep other indices of features (but this
way it will be possible to implement specialized cache for performance
sensitive operations. This is an extension mechanism we should take
care to provide.)
Cache should delegate unhandled queries to source Datastore. Two step
query of the datasource may be used :
1. get fid list from datasource
2. get features by fid that the cache doesn't already hold.
We will have to check this is performance-wise, but at least this will
save bandwidth when working with remote datastore.
Feature is not Serializable.
We have to find some way to serialize data represented by features.
What if Datastore schema changes ?
What if cache get out of sync with source datastore ? How do we detect this ?
When the cache is full, it must make room for new data. We will have
to define an eviction strategy for feature cache and index (or
In the low bandwidth scenario, data may be best copied from the source server and distributed using CD-ROM or DVD-ROM. Clients would appreciate to download only updates from the server, so they will still be able to display up-to-date data.
We would like to set up a datastore that can connect to two (or more ?) source datastores which store the same features, but possibly different versions. For example, one of the datastore would be a local storage, with good performance access, but holding data frozen at time t. The other datastore would represent a remote server.
The multisource datastore would then retrieve features :
It should be able to ask the remote store for what features have been modified, and update the local store.
Since we have to know if the original data has changed since time t, this service seems not be possible without some kind of data versioning system.
This could involve a sequential checkout (based on spatial extent) from a versioned remote datastore.
This is the scenario where a client has a local copy of a data source and wants to download updates from the original source. The local copy is used to initialize the cache, and client asks for updates to the remote source, and updates cache (or local copy).
See also :
In the low or no bandwidth scenario, users may want to edit locally the data, and commit modifications when possible.
Data cache using persistent storage will allow clients to carry on work when disconnected.
At commit time, the original data may have been modified by another client. This is a well-known issue of concurrent modifications
The initial goal is to allow client|application to cache data when querying remote sources, and make disconnected mode possible. Handling asynchronous updtaes is a plus.
In this section, I try to detail deliverables and scope of intended work.
1. in-memory data cache, read-only
The most simple scenario.
I will propose an extensible framework for DataCache.
I will propose a test framework to assess DataCache functionalities.
I will test different implementation of DataCache against MemoryDataStore, the reference implementation of DataStores. Different implementations include testing different types of spatial indices.
2. client demo using cache
A demo would be nice to appreciate the actual benefits of DataCache.
First, I can think of a very simple client - use JMapPane, with two panes (cached, control) to visually compare rendering times ?
Second, I can try a demo using uDig (adding programmatically a cached DataStore to catalog).
3. persistent data cache (disk storage), read-only
Extension of 1 to disk-based data cache.
I will consider using Hsql DataStore as cache backend - but I don't know if plugin is mature enough to be used (ask Jody ?). Think of other solutions ?
I will also consider serializing features to byte arrays and using http://research.att.com/~marioh/spatialindex/ spatial index framework.
Different types of DataCache might be chained : typically
Memory Cache -> Disk Cache -> Source DataStore
Cache must be initializable from its storage.
4. Write-through data cache
Simple handling of edits, where cache behaves as if there was no cache on edits.
5. Write-back data cache, assuming DataStore handles nicely version conflicts, ie the cache does not take care about version issues and commit conflicts
6. - OPTIONAL - sequential updates
See Multisource DataStore.
We assume client has some way to identify the version of the copy of the source it currently has.
Versioning is beyond the scope of the intended work, because it may take us real far.
But we propose to try some experiments
This could involve setting up a web service (server-side), which :
I can also test ideas using Geoserver versioned WFS (based on Andrea postgis versioned DataStore)
7. - OPTIONAL - propose uDig implementation of connected/disconnected mode
I would be glad to have time to achieve this, but can't figure out what is the required work and how much time it will need.
Next things I am planning to work on, unordered for the time being :
The sequence to query the cache I have actually implemented is
user query -> match query in tracker -> dowload missing data from source -> add new data to cache -> register query in tracker -> read cache -> anwser to query
User has to wait for all these operations to terminate, whereas he could get at least a first anwer from the cache. Using threads, we can imagine this other sequence :
M := main thread T1 := thread 1 reads data from source datastore T2 := thread 2 reads data from cache T3 := thread 3 inserts new data into cache T4 := thread 4 (lower priority) preselects unused features for eviction M QUERY--->|.....A------->E...QUERY...-----> T1 |-----|-------> | T2 |---->| | | T3 |----------->| T4 ---------> QUERY: user query starts process A: partial answer E: cache notifies listeners new data are available, after merging source and cache query sets ... thread is sleeping|waiting
Most of the cache behavior should be under control of the query tracker, which knows the known parts of the universe and may record how often each part of the universe is accessed.
We can't remember every query, but we can idealize or normalize them by expanding each to a grid. Typically, a tile will be a quadrant of a quadtree, and we approximate a query by the minimum containing quadrant.
We can attach to each quadrant :
On eviction, we pick a leaf quadrant, ie a quadrant that has no child, in the tree and make it invalid. Parent quadrants should be marked invalid or partial. When four quadrants in a quadrant are invalid, they should be removed from that quadrant, which reverts to a leaf. A leaf is always marked partial, unless it has the lowest allowed level.
In order to find the leaf to evict, we traverse the tree by level, looking for a quadrant that matches the criterium (for example, least recently used). If found, we iterate from that quadrant. If not found, ie all quadrants at given level are equals according to the used criterium, we do an exhaustive lookup of the lower level, and so on. If we have to exhaustively traverse many levels, this will be time consuming, and we may revert to random eviction after a given number of levels traversed with no matches.
On look-up, given an input bbox query, we first search for the minimum containing quadrant. Then we test if that quadrant is valid and entire. If not valid, we'll have to fetch this quadrant from the source store. If not entire, we search for invalid inner quadrants. To minimize the number of quadrants to fetch, each quadrant should yield less than a maximum number of quadrants, possibly depending on the level of the quadrant.
Some notes of the work being done through this Summer of Code project, and report on progress.
<build> <plugins> <!-- ==== Code Formatting ============================== --> <plugin> <groupId>org.codehaus.mojo</groupId> <artifactId>jalopy-maven-plugin</artifactId> <!-- next was the missing part, as I manage my project as standalone --> <version>1.0-SNAPSHOT</version> <configuration> <!-- copied from geotools-convetion.xml --> <convention>target/jalopy-geotools.xml</convention> </configuration> <executions> <execution> <goals> <!-- don't execute by default at build time, rather I want to manually launch mvn jalopy:format --> <!-- <goal>format</goal> --> </goals> </execution> </executions> </plugin> </plugins> </build>