Skip to end of metadata
Go to start of metadata

This page documents my Google Summer of code project "Caching Data in uDig". You can read my projet proposal at http://rochecrg3.scg.ulaval.ca/crg/dev/soc07/udig.

Here I will try to keep track of ideas, development notes and work progress.

  1. User Stories
  2. Work extent, roadmap
  3. TODO
  4. Development notes

User stories

First I propose some user stories, in which I dumped first ideas. I borrowed ideas from Andrea's : Datastore caching
I will update these stories as work goes on, and I get a better idea of the needs.

  1. Datastore caching
  2. Multisource datastore
  3. Async updates

User story #1 : Datastore caching

Datastore caching should allow to wrap transparently any other
Datastore (ie. the source Datastore) and leverage related subsequent
queries.

Datastore cache acts as buffer between the client and the actual
source Datastore. It may be persistent over client sessions.

The maximum capacity can be set by the user.

General purpose framework

Can we propose a general framework for arbitrary data caching ?
Raster and vector data models API are not unified in Geotools.

Data to be stored and queries

One issue is the way features are retrieved from Datastore : a
Datastore usually receives a Query, and features are not retrieved by
id. Slightly different queries will yield a common set of features.

What we will store are features, rather than query results.

But we need to track queries, and some kind of relationships with
result features. How do we want to remember previous queries ? What
relationships with features do we keep track of ?

First, we suggest we keep track of queries spatial bounds and keep a
spatial index of the cached features.

What type of queries the cache should leverage ?

  • BBox Filter
  • FID Filter

Other queries may require to keep other indices of features (but this
way it will be possible to implement specialized cache for performance
sensitive operations. This is an extension mechanism we should take
care to provide.)

Cache should delegate unhandled queries to source Datastore. Two step
query of the datasource may be used :

1. get fid list from datasource
2. get features by fid that the cache doesn't already hold.

We will have to check this is performance-wise, but at least this will
save bandwidth when working with remote datastore.

Data persistence

Feature is not Serializable.

We have to find some way to serialize data represented by features.

What if Datastore schema changes ?

What if cache get out of sync with source datastore ? How do we detect this ?

Eviction strategy

When the cache is full, it must make room for new data. We will have
to define an eviction strategy for feature cache and index (or
indices).

Multisource datastore

In the low bandwidth scenario, data may be best copied from the source server and distributed using CD-ROM or DVD-ROM. Clients would appreciate to download only updates from the server, so they will still be able to display up-to-date data.

We would like to set up a datastore that can connect to two (or more ?) source datastores which store the same features, but possibly different versions. For example, one of the datastore would be a local storage, with good performance access, but holding data frozen at time t. The other datastore would represent a remote server.

The multisource datastore would then retrieve features :

  • from the local store, if the feature has not been modified
  • from the remote store, if the feature has been modified.

It should be able to ask the remote store for what features have been modified, and update the local store.

Since we have to know if the original data has changed since time t, this service seems not be possible without some kind of data versioning system.
This could involve a sequential checkout (based on spatial extent) from a versioned remote datastore.

This is the scenario where a client has a local copy of a data source and wants to download updates from the original source. The local copy is used to initialize the cache, and client asks for updates to the remote source, and updates cache (or local copy).

See also :

Asynchronous updates

In the low or no bandwidth scenario, users may want to edit locally the data, and commit modifications when possible.

Data cache using persistent storage will allow clients to carry on work when disconnected.

At commit time, the original data may have been modified by another client. This is a well-known issue of concurrent modifications

Scope of intended work, roadmap

The initial goal is to allow client|application to cache data when querying remote sources, and make disconnected mode possible. Handling asynchronous updtaes is a plus.

In this section, I try to detail deliverables and scope of intended work.

1. in-memory data cache, read-only

The most simple scenario.

I will propose an extensible framework for DataCache.
I will propose a test framework to assess DataCache functionalities.
I will test different implementation of DataCache against MemoryDataStore, the reference implementation of DataStores. Different implementations include testing different types of spatial indices.

2. client demo using cache

A demo would be nice to appreciate the actual benefits of DataCache.

First, I can think of a very simple client - use JMapPane, with two panes (cached, control) to visually compare rendering times ?

Second, I can try a demo using uDig (adding programmatically a cached DataStore to catalog).

3. persistent data cache (disk storage), read-only

Extension of 1 to disk-based data cache.

I will consider using Hsql DataStore as cache backend - but I don't know if plugin is mature enough to be used (ask Jody ?). Think of other solutions ?

I will also consider serializing features to byte arrays and using http://research.att.com/~marioh/spatialindex/ spatial index framework.

Different types of DataCache might be chained : typically

 Memory Cache -> Disk Cache -> Source DataStore

Cache must be initializable from its storage.

4. Write-through data cache

Simple handling of edits, where cache behaves as if there was no cache on edits.

5. Write-back data cache, assuming DataStore handles nicely version conflicts, ie the cache does not take care about version issues and commit conflicts

6. - OPTIONAL - sequential updates

See Multisource DataStore.

We assume client has some way to identify the version of the copy of the source it currently has.

Versioning is beyond the scope of the intended work, because it may take us real far.
But we propose to try some experiments

This could involve setting up a web service (server-side), which :

  • keeps track of versions signature (ie feature tree that make up a version)
  • is able to look up the original source for modification since given revision, and send updates to client.

I can also test ideas using Geoserver versioned WFS (based on Andrea postgis versioned DataStore)

7. - OPTIONAL - propose uDig implementation of connected/disconnected mode

I would be glad to have time to achieve this, but can't figure out what is the required work and how much time it will need.

TODO List

Next things I am planning to work on, unordered for the time being :

  • FeatureCache : move eviction mechanisms to QueryTracker
  • document FeatureCache
  • have an in-depth look at Transaction and how it works
  • implement cache listens to DataStore for FeatureEvents
  • post project javadoc somewhere
  • update UML diagram to reflect design changes

@ideas

The sequence to query the cache I have actually implemented is

user query
   -> match query in tracker
   -> dowload missing data from source
   -> add new data to cache
   -> register query in tracker
   -> read cache
-> anwser to query

User has to wait for all these operations to terminate, whereas he could get at least a first anwer from the cache. Using threads, we can imagine this other sequence :

M  := main thread
T1 := thread 1 reads data from source datastore
T2 := thread 2 reads data from cache
T3 := thread 3 inserts new data into cache
T4 := thread 4 (lower priority) preselects unused features for eviction

M  QUERY--->|.....A------->E...QUERY...----->
T1          |-----|------->            |
T2          |---->|       |            |
T3                        |----------->|
T4                              --------->

QUERY: user query starts process
A: partial answer
E: cache notifies listeners new data are available, after merging source and cache query sets
... thread is sleeping|waiting

Most of the cache behavior should be under control of the query tracker, which knows the known parts of the universe and may record how often each part of the universe is accessed.
We can't remember every query, but we can idealize or normalize them by expanding each to a grid. Typically, a tile will be a quadrant of a quadtree, and we approximate a query by the minimum containing quadrant.

We can attach to each quadrant :

  • access statistics
  • possibly a list of features

On eviction, we pick a leaf quadrant, ie a quadrant that has no child, in the tree and make it invalid. Parent quadrants should be marked invalid or partial. When four quadrants in a quadrant are invalid, they should be removed from that quadrant, which reverts to a leaf. A leaf is always marked partial, unless it has the lowest allowed level.

In order to find the leaf to evict, we traverse the tree by level, looking for a quadrant that matches the criterium (for example, least recently used). If found, we iterate from that quadrant. If not found, ie all quadrants at given level are equals according to the used criterium, we do an exhaustive lookup of the lower level, and so on. If we have to exhaustively traverse many levels, this will be time consuming, and we may revert to random eviction after a given number of levels traversed with no matches.

On look-up, given an input bbox query, we first search for the minimum containing quadrant. Then we test if that quadrant is valid and entire. If not valid, we'll have to fetch this quadrant from the source store. If not entire, we search for invalid inner quadrants. To minimize the number of quadrants to fetch, each quadrant should yield less than a maximum number of quadrants, possibly depending on the level of the quadrant.

Development notes and progress

Some notes of the work being done through this Summer of Code project, and report on progress.

Week 6

  • working on QueryTracker

Week 5

  • tried to have InMemoryDataCache pass AbstractDataStoreTest, then MemoryDataStoreTest (because MemoryDataStore does not pass AbstractDataStoreTest). can't pass testFeatureEvents.
  • change in design : feature cache should be an extension of FeatureStore rather than DataStore. So I designed a new interface FeatureCache that extends FeatureStore. A DataCache will then be no more than a collection of FeatureCache, one per FeatureType.
  • wrote FilterSplitter : a FilterVisitor that splits a filter into two filters, isolating spatial restrictions from other attribute restrictions.
  • tried another R-tree framework : http://research.att.com/~marioh/spatialindex ; this framework provides visitors and query strategies to query the tree, which is great, but tree seems to be slower than org.geotools.index.RTree
  • committed first implementation of FeatureCache : InMemoryFeatureCache, using LRU eviction. Does work, but not that fast.
  • submitted a presentation proposal for FOSS4G (wink)
  • planned to continue work on a demo using JMapPane, but task is delayed until I have a better implementation of FeatureCache

@task

  • move eviction mechanisms to QueryTracker
  • try to understrand better how transactions are working. I will need this when going from read-only cache to write-thru.
  • update UML diagram to reflect design changes

Weeks 3 & 4

  • updated wiki page
  • looked at JMapPane and tried a first demo ; well, this was a try (wink)
  • off to a seminar, so not much work done.

Week 2

  • worked on : in-memory read-only data cache (continued)
  • created module unsupported/caching under trunk, and configured my project build
  • documented first code
  • drawn UML class diagram of draft implementation
  • tried a FeatureMarshaller that will handler DefaultFeature
  • Reading : caching (JBoss Cache, Pojo Cache), OGC Reference Model, versionning (GIT, looked at Andrea's postgis versioned DataStore)
  • tried to detail scope/extent of the intended work of this Summer fo Code project
  • found how to run
    mvn jalopy:format
    , added this section to pom.xml :
      <build>
        <plugins>
          <!-- ====    Code Formatting  ============================== -->
          <plugin>
            <groupId>org.codehaus.mojo</groupId>
            <artifactId>jalopy-maven-plugin</artifactId>
            <!-- next was the missing part, as I manage my project as standalone -->
            <version>1.0-SNAPSHOT</version>
            <configuration>
                <!-- copied from geotools-convetion.xml -->
                <convention>target/jalopy-geotools.xml</convention>
            </configuration>
            <executions>
              <execution>
                <goals>
                  <!-- don't execute by default at build time,
                       rather I want to manually launch mvn jalopy:format -->
                  <!-- <goal>format</goal> -->
                </goals>
              </execution>
            </executions>
          </plugin>
        </plugins>
      </build>
    

@task

  • think of an eviction strategy for flushing part of a spatial index. Idea would be to walk along the tree to find suitable nodes we can get rid of. But org.geotools.index.RTree implementation does not ease this. Tree could be parsed by recursively splitting top bound box into quadrants.
  • talk with Andrea about versioning and his idea of storing features by quadrants, ie tiles.
  • post project javadoc somewhere

Week 1

  • worked on : in-memory read-only data cache
  • readings : caching, spatial indexing
  • draft implementation of classes - see attached UML diagram
  • created wiki page with user stories

@task

  • implement cache size limit
  • handle better multiple types in a DataStore - for now I assumed there was one and only one FeatureType in the DataStore
  • No labels

1 Comment

  1. It's good that we can get the <a href="http://bestfinance-blog.com/topics/business-loans">business loans</a> moreover, it opens up new chances.