Message-ID: <2142066230.40864.1371628739319.JavaMail.email@example.com> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_40863_965709287.1371628739319" ------=_Part_40863_965709287.1371628739319 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
This page documents my Google Summer of code project "Cachi= ng Data in uDig". You can read my projet proposal at http://rochecrg3.scg.ulaval.ca/crg/dev/soc07/udig.
Here I will try to keep track of ideas, development notes and work progr= ess.
First I propose some user stories, in which I dumped first ideas. I borr=
owed ideas from Andrea's : =
I will update these stories as work goes on, and I get a better idea of the= needs.
Datastore caching should allow to wrap transparently any other
Datastore (ie. the source Datastore) and leverage related subsequent
Datastore cache acts as buffer between the client and the actual
source Datastore. It may be persistent over client sessions.
The maximum capacity can be set by the user.
Can we propose a general framework for arbitrary data caching ?
Raster and vector data models API are not unified in Geotools.
One issue is the way features are retrieved from Datastore : a
Datastore usually receives a Query, and features are not retrieved by
id. Slightly different queries will yield a common set of features.
What we will store are features, rather than query results.
But we need to track queries, and some kind of relationships with
result features. How do we want to remember previous queries ? What
relationships with features do we keep track of ?
First, we suggest we keep track of queries spatial bounds and keep a
spatial index of the cached features.
What type of queries the cache should leverage ?
Other queries may require to keep other indices of features (but this
way it will be possible to implement specialized cache for performance
sensitive operations. This is an extension mechanism we should take
care to provide.)
Cache should delegate unhandled queries to source Datastore. Two step query of the datasource may be used :
1. get fid list from datasource
2. get features by fid that the cache doesn't already hold.
We will have to check this is performance-wise, but at least this will save bandwidth when working with remote datastore.
Feature is not Serializable.
We have to find some way to serialize data represented by features.
What if Datastore schema changes ?
What if cache get out of sync with source datastore ? How do we detect t= his ?
When the cache is full, it must make room for new data. We will have
to define an eviction strategy for feature cache and index (or
In the low bandwidth scenario, data may be best copied from the source s= erver and distributed using CD-ROM or DVD-ROM. Clients would appreciate to = download only updates from the server, so they will still be able to displa= y up-to-date data.
We would like to set up a datastore that can connect to two (or more ?) = source datastores which store the same features, but possibly different ver= sions. For example, one of the datastore would be a local storage, with goo= d performance access, but holding data frozen at time t. The other datastor= e would represent a remote server.
The multisource datastore would then retrieve features :
It should be able to ask the remote store for what features have been mo= dified, and update the local store.
Since we have to know if the original data has changed since time t, thi=
s service seems not be possible without some kind of data versioning system=
This could involve a sequential checkout (based on spatial extent) from a v= ersioned remote datastore.
This is the scenario where a client has a local copy of a data source an= d wants to download updates from the original source. The local copy is use= d to initialize the cache, and client asks for updates to the remote source= , and updates cache (or local copy).
See also :
In the low or no bandwidth scenario, users may want to edit locally the = data, and commit modifications when possible.
Data cache using persistent storage will allow clients to carry on work = when disconnected.
At commit time, the original data may have been modified by another clie= nt. This is a well-known issue of concurrent modifications
The initial goal is to allow client|application to cache data when query= ing remote sources, and make disconnected mode possible. Handling asynchron= ous updtaes is a plus.
In this section, I try to detail deliverables and scope of intended work= .
1. in-memory data cache, read-only
The most simple scenario.
I will propose an extensible framework for DataCache.
I will propose a test framework to assess DataCache functionalities.
I will test different implementation of DataCache against MemoryDataStore, = the reference implementation of DataStores. Different implementations inclu= de testing different types of spatial indices.
2. client demo using cache
A demo would be nice to appreciate the actual benefits of DataCache.
First, I can think of a very simple client - use JMapPane, with two pane= s (cached, control) to visually compare rendering times ?
Second, I can try a demo using uDig (adding programmatically a cached Da= taStore to catalog).
3. persistent data cache (disk storage), read-only
Extension of 1 to disk-based data cache.
I will consider using Hsql DataStore as cache backend - but I don't know= if plugin is mature enough to be used (ask Jody ?). Think of other solutio= ns ?
I will also consider serializing features to byte arrays and using http://research.att.com/~marioh/spatialindex/ spatia= l index framework.
Different types of DataCache might be chained : typically
Cache must be initializable from its storage.
4. Write-through data cache
Simple handling of edits, where cache behaves as if there was no cache o= n edits.
5. Write-back data cache, assuming DataStore handles nicely version conf= licts, ie the cache does not take care about version issues and commit conf= licts
6. - OPTIONAL - sequential updates
See Multisource DataSto= re.
We assume client has some way to identify the version of the copy of the= source it currently has.
Versioning is beyond the scope of the intended work, because it may take=
us real far.
But we propose to try some experiments
This could involve setting up a web service (server-side), which :
I can also test ideas using Geoserver versioned WFS (based on Andrea pos= tgis versioned DataStore)
7. - OPTIONAL - propose uDig implementation of connected/disconnected mo= de
I would be glad to have time to achieve this, but can't figure out what = is the required work and how much time it will need.
Next things I am planning to work on, unordered for the time being :
The sequence to query the cache I have actually implemented is
User has to wait for all these operations to terminate, whereas he could= get at least a first anwer from the cache. Using threads, we can imagine t= his other sequence :
Most of the cache behavior should be under control of the query tracker,=
which knows the known parts of the universe and may record how often each =
part of the universe is accessed.
We can't remember every query, but we can idealize or normalize them by exp= anding each to a grid. Typically, a tile will be a quadrant of a quadtree, = and we approximate a query by the minimum containing quadrant.
We can attach to each quadrant :
On eviction, we pick a leaf quadrant, ie a quadrant that has no child, i= n the tree and make it invalid. Parent quadrants should be marked invalid o= r partial. When four quadrants in a quadrant are invalid, they should be re= moved from that quadrant, which reverts to a leaf. A leaf is always marked = partial, unless it has the lowest allowed level.
In order to find the leaf to evict, we traverse the tree by level, looki= ng for a quadrant that matches the criterium (for example, least recently u= sed). If found, we iterate from that quadrant. If not found, ie all quadran= ts at given level are equals according to the used criterium, we do an exha= ustive lookup of the lower level, and so on. If we have to exhaustively tra= verse many levels, this will be time consuming, and we may revert to random= eviction after a given number of levels traversed with no matches.
On look-up, given an input bbox query, we first search for the minimum c= ontaining quadrant. Then we test if that quadrant is valid and entire. If n= ot valid, we'll have to fetch this quadrant from the source store. If not e= ntire, we search for invalid inner quadrants. To minimize the number of qua= drants to fetch, each quadrant should yield less than a maximum number of q= uadrants, possibly depending on the level of the quadrant.
Some notes of the work being done through this Summer of Code project, a= nd report on progress.