In a recent entry on his bliki, Martin Fowler talks about Data Models (http://www.martinfowler.com/bliki/DataModels.html). He mentions that many people, even with ADO.NET, don't want to work in a relational style. I now think that this is a good-thing.
When I designed Neo, originally for use on a specific client project, I spent some time considering exactly this issue. I knew our data was basically a hypercube and I therefore thought it would be good to have set-based operations. (After all, you want to intersect the dimensions to get to relevant data.) For this reason I added relation-style queries to Neo. In the first few months of the project I was surprised how wrong things can go with this model and later on we have considered it best-practice to avoid such usage. Why?
Anemic Data Models
It is best explained with an example. Consider the old trusty 'pubs' sample database in Sybase and SQL Server. If you have a publisher and you want to get the titles for one, in Neo you can write this as:
Publisher publisher = /* assume you have this */
TitleFactory titleFactory = new TitleFactory(context);
result = titleFactory.Find("Publisher = {0}", publisher);
People loved this. But it has a couple of problems: For a start this loses type-safety, but that's another issue. More importantly, though, it also leads to what Martin calls Anemic Domain Models. Assume you want to get all titles for a publisher with a minimum royalties value. You now write:
result = titleFactory.Find("Publisher = {0} and Royalties > {1}", publisher, 1000);
This puts your logic in the controller (service layer) rather than in the object. Knowing that Neo provides smart lists, which provide access to objects that are part of a relation, you can first rewrite it as:
result = publisher.Titles.Find("Royalties > {0}", 1000);
This doesn't improve much but it leads the way to the real improvement: You implement a method, say GetTitlesWithMinRoyalties, on Publisher. The code now looks like
result = publisher.GetTitlesWithMinRoyalties(1000);
which places the logic where it belongs: in the domain object. (Actually, the 1000 probably belongs in the domain layer, too.) This is a very simplistic example but I think it shows what it mean.
Performance
Another problem, which we experienced first hand, is performance. Say, in your user interface you have a table in which you want to display whether an author contributed to a title, the titles in rows, the authors in columns, and an 'X' in the cell at the intersection if the author was involved with the title. (Bad example, I know.) Every cell now does the following:
result = taFactory.Find("Title = {0} and Author = {1}", this.Title, this.Author);
this.SetText = (result.Count > 0) ? 'X' : ' ';
It's easy to see that this leads to a quadratic algorithm. If there are n authors, m titles and k title-author correlation objects, with k ~ n*m, there will be O( (n*m)^2 ) comparisons to draw the table. Sounds obvious but this example is not contrived; the first implementation of our spreadsheet cells in this client project did something very similar and we saw millions of object comparisions for a simple redraw.
So, why are databases so good at set-based queries and in-memory implementations are not? Indicies and the query optimizer of course. You could implement both, with a former being a prerequisite for the latter, in an in-memory layer but this again poses its own problems because databases are by nature persistent and in-memory storage is not. In the database you can update the index whenever you insert an object or change a value, in memory you have to rebuild the indexes when you first need them; and usually you have to have everything in memory before you can build it.
- What was the solution to the Performance problem listed in 2? I guess you created a genuine model to cache the values. Did you use the same query mechanism to generate the model or did you encapsulate it in the domain model? I think the solution would make the example more valuable.
- Following your argument about problems with in memory queries one might think that since the db itself has got query optimisers and indexes, making the spreadsheet do a query to the db, not the memory model, might make the spreadsheet more efficient. This would not solve the problem of course, due to the network latency amongst other things.
- Might be worth mentioning the prevayler benchmarks too (take them with a grain of salt of course) but note how
- some things are faster in memory - to quote "Each query is a simple attribute equality query (WHERE NAME='NAME1234') and returns a List of 10, 100 or 1000 Record Objects depending on the number of Record Objects initially created."
- some things are not - "Each transaction is composed of one Record Object deletion, one insertion and one update, all by ID"
The real point about the performance problem above is that a more object-oriented domain model (at least with Neo) actually improves performance. If you want to find all Titles by a particular Author, you could either scan the entire Title list for records with that Author's ID, or you could associate an Author with its Titles at start-up. Subsequent searches are then much narrower (find all Titles in my list of Titles with a TitleId of 'X'). What started as a desire to improve performance (by narrowing down the search data) led to a much more cohesive, and less anemic, data model. Most of our performance problems have been solved via a combination of data-narrowing and judicious caching – JimArnold
