Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 15 Next »



This page discusses the options of elastic persistence and how they compare

MongoDB

Main characteristics

  • BSON format (binary JSON), so more efficiency in data transfer and storage
  • C++
  • Apache license
  • Supports updates in-place, which means that no new documents have to be created but the existing document is modified. This improves the write performance a lot.
  • Adhoc queries, which means that indexes are not mandatory because in a lot of cases MongoDB can find an optimal query path.
  • Language specific drivers
  • Only atomic transactions
  • Master-Slave replication
  • Automatic sharding support

Community

Core concepts

  • Replication set, http://www.mongodb.org/display/DOCS/Replica+Sets. Replication sets are asynchronous master/slave replications to increase availability. A replication set consists of at least 3 servers, 3 full servers or 2 full servers and one arbiter.
    • Primary. A replication set always selects a primary (or master), which is the entry point for writing data. When a primary crashes another primary is automatically selected.
    • Secondary. A secondary receives replication messages from the primary to keep them synchronized. When a primary crashes one of the secondaries will become the primary. Most MongoDB database drivers support the option to enable read operations on slave. This is not enabled by default. When you allow reads from a secondary, there's a possibility that new write data is not yet synced from the primary.
    • Arbiter. An arbiter is a node in a replication set that only participates in elections (who becomes the primary). An arbiter never becomes a primary and data is not replicated to this node. An arbiter is needed when you have an even numbers of secondaries to prevent a deadlock in elections.
  • Sharding, http://www.mongodb.org/display/DOCS/Sharding+Introduction. Sharding is the partitioning of data among multiple servers in an order-preserving manner.

    With sharding one can divide collections in data chunks that are managed on different database servers. The definition of these data chunks is done by shard keys which look like index definitions. When a shard reaches a specified maximum size limit (for example 1 GB) the data chunks managed by this shard get divided over the other shards automatically. Typically a shard consists of a replication set of database servers.
  • Routing processes (mongos). Routing or coordination process that make the cluster of servers look like one system for a database driver. You can have multiple mongos running in parallel as they get their data from a config server and don't have any relationship with eachother. The overhead of a mongo is minimal.
  • getLastError, http://www.mongodb.org/display/DOCS/getLastError+Command, http://www.mongodb.org/display/DOCS/Verifying+Propagation+of+Writes+with+getLastError. By default MongoDB write operations are asynchronous and don't give back a response code. To know that a write operation has succeeded or to know the error, the getLastError command can be used. In addition the getLastError command provides functionality to ensure that the data is being replicated over a specified number of servers in the replication set. With the fsync parameter of the getLastError command you can also ensure that the data is fsynced to the drive.

Detailed information

  • MongoDB was designed with a multi server architecture in mind. The idea is to increase availability and durability by adding more servers to a replication set and/or to a shard. Single server durability was no priority until the addition of journalling in the recent release. To ensure that important data is at least replicated over x servers in a replication set, you have to invoke the getLastError command.
  • When you execute a kill -9 on a mongodb server you'll have to run the repair command to get it nicely started again. With journalling enabled and by using getLastError for important write operations, this seems to be no problem. CouchDB has a crash only design, with no additional shutdown process so you can do a kill -9 without causing a problem. CouchDB will startup normally again.

Questions

  • How to work with update commands?  Write side?  Is our idea valid?
  • How to test scalability?
  • How to deal with relations?
  • How to handle schema updates in the code?
  • How to make a versioned update like [update ... where oid = ... and version = 2] and then check how many documents were updated.
  • From version 1.7.5. journalling support is added. We hear mixed feedback about the maturity and the functionality of this journalling support, could you share your standpoints with us?

Links

Pros

  • BSON (binary JSON) instead of JSON, which optimizes the storage and communication of documents
  • Native drivers for a lot of languages (including Java, C++, C#, PHP, Javascript, Groovy etc)
  • Good support for queries
  • High read and write performance

Cons

  • Less durability (no crash-only design) because writes are not immediately synced with the file system. This means that the write performance is higher, but data can get lost after a crash. From version 1.7.5 journalling is supported which provides more durability, but this seems to be work-in-progress.

CouchDB

Main characteristics

  • JSON format
  • Erlang
  • Apache license
  • MVCC (Multi-Version Concurrency Control), which means that updates always result in new documents with a version+1. This leads to less performance in writes, but contributes to the crash-only design of CouchDB. This approach also enables you to use versioned updates, so you can check if the document still has the same version identifier to check if there was another update.
  • No adhoc query, so indexes have to be created
  • General REST interface that can be used for CRUD operations. There are several driver projects which make it easier to use the REST interface.
  • Only atomic transactions
  • Master-Master replication, which means that it's very well suited for offline work getting synced to the database server.
  • No out-of-the-box sharding. There are additional frameworks for CouchDB available which do provide sharding support.

Community

Core concepts

  • Master/master replication

Detailed information

  • BigCouch is a fork of the CouchDB project by Cloudant (https://github.com/cloudant/bigcouch), which provides out-of-the-box support for creating clusters without breaking the CouchDB REST interface API. Seems to be a lot like the replication set functionality of MongoDB, but with CouchDB all nodes are equal masters, so requests can be handled by all nodes of a cluster.
  • Sharding is not supported out-of-the-box, but there are frameworks like Lounch and Pillow that provide sharding support for CouchDB --> http://nosql.mypopescu.com/post/683838234/scaling-couchdb

Pros

  • Durable crash-only design, which makes sure no data gets lost after a crash
  • Versioning is natively supported with MVCC
  • Master-master replication for offline work
  • No additional drivers needed due to REST interface

Cons

  • JSON and REST interface make it less optimal in network traffic
  • Query support is more limited than with MongoDB

Redis

Main characteristics

  • Binary format
  • C/C++
  • BSD license
  • Writes (and also reads) are very fast, because Redis keeps a large in-memory set. The in-memory set is synced to disk at certain intervals (for example each 60 seconds).
  • Key-value pair database
  • Native language drivers
  • Transaction support
  • Master-Slave replication

Community

Links

Cassandra

Main characteristics

  • Binary format
  • Java
  • Apache license
  • Cassandra is designed to run on multiple servers, so its writing mechanism is mostly designed to replicate write actions across the network of servers.
  • Key-value pair database, query options are limited
  • Thrift communication protocol (binary)

Community

  • Apache project, Datastax provides support for Cassandra. In the PMC are employees of Facebook, Twitter, LinkedIn and Datastax
  • Active community with around 20 posts per day on the user list
  • Main users: Facebook, Twitter (not for the tweets), Rackspace, Cisco WebEx

Links

  • No labels