NoSQL in the Enterprise

.. or MongoDB for Architects

July 31, 2011 30 minute read

nosql in the enterprise — Is this a schema I see before me?

Welcome to part two. Last time we looked at the experience of getting a NoSQL product accepted in an enterprise environment. Assuming you got through that, the next step is to do something useful with it. Like any tool, you will only get good stuff out if you know how make the best of it. In this case that means not treating it too much like a relational database and understanding the internal nuances.

For our particular set of requirements we chose MongoDB. We tried Oracle first but either the data model became too unwieldy, which slowed down development, or we were looking at blobs, which lost us query flexibility. We tried CouchDB and these problems went away, but managing constant changes to query semantics wasn’t quite as easy as we would have liked. However, it did appear we were on the right track - our business problem definitely had a NoSQL feel to it. Then, following a recommendation from Sean Reilly, who’d used it before, we gave MongoDB a whirl and everything fell into place. Ironically, from a communications perspective, that turned out to be harder to manage than I expected. MongoDB is becoming really popular and there’s plenty of skepticism around NoSQL anyway. I knew a technical justification would be needed, but it was strange having to constantly make clear that we didn’t make our choice because MongoDB is the new Lady Gaga.

I’m a little uncomfortable defending product-choices too strongly. It can be perceived as outright promotion, which this isn’t intended to be. Apart from being a satisfied customer I have nothing to do with MongoDB or 10gen (the company behind MongoDB) and I have particular admiration for CouchDB and Neo4J too. Actually, since you ask, I quite like the whole range of key-value, document, graph, and big-data tools. Part of the attraction is that the communities around them are mostly filled with interesting, funny, and knowledgeable people and this comes through in the products and vendor presentations. NoSQL is not a world of bitter fud-spreading, not by the vendors anyway: 10gen compares MongoDB with CouchDB and Basho compares Riak with MongoDB in fair, balanced, technical terms.

Healthy rivalry in NoSQL makes a lot of sense because, underneath the excitement and confusing terminology, these tools are quite different. Frankly, the term ‘NoSQL’ is the thing I have an issue with. It isn’t very helpful defining products by what they’re not. But, though it’s not always useful, there is some logic to the classification, as we saw last time when talking about the inside-out to outside-in switch. Emile Eifrem, founder of Neo4J, whose presentations tick all the boxes in the interesting, funny, and knowledgeable categories said:

The problem in NoSQL is that it’s not clearly defined. Well, that’s not the problem, that’s one of the challenges with NoSQL. People have varying views of this. Another challenge is that NoSQL is extremely hyped right now, so pretty much anyone wants to attach themselves to that term and it’s also that it’s defined by what it’s not - it’s not SQL. You could say “Hey, is this room NoSQL? - It doesn’t support SQL”

Data persistence products are just tools. Relational databases are one category of them and I don’t see what the problem is in having multiple tools at your disposal. Too many is too many, obviously, but one is clearly too few. NoSQL products are really pleasant to work with, more so if you’ve chosen the one that fits your needs properly. Building on a tool that’s nice to use feels good. It eases the path to elegant and simple designs. It makes coming to work fun. What’s not to like about that?

MongoDB: The Basics

MongoDB is a document-based data store, which means its unit of currency is a document. In MongoDB’s case this means JSON documents (internally they are stored and transferred as BSON documents for efficiency), which look like this:

{
	"name"		: "Milton Waddams",
	"age"		: 42,
	"loves"		: [ "cake", "staplers", "fire" ]
}

That very basic structure tells you something about a person (and the kind of movies I like). It’s a form of the master-detail pattern, i.e. there’s a master ‘person’ and some ‘detail’ (about things they love) placed together. Things one person loves may be shared by other people. If you were to put that (very simple) example into relational database you’d most likely use three tables:

Retrieving a person and their interests from a relational database requires a join. Retrieving from MongoDB means fetching back the single document. In one scoop. What you appear to have lost at first glance is some control and integrity over how “loves” data is maintained. The relational model enforces standardisation, so that the next person who loves staplers will have their record point to the same row in the “loves” table. With MongoDB (and this is a pattern that comes up again and again), you have choices: you could maintain a list of de-duped ‘loves’ from all person documents in another JSON document and point to it by reference, or you could delegate that management to the application domain layer above. Initially that sounds like a pain, but it’s a pattern that you will only need to solve once and the payback is forever more you can interact with people documents in single-scoop (fast) atomic operations.

Now, say a new person needs to be added but they have an attribute that doesn’t apply to anyone else (i.e. an extension to existing business functionality). Since there’s no table schema telling you what to do you can go ahead and do just that:

{
	"name"			:	"Bill Lumbergh",
	"age"			:	38,
	"loves"			:	[ "himself" ],
	"douchelevel"	:	9
}

This has no impact on existing documents and, if you want to search on common attributes like ‘name’, your existing query syntax is unaffected too.

Understanding JSON documents is relatively trivial. Even die-hard relationalists find that within days it becomes second nature. Designing good documents is the next stage, so let’s take a look at that.

Documents and Collections

Except for a few special cases, MongoDB documents have unique ids. A typical id looks like this:

68cc67093575062e3d95369e

Default ids are 12 bytes long and are generated by your chosen client driver. They are made up of four parts: a timestamp (4 bytes), a machine id (3 bytes), a process id (2 bytes) and a counter (3 bytes). MongoDB ids are kept in the _id field as a BSON ObjectId. You can use other types (e.g. UUIDs) if you so choose, though there are a few rules to follow.

The simplest way to link MongoDB documents to one another is using ids (as in the example further down). Whether or not to link documents, or store their combined data together, is an application specific concern.

Here are some factors to help decide:

Only link documents if your clients query them separately more often than they do so combined. Most of the time you are trading off performance (getting all you need in one scoop) with generality (storing separate document types separately).
Split into multiple documents if combined documents would be unusually large. MongoDB documents can be really big (16Mb in the current version) but big documents mean fewer of them in RAM at one time so always look to keep them as small as possible. This includes key/field names. It sounds odd but the size of large document collections can be reduced considerably by using very short field names and small size generally improves memory management.
Separate into multiple documents if you are going to change the size of those combined documents frequently. MongoDB pads documents with a little space to allow for growth. If that padding gets used up then it may have to move the document to a different place in the store, which has an overhead to it. Newly updated fields are then located at the end of the new document. Padding is optimised by MongoDB keeping track of how often it needs to move documents around. So rather than add more data to an existing document frequently it may be better to add the new data as a new document.

We kept master-detail data together until the point it clearly didn’t belong, which was indicated by our automated performance tests.

The canonical example of documents and links for MongoDB seems to have become the “blog post and comments” study.

For example, you could design one document structure for blog posts, and another for comments on those posts, meaning each blog post document would need to maintain an array of links to the comment documents relating to it (or comment documents maintain a link to the post they pertain to):

{
	"_id"		:	"0001",						// ids are simplified for clarity
	"type"		:	"blogpost",
	"author"	:	"Milton Waddams",
	"title"		:	"Freedom from the Tyranny of Schemas",
	"date"		:	"30th July 2011",
	"content"	:	"Time flies - it was nearly two years ago that I wrote ..",
	"tags"		:	[ "architecture", "business" ],
	"comments"	:	[
						{ "_id"	:	"1001" },
						{ "_id" :	"1002" }
					]
}

{
	"_id"		:	"1001",
	"type"		:	"comment",
	"author"	:	"Bill Lumbergh",
	"date"		:	"1st August 2011",
	"comment"	:	"Milt, we're gonna need to go ahead and move you downstairs."
}

{
	"_id"		:	"1002",
	"type"		:	"comment",
	"author"	:	"Milton Waddams",
	"date"		:	"2nd August 2011",
	"comment"	:	"Excuse me, I believe you have my stapler... "
}

Or, you could be more scoop-efficient and combine them:

{
	"_id"		:	"0001",
	"author"	:	"Milton Waddams",
	"title"		:	"Freedom from the Tyranny of Schemas",
	"date"		:	"30th July 2011",
	"content"	:	"Time flies - it was nearly two years ago that I wrote ..",
	"tags"		:	[ "architecture", "business" ],
	"comments"	:	[
						{
							"author"	:	"Bill Lumbergh",
							"date"		:	"1st August 2011",
							"comment"	:	"Milt, we're gonna need to go ahead and move you downstairs."
						},
						{
							"author"	:	"Milton Waddams",
							"date"		:	"2nd August 2011",
							"comment"	:	"Excuse me, I believe you have my stapler... "
						}
					]
}

The first example is what you might do to replicate something close to a relational database. It would work, but to build an HTML page containing a post and its comments you’re going make multiple calls to the database, and it feels a bit joiny. The second gets you the page in one scoop and fits with the outside-in concept discussed last time. And notice that those embedded documents don’t have ids now. You could add them but they would just be arbitrary (and useless) fields to MongoDB.

If we had opted for option 1 above then it would make sense to separate posts from comments, so that when we searched for posts we didn’t have to trawl through the comments as well (we could assume that there will be many more comments than posts). MongoDB supports this through collections. Collections partition data within the database - in this case you would have a collection called “posts” and one called “comments”. When querying one you won’t (can’t) get access to the other. This makes collections a bit like tables. Be careful if using a lot of collections though as there’s a default namespace limit of 24,000 (collections and indexes both count towards this). The limit can be raised with the nssize command line option.

If all the documents in one collection adhered to the same schema then it would feel very like a table. Here’s a screen shot of the posts collection in MongoVue, a Windows MongoDB client, looking for all the world like a regular table:

I used MongoVue quite a bit in early presentations. A lot of questions seemed to fall away once people saw a familiar-looking interface with familiar-looking data sitting in it. I guess there’s a common misconception out there that NoSQL databases munge up your data in arcane ways.

Sub collections are permitted too (e.g. posts.milton, posts.bill), though these are for syntactic convenience only.

Another collection feature is capped collections - special pre-allocated (by size and/or number of documents), fixed-size storage areas. Capped collections are convenient if you are happy with the constraints (can’t delete or grow the size of documents and eventually the earlier documents will be overwritten with newer ones). Capped collections have very stable write speeds because there’s no need for dynamic space allocation.

Documents and collections are stored in databases (there’s a surprise). One MongoDB instance can support multiple databases simultaneously. By convention the currently selected database is denoted by “db” in shell commands. Databases are accessed and deleted using terms that will be familiar to most:

show dbs					// list available database names

use officespaceblog			// select the blog post database

db.dropDatabase()			// delete the blog post database

Before covering how to interact with documents and collections in more detail it’s worth taking a short detour. Prior to using MongoDB directly I’d heard a few myths about its reliability, particularly in respect of losing data. I’m not sure why confusion exists about how the MongoDB server works - the source is available to browse and it’s well commented.

Let’s take a quick peek under the bonnet.

The Server

MongoDB is written in C++. The core server daemon (mongod) is very small. With so much less to do than a conventional RDBMS this is something MongoDB has in common with many other NoSQL products. It’s a pleasant experience when you first download it and realise there isn’t some arcane landscape of files and directories to understand. There’s a bin directory and everything’s in there - the main daemon, the shard manager process (mongos), a monitoring process (mongostat), a command-line query shell (mongo) and some tools for managing data imports and the like.

Here’s the list of tools taken from the README on github:

mongodump       - MongoDB dump tool - for backups, snapshots, etc..
mongorestore    - MongoDB restore a dump
mongoexport     - Export a single collection to test (JSON, CSV)
mongoimport     - Import from JSON or CSV
mongofiles      - Utility for putting and getting files from MongoDB GridFS
mongostat       - Show performance statistics

It’s fair to say that the server works in an unusual way when compared to traditional databases. Some of its features have implications for your design.

Here are three parts to understand:

Memory Mapped Files

When MongoDB starts up it expects to know about the location for its data files (the default is /data/db) and it maps these files into memory. This is the most important part of the product to get your head around. If you’ve spent much time dealing with operating systems, or been a C/C++ coder, then the leap is easy, but for those used to bespoke data management schemes with lots of configuration it can seem a bit odd at first. The consequence of memory mapping data files is that the operating system is in control of how they are managed (not entirely of course but certainly in ways that you need to know about).

The good thing about this approach is that operating systems are optimised for working out which bits to have in physical memory and which bits to leave in virtual memory at any one moment. The other good thing is that MongoDB has a lot less to do, which means there’s less application code to go wrong.

The interesting thing is that MongoDB can’t control the order in which data is persisted to disk. In earlier versions this meant that single server MongoDBs weren’t durable in the case of a crash, but since V1.7.5 it’s had write-ahead journalling which means that write operations signal their intention to make a change before the write takes place and fast crash recovery journal files are written out to disk. Another interesting by-product is that you are limited in practice to about 1.5Gb of data on 32 bit machines. The way client connections, memory mapped files, syncing to disk, and journalling interrelate is important to understand so you make good choices in your app design.

On a happy system most (or even all) of your data would be in physcial memory. But because systems runs other processes too they’re going to fight for that RAM, with the OS acting as referee. A big data set and a lot of fighting for memory can create a lot of disk activity as least-recently used pages give way to other tasks. A simple rule of thumb here is don’t run other things on the same box as MongoDB. Give it plenty of memory (as much as you can afford) and use the fastest (local, not network mounted) disks available so when things do get busy it has the best chance of settling down again quickly. Other aspects of IO include the data files themselves, which are pre-allocated by default (initially you get a 16Mb file, then a 32Mb file, and so on up to 2Gb) and journalling.
Locking

Certain write operations, for obvious reasons, acquire a lock in order to proceed. Writes in MongoDB are fast but single-threaded (i.e. only one write at a time). This makes sense because we’re dealing with memory mapped files and coordinating multiple writers would be complex to manage. The write lock is global, as in across the instance. In practice locks lasts for such short periods of time that they have little consequence. The mongostat command shows the percentage of time this lock is active. If you need to support high write-rates then you may need to think about sharding your data across multiple instances (locks only affect one instance at one time).

Having said that, we’ve yet to need to think about sharding, though it’s nice to know it’s available. I’d advise not sharding until you know you need to. Premature optimisation can lead to issues of its own:

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of non-critical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

So sayeth the great Donald Knuth. Let your tests tell you that write-speed and global locking is an issue and then shard.

Locking gets better in each release. It may be reduced to a per-collection thing soon - there’s a pending change due at some point, plus more and more operations yield locks when it’s responsible to do so.
Persistence

When data gets persisted is an intricate subject with MongoDB. I’ll try to explain it here in simple terms.

The diagram below is a snapshot of a moment in time. There are three clients (applications) using connections to exchange data with the memory mapped files (which have been simplified for clarity) via the mongod process (not shown). Clients X and Y are sharing a pooled connection and client Z has it’s own. A point to note is that on some systems the memory allocated for connections can be quite large.

There are three documents being updated: A (green), B (purple) and C (turquoise). Before the snapshot all documents had values denoted by V0. Documents are colored red where potential data loss may occur if the system crashed.

</figure>

First a happy path - Client X updated document A to V1, which is reflected in it's connection, this was saved to the journal (which flushes all [group commits](http://www.mongodb.org/display/DOCS/Journaling#Journaling-GroupCommits) to disk approximately every 100ms) and added to the memory mapped file, which is in turn flushed to disk every 60 seconds ([configurable](https://docs.mongodb.com/manual/reference/program/mongod/#cmdoption-syncdelay)). The journal entry is crossed out because it's no longer needed if a server crash occurred.

Another happy path - Client Y updated document B to V1, which is reflected in the connection it shares with Client X. The update has also been written out to the journal but the memory mapped file has not yet been written out to disk. If there's a server crash we're OK because the journal will get played and all will be well.

Path with options - Client Z has updated document C to V1. Its connection has accepted the asynchronous write but it hasn't yet been persisted anywhere. If the server crashes, or someone kills the mongod process with a -9, that data is lost. The choices are, in increasing order of paranoia:

Take the risk for this one transaction type. Not always a bad option if the data’s not valuable.
Wait for the journal write using getLastError with the j parameter. And note that calling getLastError can also be used to make sure different connections see written data consistently.
Use fsync to write all files to disk. When journalling is on this actually just waits until the next journal write.
Where you have multiple nodes in a replica set you can insist that data is sent to more machines before the operation returns (using the w parameter), though you need to be careful that w isn’t greater than the number of nodes up at that moment otherwise the call will just wait.

Basically you have a lot of choices and you want to think through your requirements for each type of write operation. If you deploy a write-heavy application to a single node, that saves business-critical data, and you do not use journalling then there’s a real risk (in the long term a certainty) you will lose information in the case of a hard server crash. But if you run a production environment in that way then one might say you haven’t really thought much about anything.

There are plenty of tuning options in MongoDB but how safe your data is and whether your application flies along, or stutters and chokes, is going to be down to application design, document design, operating system tweaks and how you deploy it. That’s not a good or a bad thing - it’s no different in principle than it would be with many databases, it just means you go looking for answers in different places.

Hard part over. Let’s relax for a moment and talk about the other parts.

Interacting with MongoDB

Getting Stuff Out

One of the killer MongoDB features for me, and it seems for many people, is the query model. Most applications read data more than they write it, so having a convenient and powerful mechanism to get at your data is pretty important. In MongoDB you query documents using documents, so passing:
```
// All docs with a key of "age" set to 42		
{ "age" : 42 }								
```
to the find command retrieves all the documents that contain the “age” key with the value of 42. This isn’t hard to understand if you come from a SQL background because it’s a lot like:
```
// The second * here refers to "the current collection", which could be 'people'		
SELECT * FROM * WHERE age = 42;				
```
To return only certain fields (rather than whole documents) you also use documents.

With a 1 if it is to be returned:
```
// returns the name, but nothing else, of all people aged 42		
{ "age" : 42 }, { "name" : 1 }				
```
or a zero if you explicitly don’t want it:
```
// returns everything except the private_data key		
{ "age" : 42 }, { "private_data" : 0 }		
```
It’s also easy to sort results, limit results sets to a fixed size, skip documents (for paging), use regular expressions, conditional operators or aggregate results using count (number of documents that matched the query) or group results by fields that they contain. The full signature of the find command is:
```
db.collection.find(query, fields, limit, skip);
```
There’s a useful side affect of queries being based on documents. If you had a document with a key like this:
```
{ ...
    "relatedDocs" : [
        { "id" : "0001" },
        { "id" : "0042" },
        { "id" : "0678" }
    ]
}
```
You can clearly use the contents of “relatedDocs” to submit subsequent queries to get the connected documents. Documents ids are used to refer to individual documents, but you might want to refer to a whole group of documents.
```
{ ... 
    "people_aged" : { "age" : 42 }
}
```
By using the value of the “people_aged” key you can get all documents where “age” is 42. This is very like an RDF triple: with a subject (document you are looking at), a predicate (people aged 42, or “friends_with”, “related_to”, “sold_with”, etc) and an object (list of documents that match the relationship criteria).

Another neat feature is geospatial queries (get nearest X items to a location, get all items within a radius of Y, etc). Once you’ve added the appropriate index set up the query syntax is quite natural.
```
// find the (100 by default) people nearest to the statue of liberty
db.people.find({ location : { $near : [40.6900, -74.0444] }});	
```
Getting Stuff In

The simplest way to add a new document to the database is just to build it up and pass it in to the save command:
```
db.people.save({"name": "Milton Waddams", "age": 42, "loves": [ "cake", "staplers", "fire" ]});
```
This is an example from the interactive shell, different language drivers may implement the syntax in slightly different ways. Mongoid, for example, a rather elegant Ruby object-mapper for MongoDB is more model based, e.g.:
```
p = Person.new(name: "Milton Waddams", age: 42, loves: [ "cake", "staplers", "fire" ])
p.save
```
An interesting, but very useful, feature is the “upsert”. If the document saved above included an id field then MongoDB looks to see if the document already exists (because ids are unique) and updates it, or adds a new document if not.
Changing Stuff

Now we can design documents, divide them up into collections, save them and query them (plus understand how these choices play out through the internal operations) the only thing left to do is manage our data by changing it and re-saving it. For this we need the update command, which can take four parameters:

criteria - specifies the document we want to make a change to. This might just be the id but can in fact be anything. If multiple documents match though only the first will actually get modified unless you use the ‘multi’ option.
changes - the new fields you want to add/change in the matched document(s) or a sub-command indicating the change you wish to make. Notable sub-commands include: $inc (increment a numeric field by a defined amount), $push (add a value to the end of an array, or start an array containing the value), and $rename (change a field name).
```
    // change age to 43
    db.people.update( { "name" : "Milton Waddams" }, { "age" : 43 } );			
```
upsert? - a true/false that makes the save upsert functionality discussed above more explicit
multi? - allow multiple updates

i.e. if multiple documents match the criteria then MongoDB will attempt to apply the change to all of them. Each operation will be atomic but, because each document change is distinct (or more accurately the lock may yielded periodically), other concurrent write operations may be interwoven, which could affect one or more of the matched documents, with last-write-wins. Remember too that some changes may fail for certain reasons (e.g. an array push to a field that not an array in one of the matching documents). You can make this a bit more isolated by using the $atomic operator:
```
    // all people aged 42 become 43
    db.people.update( { "age" : 42 , $atomic : 1 }, { "age" : 43 }, false , true );
```
this will make the update pseudo atomic - i.e. no concurrent/conflicting writes will be allowed while it’s happening (need to be careful using this with very many documents though because of the lock).

For some writes (decrement stock level for instance) you really need to know that two writes didn’t both try and change a value such that invalid data remains in the document. If we have one copy of a book in stock and no back orders possible then two clients should not both be able to add it to their basket. For this there’s a pattern which uses a neat trick with the criteria to update called ‘update if current’ which allows you to fetch a document, change it (remove the last book from stock), and then write it back only if the fields you are interested in (stock level) haven’t been changed in the meantime:
```
// find a book, assumes isbn is unique, with at least 1 in stock
book = db.books.find({ "isbn": 12345, "in_stock" : { $gt: 0 } });  
	
	
// remember current stock level, e.g. 1
book_stock = book.in_stock   
	
	
// decrement stock level, e.g. to 0
--book.in_stock;   
	
// will fail if stock has been changed in the meantime
db.books.update({"isbn" : 12345, "in_stock" : book_stock}, book));
```

That about covers the basics. There’s a lot more to explore in the online manual

Building with MongoDB

Here are a few things we learned about getting the best out of MongoDB in development.

MongoDB works really well if you are following agile practices. Playing stories in short sprints with incremental database changes can be an onerous job. MongoDB’s lazy creation of fields and collections (if you refer to one that’s not there it gets created) makes for easy testing because it puts the code in control of the data rather than the other way around. We followed quite strict TDD, CI and QA practices so were able to write minimum amounts of test-passing code, which we could then safely grow and refactor as we understood our architectural needs better. Because we also adopted automated performance tests (nightly runs that put MongoDB under load) we could tune and adapt as we went. I would strongly recommend this approach; up-front design with a tool like MongoDB is quite risky. Far better to keep it simple, test often, and change your design incrementally in a highly-controlled way. This makes for the best kind of emergent architecture and dramatically reduces the chances of nasty surprises in production. I would do this with any database but it’s particularly important in this case.

We made the choice to expose our application services via REST. It turned out that MongoDB supported this easily too. Collections and documents fit nicely with resources as do the semantics of interacting with them, and JSON on the inside plus JSON as the default content-type made it easier still.

Commercialising MongoDB

In an enterprise organisation the story doesn’t stop at deployment. For commercial support we engaged 10gen once development had got to the point where we knew we would be going live (I had thought about doing that immediately we started, but with MongoDB’s popularity I didn’t want to sound like another maybe proposition that would waste their time). 10gen develop MongoDB and provide a commercial wrapper in the shape of support, training and health checks so you can get your designs underwritten by the same team that creates the product. One of the best things about 10gen is you’re never too far from an engineer if you need help. Plus they’re a likeable lot and very easy to work with.

The MongoDB license structure is in two parts - the core server is covered by the GNU AGPL v3.0. and the drivers by Apache License v2.0. Commercial support comes at a cost (can’t say what we paid obviously, but it’s very competitive indeed considering what you get).

Like most innovative tools these days MongoDB is open source. In enterprise IT open source is still a fairly novel concept and, to some, an untrusted concept. This goes back to what I said last time about the curious situation of having to promote a database tool to your IT department and sometimes to your business. I can understand the fear of moving away from the comforting feel that a well-heeled account manager from a commercial organisation brings. I know that comforting feeling is misplaced and that in reality you can make just as big a mess with an expensive product as a free one. But to be successful you can’t ignore the fear, even if you believe it to be unfounded.

Many years ago I wrote up some thoughts on what I called the Third Way of software sourcing - instead of two choices: buy expensive stuff and tailor the bejesus out of it, or write it all your self, I suggested a third option for corporates is to engage with open source communities in a more respectful way. That is, don’t see open source as ‘free’ (and risky) but an asset to use and contribute back to. You get a great head start and mitigate risks by helping resolve defects and improve the software. The community benefits by having a brand name to associate with and it keeps the OSS project alive. We’ve tried to follow this with MongoDB and have submitted a few fixes where we found the need. We couldn’t release our main application to the community for intellectual property reasons, and no doubt hard-core open sourcers would have an issue with that, but we did release back what we could - build tools such as our schema validator have no direct competitive advantage, so that’s going to be available for others to use and extend. In turn we can benefit from that work. It’s a slow process but one with good outcomes for everyone and I hope it goes some way to changing attitudes, because it’s a much better way to work than handing over millions for a license fee and only then start the torturous process of customising a product that was supposed to do everything ‘out of the box’.

Scale and Growth

This subject will be covered later. The primary tools for scaling in MongoDB - Sharding (to scale out writes) and Replica Sets (to scale out reads) - haven’t given us much to say yet. Given where we are with our projects right now I suspect the more interesting challenges around scale are yet to come and I’d rather talk about that having gone through it in production. There are plenty of references out there to cover this (some in the notes below) for now.

Summary

I hope that’s provided some food for thought. I’ve covered quite a bit of ground and also left out some big areas (I’ve not covered map-reduce, indexing, cursors, gridFS or authentication for example).

In closing I would say that there’s no doubt in my mind that NoSQL works in the enterprise if you follow good development practices. If you can work responsibly with the teams that build them then so much the better. MongoDB was a great choice for our requirement. It’s been easy to work on and our weekly demos to the product owner have gone down extremely well. Sometimes we’ve surprised even ourselves at how quickly we can deliver new features. There are many cases where I would not use a NoSQL solution, but plenty of others where I would now say it’s a perfectly valid and long-term approach.

Notes

For real-life experiences it’s hard to beat Boxed Ice posts and of course Foursquare who have probably done more than anyone to make MongoDB a contender in this space. Also check out Mathias Meyer’s write-ups.
A great source of quality information on MongoDB is Kristina Chodorow’s blog. Kristina is a 10gen engineer and works on the MongoDB core as well as the PHP and Perl drivers. She also co-wrote the Definitive Guide.
The IBM Developerworks site has quite a nice intro to MongoDB which also covers how to get started on Windows and Mac.
Ethan Gunderson has a good post on two of the gotchas with MongoDB. It’s a bit out of date now but still worth a read.
The picture, taken by me, is of a statue of George III, known as the “copper horse”, in Windsor Great Park, England. He was the first British monarch to embrace and understand science. It’s not known whether he had any strong views on relational databases.