Pure bliss with MongoDB
I’ve been playing with MongoDB lately, and this morning I came across this blog post from Eliot Horowitz, showing how you can stream Twitter into MongoDB in a single command. How cool is that ?
What’s interesting here is that you get an awesome testbed for MongoDB since you get a stream of around 30 highly structured items per seconds (this is the sample stream, to get the true Twitter stream which apparently is 20 times more fast you have to have special credentials). A few minutes after your first download of MongoDB, you have enough data to experiment various things.
The first thing I played with was replication. Eliot shows how to set up a master instance which is dedicated to soaking the Twitter stream, while a slave instance asynchronously receive the tweets from the master and processes them.
Master-slave replication in MongoDB is very simple to set up, as demonstrated by Eliot : you start the master server with the --master command ligne switch, and the slave server with --slave --source <master address>. That’s all, the slave will eventually replicate all the data from the server. For large databases, you can preload the slave with a dump of the server and it will only replicate what’s new since the dump (given the right command line switch).
To use MongoDB, you can write code in your language of choice (provided there is an implementation of the MongoDB driver). But you can also use the MongoDB JavaScript shell which puts a lot of other DBs CLI to shame.
Next, of course, I played with queries and updates. Queries are pretty rich, with lots of useful JSON-aware operators (more on that later). MongoDB has some update facilities that look extremely useful, like atomically incrementing a number or adding an item into a set.
The only thing with queries and updates is that the query expression may look and feel a little weird at first. But there is a solid and coherent design that ensures that once you’ve understood it’s principles, it all feel very simple and natural.
Things become very interesting when you delve into the indexing features. For example, in a few minutes of reading the documentation, I could build a geospatial index of the tweets and query them :
// ... the first time, wait ~6 seconds for 475,000 tweets ...
db.twitter.ensureIndex({"geo.coordinates":"2d"});
// get the 3 nearest tweets from my house
db.runCommand( { geoNear : "twitter" , near : [48.858009,2.451625], num : 3} );
MongoDB knows how to index structured values like arrays or maps. For instance, the index stores each value in an array so that you can query for the presence of a value, or any or all values from a list. This means that you can easily build your own full text index :
// the tokenizer ; in a real word text index you would be much more clever
tokenize = function(text) {
return text.toLowerCase().match(/\w+/g);
};
// a query that gets all tweets with some text (yes, there are tweets without text)
tweets_with_text = function() {
return db.twitter.find({text:{$exists:true}});
}
// this function stores all tokens extracted from the text into a document field
indexTweet = function(tweet) {
tweet.tokens = tokenize(tweet.text);
db.twitter.save(tweet);
}
// Let's go ! This may take a little while...
tweets_with_text().forEach(indexTweet);
// We build the token index
db.twitter.ensureIndex({tokens:1});
// And now we can query the index !
// This returns the text from 5 tweets containing the word "nicolas"
db.twitter.find({tokens:"nicolas"},{text:1}).limit(5)
// This returns the full data from 5 tweets containing the words "mongodb" or "nosql"
db.twitter.find({tokens:{$in:["mongodb","nosql"]}}).limit(5)
// Returns tweets with "mongodb" AND "nosql"
db.twitter.find({tokens:{$all:["mongodb","nosql"]}}).limit(5)
// Later, we can incrementally update the index :
// we only need to index the documents without tokens.
db.twitter.find({text:{$exists:true},tokens:{$exists:false}}).forEach(indexTweet);
// Of course we can mix & match queries :
// Get all tweets containing "paris" near my home
db.twitter.find({tokens:"paris","geo.coordinates":{$near:[48.858009,2.451625]}},{text:1}).limit(5)
But wait, there’s more ! MongoDB implements MapReduce and supports sharding. So this could pave the way for a new world of scalable yet easy computing. I’ll get back to this in a future post.
A lot of features in MongoDB are still work in progress. Lots of improvements are expected in concurrency, replication, features like full text search (because the above hack is, well, a hack), and performance.
For instance, the embedded Javascript interpreter is not thread safe, which means concurrent server-side Javascript execution is not possible, yet (it seems there are plans to support v8, which would be awesome). As a consequence, the current MapReduce implementation doesn’t support parallelism on a single machine, which prevents the efficient use of multiple CPU cores. But the project is in active development so watch the roadmap for future improvements.
To conclude, don’t be shy, download MongoDB now, use Eliot’s Twitter trick to fill a database and have fun experimenting with one of the most interesting NoSQL database out there !
-
http://dirolf.com/ Mike Dirolf
-
http://dirolf.com/ Mike Dirolf
-
http://dirolf.com/ Mike Dirolf
-
http://dirolf.com/ Mike Dirolf
-
http://dirolf.com/ Mike Dirolf
-
http://dirolf.com/ Mike Dirolf
-
http://dirolf.com/ Mike Dirolf
-
http://dirolf.com/ Mike Dirolf
-
http://dirolf.com/ Mike Dirolf
-
http://dirolf.com/ Mike Dirolf
-
http://dirolf.com/ Mike Dirolf
-
http://dirolf.com/ Mike Dirolf
-
http://dirolf.com/ Mike Dirolf
-
http://dirolf.com/ Mike Dirolf
-
http://dirolf.com/ Mike Dirolf
-
http://dirolf.com/ Mike Dirolf
-
http://dirolf.com/ Mike Dirolf
-
http://dirolf.com/ Mike Dirolf
-
http://dirolf.com/ Mike Dirolf
-
http://dirolf.com/ Mike Dirolf
-
http://dirolf.com/ Mike Dirolf
-
http://dirolf.com/ Mike Dirolf
-
http://dirolf.com/ Mike Dirolf
-
http://twitter.com/skallpaul Sougata Pal