Archive pour la catégorie ‘Non classé’
Twitter, again
Back in 2009 I wrote an article against Twitter. Well, it’s time to admit I was wrong. OK, some of my point are still valid, but Twitter improved a lot in two years time :
- Retweet system integration (announced one day after my original post !)
- Integrated URL shortening showing original URLs on the Web interface
- New Twitter handling pictures and videos as first-rate content
- Better infrastructure, fail whale sightings rarefaction, Twitter API status transparency…
But the final argument for coming back to Twitter, for me, is this :
Google Reader lost its sharing features, promoting +1s. +1s are not social (try to find all your friends +1 in the same place). Twitter FTW.
— Nicolas Lehuen (@nlehuen) December 21, 2011
What’s new in Java 1.6.0_21 ?
Why, a new Oracle-branded Java Web Start splash screen, of course !
Before (up until 1.6.0_20) :

After :
It takes time to get used to it…
To be fair, this update contains truly interesting things, like a new version of the Hotspot VM (17.0), a better performing VisualVM and tons of bugfixes.
Check the release notes (notice the improvements in the Java web site since Oracle took over). You can compare them with the previous ones and derive whatever conclusions you want
Pure bliss with MongoDB
I’ve been playing with MongoDB lately, and this morning I came across this blog post from Eliot Horowitz, showing how you can stream Twitter into MongoDB in a single command. How cool is that ?
What’s interesting here is that you get an awesome testbed for MongoDB since you get a stream of around 30 highly structured items per seconds (this is the sample stream, to get the true Twitter stream which apparently is 20 times more fast you have to have special credentials). A few minutes after your first download of MongoDB, you have enough data to experiment various things.
The first thing I played with was replication. Eliot shows how to set up a master instance which is dedicated to soaking the Twitter stream, while a slave instance asynchronously receive the tweets from the master and processes them.
Master-slave replication in MongoDB is very simple to set up, as demonstrated by Eliot : you start the master server with the --master command ligne switch, and the slave server with --slave --source <master address>. That’s all, the slave will eventually replicate all the data from the server. For large databases, you can preload the slave with a dump of the server and it will only replicate what’s new since the dump (given the right command line switch).
To use MongoDB, you can write code in your language of choice (provided there is an implementation of the MongoDB driver). But you can also use the MongoDB JavaScript shell which puts a lot of other DBs CLI to shame.
Next, of course, I played with queries and updates. Queries are pretty rich, with lots of useful JSON-aware operators (more on that later). MongoDB has some update facilities that look extremely useful, like atomically incrementing a number or adding an item into a set.
The only thing with queries and updates is that the query expression may look and feel a little weird at first. But there is a solid and coherent design that ensures that once you’ve understood it’s principles, it all feel very simple and natural.
Things become very interesting when you delve into the indexing features. For example, in a few minutes of reading the documentation, I could build a geospatial index of the tweets and query them :
// ... the first time, wait ~6 seconds for 475,000 tweets ...
db.twitter.ensureIndex({"geo.coordinates":"2d"});
// get the 3 nearest tweets from my house
db.runCommand( { geoNear : "twitter" , near : [48.858009,2.451625], num : 3} );
MongoDB knows how to index structured values like arrays or maps. For instance, the index stores each value in an array so that you can query for the presence of a value, or any or all values from a list. This means that you can easily build your own full text index :
// the tokenizer ; in a real word text index you would be much more clever
tokenize = function(text) {
return text.toLowerCase().match(/\w+/g);
};
// a query that gets all tweets with some text (yes, there are tweets without text)
tweets_with_text = function() {
return db.twitter.find({text:{$exists:true}});
}
// this function stores all tokens extracted from the text into a document field
indexTweet = function(tweet) {
tweet.tokens = tokenize(tweet.text);
db.twitter.save(tweet);
}
// Let's go ! This may take a little while...
tweets_with_text().forEach(indexTweet);
// We build the token index
db.twitter.ensureIndex({tokens:1});
// And now we can query the index !
// This returns the text from 5 tweets containing the word "nicolas"
db.twitter.find({tokens:"nicolas"},{text:1}).limit(5)
// This returns the full data from 5 tweets containing the words "mongodb" or "nosql"
db.twitter.find({tokens:{$in:["mongodb","nosql"]}}).limit(5)
// Returns tweets with "mongodb" AND "nosql"
db.twitter.find({tokens:{$all:["mongodb","nosql"]}}).limit(5)
// Later, we can incrementally update the index :
// we only need to index the documents without tokens.
db.twitter.find({text:{$exists:true},tokens:{$exists:false}}).forEach(indexTweet);
// Of course we can mix & match queries :
// Get all tweets containing "paris" near my home
db.twitter.find({tokens:"paris","geo.coordinates":{$near:[48.858009,2.451625]}},{text:1}).limit(5)
But wait, there’s more ! MongoDB implements MapReduce and supports sharding. So this could pave the way for a new world of scalable yet easy computing. I’ll get back to this in a future post.
A lot of features in MongoDB are still work in progress. Lots of improvements are expected in concurrency, replication, features like full text search (because the above hack is, well, a hack), and performance.
For instance, the embedded Javascript interpreter is not thread safe, which means concurrent server-side Javascript execution is not possible, yet (it seems there are plans to support v8, which would be awesome). As a consequence, the current MapReduce implementation doesn’t support parallelism on a single machine, which prevents the efficient use of multiple CPU cores. But the project is in active development so watch the roadmap for future improvements.
To conclude, don’t be shy, download MongoDB now, use Eliot’s Twitter trick to fill a database and have fun experimenting with one of the most interesting NoSQL database out there !
FriendFeed might actually benefit from Google Wave
With the current buzz about Google Wave, it is difficult not to feel sorry for FriendFeed. The common ground between Wave and FriendFeed is indeed pretty rich : real time communication, threaded conversations, lightweight social networking… Some people even suggest that there is some irony in seeing Google rip off good ideas from FriendFeed, the latter being founded by ex-googlers.
However, there is something peculiar about Wave : apart from the very Googly UI, the infrastructure is meant to be opened and decentralized. Wave can be seen as a way to collaboratively edit an XML document (the wavelets, with are the basic communication nodes found in waves). The point is that this XML document can be replicated on many providers’ infrastructures, and updates are propagated through extensions of the XMPP protocol (XMPP is AKA Jabber, it’s an instant messaging protocol that Google Talk also uses with extensions).
Check out this paper about Google Wave Federation Infrastructure.
So, we are not really in a winnner-takes-all scenario. Now, if I where FriendFeed, I’d have two choices, now.
1) Decide this is a Fire and Motion move from Google, aimed at first at Microsoft (to take the wind out of Bing), then Facebook, then FriendFeed. Keep away from this game, and slowly become an awesome app no one uses, dedicated to geeks and power users, and watch people have all the fun out there on Wave or Facebook.
2) Embrace Google Wave Federation Protocol and build gateways and proxies that enable FriendFeed users to participate into Google Wave discussion without leaving their favorite UI. Of course this means that FriendFeed might become a victim of the Fire and Motion scheme : Google adds new stuff into Google Wave, FriendFeed runs to support it, then Google adds new stuff, and on and on, which means that FF could loose its technical leadership on the market. However, there is certainly room for more than one Wave provider, and if this new way of communication becomes mainstream, this could mean that FriendFeed would at last find a way out of Geekland and reach a wide audience, wider than it is currently now. Then launch a feature war with Google on the social networking or real-time search front, both fronts on which they have already proven that they good do very well.
So, depending on the choice they make now, I think Google Wave could actually be good news for FriendFeed, turning their awesome power-app into the second player in a potentially huge new market.
Une idée lumineuse !
Ce matin dans le métro, une magnifique opération marketing du Groupe France Mutuelle pour promouvoir leur produit FC Santé.
Distribution générale d’une boite en carton 100% non recyclé, contenant une bonne dose de mousse 100% vrai pétrole équitable au sein duquel était lové une merveille de petit GiFiMi (la mascotte du groupe) en pur plastique du même métal.
Quelle merveilleuse attention en ces périodes de fêtes ! On imagine des milliers de petits neveux mal-aimés ou de belles-filles ingrates qui se verront refiler cet ersatz de Bibendum à Noël. Le dispositif doit être redoutablement efficace : le petit bonhomme pas en mousse planté au sommet du sapin de Noël, qui pourra hésiter à souscrire à une mutuelle avec franchise cautionnée (lire : si tu es malade, c’est de ta faute, tu paies, sinon on te redonne un petit bout de tes versements) ?
Bien sûr cela a donné lieu à un face-lift des plus tendances de la station (en l’occurrence, Mairie de Montreuil, mais apparemment d’autres stations ont subit le même sort) :

Ces photos ne sont que la partie visible d’un iceberg de boites éventrées qui débordaient de toutes les poubelles de la station, dans les couloirs, etc.
A en juger par la signature sur le flyer se trouvant dans la boite, c’est The Marketingroup et son sens de la relation qui ont oeuvré à cette réussite.
Le plus beau était sans conteste le volet événementiel de l’opération : le moment où en pleine heure de pointe, sur un quai bondé, le conducteur du train resté quelques minutes de trop à l’arrêt demande à tout le monde de sortir « pour cause de colis suspect dans le train ». Au final, c’était une fausse alerte (et je ne sais pas si c’était lié) mais c’est sûr, on les regarde tout de suite avec plus de respect, ces centaines de boites en carton donc chacune pourrait contenir un pain de plastic de taille respectable.
Ecologique, ciblé, efficace et citoyen, que demander de plus ?
PullXML renamed to PushXML, hosted on Google Code
As the API for PushXML is definitely not a pull API (it is based on a callback instead), I’ve renamed PullXML to PushXML.
I have also decided to host this small project on Google Code, so that I can have a public Subversion repository, issue tracking and so on.
Here is the link to the project and here is a direct link to the latest version of PushXML.
PullXML : an XML Pull Parser for PHP 5
You’ve got a big, big XML file to parse. You need to do it in PHP. Well, you’re in deep trouble. Unless I’ve missed anything, here are the different alternatives I’ve seen, none of which satisfied me :
- Use SimpleXML. Not possible here because your big, big XML file won’t fit into memory.
- Use DOM or DOM XML. Same problem, the file won’t fit into memory, PLUS this time you get a notoriously crappy API.
- Use XMLReader. No memory problem, this time. However the API is awkward, maybe it gets better using a combination of calls to
XMLReader::expand()andXMLReader::next(), but then again it’s back to the crappy DOM API. - Use The SAX-like streaming XML parser. This works, no memory problem, but then again it’s pretty awkward, you have to implements a stack-based machine to do anything remotely useful if the XML document is a little bit complicated.
Well, after messing around with PHP for a few hours (man, this language is soooo weird ! I miss Python…), I came up with PullXML, an XML Pull Parser. It works in PHP 5, it could be ported to PHP 4 but frankly this is not something I look forward to doing
.
PullXML is implemented with the SAX-like streaming parser. It builds objects that look like SimpleXML objects, but instead of loading the whole document in memory, it builds them chunk by chunk and calls a callback you provide when a chunk is ready.
The chunks are delimited by the pivot, which is a simplified XPath expression that gives the path that each chunk must match. For example, if the pivot is /foo/bar, then PullXML will call your callback for each bar element that is in a foo element, including the content of the bar element, of course (otherwise this would be quite useless). But the best way to see how it works is to have a look at the source code and experiment with the example at the end.
I’m no expert in PHP and I haven’t much used SimpleXML, so this must be quite buggy for the moment. Yet, it already does the job as expected on one of my projects. If you have any remark about the source code or any suggestion for a better compatibility with SimpleXML, feel free to leave me a comment.
Ideas Worth Sharing and the Gapminder
This presentation from Hans Rosling is a definite must-see. The message about world health is enlightening, and the visualisation tools are impressive. This is one of the coolest thing I’ve seen for a long time, I mean in a « we should have thought about it before » way.
The great news is that you can do your best Hans Rosling impersonation by using the tools his Gapminder venture developed. They are available here, here and here.
Switched to DotClear 2 beta 6
Hi ! This is my first post under DotClear 2 beta 6. I’ve apparently succeeded in migrating from DotClear 1.2.5.
I only had a small problem with permalink begin moved to /index.php/2006/... to /index.php/post/2006/..., but I solved it by making a full export, tinkering the exported file with regexps, then performing a full import. Problem solved !
Now I need to customize the templates and re-add all the annoying widgets that my former blog had (or not). But that will be for another day, because I’ve got some work to to right now.
In any case, kudos to the DotClear 2 development team for this great release !
