It’s been a few months since my last release of an official version of
pytst. Seeing that someone in Japan uses it made me have a look at the project once again. If I remember correctly, the latest evolutions where mostly refactorings in the hope that I could come up with on-disk storage of the tree, plus a few experiment on my full text indexing system based on the TST (which I won’t release yet). So, I’m releasing the latest version, having made sure that it compiled correctly for Python 2.3, 2.4 and 2.5 on Win32 and Ubuntu Linux (I only tested Python 2.4 on Linux)
I’ve been pretty busy lately, what with the fact that I now have TWO regular jobs (both theoretically being at 50%) : I’m still CTO at The CRM Company, but I am now VP Product & Marketing (and developer ) at Yoono on top of that. Interesting times…
My work on TSTs has been used at Yoono ; my friend (and now boss) Laurent Quérel and I have gone a bit further with TSTs and came up with a kind of TST which can store multiple characters per tree node. Any time some part of a prefix can be shared by multiple keys, the part is fully stored in the node and not duplicated, thus saving a lot of unnecessary nodes and navigation. Though it surely exists somewhere, we haven’t seen it anywhere else, so we decided to simply call this kind of TST a CompactTST. What’s interesting in this TST is that it has a very low memory overhead : the indexing structure only adds 10% to the bulk volume of the data (string key + integer value in our case) it contains.
As I’ve wrote previously, my work on Yoono made me switch back to writing Java. Hence, the CompactTST is written in Java, but with a low-level edge : like I’ve did in C++ for pytst, we didn’t want to build a Java instance per tree node. Our solution is a bit rough but it does work perfectly : we store nodes in a
byte, manually managing the layout of nodes and the storage of their data in the byte array. The result is pretty fast, very compact and the GC doesn’t work at all.
This gave me the idea of reusing this method to store a big number of objects in memory. Now that we have 64 bit servers with 8 Gb memory, why not make the most of it ? So currently I’m on my way to storing tens of millions of objects in memory through my own memory allocator system, implemented on top of multiple byte arrays, without any problem caused by memory or GC overhead. Of course, this means that I can’t rely on the GC anymore, but it’s the whole point anyway. We now have a way to store millions of
List<String> in an efficient way. I’m sure die hard Java developers would cringe at what we are doing, but it’s fun and worthwhile.
I’m still planning on writing
pytst 2.0, this time with a proper storage layer which will make me able to switch from memory-based to disk-based storage. Now that we have a neat CompactTST in Java, I’ll take the opportunity of porting the algorithms to C++ at the same time. It’s only a matter of finding the time to do it, which is easier said than done…