Categories

pytst 0.99

The new release is available from the usual place. What’s new ? Well, above the surface, there is a new contains(string) method which tells you whether the tree contains a given string or not. Another new thing is a Win32 binary release of pytst using the Boost.Python library, which yields 25% more performance. If you’re interested by building a binary version of this for another platform, write me. I’ve been told that the latest (CVS) SWIG version was much more optimised, but alas, there is no Win32 binary version as of today and I’m too busy to try to build one.

Under the hood, I’ve been doing a bit of source code refactoring, in preparation of a new on-disk storage implementation. To start with, I plan on using Berkekey DB in Recno mode, since it’s quite close to an on-disk vector. If everything goes well, I’ll try to do it by myself (of course I don’t think I’ll achieve the same level of features and quality than the one provided by Sleepycat).

Sadly, it looks like my design for the node storage manager interface is a bit clunky… To do it well, I’ll have to refactor so many things that I’m beginning to think about a full rewrite (yeah, I know, it’s bad). In this case, I’ll release the current version as 1.0 and I’ll start over on a 2.0 branch, fully API compatible but with the new on-disk storage feature.

Note that there already is some support for disk storage of a tree ; but it’s a cold storage, you populate the tree in memory, then save it on disk, then you can reload it from disk to memory later on. The problem is that the tree is entirely loaded into memory, which limits the usage to huge trees (because you can hold one heck of a huge tree into 1 Gb of RAM), not letting us dwelve into the realm of mega huge trees :) . A real on-disk storage would allow me to page in and out parts of the tree from the disk to the RAM.

Another note : I’m currently using pytst to experiment with a kind of full text index that support fast incremental search. So far I’ve got a working prototype implemented as a layer of Python code relying on pytst. My experiments with a set of files from the Gutenberg project are very satisfying. I can’t wait to have an on-disk storage to be able to compete with Google :) .

Oddly, when I tried to port my prototype to full C++ (with a Boost.Python interface), I’ve ended up with a slower index… Even if Python’s dict is pretty fast, there must be something I don’t understand about the C++ STL, so I bought Scott Meyers’ Effective STL to help me there.