Nicolas Lehuen's Weblog

To content | To menu | To search

Sunday, April 19 2009

The difference between pytst and ctst

Putting away the difference in languages (C++ for pytst, pure C for ctst), the two projects are quite different with regards to the way data is stored in the tree.

pytst implements a text-book ternary search tree with an AVL balancing algorithm. While I was working on this project, I designed a scanning algorithm that I particularly proud of, until I found out that it was already known as the Aho-Corasick algorithm. This scanning algorithm is handy when you want to match an input text against a huge quantity of search strings (which was the initial use case that led me to develop this).

The problem with pytst is that it is a very straightforward ternary search tree implementation, in that it requires at least a tree node per character per unique prefix in the input strings. Given all the meta data that can reside in a tree node (at minimum, we need the character plus up to three pointers to other nodes, plus one pointer for Aho-Corasick, plus anything required by the memory allocator), there is a huge memory overhead. The algorithms pytst implements may be pretty efficient, but the trade off in memory is quite taxing.

While I was working for Yoono, we decided to store huge quantities of URLs in memory using a TST, since the hierarchical nature of URLs from the same domain was a good fit for this structure. This time, the implementation language was Java, but we developed some pretty nifty tricks to optimize memory usage ; basically I reimplemented malloc/free running on int[], char[] etc. This way we would reduce the load on the Java memory allocation, object header overhead, GC, etc. We could then store millions of node in memory. This was cool, and it worked well, but it was a bit stupid from me the optimize the allocation issues before optimizing what was allocated. Fortunately, a clever colleague from Yoono (Yann Landrin-Schweitzer) took over my work and went further, optimizing the data structure itself based on some discussion he, Laurent Quérel and I had together.

ctst started as a simple rewrite in pure C of pytst, motivated by the difficulties I encountered with compiling the C++ code of pytst with different compilers, and building bindings for Ruby (which I had started using at that time). I also wanted to separate as much as possible the code implementing the algorithms from the code implementing the code storage, which was something we managed to do quite cleanly in Java. The ideas about optimizing the structure came by during the rewrite, so I finally decided to jump ship and implement them in ctst as well.

So how does ctst store data in the tree ? We took some of the ideas from Patricia trees and mixed it with the B-Tree principles. The result is a pretty compact and efficient structure. To illustrate it, let's use a little example.

Suppose we want to store this list of words in ctst :

compacity
compact
compacted
community
commuter
commuters
compute

ctst build this tree (click on the image to zoom) :

Compacity test

( This drawing was done automagically using the dump method, which generate a Graphviz .dot file. )

This gives 11 nodes for 7 strings. I won't waste time to draw the ternary search tree equivalent, so you'll have to trust me, but you would need at least 25 nodes. Of course the node don't contain the same things, but the payload/overhead ratio is highly in favor of ctst.

Of course I haven't re-implemented all pytst algorithms in ctst, but now that I "finally" managed to correct a long-standing but in the codebase, I'll be able to get back at it. So stay tuned !

Thursday, April 2 2009

pytst 1.17

pytst 1.17 is out !

The source code can be fetched using git. If you prefer tarballs, head to the downloads page, where you'll also find Win32 binaries.

Included in this release :

  • Support for 64 bits architectures (tested under Linux only) - thanks to Thomas Brox Røst for the patch.
  • The test script (in python/test/test.py) is now self-contained, no more nasty references to a "tcc" module which was used at my company. Thanks to Paul Harrington for the prodding about this :), and for our first fork / merge test.

BTW, github is really, really cool. The only thing missing is an issue tracker. Maybe in a future release ?

Wednesday, April 1 2009

pytst now hosted on github

I've created an account on github and ported my private SVN repository to a public git repository. Now you can fork on the project as you want and eventually ask me to pull your modifications (using the "send pull request" functionality). The wiki will be a good place to document the library.

Hopefully this is the beginning of a new life cycle for pytst, I keep getting bugs/enhancement requests from time to time but I haven't enough spare time to address them.

Here is the repository home page : http://github.com/nlehuen/pytst/

Thursday, February 5 2009

pytst 1.16

I've just released pytst 1.16, you can download it there.

This release fixes an annoying bug that occurs if you used CallableAction or CallableFilter. If your Python callback functions raised a Python exception, the whole process crashed. This meant that you had to catch all Python exceptions in the callback, which was not always handy. The exceptions now behave as expected, that is to say they are passed from the callback to the Python calling code through the C++ layer.

Thanks to Keith Davidson who reported this bug. Keith also reported a problem with NULL characters inside the keys : keys seem to be handled as NULL-terminated strings. It looks like there is a problem in the SWIG layer, since the C++ code doesn't assume strings are NULL-terminated. I'll have a look at this problem ASAP.

Wednesday, April 2 2008

pytst 1.15

I've just released pytst 1.15, you can download it there.

This release fixes a quite important problem introduced on 2006/05/05... For a reason I don't quite remember now, I decided to make the TST C++ template specialization dedicated to the Python bindings privately inherit from the more generic tst<...> template. I guess this was related to some problems I had with SWIG.

The problem is that this broke something in the way SWIG handled inheritance between the two types, the net result being that the Python TST class lost its prefix_match methods and other methods that were declared in the parent class but not in the specialization.

I've reverted the inheritance between the two C++ classes back to public inheritance, and everything is back to normal now.

I really wished I could spend more time working on ctst, since it features theoretically better structures and algorithms, plus to be frank I'm quite fed up with C++, and I'd be more than happy to go back to C.

Also, I must say that the native API for Ruby is very impressive. I've been developing the Ruby bindings for ctst by hand, not using SWIG, and I've spent substantially less time than it took to have SWIG properly handle my C++ code from pytst...

- page 1 of 7