Putting away the difference in languages (C++ for pytst, pure C for ctst), the two projects are quite different with regards to the way data is stored in the tree.
pytst implements a text-book ternary search tree with an AVL balancing algorithm. While I was working on this project, I designed a scanning algorithm that I particularly proud of, until I found out that it was already known as the Aho-Corasick algorithm. This scanning algorithm is handy when you want to match an input text against a huge quantity of search strings (which was the initial use case that led me to develop this).
The problem with pytst is that it is a very straightforward ternary search tree implementation, in that it requires at least a tree node per character per unique prefix in the input strings. Given all the meta data that can reside in a tree node (at minimum, we need the character plus up to three pointers to other nodes, plus one pointer for Aho-Corasick, plus anything required by the memory allocator), there is a huge memory overhead. The algorithms pytst implements may be pretty efficient, but the trade off in memory is quite taxing.
While I was working for Yoono, we decided to store huge quantities of URLs in memory using a TST, since the hierarchical nature of URLs from the same domain was a good fit for this structure. This time, the implementation language was Java, but we developed some pretty nifty tricks to optimize memory usage ; basically I reimplemented malloc/free running on int[], char[] etc. This way we would reduce the load on the Java memory allocation, object header overhead, GC, etc. We could then store millions of node in memory. This was cool, and it worked well, but it was a bit stupid from me the optimize the allocation issues before optimizing what was allocated. Fortunately, a clever colleague from Yoono (Yann Landrin-Schweitzer) took over my work and went further, optimizing the data structure itself based on some discussion he, Laurent Quérel and I had together.
ctst started as a simple rewrite in pure C of pytst, motivated by the difficulties I encountered with compiling the C++ code of pytst with different compilers, and building bindings for Ruby (which I had started using at that time). I also wanted to separate as much as possible the code implementing the algorithms from the code implementing the code storage, which was something we managed to do quite cleanly in Java. The ideas about optimizing the structure came by during the rewrite, so I finally decided to jump ship and implement them in ctst as well.
So how does ctst store data in the tree ? We took some of the ideas from Patricia trees and mixed it with the B-Tree principles. The result is a pretty compact and efficient structure. To illustrate it, let's use a little example.
Suppose we want to store this list of words in ctst :
compacity compact compacted community commuter commuters compute
ctst build this tree (click on the image to zoom) :
( This drawing was done automagically using the dump method, which generate a Graphviz .dot file. )
This gives 11 nodes for 7 strings. I won't waste time to draw the ternary search tree equivalent, so you'll have to trust me, but you would need at least 25 nodes. Of course the node don't contain the same things, but the payload/overhead ratio is highly in favor of ctst.
Of course I haven't re-implemented all pytst algorithms in ctst, but now that I "finally" managed to correct a long-standing but in the codebase, I'll be able to get back at it. So stay tuned !


