Archive pour la catégorie ‘pytst’
pytst 1.18
I’ve just released pytst 1.18 which fixes a bug reported by Keith Davidson. The bug was in the scan method and appeared in some very specific yet totally mundane circumstances : if you scanned an input string of a very specific length with regard to the content of the tree, then the first scan would be ok, but the next ones not. This was something that was not in the unit tests and that managed to survive many years before being reported.
The code for the scan method (in tst.h) was one of the most complicated piece of code in this project, and I had quite a hard time figuring out what was happening. In the process, I began to realise that a rewrite would be necessary, so I cleaned the slate and entirely re-designed the algorithm, this time with careful planning.
While doing this, I realised that the old algorithm was much more wrong than I thought. There is a catch with the new algorithm, however : it has to do some backtracking. That is to say, in some condition, the algorithm has to go back and read again some part of the input string.
The old algorithm never had to look back – but the old algorithm would miss some matches, if you had dictionary entries that included multiple smaller dictionary entries. For example, in a dictionary with {'111','222','111222111'}, if you provided the input string '11122211' (notice the missing 1 at the end), then you would have a match on '111', then a non match on '22211'. The match on '222' would be missed.
For the moment, I prefer to release pytst 1.18 with an algorithm that is correct yet not optimal, and try to find a non-backtracking solution later. From my measurements, the impact on performance is around 10% but YMMV.
For the record, the new algorithm looks like this :
There is backtracking in to places : (J) and (K) where the input string index i is modified. To prevent backtracking, I have to find a way to analyse the tree so that I emit any match included in the current state then change to a proper state. It’s what Aho-Corasick is supposed to do, but I need to have a better look at it since I obviously got it wrong the first time.
You can download pytst 1.18 here.
The difference between pytst and ctst
Putting away the difference in languages (C++ for pytst, pure C for ctst), the two projects are quite different with regards to the way data is stored in the tree.
pytst implements a text-book ternary search tree with an AVL balancing algorithm. While I was working on this project, I designed a scanning algorithm that I particularly proud of, until I found out that it was already known as the Aho-Corasick algorithm. This scanning algorithm is handy when you want to match an input text against a huge quantity of search strings (which was the initial use case that led me to develop this).
The problem with pytst is that it is a very straightforward ternary search tree implementation, in that it requires at least a tree node per character per unique prefix in the input strings. Given all the meta data that can reside in a tree node (at minimum, we need the character plus up to three pointers to other nodes, plus one pointer for Aho-Corasick, plus anything required by the memory allocator), there is a huge memory overhead. The algorithms pytst implements may be pretty efficient, but the trade off in memory is quite taxing.
While I was working for Yoono, we decided to store huge quantities of URLs in memory using a TST, since the hierarchical nature of URLs from the same domain was a good fit for this structure. This time, the implementation language was Java, but we developed some pretty nifty tricks to optimize memory usage ; basically I reimplemented malloc/free running on int[], char[] etc. This way we would reduce the load on the Java memory allocation, object header overhead, GC, etc. We could then store millions of node in memory. This was cool, and it worked well, but it was a bit stupid from me the optimize the allocation issues before optimizing what was allocated. Fortunately, a clever colleague from Yoono (Yann Landrin-Schweitzer) took over my work and went further, optimizing the data structure itself based on some discussion he, Laurent Quérel and I had together.
ctst started as a simple rewrite in pure C of pytst, motivated by the difficulties I encountered with compiling the C++ code of pytst with different compilers, and building bindings for Ruby (which I had started using at that time). I also wanted to separate as much as possible the code implementing the algorithms from the code implementing the code storage, which was something we managed to do quite cleanly in Java. The ideas about optimizing the structure came by during the rewrite, so I finally decided to jump ship and implement them in ctst as well.
So how does ctst store data in the tree ? We took some of the ideas from Patricia trees and mixed it with the B-Tree principles. The result is a pretty compact and efficient structure. To illustrate it, let’s use a little example.
Suppose we want to store this list of words in ctst :
compacity compact compacted community commuter commuters compute
ctst build this tree (click on the image to zoom) :
( This drawing was done automagically using the dump method, which generate a Graphviz .dot file. )
This gives 11 nodes for 7 strings. I won’t waste time to draw the ternary search tree equivalent, so you’ll have to trust me, but you would need at least 25 nodes. Of course the node don’t contain the same things, but the payload/overhead ratio is highly in favor of ctst.
Of course I haven’t re-implemented all pytst algorithms in ctst, but now that I « finally » managed to correct a long-standing but in the codebase, I’ll be able to get back at it. So stay tuned !
pytst 1.17
pytst 1.17 is out !
The source code can be fetched using git. If you prefer tarballs, head to the downloads page, where you’ll also find Win32 binaries.
Included in this release :
- Support for 64 bits architectures (tested under Linux only) – thanks to Thomas Brox Røst for the patch.
- The test script (in python/test/test.py) is now self-contained, no more nasty references to a « tcc » module which was used at my company. Thanks to Paul Harrington for the prodding about this
, and for our first fork / merge test.
BTW, github is really, really cool. The only thing missing is an issue tracker. Maybe in a future release ?
pytst now hosted on github
I’ve created an account on github and ported my private SVN repository to a public git repository. Now you can fork on the project as you want and eventually ask me to pull your modifications (using the « send pull request » functionality). The wiki will be a good place to document the library.
Hopefully this is the beginning of a new life cycle for pytst, I keep getting bugs/enhancement requests from time to time but I haven’t enough spare time to address them.
Here is the repository home page : http://github.com/nlehuen/pytst/
pytst 1.16
I’ve just released pytst 1.16, you can download it there.
This release fixes an annoying bug that occurs if you used CallableAction or CallableFilter. If your Python callback functions raised a Python exception, the whole process crashed. This meant that you had to catch all Python exceptions in the callback, which was not always handy. The exceptions now behave as expected, that is to say they are passed from the callback to the Python calling code through the C++ layer.
Thanks to Keith Davidson who reported this bug. Keith also reported a problem with NULL characters inside the keys : keys seem to be handled as NULL-terminated strings. It looks like there is a problem in the SWIG layer, since the C++ code doesn’t assume strings are NULL-terminated. I’ll have a look at this problem ASAP.
pytst 1.15
I’ve just released pytst 1.15, you can download it there.
This release fixes a quite important problem introduced on 2006/05/05… For a reason I don’t quite remember now, I decided to make the TST C++ template specialization dedicated to the Python bindings privately inherit from the more generic tst<...> template. I guess this was related to some problems I had with SWIG.
The problem is that this broke something in the way SWIG handled inheritance between the two types, the net result being that the Python TST class lost its prefix_match methods and other methods that were declared in the parent class but not in the specialization.
I’ve reverted the inheritance between the two C++ classes back to public inheritance, and everything is back to normal now.
I really wished I could spend more time working on ctst, since it features theoretically better structures and algorithms, plus to be frank I’m quite fed up with C++, and I’d be more than happy to go back to C.
Also, I must say that the native API for Ruby is very impressive. I’ve been developing the Ruby bindings for ctst by hand, not using SWIG, and I’ve spent substantially less time than it took to have SWIG properly handle my C++ code from pytst…
I wish I learned Japanese
When I was young (I’m talking days of yore, here, about ten years ago) and innocent, I was an engineering student at the INSA (Institut National des Sciences Appliquées, meaning National Institute for Applied Sciences). I was there for 5 years, and two of the best teachers I had were not math or physics teachers, but English teachers. Their courses were very important to me because they not only taught English, but also how to make a clear presentation and manage meetings. I’m using this every day, now. If I compare this to the tons of math I’ve learned (sometimes at 30 hours a week doses, the rest being computer science), which I’m using at somewhere near 5% now, the yield of those English courses is clearly superior. Mr. and Mrs Souillard, thank you ! I hope my awkward writing won’t be a disgrace to your teaching standards.
Anyway, Mr. Souillard also taught optional Japanese lessons, but I’ve decided not to follow them, because, you know, I had much more interesting things to do, like more maths or physics (and, to be comprehensive, drink booze, play video games and make out with girls).
So, it looks like there is a justice, after all, because my pytst code has been noticed by some Japanese people, and I can’t understand what they write about it. Masaki Yatsu contacted me today to announce that he ported pytst to Ruby, apparently using SWIG. He also added support for multibyte character sets, which is obviously pretty important in Japanese. This is the part I’ve understood, because he wrote me in English. Unfortunately, his blog is in Japanese. Check out the Google Translation result, if you want to have some fun. I can barely understand 25% of what Masaki writes (and to be clear, it’s only my fault
.
As for odz’s blog entry, it was much worse, because I couldn’t get whether he had better or worse results using pytst versus Perl regexp (it turns out my code performs ten times better for his use case, yay!).
I’m not sure one or two years of learning Japanese would have helped me much in this situation, though. I guess in the best case I would have the same level of understanding that Google Translation gives me. But at least I wouldn’t feel such a dork, instead I would be bragging about my deep knowledge of kanji.
Anyway, it’s good to see one’s code reused by people from the other side of the planet, with different alphabets and character encodings, apparently without too many problems. I’m getting a warm and fuzzy feeling from that
.
Incidentally, since I’ve posted a comment on odz’s blog, some spam crawler has fetched my email address, and I now receive spam in japanese. Wonderful. My bayesian anti-spam filter is so happy to get all those new words to learn ! Actually it’s not.
pytst 1.14 for Python 2.3, 2.4 and 2.5
It’s been a few months since my last release of an official version of pytst. Seeing that someone in Japan uses it made me have a look at the project once again. If I remember correctly, the latest evolutions where mostly refactorings in the hope that I could come up with on-disk storage of the tree, plus a few experiment on my full text indexing system based on the TST (which I won’t release yet). So, I’m releasing the latest version, having made sure that it compiled correctly for Python 2.3, 2.4 and 2.5 on Win32 and Ubuntu Linux (I only tested Python 2.4 on Linux)
I’ve been pretty busy lately, what with the fact that I now have TWO regular jobs (both theoretically being at 50%) : I’m still CTO at The CRM Company, but I am now VP Product & Marketing (and developer
) at Yoono on top of that. Interesting times…
My work on TSTs has been used at Yoono ; my friend (and now boss) Laurent Quérel and I have gone a bit further with TSTs and came up with a kind of TST which can store multiple characters per tree node. Any time some part of a prefix can be shared by multiple keys, the part is fully stored in the node and not duplicated, thus saving a lot of unnecessary nodes and navigation. Though it surely exists somewhere, we haven’t seen it anywhere else, so we decided to simply call this kind of TST a CompactTST. What’s interesting in this TST is that it has a very low memory overhead : the indexing structure only adds 10% to the bulk volume of the data (string key + integer value in our case) it contains.
As I’ve wrote previously, my work on Yoono made me switch back to writing Java. Hence, the CompactTST is written in Java, but with a low-level edge : like I’ve did in C++ for pytst, we didn’t want to build a Java instance per tree node. Our solution is a bit rough but it does work perfectly : we store nodes in a byte[], manually managing the layout of nodes and the storage of their data in the byte array. The result is pretty fast, very compact and the GC doesn’t work at all.
This gave me the idea of reusing this method to store a big number of objects in memory. Now that we have 64 bit servers with 8 Gb memory, why not make the most of it ? So currently I’m on my way to storing tens of millions of objects in memory through my own memory allocator system, implemented on top of multiple byte arrays, without any problem caused by memory or GC overhead. Of course, this means that I can’t rely on the GC anymore, but it’s the whole point anyway. We now have a way to store millions of String, int, List<Integer> and List<String> in an efficient way. I’m sure die hard Java developers would cringe at what we are doing, but it’s fun and worthwhile.
I’m still planning on writing pytst 2.0, this time with a proper storage layer which will make me able to switch from memory-based to disk-based storage. Now that we have a neat CompactTST in Java, I’ll take the opportunity of porting the algorithms to C++ at the same time. It’s only a matter of finding the time to do it, which is easier said than done…
pytst 1.09
This release, available here as usual, adds the match method, which supports the glob-like wildcards ? and *. For instance, my_tree.match("a*?d",None,DictAction()) will give you matches for "added" and "abcd" and even "and" but not for "ad".
pytst 1.08 : SWIG is back in the game
This release uses SWIG 1.3.28. The latest version of SWIG brings huge improvements in performance, bringing it on par with Boost.Python according to my crude benchmarks. The advantage of SWIG is that it is much more easy to build, being supported by the distutils package (though I had problems with C++ support). You don’t have to learn a new set of building tricks like when using Boost. So, it seems that SWIG is back in the game !
Big thanks to the SWIG development team !
Update : Using the -O optimizing option in SWIG 1.3.28 yields even better performance, thus making the SWIG wrapper effectvely faster than the Boost.Python one. I don’t know what happens under the hood though, but at least I can read the generated code, contrary to what happens in Boost. My unit and torture tests behave as before, so it seems everything is OK ! Great !

