Archive pour la catégorie ‘pytst’
pytst-0.50
As promised, I’ve released a very preliminary version of pytst, a Ternary Search Tree (TST or trie) implementation in C++ with a Python interface (built with SWIG). Download it here if you dare !
Basically, it behaves like a dictionary, but the keys can only be plain strings (sorry, not unicode strings yet). So why bother ? Because TSTs are a lot smarter than dictionaries when it comes to :
- Prefix-matching : find the longest entry in the TST which is a prefix to a given string. Handy for things like scanners, url matchers and so on.
- Scanning (corolary of the previous one) : using the Aho-Corasick algorithm, you can implement pretty efficient scanners with a TST. The good thing it that it can scale up to tens of thousands of entries and still perform well.
- Spelling correctors : find a set of entries which spelling is close to a given string. The distance used is the Levenshtein distance.
I have been using this package in production for nearly a year now, without any problem (except a surprise due to SWIG directors when I switched to Python 2.4, but this is fixed now). I release this under the LGPL. Incidentally, this is my first release under a license, up until now I was pretty much releasing my dirty works in the public domain where it could safely be ignored
. I am NOT satisfied with the packaging (especially the tests), the code layout (it’s been so long I had not been writing C++ that I forgot a whole bunch of coding conventions) and so on, but I decided to release this as is, and see what happens.
Uh, yes, there is no documentation for now. It’s part of the « let’s see what happens » scheme, you see. If someone really is interested by this package, I’d be delighted to hear about you ! This would be a real motivation to write a documentation
.
James Tauber : Updated Python Trie Implementation
James Tauber : Updated Python Trie Implementation
I have written my own implementation of a Trie in Java, C++, D, and Python. This data structure is pretty efficient for prefix matching, but also for finding any key in the tree that is below a given Levenshtein distance from an arbitrary world. This means that you can quickly find all word which spelling is close to a given word, which is handy for name matching or spelling correctors.
Tries can also be used in text parsing to efficiently implement greedy tokenizers, tokenizers that extract the longest known sequences from a text, even when some known sequences are a prefix of other longer known sequences. Regular expression engines can do that, but tries can easily handle thousands or tens of thousands of known sequences of arbitrary length whereas some engines may choke with regular expressions of tens of kilobytes.
Both the (templatized) C++ and D version are very efficient, since I’ve written them at a pretty low level in order to use a few memory allocations (read : objects creation) as possible. For instance, the entire tree nodes are stored in a single array, and the node objects are preallocated in this array instead of being allocated from the heap.
The final step was to build a SWIG wrapper from the C++ version to a Python extension. I did this a while ago using SWIG 1.3.21 and I quite enjoyed the experience. The native extension to Python is way, way faster than the pure Python module ! Now, all this requires a bit of documentation, but I’ll eventually publish this code as it may prove useful.
James Trauber’s implementation is very neat, because he leverages the Python dictionary object. My trie are implemented as TSTs (Ternary Search Trees), which roughly gives a O(Nlog2(M)) lookup time, with N being the maximum word length and M being the maximum number of items per node (which depends on the variety of characters used, typically with real world words it’s bounded by 26*2 letters + 10 numbers + ponctuation ~ 64 => log2(M) <= 5 ). Using dictionaries at each node reduces the lookup time to O(N), which may be 5 times faster (warning, goofy maths here !) for the same implementation strategy (i.e. same language) at the cost of a bigger memory consumption. I’ll try to benchmark his implementation against mine one day…