Categories

James Tauber : Updated Python Trie Implementation

James Tauber : Updated Python Trie Implementation

I have written my own implementation of a Trie in Java, C++, D, and Python. This data structure is pretty efficient for prefix matching, but also for finding any key in the tree that is below a given Levenshtein distance from an arbitrary world. This means that you can quickly find all word which spelling is close to a given word, which is handy for name matching or spelling correctors.

Tries can also be used in text parsing to efficiently implement greedy tokenizers, tokenizers that extract the longest known sequences from a text, even when some known sequences are a prefix of other longer known sequences. Regular expression engines can do that, but tries can easily handle thousands or tens of thousands of known sequences of arbitrary length whereas some engines may choke with regular expressions of tens of kilobytes.

Both the (templatized) C++ and D version are very efficient, since I’ve written them at a pretty low level in order to use a few memory allocations (read : objects creation) as possible. For instance, the entire tree nodes are stored in a single array, and the node objects are preallocated in this array instead of being allocated from the heap.

The final step was to build a SWIG wrapper from the C++ version to a Python extension. I did this a while ago using SWIG 1.3.21 and I quite enjoyed the experience. The native extension to Python is way, way faster than the pure Python module ! Now, all this requires a bit of documentation, but I’ll eventually publish this code as it may prove useful.

James Trauber’s implementation is very neat, because he leverages the Python dictionary object. My trie are implemented as TSTs (Ternary Search Trees), which roughly gives a O(Nlog2(M)) lookup time, with N being the maximum word length and M being the maximum number of items per node (which depends on the variety of characters used, typically with real world words it’s bounded by 26*2 letters + 10 numbers + ponctuation ~ 64 => log2(M) <= 5 ). Using dictionaries at each node reduces the lookup time to O(N), which may be 5 times faster (warning, goofy maths here !) for the same implementation strategy (i.e. same language) at the cost of a bigger memory consumption. I’ll try to benchmark his implementation against mine one day…

  • http://http://rosh.wordpress.com Roshan Mathews

    Where is the code??? I’ve been looking for an efficient implementation and Tauber’s code (which I’m sure is good) isn’t good enough. :)