Categories

pytst 1.18

I’ve just released pytst 1.18 which fixes a bug reported by Keith Davidson. The bug was in the scan method and appeared in some very specific yet totally mundane circumstances : if you scanned an input string of a very specific length with regard to the content of the tree, then the first scan would be ok, but the next ones not. This was something that was not in the unit tests and that managed to survive many years before being reported.

The code for the scan method (in tst.h) was one of the most complicated piece of code in this project, and I had quite a hard time figuring out what was happening. In the process, I began to realise that a rewrite would be necessary, so I cleaned the slate and entirely re-designed the algorithm, this time with careful planning.

While doing this, I realised that the old algorithm was much more wrong than I thought. There is a catch with the new algorithm, however : it has to do some backtracking. That is to say, in some condition, the algorithm has to go back and read again some part of the input string.

The old algorithm never had to look back – but the old algorithm would miss some matches, if you had dictionary entries that included multiple smaller dictionary entries. For example, in a dictionary with {'111','222','111222111'}, if you provided the input string '11122211' (notice the missing 1 at the end), then you would have a match on '111',  then a non match on '22211'. The match on '222' would be missed.

For the moment, I prefer to release pytst 1.18 with an algorithm that is correct yet not optimal, and try to find a non-backtracking solution later. From my measurements, the impact on performance is around 10% but YMMV.

For the record, the new algorithm looks like this :

There is backtracking in to places : (J) and (K) where the input string index i is modified. To prevent backtracking, I have to find a way to analyse the tree so that I emit any match included in the current state then change to a proper state. It’s what Aho-Corasick is supposed to do, but I need to have a better look at it since I obviously got it wrong the first time.

You can download pytst 1.18 here.

  • http://twitter.com/ivan_bezdomny Nikolai Yakovenko

    Hi Nicolas,nnDo you have a version of PyTST 1.18 that I can download for Mac OS? I downloaded latest ZIP in the link you sent me… but I think it still has this problem.nnI am matching words in text preprocessed to make explicit word boundary. So I match strings like  » dog  » and  » cat « . If a text has  » dog cat « , then the second won’t match. This is the problem that you fixed, right?nnI used Aho-Corasick extensively when I worked for Google :-) . Trying to use it now too, on an independent project.nnThanks,nNikolai

  • http://twitter.com/nlehuen Nicolas Lehuen

    Hi Nikolai,nnCan you send me your test case (nicolas at lehuen dot com) ? It is very simple so it should work straight away… Unless I’ve missed something.nnRegards,nNicolas

  • dbg

    Hi Nicolas,nnThanks for this useful module! I was looking everywhere for a fast substring search implementation. However, I also seem to still have a problem with the scan function. For instance, if I have a dictionary like {’111′,’222′,’112′} and a string ’111222111′, I would expect it to find ’111′ at positions 0 and 6, ’222′ at position 3, and ’112′ at position 1. Currently it is not finding 112 at all, probably because that match begins inside the match ’111′. Is this the intended behavior, or was the 1.18 update supposed to fix this?u00a0nnMany thanks for your help, and your useful module.u00a0

  • dbg

    Here is some exact code:u00a0nn>>> t = tst.TST()>>> t['111'] = ‘first’>>> t['222'] = ‘second’>>> t['112'] = ‘third’>>> string = ’111222111′nn>>> class tst_findall(object):… u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 def __init__(self):… u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 self.pos = 0… u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 self.scores = []… u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 self.locations = []… u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 self.nmers = []… u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 def hit(self,key,length,obj):… u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 if length > 0:… u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 self.scores.append(obj)… u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 self.nmers.append(key)… u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 self.locations.append((self.pos,self.pos+length))… u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 self.pos += abs(length)… u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 def result(self):… u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 u00a0 return (self.nmers,self.locations,self.scores)nn>>> tstf = tst_findall()>>> call_tst_findall = tst.CallableAction(tstf.hit,tstf.result)nn>>> t.scan(string,call_tst_findall)(['111', '222', '111'], [(0, 3), (3, 6), (6, 9)], ['first', 'second', 'first'])nnI would expect:nn>>> t.scan(string,call_tst_findall)(['111','112','222', '111'], [(0, 3), (1, 4), (3, 6), (6, 9)], ['first', 'third', 'second', 'first'])

  • http://nicolas.lehuen.com/ Nicolas Lehuen

    Hi dbg,nnThe current scan implementation is greedy, it find longest matches and « consume » them. This is precisely what I needed for my projects (I even have a version of the scan algorithm that only accept matches on stop word boundaries), but of course it may not be what you need.nnWhat you need is an implementation of the Aho-Corasick algorithm (http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm). Initially I tried to implement a custom-made, greedy, non-backtracking algorithm that I thought was Aho-Corasick. But obviously, being greedy, it could not be A-C !nnThen I found the bug the blog entry mentioned, and I dumped the whole thing for a simpler, safer, greedy-but-backtracking algorithm.u00a0Truth is, I don’t use pytst anymore and I have no more time to maintain it. But I could not just let the project out there with a bug in the scan algorithm, so I did this last release.nnThe whole source code is under github (https://github.com/nlehuen/pytst) so don’t hesitate to fork the project if needed.nnRegards,nNicolasnnnnn