Archive pour la catégorie ‘pytst’
pytst : ouch !
I had no time to cover this previously (what with my new fatherhood and all), but I recently have been contacted on the 1st of september by Stani of SPE fame. He was interested in using pytst to implement a mass search/replace function. He had tried the regex way but to no avail. I gave him a bit of source code to do this with pytst, and he came back to me with his own version :
DELIMITERS = ' .,?!@():"\' '
reDELIMITERS = re.compile('([%s])'%DELIMITERS)
class MultiReplaceTST(tst.TST):
def __call__(self,input_string):
output = cStringIO.StringIO()
for source_string, status, replace_with in self.scan_with_stop_chars(input_string,DELIMITERS,tst.TupleListAction()):
if status>0:
output.write(replace_with)
else:
output.write(source_string)
return output.getvalue()
class MultiReplaceDict(dict):
def __call__(self,input_string):
input_string = reDELIMITERS.split(input_string)
output = cStringIO.StringIO()
for word in input_string:
try:
output.write(self[word])
except KeyError:
output.write(word)
return output.getvalue()
Well, it’s hard to admit, but the second version, using Python dict and re, is apparently ten times faster than the first version using pytst… Ouch !
I’ll have to test it myself, but I think one of the culprit (besides me
is SWIG. As I wrote in a previous post, profiling pytst showed that the hotspots were mainly all in the C/C++ wrapping code generated by SWIG. This is very irritating, so I now feel the urge of reimplementing pytst without using SWIG.
std::vector is very slow ?
I was trying to understand why my latest version of pytst (0.92) was 10% slower than the previous one (0.86). I have made three important changes between those two versions :
- I used my
PythonReferenceclass to ease the Python reference count management. I was fed up with having to managePy_INCREFandPy_DECREFmanually, so I decided to let the C++ compiler do the job itself, following the advice of Scott Meyers. - I made a complete overhaul of the
memory_storageclass, which is the class that, well, manages the storage of the TST nodes in memory (my long term plan is to provide afile_storageclass to be able to store and use the tree directly on the filesystem). I was feeling a bit silly to have implemented my own kind ofstd::vectorimplementation, so I wisely decided to usestd::vectorinstead. After all, why reinvent the wheel ? - The serialization system is now completely independent of the storage implementation. It’s like a generic dump of the tree. This way, it will be possible to serialize a
tstinstance backed by amemory_storageand load it into atstinstance backed by a (not yet available)file_storage. Of course, this point is not related to the performance loss at all.
It turns out that when you use a profiler (I used the 15-days free evaluation of AQtime 4, which has everything I could dream of for a profiler, the first thing being that « it just works »), a lot of methods from std::vector are hotspots. Granted, this is in debug mode, so some code that should be inlined is not, but if a function is an hotspot when called, it has good chances of remaining an hotspot when inlined, don’t you think ? Well, I’ve quickly reimplemented a std::vector-like behaviour right into memory_storage, and presto, I nearly get back my 10% percents of lost performance. Plus, the source code is not so much more complicated.
So, sometimes it’s worth reinventing the wheel, especially for simple cases like that.
Anyway, those results are not especially surprising : the Microsoft implementation of std::vector::at (which was the most method that my code used the most) instantiates an iterator and uses the overloaded + operator, only to access a vector element ! I can’t imagine how this could be more complicated. I don’t know if this kind of convoluted way to do things is also found in other STL implementations, but it is a bit creepy…
Now, after the removal of std::vector, what’s the top hotspot ? SWIG_Python_ConvertPtr ! Looks like SWIG will have to disappear… I’m really thinking about using PyCXX or Boost.Python.
It is very frustrating to have to fight against all those layers of foreign code, when all you want to do is optimize your own code ! Anyway, here comes pytst 0.93 !
Working on an Hash Array-Mapped Tree (HAMT)
I’m currently working on an implementation of an Hash Array-Mapped Tree (HAMT), a replacement for hashtables which is theoretically more efficient. Right now I have the C++ implementation, I still have to work on memory and CPU optimisations as well as Python and Java bindings. I’d like to try something else than SWIG, this time, because I’m not totally satisfied with my experience on pytst. Maybe Boost.Python ?
Yet another Python TST implementation
Here, from a French compatriot
. This one implements tail compression, which is an important space optimisation for TSTs. I have not implemented it until now since some of the algorithms implemented in my TST (namely the close match algorithm) would become quite complicated. I’m still thinking about a way to do it efficiently…
Update: this is a TST implementation all right, but without Python bindings.
pytst 0.85
The reference counting problem should be fixed in this version. I finally understood the various interactions between Python, SWIG and my code. Download it here.
To stop worrying about reference counting, I’m planning on implementing a kind of smart pointer, this way :
class PythonReference {
public:
inline PythonReference() {
this->ref=NULL;
}
inline PythonReference(PyObject *object) {
this->ref = object;
Py_XINCREF(object);
}
inline PythonReference(const PythonReference& that) {
this->ref = that.ref;
Py_XINCREF(this->ref);
}
inline PythonReference& operator= (const PythonReference& that) {
PyObject* old = this->ref;
if((this->ref = that.ref) != old) {
Py_XINCREF(this->ref);
Py_XDECREF(old);
}
return *this;
}
inline PyObject* object() {
return ref;
}
inline ~PythonReference() {
Py_XDECREF(ref);
}
private:
PyObject* ref;
};
That is to say a class encapsulating a Python object reference and automagically incrementing and decrementing the reference count as it is copied and passed in function calls.
pytst 0.83
Available from the usual source. This release tries to fix some problems with Python reference counting. I’m not quite sure whether it is really solved now, or whether my fix created a potential memory leak. At least the Python process doesn’t behave strangely now after a few operations on the tree. I’m looking for ways to debug this efficiently, so hold the line…
Another implementation of the Aho-Corasick algorithm with Python bindings
Here. I need to bench it more thoroughly, but from my first tests, it seems that pytst is faster and less memory intensive (my million keywords test applied to ahocorasick-0.8 took virtually forever with a lot of swap). pytst is also more rich in functionalities ; you can use pytst as a dictionary, ahocorasick-0.8 is just a scanner.
pytst 0.80
This release focuses on abstracting the node storage mechanism, so that I can eventually replace the memory-based one by a file-based one (ideally using a memory-mapped file).
The cool thing about C++ templates is that they allow a certain level of abstraction without much impact on the performance. Here, instead of defining a storage class with different derived class (memory_storage, memory_mapped_file_storage etc.), I chose to add a template parameter to the tst class which is the storage class you want to use. The difference between the two solutions is that there is no need to have virtual methods in the storage class since its type is known at compile time, so we save all the virtual method dispatching cost, which is critical since the storage class is used intensively. This, plus the fact that the two most used methods in the memory_storage class can be inlined, ensures that no performance loss is observed, even with this new abstraction layer.
pytst 0.64
A bunch of performance improvements : I’ve saved 4 bytes for every node without much impact on the performance level, thanks to two tricks.
- Each node had a height field to save some time when computing the balance of each node. I’ve decided to drop this field and always compute the height of each node dynamically. It turns out that it’s not so expensive because you always know the height of one of the children nodes for free from the recursion, so you only have to « pay » for the computation of the height of the other node.
- Each insert operation in an AVL tree can generate at most one balance operation (this is proven by a theorem, so we might as well use this !). So, once you’ve made a balance operation, you can skip all other balance checks, which means all height computations.
I guess all this only makes sense when reading the code
.
As an aside, I’ve built a SWIG wrapper to the library for Java. It performs quite well, though the Java overhead is somewhat bigger than the Python overhead. Plus, the packaging is currently very poor, you’ll only have access to the Win32 binaries. But some friends are interested, so, here it is…
pytst performance
I have tried this little stress test for pytst :
>>> import tst >>> t = tst.TST() >>> length = 0 >>> for i in range(int(1e6)): ... s = str(i) ... length += len(s) ... t[s]=1 ... >>> t.pack() >>> length, t.bytes_allocated() (5888890, 36000032)
So, storing 1 million strings totalling 5.9 Mb of data requires 36 Mb of memory and around 10 clock-wall seconds. Ouch. But all this memory overhead has a purpose : it contains some very precious index information which allows the TST to perform very well when used properly. For example, you can scan very efficiently a string for the longest matches it contains :
>>> t.scan('777656522*56s5',tst.TupleListAction())
[('777656', 6, 1), ('522', 3, 1), ('*', -1, None), ('56', 2, 1), ('s', -1, None), ('5', 1, 1)]
This example is stupid as hell, because using regular expression would have been much more intelligent, but stay with me : instead of dumb numbers, we could have stored, say, one million of known genetic sequences, and then scan a whole genome very efficiently. How efficiently ? Well, the use of the Aho-Corasick algorithm enables us to read each character from the scanned string only once, and never look back. The algorithm does this by carefully examining the strings stored in the tree and derive an optimized scanner. This scanner requires some data that are stored in the tree, hence the memory overhead we observed previously. For a bit of benchmark, on my computer, a 488 kb text resulting in 86570 tokens is parsed in about 300 ms (remember that we are looking for the longest matches amongst a million strings !).
OK, now let’s try to do this with a simple yet equivalent regular expression instead :
>>> import re
>>> r = re.compile(r'\d{1,6}')
>>> r.split('777656522*56s5')
['', '', '*', 's', '']
>>> r = re.compile(r'(\d{1,6})')
>>> r.split('777656522*56s5')
['', '777656', '', '522', '*', '56', 's', '5', '']
Well, first of all, we don’t get exactly what we want. The scanner built in pytst returns a list of tokens and non-recognized strings as a list of tuple. Each tuple contains first a string, then an positive integer if the string is contained in the tree or -1 if not, and the value associated to the string in the tree or None. Doing this with regular expressions is possible (I’ve done it to implement a poor man’s lexer) but quite tricky.
Another remark that can be done is that we have cheated by using \d{1,6} in the code snippet above. We have found a global regular expression to represent all number between 0 and 999999. Fine ! But this is just because my example was lousy. In real life, you want to store one million « random » strings such as genetic sequences, for which you can’t find a precise regular expression that can tell whether a given sequence is or is not in the tree. So, to simulate what it would be to compete against the TST with a regular expression engine, let’s do this :
>>> import re >>> s = '|'.join([str(i) for i in range(int(1e6))]) >>> s[:10] '0|1|2|3|4|' >>> len(s) 6888889
OK, let’s go, let’s compile this 68.9 Mb regular expression (WARNING : don’t do this at home !) :
>>> r = re.compile(s) >>>
Pfew ! It’s been a while, hasn’t it ? 6’54″ on my modest computer. I’m glad I have 1 Gb of memory, because the python process actually uses 573 Mb of memory after that. The memory overhead of the regular expression is gigantic ! Plus, I was nice, because I should have used groups delimitors around each number so that I could get them when calling split. The problem is that you cannot have more than 100 groups in a regular expression, so it will not fit here. Anyway, let’s see how the scanner performs :
>>> r.split('777656522*56s5')
Traceback (most recent call last):
File "<stdin>", line 1, in ? RuntimeError: internal error in regular expression engine
Bad luck. So what is this all about ? Am I bullying the regular expressions module ? Not at all ! It’s a great piece of software, which possibilites that go way beyond what the Aho-Corasick algorithm can do. But when you have thousands or millions of strings to search in a text, regular expressions engines cannot cope. If they cannot cope, then either you turn to the naive algorithm (say, using a dict to store your entries) and spend years looking for your strings, or you use Aho-Corasick. All your base are belong to us !
By the way, let’s try to fill a dict :
>>> d = {}
>>> for i in range(1000000):
... d[str(i)]=1
...
The memory used by the python process is actually 40% higher when loading a dict than when loading a TST ! I’ll have to check this more thoroughly, but it looks like the TST is also competitive to a dict, memory-wise (because the loading and reading time have nothing in common, dict is way faster)
Next time, we’ll see other uses of the TST.