Archive pour la catégorie ‘pytst’
pytst 1.07
The pytst 1.07 distribution improves vastly the packing algorithm. Turns out my previous so-called « clever algorithm » was running in O(n2), which is as bad as it can get. Well, no, it could have been O(2n), but still. The new algorithm runs in O(n (3+log n)). I’m thinking about a better one, but I just couldn’t leave things as is until I come up with the final algorithm.
Warning : I’ve just checked the Python 2.3 build, and GCC 3.4.2 (mingw-special) chokes on the most stupid piece of code ever :
#include <exception>
class TSTException : public exception {
public:
TSTException(const char* _message) : message(_message) { }
const char* what() {
return message;
}
private:
const char* message;
};
GCC complains that exception is not a known class name. Doh ! I have to get some sleep now, I’ll see this later.
pytst 1.06
|
I’ve made a lot of internal changes and improvements to the C++ part of pytst in order to implement an incremental full text index with both Python and COM bindings. The only visible change of the pytst 1.0.6 distribution is a more compact serialisation format. It’s not top notch compression, but it saves about 25% of disk space. For top notch compression, I’m currently reading Managing Gigabytes, an excellent book about compression and documents (text and images) indexation. It’s very interesting and pretty inspirative. It’s rather old, though, for example the index mentions Excite and Altavista but not the small but promising Google search engine. But it’s a theory book and most of the theory hasn’t changed in 10 years, of course. |
pytst 1.01
Download it here.
What’s new ?
- There is at last a nice test suite which I use to ensure that there are no regression and that both the SWIG version and the Boost.Python version are in sync. The test suite also records performance data in a CSV file which I can then analyse with Excel to check if there is a performance improvement or regression when I introduce something.
- A creeping bug in
close_matchis definitely fixed. I always knew it was here but could not properly reproduce it. Luckily with the new test suite I was able to track it. - WARNING : there is a change in the
DictActioncollector : it now retains the lowest distance whenever akey,valuepair is seen multiple times, for instance inclose_match. Theclose_matchalgorithm now stores distances (the lower the better) rather than remaining distances (the higher the better). For instance, if you were looking for all entries with a maximum distance of1from"123", you will now get"122"with a distance of1instead of a remaining distance of0. This is much more intuitive than before. - Speaking about collectors, you know, those
ListAction,TupleListAction,DictActionclasses ? Well, those are not the only way to collect data from the TST. There are now true Python iterators ! Instead of usingwalkorclose_match, you can use the default iterator, theiterator([root])method or theclose_match_iterator(string,max_distance)method.
Examples of iterator usage :
from tst import * t = tst.TST()
# ... filling t...
# BEFORE :
contents = t.walk(None,DictAction())
# NOW :
content = {}
for key, value in t:
content[key] = value
# BEFORE :
# All words beginning by "foo" :
from_foo = t.walk(None,DictAction(),"foo")
# NOW :
from_foo = {}
for key, value in r.iterator("foo"):
from_foo[key]=value
# BEFORE :
# All words close to "foo" :
close_to_foo = t.close_match("foo",1,None,DictAction())
# NOW :
close_to_foo = {}
for key, value in r.close_match_iterator("foo",1):
if key in close_to_foo:
# the close match iterator can return the same
# key multiple times
print "%s already inserted"%key
else:
close_to_foo[key] = value
What’s the point if the new code is bigger ? Well, if your tree is big, you may save some memory by iterating into the tree rather than by collecting all the data in a list / list of tuples / dictionary THEN processing it. That’s your choice ! Performance-wise, I’d say the collectors should be faster, but I still need to check this.
pytst 1.00 !
What’s new : after reading Scott Meyers’s Effective STL, I finally saw the light and was able to squeeze a little bit more of performance, eventually getting faster using std::vector than what I got with my hand-written vector-like class. This cancels what I’ve written about std::vector. This shows that STL is a very powerful library, but also that it makes it easy to shoot yourself in the foot if you don’t use it correctly. It also shows that Scott Meyers’s books are much needed if you want to write, well, effective C++ (wink, wink, nudge, nudge).
As an extra bonus, the new implementation of the pack() method now really packs the tree. Removed nodes were not, and still aren’t, removed from the tree. Instead, they are kept for later reuse, so that if a new node is later needed, a removed node can be used instead. This saves a lot of time and memory if you have a dirty phase in your computations where you have to add and remove a lot of nodes. Once the war is over, however, those empty nodes were left in memory, even if the chance of them being reused dropped to zero. Now, when you pack() the tree, those nodes are removed (the tree nodes get defragmented) and the final memory buffer gets shrinked down to the minimum.
It’s somewhat comparable to what you do when you « compress » a database (vacuum it in PostgreSQL, compress it in Access or Microsoft SQL Server, optimize it in MySQL, I guess). It has been quite interesting to find this algorithm as I guess it’ll be pretty useful when disk storage will come into play.
I’m leaving for Barcelona tomorrow night to spend New Year’s Eve with some friends over there. So I won’t update this blog till then… So let me give you readers (all 3 of them) my best wishes for 2006 !
pytst 0.99
The new release is available from the usual place. What’s new ? Well, above the surface, there is a new contains(string) method which tells you whether the tree contains a given string or not. Another new thing is a Win32 binary release of pytst using the Boost.Python library, which yields 25% more performance. If you’re interested by building a binary version of this for another platform, write me. I’ve been told that the latest (CVS) SWIG version was much more optimised, but alas, there is no Win32 binary version as of today and I’m too busy to try to build one.
Under the hood, I’ve been doing a bit of source code refactoring, in preparation of a new on-disk storage implementation. To start with, I plan on using Berkekey DB in Recno mode, since it’s quite close to an on-disk vector. If everything goes well, I’ll try to do it by myself (of course I don’t think I’ll achieve the same level of features and quality than the one provided by Sleepycat).
Sadly, it looks like my design for the node storage manager interface is a bit clunky… To do it well, I’ll have to refactor so many things that I’m beginning to think about a full rewrite (yeah, I know, it’s bad). In this case, I’ll release the current version as 1.0 and I’ll start over on a 2.0 branch, fully API compatible but with the new on-disk storage feature.
Note that there already is some support for disk storage of a tree ; but it’s a cold storage, you populate the tree in memory, then save it on disk, then you can reload it from disk to memory later on. The problem is that the tree is entirely loaded into memory, which limits the usage to huge trees (because you can hold one heck of a huge tree into 1 Gb of RAM), not letting us dwelve into the realm of mega huge trees
. A real on-disk storage would allow me to page in and out parts of the tree from the disk to the RAM.
|
Another note : I’m currently using Oddly, when I tried to port my prototype to full C++ (with a Boost.Python interface), I’ve ended up with a slower index… Even if Python’s |
Boost.Python rocks !
After a few hours of struggle, I managed to build a Python binding of tst 0.97 using Boost.Python. The results are great, my (unscientific) performance benchmarks show a 25% performance improvement over the version using SWIG. This is not surprising since SWIG takes your C++ class, builds a flat-view of your class as a set of C functions, then build a Python wrapper around this set of functions.
A few remarks, though :
- Boost and Boost.Python is a whole new world to discover. If you want things to remain a bit under your control, you HAVE to forget about make, nmake or Visual Studio and learn a new project building tool, bjam. It does not go without its surprises (for example, as a hint, don’t name the top-level directory of your project « boost ». It will confuse bjam.). Also, I’m a bit worried about the « buildability » of
pytst. Up until now, it could be built usingpython setup.py install, now you have to use a whole new toolset. - What’s cool is that the
.pydextension does not need a.pymodule. What’s less cool is that it need a runtime DLL namedboost_python.dll(but it is built along with your own extension). - Boost.Python is quite clever, but it does not understand the
char* string, int string_lengthidiom. SWIG does not, either, but at least you can teach it using SWIG templates. To work around this I had to usestd::basic_stringwhich is recognized by Boost and mapped to a Pythonstrtype. Anyway, I have to changetstso that it usesbasic_strings instead of this idiom which is hackish and not supported by Delphi, either. Because, you know, the Yoono client is developed in Delphi (hint, hint). - Boost.Python is very clean and cool, using hardcore C++ meta-programming constructs. The problem is that you don’t get to know what is the result of all this magic. There is no code generation, since everything is done through meta-programming. This is a bit disturbing because you have to have an absolute trust in Boost.Python and hope it always does The Right Thing™.
- Since it uses hardcore C++ meta-programming, I strongly advise you to refresh your knowledge on C++ templates. On the one hand, the mapping code is pretty simple and straightforward (see below), but on the other hand, as soon as you dwelve a bit further in the tricks and trades of overloading pure virtual functions in Python, things get quite hairy (see below, too).
To sum up : what’s cool with Boost.Python ? Performance, and code like that :
class_< TST >("TST")
.def("put",&TST::put)
.def("__setitem__",&TST::put)
.def("get",&TST::get)
.def("__getitem__",&TST::get)
.def("get_or_build",&TST::get_or_build)
.def("remove",&TST::remove)
.def("write",&TST::write)
.def("read",&TST::read)
.def("walk",&TST::walk)
.def("close_match",&TST::close_match)
.def("prefix_match",&TST::prefix_match)
.def("pack",&TST::pack);
What’s less cool is code like that :
template <class S, class T> class string_action_wrapper :public action<S,T>, public wrapper< action<S,T> > {
public:
void perform(S* string,int string_length,int remaining_distance,T data) {
return call<void,std::basic_string<S>,int,T>(
this->get_override("perform").ptr(),
std::basic_string<S>(string,string_length),
remaining_distance,
data
);
}
T result() {
return call<T>(this->get_override("result").ptr());
}
};
But who cares as long as we get some performance ! Mu-HAHAHAHA ! Err, sorry…
pytst 0.97
This release fixes a bug in the latest walk(filter,action,start=None) version : when start was provided, the results were sometimes wrong. I should never release things in a hurry
. Download it here.
pytst 0.96 : warning, API changes
OK, I should have done this waaaaaaay before. Anyway, I’ve made those changes to the API :
TST.almost(string,maximum_distance,filter,action)has been renamed toTST.close_match(string,maximum_distance,filter,action)TST.common_prefix(string,filter,action)has been renamed toTST.prefix_match(string,filter,action)walknow takes an optional parameter :TST.walk(filter,action,start=None). Ifstartis provided, the walk starts there, giving you all entries in the tree whose key begins with the given string.- The
NullFilterclass is removed, since it is useless. Just passNoneif you don’t want any filter.
I’m sorry if the migration poses any problems, all the more so that I’m the first who has to refactor some code
. The method names were very bad, common_prefix was especially confusing. This should be the last API change, the 1.0 version is close now.
Download pytst 0.96 here.
pytst 0.95
William Underwood found a bug in the common_prefix method. Bad results where returned… This method was largely unused by me and therefore untested, which proves that you should never publish untested parts of an API, because someone somewhere will use it even if you don’t yourself
.
pytst 0.95 is available in the best groceries and also on download here. You’ll also find a 0.95.noscan which does not have the Aho-Corasick algorithm but has smaller TST nodes (20 bytes vs 32 bytes, the extra 12 bytes are used to represent the backtracking data). This could interest you if your application is memory intensive.
pytst 0.94
Nothing much new this time, I’m now using setuptools to package the application into an egg file (see EasyInstall for the user’s point of view). I have also sneakily used my access on one of the Apache foundation’s FreeBSD servers to check whether pytst can be built and used on another platform than Win32. Everything seems OK
.
Download pytst 0.94 here.