10 November 2011

On Python, Cython and Pypy performance (with genetic algorithms)

As a part of my master's thesis I have to work on some optimization methods which are in turn operating on a heavy mathematical model -- one evaluation may take about a second or even more depending on the implementation. Currently I'm working on a genetic algorithm optimizer based on excellent PyEvolve module for Python.

The original optimizer has been written by my tutor in pure C++ with GAlib taking care of genetic computation. Originally it uses a steady state genetic algorithm engine but I switched it to the simple GA engine that is compatible (although slower) with one present in PyEvolve. That way it was easier (or even fairer) to compare results and performance.


To interact with the core mathematical model I've eventually written three different optimizer implementations (more in next section). Since the evaluation is painfully slow, I tried different ways to make it faster. It's somewhat important to stop here and explain the situation. Evaluation is made with a not-so-much optimized integration algorithm -- something along this lines:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
def integration(*args):
    # Iterating over main blocks in X axis.
    for idx in xrange(0, num_dx):
        # ... same for Y axis.
        for idy in xrange(0, num_dy):
            # Numerical quadrature in [-1; 1] x [-1; 1] square.
            for ix in xrange(4):  # Integration along X axis.
                # ...
                for iy in xrange(4):  # Integration along Y axis.
                    # ...
                    # Integration along Z axis (analytical)

    # Result transformation to original area.
    # ...
Yes, this code is awfully inefficient but in C it really doesn't matter that much (very unlike in Python). Also this function is being run tens or even hundreds of times for every GA solution. Let's just say that even in very primitive GA run the most inner loop is executed 30.000 - 50.000 times. Of course this function could be possibly replaced with some SciPy's quadrature function -- that's the next exercise for me.


Optimizer implementations
The original C++ optimizer used some custom genetic operators which turned out to be a bit worse than PyEovlve's default ones. I ported them anyway so my results wouldn't differ much between implementations. I put up these operators' code on Github, they're not really important in this story.
***
My first implementation of the optimizer was a thick Cython wrapper around original C++ classes. I followed the official documentation and have met oh so many problems. All in all this usually went like this:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Useful tweaks for problematic types
cdef extern from *:
    ctypedef void* const_void_ptr "const void*"
    ctypedef char* const_char_tr "const char*"
    ctypedef int bool


cdef extern from "someheader.h":
    cdef cppclass CppClass:
        pass


cdef class PyCppClass(object):
    cdef CppClass *thisptr

    def __init__(self):
        self.thisptr = new CppClass()

    def __dealloc__(self):
        del self.thisptr

    # Getters/Setters/Utilities doing work on CppClass objects.
    # ...
    def do_something(self):
        return self.thisptr.do_something()
This is in most cases all you need to do to wrap a C++ class (still not really user friendly). This is only an input code -- you have to cythonize it and compile with proper linking to Python libraries. Fortunately you can use a shortcut path with pyximport:
1
2
3
4
5
6
7
8
import pyximport
# Initialize pyximport so you can import cythonized modules.
pyximport.install()
# Import your module from .pyx file -- this will invoke background compilation.
from cy_module import some_func

# Code using some_func
# ...
Cython will try it's best (using distutils) to prepare a module for you. Intermediate files are usually placed in your home directory under .pyxbld/ path -- don't worry about them. You can also customize the way your module is being compiled by creating a module_name.pybxld file in the same directory as a .pyx file. This Stack Overflow answer will get you started -- the rest is just distutils documentation on building extensions.
Cython is after all not that bad, until you want to something more advanced like passing a Python list of floats to C++ code. The only way I found to solve this is by copying memory: you allocate a new array and fill it with a list data in a for-in-range-len loop. Of course you have to manually free the memory taken by this array -- let's hope you know when it's finished with (or you'll get a segfault).
This thick wrapper did really good, sometimes even outperformed the original version. What pushed me to write another implementation(s) was a difference in the intermediate results (my bad, fixed by now). I went really deep (even into C++ layer) and couldn't find why my cythonized classes at times produced distorted numerical results. So I ended up writing a pure Python implementation of the whole underlying model...
***
The second implementation was a complete rewrite of optimizer code and underlying model to pure Python code using NumPy arrays. After getting around C++'s multiple function signatures I finally got the same numerical results as the original optimizer. But it all came with a huge price. The aforementioned deeply nested loops were terribly. unacceptably slow. The execution time increased three or four orders of magnitude.
Then it hit me -- I can use PyPy as well, at least there would be some JIT optimizations. Unfortunately we can't run NumPy under PyPy (although it is work in progress). I stripped all NumPy array and linalg calls from my code.What's interesting that even under CPython this version run actually faster.
First few tests under PyPy have left me speechless. Now this really heavy, Python-unfriendly code executed only about ten times slower than original written in C++ -- bravo PyPy devs!!
Some things I learned  making this implementation:

  • using NumPy (and SciPy for that matter) is not always the best choice -- packing and unpacking arrays is slow. You'd better use it for really large matrices and avoid storing Python objects,
  • use a profiler. cProfile and RunSnakeRun is a good start -- you can easily trace slow functions.  line_profiler is a great tool to find the worst offenders on a single line level,
  • funnily enough the functions from math package can sometimes be slower than equivalent expressions, for example: math.sqrt(x) and x**0.5 -- test this yourself,
  • generally, calling many functions from a heavy (and nested) loop is not a good idea.
Performance issues have led me to write yet another implementation.


The last implementation is a really thin Cython wrapper around original optimizer code. It doesn't directly interact with the underlying mathematical model. The only thing that's left (written in Python) is the code related to PyEvolve (and argument parsing if you need to know). I do all initialization steps in C++ layer by calling one C-externed function then wrapped in Cython without any data conversion. Then I call a similar function to fetch optimization constraints and variable count. PyEvolve does all genetic work and calls another wrapped function to do evaluation in C++ model.
Minimal Cython wrapper did really well. It performed even better than the original C++ optimizer! I blame this on better genetic algorithm implementation in PyEvolve. Then I went even further -- why bother evaluating the same genome more than once? I added a simple memoization to my evaluation function:
1
2
3
4
5
6
7
8
9
def memoized_eval(eval_fun):
    memo = {}

    def inner_func(chromosome):
        key = tuple(chromosome)
        if key not in memo:
            memo[key] = eval_fun(chromosome)
        return memo[key]
    return inner_func
This one trick allowed me to cut down the execution time by thirty to fifty percent.


Results
This post is already to long, let's see some stats ;). I run four series of tests using all the optimizer implementations:
  • 10 generations with a population of 10 individuals,
  • 5 generations with a population of 50,
  • 20 generations with a population of 50,
  • 50 generations with a population of 100.

I have set the initial random seed manually for the tests to be repeatable and (quite) fair. The testing machine is (just for reference) AMD Athlon 64 X2 5600+ computer with 2GB of DDR3 RAM, software versions:
  • Python 2.7.2 (all implementations but one),
  • PyPy 1.6.0 (only pure Python version),
  • PyEvolve 0.6rc1,
  • Kubuntu 11.10 with Linux 3.0.0 kernel.
These are the execution times:
(Screenshot from Google Docs, sorry for that)
Or if you like graphs:
(Graph uses logarithmic scale. Minimal wrapper and Cython class wrapper lines are overlapping.)
(Example results for the simplest test run -- all other runs tend to give similar graphs.)
It's obvious that the pure Python version is the slowest one (not particularly Python's or PyPy's fault). On the other hand wrapping C++ (and C) code with Cython adds no real performance penalty . It's also pretty straightforward -- the only annoying part is frequent recompilation. 


PS. Use that profiler for your high performance applications.

1 comment:

  1. Hi

    You can try either using PyPy's numpy (although linalg wouldn't work for now) or use list-strategies branch. It should be vastly faster, probably to be merged soon (but not going into 1.7 release).

    The main difference is that if you have a JIT, you *don't* unpack numpy arrays - they stay as optimized integers/floats.

    Cheers,
    fijal

    ReplyDelete