This weekend I was a bit tired of fixing Krita bugs and decided to do work a bit on features again. So I started to work on optimizing the painting in Krita again (though one could see performance issues as bugs).
Lukas had already optimized brush masks before mainly by improving the algorithm. Back then the goal was to be able to have fast painting with a 70px brush on a 2500×2500 image, the new goal is a 500px brush on 6000×6000 image. When I looked at the CPU utilisation of the stroke benchmark I noticed that only on thread was busy, so I wanted to try to parallelize it. I know that KSysguard might not the most precise way to measure it, but it gives a nice indication:
The first thing I wanted to try was OpenMP which I knew through my Algorithm Engineering course from university. I tried to use for-loop parallelization, but it did work out very well as it turn out even slower than without. I’m not completely sure why that happened, but I assume that the loop wasn’t well suited for OpenMP.
After that my next try was to use QtConcurrent on the problem. My idea was to split the mask into a list of separate rectangles where the threads could work. QtConcurrent was suprisingly easy to use and I only needed to make very few changes to the old code. Here is the result of the benchmark with QtConcurrent code:
The random lines with 300px brush benchmark improved from 9359 msec to 5621 msec which is a speedup 1.6 on my Core i5 430 (dual-core). That isn’t too bad if you consider there is also some serial code in there. For smaller brushes the speedup is much smaller. Unfortunately I don’t have a quad-core system to test, it would be interesting to see how it scales. I’m still wondering why the QtConcurrent code doesn’t run with 100% CPU utilisation. The benchmark should be big enough to reach the maximum.
Since Krita is currently the feature freeze currently, the code won’t make into Krita 2.3.