This weekend I was a bit tired of fixing Krita bugs and decided to do work a bit on features again. So I started to work on optimizing the painting in Krita again (though one could see performance issues as bugs).
Lukas had already optimized brush masks before mainly by improving the algorithm. Back then the goal was to be able to have fast painting with a 70px brush on a 2500×2500 image, the new goal is a 500px brush on 6000×6000 image. When I looked at the CPU utilisation of the stroke benchmark I noticed that only on thread was busy, so I wanted to try to parallelize it. I know that KSysguard might not the most precise way to measure it, but it gives a nice indication:
The first thing I wanted to try was OpenMP which I knew through my Algorithm Engineering course from university. I tried to use for-loop parallelization, but it did work out very well as it turn out even slower than without. I’m not completely sure why that happened, but I assume that the loop wasn’t well suited for OpenMP.
After that my next try was to use QtConcurrent on the problem. My idea was to split the mask into a list of separate rectangles where the threads could work. QtConcurrent was suprisingly easy to use and I only needed to make very few changes to the old code. Here is the result of the benchmark with QtConcurrent code:
The random lines with 300px brush benchmark improved from 9359 msec to 5621 msec which is a speedup 1.6 on my Core i5 430 (dual-core). That isn’t too bad if you consider there is also some serial code in there. For smaller brushes the speedup is much smaller. Unfortunately I don’t have a quad-core system to test, it would be interesting to see how it scales. I’m still wondering why the QtConcurrent code doesn’t run with 100% CPU utilisation. The benchmark should be big enough to reach the maximum.
Since Krita is currently the feature freeze currently, the code won’t make into Krita 2.3.


October 26, 2010 at 10:38 am
THANKS a lot indeed for your work on Krita.
Krita 2.3 is shaping up very well and it will be a great release
October 26, 2010 at 11:28 am
Sweet! I’m loving all the speed-up’s Krita’s been getting. These are the kinds of things virtually *every* user will end up benefiting from. Great stuff, Sven!
By the way, I’ve got a quad core I can donate to the testing cause if you feel like telling me how to do the test
October 26, 2010 at 11:49 am
While maybe a bit more work (not much imo) you could try out a plain QThread. QtConcurrent is said to have a big overhead for it’s convenience.
October 26, 2010 at 12:19 pm
That’s a nice work Sven !
Thank you
October 26, 2010 at 1:21 pm
Does the code use Eigen – I don’t know much about it but there are #defs for turning on different varieties of parallelisation (eg openmp) for the vector operations, I don’t know whether that would be more of a win than paralellising the algorithms.
October 26, 2010 at 4:12 pm
@Milian Wolff: I just tried with QThread and it turned out a bit slower than the QtConcurrent version.
@maninalift: The code doesn’t use eigen as there are no vector operations.
October 31, 2010 at 7:31 pm
>>I’m still wondering why the QtConcurrent code doesn’t run with 100% CPU utilisation. The benchmark should be big enough to reach the maximum.<<
Probably because of Hyper-threading, the cores are running at 100%, but it's divided in two threads per core, so each thread goes only 50%..
October 31, 2010 at 7:35 pm
No, OpenMP manages to get 100% on all four threads so that should be possible.