Parallel brush mask processing

This weekend I was a bit tired of fixing Krita bugs and decided to do work a bit on features again. So I started to work on optimizing the painting in Krita again (though one could see performance issues as bugs).

Lukas had already optimized brush masks before mainly by improving the algorithm. Back then the goal was to be able to have fast painting with a 70px brush on a 2500×2500 image, the new goal is a 500px brush on 6000×6000 image. When I looked at the CPU utilisation of the stroke benchmark I noticed that only on thread was busy, so I wanted to try to parallelize it. I know that KSysguard might not the most precise way to measure it, but it gives a nice indication:

The first thing I wanted to try was OpenMP which I knew through my Algorithm Engineering course from university. I tried to use for-loop parallelization, but it did work out very well as it turn out even slower than without. I’m not completely sure why that happened, but I assume that the loop wasn’t well suited for OpenMP.

After that my next try was to use QtConcurrent on the problem. My idea was to split the mask into a list of separate rectangles where the threads could work. QtConcurrent was suprisingly easy to use and I only needed to make very few changes to the old code. Here is the result of the benchmark with QtConcurrent code:

The random lines with 300px brush benchmark improved from 9359 msec to 5621 msec which is a speedup 1.6 on my Core i5 430 (dual-core). That isn’t too bad if you consider there is also some serial code in there. For smaller brushes the speedup is much smaller. Unfortunately I don’t have a quad-core system to test, it would be interesting to see how it scales. I’m still wondering why the QtConcurrent code doesn’t run with 100% CPU utilisation. The benchmark should be big enough to reach the maximum.

Since Krita is currently the feature freeze currently, the code won’t make into Krita 2.3.

8 Responses to “Parallel brush mask processing”

  1. Silvio Grosso Says:

    THANKS a lot indeed for your work on Krita.

    Krita 2.3 is shaping up very well and it will be a great release 🙂

  2. Kubuntiac Says:

    Sweet! I’m loving all the speed-up’s Krita’s been getting. These are the kinds of things virtually *every* user will end up benefiting from. Great stuff, Sven!

    By the way, I’ve got a quad core I can donate to the testing cause if you feel like telling me how to do the test 🙂

  3. Milian Wolff Says:

    While maybe a bit more work (not much imo) you could try out a plain QThread. QtConcurrent is said to have a big overhead for it’s convenience.

  4. morice-net Says:

    That’s a nice work Sven !
    Thank you

  5. maninalift Says:

    Does the code use Eigen – I don’t know much about it but there are #defs for turning on different varieties of parallelisation (eg openmp) for the vector operations, I don’t know whether that would be more of a win than paralellising the algorithms.

  6. slangkamp Says:

    @Milian Wolff: I just tried with QThread and it turned out a bit slower than the QtConcurrent version.

    @maninalift: The code doesn’t use eigen as there are no vector operations.

  7. adamce Says:

    >>I’m still wondering why the QtConcurrent code doesn’t run with 100% CPU utilisation. The benchmark should be big enough to reach the maximum.<<
    Probably because of Hyper-threading, the cores are running at 100%, but it's divided in two threads per core, so each thread goes only 50%..

  8. slangkamp Says:

    No, OpenMP manages to get 100% on all four threads so that should be possible.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: