When last we looked at Bézier curve calculations, we were able to calculate five million points in about 0.6s (~8.3Mp/s or megapointspersecond). That’s 1000 points per curve, 100 curves, at 50fps. That was 5x faster than the original Os
optimized function. But we’re just getting warmed up. We haven’t yet gotten half of the performance available.
In this installment, we’ll look at improving our algorithm. The code is available on github.
We tried the Accelerate framework, but it didn’t help us. The cost of the function calls obliterated our gains. What can we do? First, let’s look at the code again, and see if we’re doing anything foolish.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 

Notice how we’re recalculating a lot of things. For example, we calculate (1t)*(1t)*(1t)
twice with the same t
. That can’t be good. What if we factor out the part that doesn’t change between x and y?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 

Hey, that gets us from 0.6s to 0.5s (10Mp/s). A 17% improvement’s pretty good. But let’s think about this some more. The values t
can take are exactly dependent on kSteps
, which is a constant for the program. And since the C
variables depend only on t
, that means they’re a fixed set as well. We should only have to calculate them once for the whole program. That seems a lot of work we don’t need to do. Let’s see how it turns out.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 

0.16s. 31Mp/s. That’s over 3x faster by just calculating the piece that changes.
Lesson 2: In most cases, your biggest improvements will come from changing your algorithm. Whenever possible, get expensive things out of loops. Don’t make a calculation fast if you can get rid of the calculation entirely. Remember that if you’re called many times, that’s the same as a loop.
The cost of this is 4 floats (16 bytes) per step to store the constants. So for a 1000 step curve, that’s less than 16kB. Not a bad investment on iOS. This cost is for as many curves as you want, as long as they all use the same step size. Of course, if you want different numbers of steps, you could just pass a scale variable to calculate every other point, every fourth point, etc. But by the time we’re done optimizing this (and there’s still plenty of performance left to unlock), you may find that it’s faster and easier just to calculate the same number of points for all curves.
There is another common way to speed up Bézier calculation. Hannu Kankaanpää wrote an excellent article explaining forward differencing using a Taylor series. His approach is fast. About 5060% faster than copyBezierXY()
. But copyBezierTable()
is about twice as fast as forward differencing if you calculate a lot of curves with the same step size. Forward differencing is fast if you have one incredibly expensive curve to calculate (say a large Bézier surface). But it loses when you need to calculate a lot of curves. Factoring out everything but the points themselves into a precalcuated table lets you skip almost all the work. And that’s the goal.
We still haven’t pulled out Instruments, and we’re still writing in portable C. I wonder what we might get if we go offroad and write directly for the NEON coprocessor. Yes, that means we’re moving onto ARM assembler in the next post. Think you can’t beat the compiler? Think it’s not worth it to try? Think again.