Introduction to Fast Bezier (and Trying the Accelerate.framework)

[If you want the answer to last time’s homework, skip to the end.]

So you want to hand-calculate Bézier curves. Good for you. It comes up more often then you’d think on iOS, even though UIBezierPath is supposed to do it all for you. The truth is, sometimes you need the numbers yourself. For instance if you want to calculate intersections, or you want to draw text along the curve (like in CurvyText from iOS:PTL chapter 18). Just as a refresher, here’s the cubic Bézier function in C, written in the simplest possible way (we’re only going to discuss the cubic Bézier function here):

static CGFloat Bezier(CGFloat t, CGFloat P0, CGFloat P1, CGFloat P2, CGFloat P3) {
  	  return 
	powf(1-t, 3) * P0
	+ 3 * powf(1-t, 2) * t * P1
	+ 3 * (1-t) * powf(t, 2) * P2
	+ powf(t, 3) * P3;
}

If you want to see this function in use, pull down CurvyText and look in CurvyTextView.m.

P0 through P3 are the control points. If you called addCurveToPoint:controlPoint1:controlPoint2:, then the starting point (wherever you “were”) would be P0. The end point would be P3 and the control points would be P1 and P2. t is a value that runs from 0 to 1; we’ll talk about it more in a moment. If you’re still really lost, you should take time out and read the Wikipedia article on Bézier curves.

The first thing to understand is that this is exactly the same function that UIBezierPath uses. (Well, exactly equivalent to the function they use. I’m certain they don’t use this one because it’s pretty inefficient as we’ll see.) So when we start calculating, we’re going to match UIBezierPath exactly. That’s great because it means you can mix-and-match your code and Core Graphics and everything will match-up.

The next thing to understand is that in this form, the Bézier curve function is effectively one-dimensional. We hand it an “x” value and we get back an “x” value. We hand it a “y” value and we get back a “y” value. So we have to call it twice per point. That’ll be important later for optimization purposes.

Finally, the strangest thing you need to understand is that this function is not linear for t. At t=0, the equation evaluates to P0. At t=1, the equation evaluates to P3. (The equation never evaluates to P1 or P2 since Bézier curves do not pass through these control points.) But between 0 and 1, things get much more complicated. t=0.5 does not represent “half-way along the curve.” Small changes in t can represent very small movement or very large movement along the curve depending on where you are on it. This fact is one of the headaches of Bézier curves.

Our goal is to figure out how to calculate Bézier very fast. So we’re not going to worry much about drawing. We’re just going to investigate ways to calculate this function quickly. Hopefully the process will help you when you’re investigating other functions.

First, let’s build a test project. You can get the full test code in BezierPerf. Starting with a basic project, you want this in didFinishLaunchingWithOptions: to simulate 100 curves drawn at 50 frames per second:

#define BEZIER_FUNC copyBezierSimple

const unsigned kFPS = 50;
const unsigned kNumCurves = 100;

CGPoint P0 = {50, 500};
CGPoint P1 = {300, 300};
CGPoint P2 = {400, 700};
CGPoint P3 = {650, 500};

CFAbsoluteTime start = CFAbsoluteTimeGetCurrent();
CGPoint lastPoint = {0};
for (unsigned i = 0; i < kFPS * kNumCurves; ++i) {
  CGPoint *points;
  unsigned count = BEZIER_FUNC(P0, P1, P2, P3, &points);
  lastPoint = points[count-1];
  free(points);
}
printf("Time: %f\n", CFAbsoluteTimeGetCurrent() - start);
printf("(%f, %f)\n", lastPoint.x, lastPoint.y );
exit(0);

Now we’ll start with the simplest approach (Bezier.c):

unsigned int copyBezierSimple(CGPoint P0, CGPoint P1, CGPoint P2, CGPoint P3, CGPoint **results) {
  *results = calloc(kSteps + 1, sizeof(struct CGPoint));

  for (unsigned step = 0; step <= kSteps; ++step) {
    CGFloat x = Bezier((CGFloat)step/(CGFloat)kSteps, P0.x, P1.x, P2.x, P3.x);
    CGFloat y = Bezier((CGFloat)step/(CGFloat)kSteps, P0.y, P1.y, P2.y, P3.y);
    (*results)[step] = CGPointMake(x, y);
  }
  return kSteps + 1;
}

This just calls Bezier() twice. On my iPhone 4, in Debug mode, with 1000 points, this takes over 20 seconds to run. In Release mode (with optimizations) it takes just over 3 seconds. Wow. 6x improvement. We’ll look into what it’s doing later, but 3s is still no good. We need these calculations every second, and we haven’t even drawn anything yet. Time to start optimizing.

It’s math, so Accelerate.framework is going to make it magically fast, right? I mean, everyone knows that Accelerate uses the fancy vector hardware and that means super-duper fast. So let’s do that. We’ll rewrite Bezier() to use Accelerate. This calculates P*B*T as we discussed in the last post.

static CGFloat BezierAccelerate(CGFloat t, CGFloat P0, CGFloat P1, CGFloat P2, CGFloat P3) {
  const CGFloat P[1][4] = {P0, P1, P2, P3};

  static const CGFloat B[4][4] = 
  { {-1,  3, -3, 1},
    { 3, -6,  3, 0},
    {-3,  3,  0, 0},
    { 1,  0,  0, 0}};
  
  const CGFloat T[4][1] = { powf(t, 3), powf(t, 2), t, 1 };
  
  CGFloat PB[1][4];
  vDSP_mmul((CGFloat*)P, 1, (CGFloat*)B, 1, (CGFloat*)PB, 1, 1, 4, 4);
  
  CGFloat result[1][1];
  vDSP_mmul((CGFloat*)PB, 1, (CGFloat *)T, 1, (CGFloat*)result, 1, 1, 1, 4);
  
  return result[0][0];
}

This is going to be great. We got rid of that long crazy function and replaced it with a couple of elegant matrix multiples running on the NEON vector processor. I can feel the speed already.

11s. In Release mode, with optimizations.

OK, so it’s twice as fast as the original Debug code. But it’s almost 4 times slower than just turning on -Os with the original function. What? Isn’t Accelerate supposed to make all things faster? We haven’t even tried to optimize the original Bezier() function, and still Accelerate is slower.

Lesson 1: Accelerate is not magic “go fast” fairy dust. It may even be slower than the simplest possible non-Accelerate implementation.

This is not to say that Accelerate isn’t awesome. It can be for certain problems. But day-to-day multiplication is often not one of them.

I promised we’d talk about what -Os is doing and why it’s giving us a 6x speedup. First, you can take a look at the code by editing the Scheme and setting the Build Configuration to Debug. Then use Product, Generate Output, Assembly File to see what happens. At the bottom of the assembler output window, you will see “Show Assembly Output For.” Switching between “Running” and “Archiving” will let you compare the debug and release output.

The easiest way to navigate this output is by searching for .thumb_func which will help you find your functions. You can then look for blx to find the places you make function calls. The .loc entries tell you where you are in the source file (filenumber, line, column). Handy, right? In debug mode, you can see the Bezier function and the four calls to powf(). But in release mode, the Bezier function is gone entirely. It’s been inlined. This is a big deal. Function calls can be expensive. This is one of the big reasons that hand-calculation can be so much faster than Accelerate. Accelerate is optimized for big matrices. For little tiny things, the cost of the function call dwarfs what you might save otherwise. Also, most Accelerate functions include fancy features like adjustable stride and parameter error checking. “Fancy features” translates into “probably slower.”

Back to the assembler, note that there are only two calls to powf(). Where did the other calls go? Well, if you dig a little, you’ll notice these lines:

vmul.f32	d16, d0, d0
...
vmul.f32	d17, d1, d1

These are squaring functions (d0*d0 -> d16). That is incredibly, insanely faster than pow(x,2). We got rid of a lot of function calls (both calls to Bezier and calls to powf), and we replaced a very expensive squaring function with a very cheap one using vmul.

Which brings us to the next point. vmul is a NEON instruction. You don’t have to use Accelerate to get NEON benefits. The compiler uses the NEON processor for all kinds of things.

So now that we know that the compiler will replace pow(x,2) with x*x, I wonder if cubing by hand is faster, too. Let’s try BezierNoPow():

static CGFloat BezierNoPow(CGFloat t, CGFloat P0, CGFloat P1, CGFloat P2, CGFloat P3) {
  return
    (1-t)*(1-t)*(1-t) * P0
    + 3 * (1-t)*(1-t) * t * P1
    + 3 * (1-t) * t*t * P2
    + t*t*t * P3;
}

Debug: 9.7s. Nice. 50% improvement over unoptimized powf().

Release: 0.6s. 5x speedup over optimized powf(). 30x speedup over the original function.

In fairness, we should try removing the powf() calls in the Accelerate function. That gets us to 6.5s. Still 10x slower then hand-coded.

Now 0.6s is still 60% of our available time, so it’s worth digging deeper to see what else we can do to speed things up. We’ll talk more about that later. We haven’t really gotten warmed up yet. We haven’t even launched Instruments. There is still plenty of speed we can pick up here. Read more at Even Faster Bezier.

For those of you who did your homework from last time, here’s the answer, this time in C, for the matrix you need to calculate the derivative (slope) of the Bézier function:

const float Bp[4][3] = {
  {-3, 6, -3},
  {9, -12, 3},
  {-9, 6, 0},
  {3, 0, 0}
};