Difference of two angles without branching

I am writing a simple mechanical simulation program in C(++). The problem is as follows: given two vectors, compute the angle between those vectors AND make sure it is within the allowed range of results: (M_PI; M_PI). For now, I am using a rather obvious solution:
angle = atan2(ay, ax)  atan2(by, bx); if(angle < M_PI) angle += 2*M_PI; if(angle > M_PI) angle = 2*M_PI;
However, I have to do this three times, which means six totally unpredictable branch instructions in the innermost loop of the program, that's a Bad Thing. I wonder if it is possible to somehow reduce the number of branches. I've seen an assembler trick that allows to store the bigger one of two integers without branching, but I don't know if a similar feat can be achieved using the FPU.

The basic idea here would be to do
angle = round(angle * 0.5*M_1_PI) * 2*M_PI;
This works as follows:
1) Divide by 2pi, this brings the range from [2pi,2pi] to [1,1] (actually, any range should work, but I'll use this one for explanitory purposes).
2) Round to the result to the nearest integer: if it was originally between pi and pi, step 1 will bring it between 0.5 and 0.5, so it will round to 0. If it was originally between 2pi and pi, it will go to between 1 and 0.5, and then be rounded to 1. If it was originally between pi and 2pi, it will go to between 0.5 and 1, and then be rounded to 1.
3) Multiply the result of the previous step by 2pi, and subtract it from the original angle.
The tricky part is getting that round() to be fast. If your compiler can inline it efficiently, you don't need to do anything more. Otherwise you may need some assembly....
I tested on an AMD64 machine, using GCC 4.1.1 and glibc 2.4, using "O3 march=athlon64" for optimisation flags, and with all 4 combinations of 32/64bit mode and 387/SSE floating point (aka mfpmath=387/mfpmath=sse). It seems that this combination can inline the round() call in 32bit mode, but it doesn't seem to have been implemented for 64bit yet.
Even with the inlining, however, a lot of time is spent saving and restoring the FPU's rounding mode. If you make sure it's set appropriately beforehand, you won't need to bother with that. The default (at least on Linux) is roundhalfeven is fine (and has the advantage over plain roundtonearest that the edge cases of +/0.5 round to 0, meaning that angles of +/pi won't be changed, the same as your original version)  if your system uses something else by default, the manual should tell you how to change it (and if you changed it deliberately, you should know how to change it back ). So, we just need to insert the rounding instruction itself. With GCC, this consists of
asm ("frndint" : "=t" (tmp) : "0" (tmp));
(again, see your manual if you're using something else) where tmp is a double containing the value to be rounded, and that will contain the rounded value after it completes. Note that this is a 387 instruction  I couldn't find the SSE equivalent with a quick Google  so it will be more efficient if the compiler uses 387 for everything else.
This results in no branching, as required. With the compiler set to 387 mode, it's approximately twice as fast as the version you posted; in SSE mode, it's still slightly faster than yours, but not by so much. Here's the full code I used to test/time it:
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <time.h>
inline double
clamp_angle_new(double angle)
{
double tmp = angle * 0.5*M_1_PI;// equivalent to tmp = round(tmp), requires that the FPU is in roundtonearest or roundhalfeven mode
asm ("frndint" : "=t" (tmp) : "0" (tmp));
angle = tmp * 2*M_PI;
return angle;
}inline double
clamp_angle_old(double angle)
{
if (angle < M_PI)
angle += 2M_PI;
if (angle > M_PI)
angle = 2M_PI;
return angle;
}#define ACC_THRESHHOLD 0
#define N_TESTS 80000000
double angles[N_TESTS];int
main(void)
{
printf("Choosing angles...\n");
for (int i = 0; i < N_TESTS; i++)
angles[i] = (drand48()  0.5) * 4M_PI; // [2M_PI, 2*M_PI)printf("Testing accuracy...\n");
for (int i = 0; i < N_TESTS; i++) {
double clamp_new = clamp_angle_new(angles[i]);
double clamp_old = clamp_angle_old(angles[i]);
if (fabs(clamp_new  clamp_old) > ACC_THRESHHOLD) {
fprintf(stderr, "clamp_angle_new(%f) == %f != %f == clamp_angle_old(%f)\n", angles[i], clamp_new, clamp_old, angles[i]);
return EXIT_FAILURE;
}
}clock_t start, end;
volatile double junk; // trick the optimizerprintf("Timing clamp_angle_new...\n");
start = clock();
for (int i = 0; i < N_TESTS; i++)
junk = clamp_angle_new(angles[i]);
end = clock();
printf("Took %f seconds\n", (double)(end  start) / CLOCKS_PER_SEC);printf("Timing clamp_angle_old...\n");
start = clock();
for (int i = 0; i < N_TESTS; i++)
junk = clamp_angle_old(angles[i]);
end = clock();
printf("Took %f seconds\n", (double)(end  start) / CLOCKS_PER_SEC);return EXIT_SUCCESS;
}And the (32bit) assembly output of clamp_angle_new(), to prove that it doesn't branch:
clamp_angle_new: pushl %ebp movl %esp, %ebp fldl 8(%ebp) flds .LC0 fmul %st(1), %st fmull .LC1 frndint fld %st(0) faddp %st, %st(1) fmull .LC2 leave faddp %st, %st(1) ret

[quote user="iwpg"]
The default (at least on Linux) is roundhalfeven is fine (and has the advantage over plain roundtonearest that the edge cases of +/0.5 round to 0, meaning that angles of +/pi won't be changed, the same as your original version)
[/quote]
Hmm, seems I was a little confused here. What I was calling roundtohalfeven is actually called roundtonearest (at least in the manual I'm reading) and there's no mode for what I was calling roundtonearest.

[quote user="iwpg"]
Even with the inlining, however, a lot of time is spent saving and restoring the FPU's rounding mode. If you make sure it's set appropriately beforehand, you won't need to bother with that. The default (at least on Linux) is roundhalfeven is fine (and has the advantage over plain roundtonearest that the edge cases of +/0.5 round to 0, meaning that angles of +/pi won't be changed, the same as your original version)  if your system uses something else by default, the manual should tell you how to change it (and if you changed it deliberately, you should know how to change it back ). So, we just need to insert the rounding instruction itself.
[/quote]
You could also try the rint() function, which does the same thing as the frndint instruction. You'd still be relying on the compiler's ability to inline it, but it's more portable than assembly, and might be faster than round() because it doesn't change the rounding mode.

(Woohoo, yet another post. I so hate the edit timeout.)
[quote user="iwpg"]double tmp = angle * 0.5*M_1_PI; angle = tmp * 2*M_PI;
Also, these two might be faster with brackets added:
double tmp = angle * (0.5*M_1_PI); angle = tmp * (2*M_PI);
because the compiler might not assume that multiplication is associative (it isn't really with floating point, but with some optimisation settings the compiler might pretend it is anyway).

[quote user="Tweenk"]
I am writing a simple mechanical simulation program in C(++). The problem is as follows: given two vectors, compute the angle between those vectors AND make sure it is within the allowed range of results: (M_PI; M_PI). For now, I am using a rather obvious solution:
angle = atan2(ay, ax)  atan2(by, bx);
if(angle < M_PI) angle += 2*M_PI;
if(angle > M_PI) angle = 2*M_PI;However, I have to do this three times, which means six totally unpredictable branch instructions in the innermost loop of the program, that's a Bad Thing. I wonder if it is possible to somehow reduce the number of branches. I've seen an assembler trick that allows to store the bigger one of two integers without branching, but I don't know if a similar feat can be achieved using the FPU.
[/quote]
Are you sure you need to optimize this? Profile before guessing. Guessing leads to wasted time and unnecessarily ugly code.
As a side offer, your going to waste more time with the atan2 calls than with the branches. Are you sure you need to compute the distance between the two angles? You might be better off using vector math instead. It depends on why you're calculating the difference between angles.
Alternatively, maybe you should keep track of the angles and magnitudes, rather than the coordinates, depending on your uses.
Well, good luck,
Daniel.

Vacuous requirement, as all angles are within π radians of each other. It looks like you're actually trying to map the infinite number of representations for any angle to a "canonical" representation within a given range... is the real problem testing whether two angles are sufficiently "close" to each other? Or, what is it you're doing with the angles that requires them to be in a specific range?

As has been previously mentioned, the trig functions will be your killers here far more than mispredicted branches.
If you want the cosine of the angle between them, you can take the dot product, which is ((axbx) + (ayby)), and divide it by the product of the magnitudes of the two vectors, which would be sqrt((axax + ayay) * (bxbx + byby)).
If you need the sine of the angle, you can take the magnitude of the cross product, divided by the product of the magnitudes of the two vectors, giving you ((axby)  (bxay)) / sqrt((axax + ayay) * (bxbx + byby)).
Of course, if your vectors are already normalized, you can skip the entire divisor in these calculations.
These won't give you the general broad range of results you were looking for. arcsin and arccos each have a total range of M_PI, and you are looking for a range of 2*M_PI.
The sign of the magnitude of the cross product will tell you which vector is to the right of the angle you're measuring. If all you really wanted to do is check that, you don't need to do the division at all (nor take arcsin), since the divisor will never be negative.
The real mantras here are:
1) Know what problem you're trying to solve.
2) Choose an algorithm that solves it sensibly.
3) If it's not running fast enough, profile it to see where to improve it.
4) Only after the first three steps, make that optimization change.

angle = atan2(ax * by  ay * bx, ax * bx + ay * by);

[quote user="cf3"]
[/quote]
angle = atan2(ax * by  ay * bx, ax * bx + ay * by);
I overlooked the sign convention, this should be
angle = atan2(ay * bx  ax * by, ax * bx + ay * by);

To clear up some questions that have arisen:
 I can't use vector math. My attempts at that have failed, so I tried this approach. Generally I am modelling a membrane that reacts to bending forces, not only stretching ones. The idea is to calculate the energy associated with the current angle between two segments of the membrane and then calculate the energies when the point in the middle is moved a tiny bit on the x axis and on the y axis (numeric differentiation). Surely this can be done in some other way, but I'll stick with this one for now.
 The value of atan2(...)  atan2(...) sometimes does come outside the range (M_PI;M_PI>. Consider the case when you are computing the difference between two angles when the first one is close to M_PI and the second to M_PI  you get a difference of about 2M_PI, which is obviously outside the range.
 I use DevC++ and mingw32 (so that's basically gcc).
 The assembler doesn't scare me I actually thought about deliberately including some for some wicked purposes ):>
 Even if optimizing this is superfluous, this trick is rather general, so I wanted to know it anyway.
 I can't use vector math. My attempts at that have failed, so I tried this approach. Generally I am modelling a membrane that reacts to bending forces, not only stretching ones. The idea is to calculate the energy associated with the current angle between two segments of the membrane and then calculate the energies when the point in the middle is moved a tiny bit on the x axis and on the y axis (numeric differentiation). Surely this can be done in some other way, but I'll stick with this one for now.

OK, you were right that the atan2 was taking more time than branching. By replacing the math.h atan2 with:
inline double asm_atan2(register double y, register double x) { asm( "fpatan" : "=t" (y) : "0" (x), "u" (y) : "st(1)"); return y; }
I roughly doubled the program's performance

@Tweenk said:
 I can't use vector math. My attempts at that have failed, so I tried this approach.
As I understand it, vector math will be a lot quicker (less CPU) than doing atan2, so if you're trying to optimize, you might want to try it again. However, as I am not so familiar with the gory innards of C++ I don't have as many suggestions as everyone else here. This link helped me understand using dot product to find the angle between two vectors, even if you don't end up going that route:
http://www.mvps.org/directx/articles/math/dot/index.htm  Understanding the Dot Product

I am writing a simple mechanical simulation program in C(++). The problem is as follows: given two vectors, compute the angle between those vectors AND make sure it is within the allowed range of results: (M_PI; M_PI). For now, I am using a rather obvious solution:
If you have two vectors already, then as someone else suggested use dot product of those two vectors. That can be a faster, and simpler method and you get none of this clamping issue.
First you have to normalize the vectors by dividing their components by their length.
v1 = v1 / v1.length()
v2 = v2 / v2.length()
Then when you get the dot product of them:
float dotproduct = v1.x * v2.x + v1.y * v2.y
dot product now holds cos A, where A is your angle. If you really need A in radians or degrees then you will have to use inverse cosine to get it. But for most purposes you should be able to just use that value (for example if you need the angle to test whether two vectors are within 45 degrees of one another then the test would be if (dotproduct < cos(45), or dotproduct < 0.707) with no need to convert to radians or degrees)
The only expense involved here is calculating the length of a vector which requires a square root, but I assume that's better than using an atan2.
Also if you don't like having to normalize the vectors I believe you can accomplish the same thing doing:
cos A = (v1 dot v2) / (v1.length()*v2.length())
If you have your lengths cached from elsewhere then you are laughing. You have an angle (in cos A form) using just one division, three multiplications and 2 additions.

Nope, that's not that. I have to implement something that simulates an angular spring  the force is proportional to the angle. To make matters worse, both arms of this angular spring are themselves linear springs, and the whole thing moves around. The task is to model a red blood cell as a circle consisting of material points linked together by springs, with an additional "angular spring" between two adjacent linear springs (this has to simulate the surface tension, I hope somebody understands what I mean). I ended up using numeric differentiation to calculate the forces from the angular spring.
I don't see an obvious way to use vector math here at the moment. Maybe I'll come up with something more spiffy later.