I have not explicitly obtained permission from the posters to reproduce their postings. Anyone who objects should mail me at the address given at the end of this web page.
I have felt free to edit the the postings to avoid repetition, to change the formatting and to cut out anything I deemed irrelevant. The order of the routines in this document is more or less the order in which they were posted in this thread. Some of the routines are better than others.
I have no intention of maintaining a FAQ for comp.sys.arm, however, I also have archived another thread on square root routines.
[2012-06-07: ManuTOO pointed out to me that the original article appears to contain an error. This is marked, and corrections suggested, in text highlighed in the same manner as this paragraph]
When doing graphics, e.g. raytracing, it is often necessary to normalize vectors, i.e. given (x,y) to calculate (x/l,y/l) where l = sqrt(x^2+y^2), or similar for (x,y,z) when in 3D. This is quite expensive to calculate on the ARM since there is no built-in divide or square root. The usual procedure is to first calculate l using a software square-root and then use software divide two or three times afterwards or to compute 1/sqrt(...) once and multiply by this, which is faster when you have hardware multiply (especially on the StrongARM). However, 1/sqrt(k) is normally (to my knowledge) done using first a software squareroot and then a software divide. My idea is that you can do 1/sqrt(k) faster than this.
I assume that k is normalized so. 0.25<=k<1, as is usual for square root routines. This means that the result is 1<r<=2.
We must find r = 1/sqrt(k). This is the same as solving k*r^2 = 1. We now try to do this by a binary search:
r = 1.0; for (i = 1; i <= precision; i++) if (k*(r+0.5^i)^2 <= 1.0) then r += 0.5^i;Now, the squaring and the multiplication are expensive, so this is not a good way of doing it. We may, however, do these operations incrementally, using
k*(r+0.5^i)^2 = k*r^2+k*0.5^(2*i)+2*k*r*0.5^i
.
So we use variables kr2 to hold k*r^2 and kr to hold 2*k*r.
r = 1.0; kr2 = k; kr = 2*k; for (i = 1; i <= precision; i++) { t = kr2+k*0.5^(2*i)+kr*0.5^i; if (t <= 1.0) { r += 0.5^i; kr2 = t; kr += 2*k*0.5^i; } }Note that if the result should be exactly 2, the computed result is 2-0.5^precision. If you consider this a problem, you can add a test for k=0.25 before the loop.
We don't like to have numbers greater than 1, so we calculate q = 0.5*r instead of r. We use
k*(r+0.5^i)^2 = k*(2*q+0.5^i)^2 = 4*k*q^2+4*k*q*0.5^i+k*0.5^(2*i) = 4*(k*q2+k*q*0.5^i+k*0.5^(2*i+2)).We now use kq2 = k*q^2, kq = k*q.
[2012-06-07: Warning: the following code appears to contain an error. It is quoted exactly as it appears in the article, the problem is discussed after the code.]
q = 0.0; kq2 = 0.0; kq = 0.0; for (i = 1; i <= precision; i++) { t = kq2+kq*0.5^i+k*0.5^(2*i+2); if (t <= 0.25) { q += 0.5^(i+1); kq2 = t; kq += k*0.5^(i+1); } }
[2012-06-07: If the code has been transformed using the
substitution q = 0.5*r then, since the previous code was initialised with
r = 1.0 we'd expect this code to be initialised with q = 0.5. The fix is
simply to alter the initialisation to
q = 0.5; kq2 = 0.25 * k; kq = 0.5 * k
.]
When making this into ARM code, we assume numbers in 0<=x<1 are represented as integers holding x*2^32. We unroll the loop to avoid register-controlled shifts.
[2012-06-07: The article later warns that this ARM code
is untested and so may contain errors. As well as the initialisation error
already discussed, there appear to be three other issues: 1) the second
ADD
in the calculation of t
is missing a parameter
— it should be ADD R4,R4,R0,LSR#n
; 2) compared to the
earlier code, the comparison has been inverted, <=
has been
encoded as condition code HS
, I think it should be
LS
; 3) the original code also uses 2^n
for constants
— I suspect most assemblers would prefer 1<<n
. I'll
quote the original article first and then my suggested, but untested, fix.]
\ At entry R0 holds k, 0.25<=k<1 \ At exit R1 holds q = r/2, where r = 1/sqrt(k). \ kq2 is in R2, kq is in R3 \ R4 is not preserved .rsqrt MOV R1, #0 \ q = 0.0 MOV R2, #0 \ kq2 = 0.0 MOV R3, #0 \ kq = 0.0 ADD R4,R2,R3,LSR#1 \ t = kq2+kq*0.5^1 ADD R4,R0,LSR#4 \ t += k*0.5^4 CMP R4,#2^30 \ if t<=0.25 ADDHS R1,R1,#2^30 \ q += 0.5^2 MOVHS R2,R4 \ kq2 = t ADDHS R3,R3,R0,LSR#2 \ kq += k*0.5^2 ... ADD R4,R2,R3,LSR#14 \ t = kq2+kq*0.5^14 ADD R4,R0,LSR#30 \ t += k*0.5^30 CMP R4,#2^30 \ if t<=0.25 ADDHS R1,R1,#2^17 \ q += 0.5^15 MOVHS R2,R4 \ kq2 = t ADDHS R3,R3,R0,LSR#15 \ kq += k*0.5^15 ADD R4,R2,R3,LSR#15 \ t = kq2+kq*0.5^15 CMP R4,#2^30 \ if t<=0.25 ADDHS R1,R1,#2^16 \ q += 0.5^16 MOVHS R2,R4 \ kq2 = t ADDHS R3,R3,R0,LSR#16 \ kq += k*0.5^16 ... ADD R4,R2,R3,LSR#30 \ t = kq2+kq*0.5^30 CMP R4,#2^30 \ if t<=0.25 ADDHS R1,R1,#2 \ q += 0.5^31 MOVHS R2,R4 \ kq2 = t ADDHS R3,R3,R0,LSR#31 \ kq += k*0.5^31 ADD R4,R2,R3,LSR#31 \ t = kq2+kq*0.5^31 CMP R4,#2^30 \ if t<=0.25 ADDHS R1,R1,#1 \ q += 0.5^32 MOV PC,R14
[2012-06-07: Here's my proposed correction:
\ At entry R0 holds k, 0.25<=k<1 \ At exit R1 holds q = r/2, where r = 1/sqrt(k). \ kq2 is in R2, kq is in R3 \ R4 is not preserved .rsqrt MOV R1, #1<<31 \ q = 0.5 MOV R2, R0,LSR#2 \ kq2 = k * 0.25 MOV R3, R0,LSR#1 \ kq = k * 0.5 ADD R4,R2,R3,LSR#1 \ t = kq2+kq*0.5^1 ADD R4,R4,R0,LSR#4 \ t += k*0.5^4 CMP R4,#1<<30 \ if t<=0.25 ADDLS R1,R1,#1<<30 \ q += 0.5^2 MOVLS R2,R4 \ kq2 = t ADDLS R3,R3,R0,LSR#2 \ kq += k*0.5^2 ... ADD R4,R2,R3,LSR#14 \ t = kq2+kq*0.5^14 ADD R4,R4,R0,LSR#30 \ t += k*0.5^30 CMP R4,#1<<30 \ if t<=0.25 ADDLS R1,R1,#1<<17 \ q += 0.5^15 MOVLS R2,R4 \ kq2 = t ADDLS R3,R3,R0,LSR#15 \ kq += k*0.5^15 ADD R4,R2,R3,LSR#15 \ t = kq2+kq*0.5^15 CMP R4,#1<<30 \ if t<=0.25 ADDLS R1,R1,#1<<16 \ q += 0.5^16 MOVLS R2,R4 \ kq2 = t ADDLS R3,R3,R0,LSR#16 \ kq += k*0.5^16 ... ADD R4,R2,R3,LSR#30 \ t = kq2+kq*0.5^30 CMP R4,#1<<30 \ if t<=0.25 ADDLS R1,R1,#2 \ q += 0.5^31 MOVLS R2,R4 \ kq2 = t ADDLS R3,R3,R0,LSR#31 \ kq += k*0.5^31 ADD R4,R2,R3,LSR#31 \ t = kq2+kq*0.5^31 CMP R4,#1<<30 \ if t<=0.25 ADDLS R1,R1,#1 \ q += 0.5^32 MOV PC,R14
]
Note that the component 0.5^(2*i+2) of the sum becomes less than the precision after the first 14 iterations, so we omit it. Hence, the time used is 3+14*6+16*5+3 = 170 cycles, not counting call/return. This is only slightly more than a square root, and certainly less than a square root followed by a divide. Furthermore, the code takes up less space (in the cache) than a fully unrolled divide plus a fully unrolled square root. I don't expect the precision to be any worse, as you don't accumulate errors.
Note that the code is not tested, so there might be small errors. I'm sure the general principle is fine, but I may have miscalculated some of the shift-counts in the assembly code etc.
r[n+1] = r[n] * (1.5 - k/2 * r[n] * r[n])which doubles the number of bits of accuracy each time.
Colin hadn't worked out the details of the time and memory usage of an ARM version of this method, but suggested that code and table space usage would be comparable to Torben's unrolled loop, and that this method should `yield a result more quickly'. Sadly, he left the implementation details to any interested reader.
Wilco Dijkstra also suggested the look-up table method might be useful when fewer than 32 bits of accuracy were required. He commented that he used 128 entry look-up tables with interpolation to calculate 16 bit fixed point sines, cosines and square roots. He said that this method gave him about 15.5 bit precision in 22 ARM710 cycles, he reckoned this would be about 13 StrongARM cycles including the return. He estimated that a 512 or 1024 entry table might give 20 bits of precision.
This looks much worse than the simultaneous approach for an ARM 710. However, on an ARM7DM or a StrongARM where full 32 bit multiplies are fast, the multiplies only take 6 cycles each. This reduces the time to 5*32+2*6=172 cycles [178 cycles] neglecting any fix up code.
This method assumes a modified 5 cycle/bit 1/sqrt calculation which Wilco went on to describe.
It's clear that approach 4 wins for an ARM710 which is the only case that Wilco considered in detail. I think that on a chip with a fast long multiply instruction, the calculations would have to be redone.
Approach 4 relies on a simultaneous calculation described by Wilco. Wilco first modified Torben's routine to use 5 cycles/bit instead of the 5.5 Torben's original needed. He also altered the entry conditions. For the original routine, k had to satisfy 0.25<=k<1, Wilco suggested 0.25<k<=1 as that guarantees that 1/l is less than 2 allowing 31 bit fixed point calculations to be used instead 30 bit.
Wilco's routine follows, it calculates both l and 1/l at 5 cycles/bit
; BITS is the required precision, range 2..31. ; IN : n is the number from which the root and inverse root are calculated ; OUT: root0 = SQRT (n) ; root1 = 1/SQRT (n) ; TMP: rrn, t .root_inv MOV rrn, #1 << (BITS - 1) MOV root0, n MOV root1, #1 << BITS .root1 ADD t, root0, n, LSR #1 RSBS rrn, t, rrn ADDLO rrn, rrn, t ADDHS root0, root0, n, LSR #i ADDHS root1, root1, #1 << (BITS - 1) [ unroll for i = 2 .. BITS-1 ADD t, root0, n, LSR #i + 1 RSBS rrn, t, rrn, LSL #1 ADDLO rrn, rrn, t ADDHS root0, root0, n, LSR #i ADDHS root1, root1, #1 << (BITS - i) ] .rootend RSBS rrn, root0, rrn, LSL #1 ADDHS root1, root1, #1Wilco noted that for BITS=31, this routine takes 155 cycles, whereas Torben's original took 176 for BITS=30, for comparison, the new routine takes 150 cycles for BITS=30.
Wilco also noted that the new routine is suitable for looping with the following skeleton:
.. ORR root1, root1, #1 << ... ; for fast looping [ .root1 ] loop [ unroll K times ] MOV n, n, LSR #K MOVS root1, root1, LSL #K ; make sure carry is set BCC loop [ .rootend ]What value you choose for K depends on how much code you want to save. Wilco suggested that for a 27 bit version you could take K=8 whereas for a 31 bit version, take K=7 or 14. He added that a fully unrolled loop makes little sense on a cached system (640 bytes), for example, on an ARM7, one cache miss would add a 20 cycle penalty, while with K=14 the average penalty would be half as big.
Finally, Wilco delivered the modified version of the code that calculates x*1/l and y*1/l simultaneously (I've added code for z*1/l).
MOV xres, #0 MOV yres, #0 [ MOV zres, #0 ] [ ADD t, root0, n, LSR #i + 1 RSBS rrn, t, rrn, LSL #1 ADDLO rrn, rrn, t ADDHS root0, root0, n, LSR #i ; ADDHS root1, root1, #1 << (BITS -i) -> REMOVE (if you don't want 1/l) ADDHS xres, xres, x, LSR #i ADDHS yres, yres, y, LSR #i [ ADDHS zres, zres, z, LSR #i ] ]