I have not explicitly obtained permission from the posters to reproduce their postings. Anyone who objects should mail me at the address given at the end of this web page.
I have felt free to edit the the postings to avoid repetition, to change the formatting and to cut out anything I deemed irrelevant. The order of the routines in this document is more or less the order in which they were posted in this thread. Some of the routines are better than others.
I have no intention of maintaining a FAQ for comp.sys.arm, however, I also have archived another thread on square root routines.
I assume that k is normalized so. 0.25<=k<1, as is usual for square root routines. This means that the result is 1<r<=2.
We must find r = 1/sqrt(k). This is the same as solving k*r^2 = 1. We now try to do this by a binary search:
r = 1.0;
for (i = 1; i <= precision; i++)
if (k*(r+0.5^i)^2 <= 1.0) then r += 0.5^i;
Now, the squaring and the multiplication are expensive, so this is
not a good way of doing it. We may, however, do these operations
incrementally, using
k*(r+0.5^i)^2 = k*r^2+k*0.5^(2*i)+2*k*r*0.5^i.
So we use variables kr2 to hold k*r^2 and kr to hold 2*k*r.
r = 1.0; kr2 = k; kr = 2*k;
for (i = 1; i <= precision; i++) {
t = kr2+k*0.5^(2*i)+kr*0.5^i;
if (t <= 1.0) {
r += 0.5^i;
kr2 = t;
kr += 2*k*0.5^i;
}
}
Note that if the result should be exactly 2, the computed result is
2-0.5^precision. If you consider this a problem, you can add a test
for k=0.25 before the loop.
We don't like to have numbers greater than 1, so we calculate q = 0.5*r instead of r. We use
k*(r+0.5^i)^2 = k*(2*q+0.5^i)^2
= 4*k*q^2+4*k*q*0.5^i+k*0.5^(2*i)
= 4*(k*q2+k*q*0.5^i+k*0.5^(2*i+2)).
We now use kq2 = k*q^2, kq = k*q.
q = 0.0; kq2 = 0.0; kq = 0.0;
for (i = 1; i <= precision; i++) {
t = kq2+kq*0.5^i+k*0.5^(2*i+2);
if (t <= 0.25) {
q += 0.5^(i+1);
kq2 = t;
kq += k*0.5^(i+1);
}
}
When making this into ARM code, we assume numbers in 0<=x<1 are
represented as integers holding x*2^32. We unroll the loop to avoid
register-controlled shifts.
\ At entry R0 holds k, 0.25<=k<1
\ At exit R1 holds q = r/2, where r = 1/sqrt(k).
\ kq2 is in R2, kq is in R3
\ R4 is not preserved
.rsqrt
MOV R1, #0 \ q = 0.0
MOV R2, #0 \ kq2 = 0.0
MOV R3, #0 \ kq = 0.0
ADD R4,R2,R3,LSR#1 \ t = kq2+kq*0.5^1
ADD R4,R0,LSR#4 \ t += k*0.5^4
CMP R4,#2^30 \ if t<=0.25
ADDHS R1,R1,#2^30 \ q += 0.5^2
MOVHS R2,R4 \ kq2 = t
ADDHS R3,R3,R0,LSR#2 \ kq += k*0.5^2
...
ADD R4,R2,R3,LSR#14 \ t = kq2+kq*0.5^14
ADD R4,R0,LSR#30 \ t += k*0.5^30
CMP R4,#2^30 \ if t<=0.25
ADDHS R1,R1,#2^17 \ q += 0.5^15
MOVHS R2,R4 \ kq2 = t
ADDHS R3,R3,R0,LSR#15 \ kq += k*0.5^15
ADD R4,R2,R3,LSR#15 \ t = kq2+kq*0.5^15
CMP R4,#2^30 \ if t<=0.25
ADDHS R1,R1,#2^16 \ q += 0.5^16
MOVHS R2,R4 \ kq2 = t
ADDHS R3,R3,R0,LSR#16 \ kq += k*0.5^16
...
ADD R4,R2,R3,LSR#30 \ t = kq2+kq*0.5^30
CMP R4,#2^30 \ if t<=0.25
ADDHS R1,R1,#2 \ q += 0.5^31
MOVHS R2,R4 \ kq2 = t
ADDHS R3,R3,R0,LSR#31 \ kq += k*0.5^31
ADD R4,R2,R3,LSR#31 \ t = kq2+kq*0.5^31
CMP R4,#2^30 \ if t<=0.25
ADDHS R1,R1,#1 \ q += 0.5^32
MOV PC,R14
Note that the component 0.5^(2*i+2) of the sum becomes less than the
precision after the first 14 iterations, so we omit it. Hence, the
time used is 3+14*6+16*5+3 = 170 cycles, not counting call/return.
This is only slightly more than a square root, and certainly less
than a square root followed by a divide. Furthermore, the code takes
up less space (in the cache) than a fully unrolled divide plus a
fully unrolled square root. I don't expect the precision to be any
worse, as you don't accumulate errors.
Note that the code is not tested, so there might be small errors. I'm sure the general principle is fine, but I may have miscalculated some of the shift-counts in the assembly code etc.
r[n+1] = r[n] * (1.5 - k/2 * r[n] * r[n])
which doubles the number of bits of accuracy each time.
Colin hadn't worked out the details of the time and memory usage of an ARM version of this method, but suggested that code and table space usage would be comparable to Torben's unrolled loop, and that this method should `yield a result more quickly'. Sadly, he left the implementation details to any interested reader.
Wilco Dijkstra also suggested the look-up table method might be useful when fewer than 32 bits of accuracy were required. He commented that he used 128 entry look-up tables with interpolation to calculate 16 bit fixed point sines, cosines and square roots. He said that this method gave him about 15.5 bit precision in 22 ARM710 cycles, he reckoned this would be about 13 StrongARM cycles including the return. He estimated that a 512 or 1024 entry table might give 20 bits of precision.
This looks much worse than the simultaneous approach for an ARM 710. However, on an ARM7DM or a StrongARM where full 32 bit multiplies are fast, the multiplies only take 6 cycles each. This reduces the time to 5*32+2*6=172 cycles [178 cycles] neglecting any fix up code.
This method assumes a modified 5 cycle/bit 1/sqrt calculation which Wilco went on to describe.
It's clear that approach 4 wins for an ARM710 which is the only case that Wilco considered in detail. I think that on a chip with a fast long multiply instruction, the calculations would have to be redone.
Approach 4 relies on a simultaneous calculation described by Wilco. Wilco first modified Torben's routine to use 5 cycles/bit instead of the 5.5 Torben's original needed. He also altered the entry conditions. For the original routine, k had to satisfy 0.25<=k<1, Wilco suggested 0.25<k<=1 as that guarantees that 1/l is less than 2 allowing 31 bit fixed point calculations to be used instead 30 bit.
Wilco's routine follows, it calculates both l and 1/l at 5 cycles/bit
; BITS is the required precision, range 2..31.
; IN : n is the number from which the root and inverse root are calculated
; OUT: root0 = SQRT (n)
; root1 = 1/SQRT (n)
; TMP: rrn, t
.root_inv
MOV rrn, #1 << (BITS - 1)
MOV root0, n
MOV root1, #1 << BITS
.root1
ADD t, root0, n, LSR #1
RSBS rrn, t, rrn
ADDLO rrn, rrn, t
ADDHS root0, root0, n, LSR #i
ADDHS root1, root1, #1 << (BITS - 1)
[ unroll for i = 2 .. BITS-1
ADD t, root0, n, LSR #i + 1
RSBS rrn, t, rrn, LSL #1
ADDLO rrn, rrn, t
ADDHS root0, root0, n, LSR #i
ADDHS root1, root1, #1 << (BITS - i)
]
.rootend
RSBS rrn, root0, rrn, LSL #1
ADDHS root1, root1, #1
Wilco noted that for BITS=31, this routine takes 155 cycles, whereas Torben's
original took 176 for BITS=30, for comparison, the new routine takes 150
cycles for BITS=30.
Wilco also noted that the new routine is suitable for looping with the following skeleton:
..
ORR root1, root1, #1 << ... ; for fast looping
[ .root1 ]
loop
[ unroll K times ]
MOV n, n, LSR #K
MOVS root1, root1, LSL #K ; make sure carry is set
BCC loop
[ .rootend ]
What value you choose for K depends on how much code you want to save. Wilco
suggested that for a 27 bit version you could take K=8 whereas for a 31 bit
version, take K=7 or 14. He added that a fully unrolled loop makes little
sense on a cached system (640 bytes), for example, on an ARM7, one cache miss
would add a 20 cycle penalty, while with K=14 the average penalty would be
half as big.
Finally, Wilco delivered the modified version of the code that calculates x*1/l and y*1/l simultaneously (I've added code for z*1/l).
MOV xres, #0
MOV yres, #0
[ MOV zres, #0 ]
[
ADD t, root0, n, LSR #i + 1
RSBS rrn, t, rrn, LSL #1
ADDLO rrn, rrn, t
ADDHS root0, root0, n, LSR #i
; ADDHS root1, root1, #1 << (BITS -i) -> REMOVE (if you don't want 1/l)
ADDHS xres, xres, x, LSR #i
ADDHS yres, yres, y, LSR #i
[ ADDHS zres, zres, z, LSR #i ]
]