Implementation of the Discrete Fourier Transform - FFT - c++

I am trying to do a project in sound processing and need to put the frequencies into another domain. Now, I have tried to implement an FFT, that didn't go well. I tried to understand the z-transform, that didn't go to well either. I read up and found DFT's a lot more simple to understand, especially the algorithm. So I coded the algorithm using examples but I do not know or think the output is right. (I don't have Matlab on here, and cannot find any resources to test it) and wondered if you guys knew if I was going in the right direction. Here is my code so far:
#include <iostream>
#include <complex>
#include <vector>
using namespace std;
const double PI = 3.141592;
vector< complex<double> > DFT(vector< complex<double> >& theData)
// Define the Size of the read in vector
const int S = theData.size();
// Initalise new vector with size of S
vector< complex<double> > out(S, 0);
for(unsigned i=0; (i < S); i++)
out[i] = complex<double>(0.0, 0.0);
for(unsigned j=0; (j < S); j++)
out[i] += theData[j] * polar<double>(1.0, - 2 * PI * i * j / S);
return out;
int main(int argc, char *argv[]) {
vector< complex<double> > numbers;
vector< complex<double> > testing = DFT(numbers);
for(unsigned i=0; (i < testing.size()); i++)
cout << testing[i] << endl;
The inputs are:
102023 102023
102023 102023
And the result:
(408092, 0)
(-0.0666812, -0.0666812)
(1.30764e-07, -0.133362)
(0.200044, -0.200043)
Any help or advice would be great, I'm not expecting a lot, but, anything would be great. Thank you :)

#Phorce is right here. I don't think there is any reson to reinvent the wheel. However, if you want to do this so that you understand the methodology and to have the joy of coding it yourself I can provide a FORTRAN FFT code that I developed some years ago. Of course this is not C++ and will require a translation; this should not be too difficult and should enable you to learn a lot in doing so...
Below is a Radix 4 based algorithm; this radix-4 FFT recursively partitions a DFT into four quarter-length DFTs of groups of every fourth time sample. The outputs of these shorter FFTs are reused to compute many outputs, thus greatly reducing the total computational cost. The radix-4 decimation-in-frequency FFT groups every fourth output sample into shorter-length DFTs to save computations. The radix-4 FFTs require only 75% as many complex multiplies as the radix-2 FFTs. See here for more information.
! ===================================================================
! Discription: Radix 4 is a descreet complex Fourier transform algorithim. It
! is to be supplied with two real arrays, one for real parts of function
! one for imaginary parts: It can also unscramble transformed arrays.
! Usage: calling FASTF(XREAL,XIMAG,ISIZE,ITYPE,IFAULT); we supply the
! following:
! XREAL - array containing real parts of transform sequence
! XIMAG - array containing imagianry parts of transformation sequence
! ISIZE - size of transform (ISIZE = 4*2*M)
! ITYPE - +1 forward transform
! -1 reverse transform
! IFAULT - 1 if error
! - 0 otherwise
! ===================================================================
! Forward transform computes:
! X(k) = sum_{j=0}^{isize-1} x(j)*exp(-2ijk*pi/isize)
! Backward computes:
! x(j) = (1/isize) sum_{k=0}^{isize-1} X(k)*exp(ijk*pi/isize)
! Forward followed by backwards will result in the origonal sequence!
! ===================================================================
! Check for valid transform size upto 2**(max2):
print*,'FFT: Error: Data array < 4 - Too small!'
II = 4
IPOW = 2
! Prepare mod 2:
II = II*2
print*,'FFT: Error: FFT1!'
! Check for correct type:
print*,'FFT: Error: Wrong type of transformation!'
! No entry errors - continue:
! call FASTG to preform transformation:
! Due to Radix 4 factorisation results are not in the same order
! after transformation as they were when the data was submitted:
! We now call SCRAM, to unscramble the reults:
! ===============================================================
! Discription: This is the radix 4 complex descreet fast Fourier
! transform with out unscrabling. Suitable for convolutions or other
! applications that do not require unscrambling. Designed for use
! with FASTF.FOR.
ZFLOAT(K) = FLOAT(K) ! Real equivalent of K.
PI = (4.0)*ZATAN(1.0)
! Forward transform:
! If this is for an inverse transform - conjugate the data:
DO 4, K = 1,N
! Proform appropriate transformations:
BCOS = -2.0*ZSIN(Z)**2
BSIN = ZSIN(2.0*Z)
CW1 = 1.0
SW1 = 0.0
! This is the main body of radix 4 calculations:
I1 = I0 + IFACA
I2 = I1 + IFACA
I3 = I2 + IFACA
XREAL(I0) = XS0 + XS2
XIMAG(I0) = YS0 + YS2
X1 = XS1 + YS3
Y1 = YS1 - XS3
X2 = XS0 - XS2
Y2 = YS0 - YS2
X3 = XS1 - YS3
Y3 = YS1 + XS3
XREAL(I2) = X1
XIMAG(I2) = Y1
XREAL(I1) = X2
XIMAG(I1) = Y2
XREAL(I3) = X3
XIMAG(I3) = Y3
! Now IF required - we multiply by twiddle factors:
7 XREAL(I2) = X1*CW1 + Y1*SW1
XIMAG(I2) = Y1*CW1 - X1*SW1
XREAL(I1) = X2*CW2 + Y2*SW2
XIMAG(I1) = Y2*CW2 - X2*SW2
XREAL(I3) = X3*CW3 + Y3*SW3
XIMAG(I3) = Y3*CW3 - X3*SW3
! Calculate a new set of twiddle factors:
TEMPR = 1.5 - 0.5*(Z*Z + SW1*SW1)
CW2 = CW1*CW1 - SW1*SW1
SW2 = 2.0*CW1*SW1
CW3 = CW1*CW2 - SW1*SW2
SW3 = CW1*SW2 + CW2*SW1
! Set up tranform split for next stage:
! This is the calculation of a radix two-stage:
DO 13, K = 1,N,2
XREAL(K + 1) = XREAL(K) - XREAL(K + 1)
XIMAG(K + 1) = XIMAG(K) - XIMAG(K + 1)
! For the inverse case, cojugate and scale the transform:
Z = 1.0/ZFLOAT(N)
DO 16, K = 1,N
17 return
! ----------------------------------------------------------
!-END of subroutine FASTG.FOR.
! ----------------------------------------------------------
! ==========================================================
! Discription: Subroutine for unscrambiling FFT data:
! ==========================================================
INTEGER L(19),II,J1,J2,J3,J4,J5,J6,J7,J8,J9,J10,J11,J12
INTEGER J13,J14,J15,J16,J17,J18,J19,J20,ITOP,I
EQUIVALENCE (L1,L(1)),(L2,L(2)),(L3,L(3)),(L4,L(4))
EQUIVALENCE (L5,L(5)),(L6,L(6)),(L7,L(7)),(L8,L(8))
EQUIVALENCE (L9,L(9)),(L10,L(10)),(L11,L(11)),(L12,L(12))
EQUIVALENCE (L13,L(13)),(L14,L(14)),(L15,L(15)),(L16,L(16))
EQUIVALENCE (L17,L(17)),(L18,L(18)),(L19,L(19))
II = 1
ITOP = 2**(IPOW - 1)
I = 20 - IPOW
DO 5, K = 1,I
L(K) = II
L0 = II
I = I + 1
DO 6, K = I,19
II = II*2
L(K) = II
II = 0
DO 9, J1 = 1,L1,L0
DO 9, J2 = J1,L2,L1
DO 9, J3 = J2,L3,L2
DO 9, J4 = J3,L4,L3
DO 9, J5 = J4,L5,L4
DO 9, J6 = J5,L6,L5
DO 9, J7 = J6,L7,L6
DO 9, J8 = J7,L8,L7
DO 9, J9 = J8,L9,L8
DO 9, J10 = J9,L10,L9
DO 9, J11 = J10,L11,L10
DO 9, J12 = J11,L12,L11
DO 9, J13 = J12,L13,L12
DO 9, J14 = J13,L14,L13
DO 9, J15 = J14,L15,L14
DO 9, J16 = J15,L16,L15
DO 9, J17 = J16,L17,L16
DO 9, J18 = J17,L18,L17
DO 9, J19 = J18,L19,L18
J20 = J19
DO 9, I = 1,2
II = II +1
! J20 is the bit reverse of II!
! Pairwise exchange:
8 J20 = J20 + ITOP
! -------------------------------------------------------------------
! -------------------------------------------------------------------
Going through this and understanding it will take time! I wrote this using a CalTech paper I found years ago, I cannot recall the reference I am afraid. Good luck.
I hope this helps.

Your code works.
I would give more digits for PI ( 3.1415926535898 ).
Also, you have to devide the output of the DFT summation by S, the DFT size.
Since the input series in your test is constant, the DFT output should have only one non-zero coefficient.
And indeed all the output coefficients are very small relative to the first one.
But for a large input length, this is not an efficient way of implementing the DFT.
If timing is a concern, look into the Fast Fourrier Transform for faster methods to calculate the DFT.

Your code looks right to me. I'm not sure what you were expecting for output but, given that your input is a constant value, the DFT of a constant is a DC term in bin 0 and zeroes in the remaining bins (or a close equivalent, which you have).
You might try testing you code with a longer sequence containing some type of waveform like a sine wave or a square wave. In general, however, you should consider using something like fftw in production code. Its been wrung out and highly optimized by many people for a long time. FFTs are optimized DFTs for special cases (e.g., lengths that are powers of 2).

Your code looks okey. out[0] should represent the "DC" component of your input waveform. In your case, it is 4 times bigger than the input waveform, because your normalization coefficient is 1.
The other coefficients should represent the amplitude and phase of your input waveform. The coefficients are mirrored, i.e., out[i] == out[N-i]. You can test this with the following code:
double frequency = 1; /* use other values like 2, 3, 4 etc. */
for (int i = 0; i < 16; i++)
numbers.push_back(sin((double)i / 16 * frequency * 2 * PI));
For frequency = 1, this gives:
which seems correct to me: negligible DC, amplitude 8 for 1st harmonics, negligible amplitudes for other harmonics.

MoonKnight has already provided a radix-4 Decimation In Frequency Cooley-Tukey scheme in Fortran. I'm below providing a radix-2 Decimation In Frequency Cooley-Tukey scheme in Matlab.
The code is an iterative one and considers the scheme in the following figure:
A recursive approach is also possible.
As you will see, the implementation calculates also the number of performed multiplications and additions and compares it with the theoretical calculations reported in How many FLOPS for FFT?.
The code is obviously much slower than the highly optimized FFTW exploited by Matlab.
Note also that the twiddle factors omegaa^((2^(p - 1) * n)) can be calculated off-line and then restored from a lookup table, but this point is skipped in the code below.
For a Matlab implementation of an iterative radix-2 Decimation In Time Cooley-Tukey scheme, please see Implementing a Fast Fourier Transform for Option Pricing.
% --- Radix-2 Decimation In Frequency - Iterative approach
clear all
close all
N = 32;
x = randn(1, N);
xoriginal = x;
xhat = zeros(1, N);
numStages = log2(N);
omegaa = exp(-1i * 2 * pi / N);
mulCount = 0;
sumCount = 0;
M = N / 2;
for p = 1 : numStages;
for index = 0 : (N / (2^(p - 1))) : (N - 1);
for n = 0 : M - 1;
a = x(n + index + 1) + x(n + index + M + 1);
b = (x(n + index + 1) - x(n + index + M + 1)) .* omegaa^((2^(p - 1) * n));
x(n + 1 + index) = a;
x(n + M + 1 + index) = b;
mulCount = mulCount + 4;
sumCount = sumCount + 6;
M = M / 2;
xhat = bitrevorder(x);
timeCooleyTukey = toc;
xhatcheck = fft(xoriginal);
timeFFTW = toc;
rms = 100 * sqrt(sum(sum(abs(xhat - xhatcheck).^2)) / sum(sum(abs(xhat).^2)));
fprintf('Time Cooley-Tukey = %f; \t Time FFTW = %f\n\n', timeCooleyTukey, timeFFTW);
fprintf('Theoretical multiplications count \t = %i; \t Actual multiplications count \t = %i\n', ...
2 * N * log2(N), mulCount);
fprintf('Theoretical additions count \t\t = %i; \t Actual additions count \t\t = %i\n\n', ...
3 * N * log2(N), sumCount);
fprintf('Root mean square with FFTW implementation = %.10e\n', rms);

Your code is correct to obtain the DFT.
The function you are testing is (sin ((double) i / points * frequency * 2) which corresponds to a synoid of amplitude 1, frequency 1 and sampling frequency Fs = number of points taken.
Operating with the obtained data we have:
As you can see, the DFT coefficients are symmetric with respect to the position coefficient N / 2, so only the first N / 2 provide information. The amplitude obtained by means of the module of the real and imaginary part must be divided by N and multiplied by 2 to reconstruct it. The frequencies of the coefficients will be multiples of Fs / N by the coefficient number.
If we introduce two sinusoids, one of frequency 2 and amplitude 1.3 and another of frequency 3 and amplitude 1.7.
for (int i = 0; i < 16; i++)
numbers.push_back(1.3 *sin((double)i / 16 * frequency1 * 2 * PI)+ 1.7 *
sin((double)i / 16 * frequency2 * 2 * PI));
The obtained data are:
Good luck.


N-body simulation on Fortran leap frog algorithm

I am using a simple 'leapfrog algorithm I am aiming to simulate the orbits of the earth a Jupiter around the sun. I am unable to get them to orbit despite being fairly sure the maths is correct. It appears that gravity is acting too weekly and the planet merely floats away from the sun, interestingly if I adjust the Newtonian acceleration due to gravity term by multiplying it by rad2 I find that the system does indeed produce fairly stable orbits but at much much too large radii.
program physim
Implicit none
integer :: i,j,n,day ! Integer variables
doubleprecision :: G , r(1:3,1:10) , a(1:3, 1:10) , v(1:3, 1:10) , m(1:3), dt, Au, dr(1:3),
rad2(1:3), t, tcount, tend, tout
! constants
day = 86400
tout = 10*day
tend = 20*day
Au = 15e11
n = 3
G = 6.67e-11
!n = 2
dt = 100
r(1,1) = 0.
r(2,1) = 0.
r(3,1) = 0.
v(1,1) = 0.
v(2,1) = 0.
v(3,1) = 0.
m(1) = 1.9898e30
r(1,2) = Au
r(2,2) = 0.
r(3,2) = 0.
v(1,2) = 0.
v(2,2) = 30000
v(3,2) = 0.
m(2) = 6e24
r(1,3) = 5.2*Au
r(2,3) = 0.
r(3,3) = 0.
v(1,3) = 0.
v(2,3) = 13070
v(3,3) = 0.
m(3) = 2e27
a = 0
tcount = 0
do i = 1, n
do j = 1, n
!calculating acceleration
if (i==j)cycle
dr(1:3) = r(1:3, j) - r(1:3, i)
rad2 = dr(1)**2 + dr(2)**2 + dr(3)**2
a(1:3, i) = a(1:3, i) + G*m(j)*dr(1:3)/(rad2*sqrt(rad2))
end do
end do
do i = 1, n
r(1:3 ,i) = r(1:3, i) + v(1:3, i)*dt
v(1:3, i) = v(1:3, i) + a(1:3, i)*dt
end do
t = t + dt
tcount = t + dt
if(tcount>tout) then
!write(6,*) a(1,2)
!write(6,*) rad2
write(6,*) a(1,1) , a(2,1), a(3, 2)
end if
end do
end program
Your most fundamental problem was that 1 A.U. = 1.5e11 m, not 15e11. Then you were doing stuff like resetting tcount every trip through the loop. Set it before the start of the main loop and then only reset when you print out a line of output. It should be updated as tcount=tcount+dt and then you probably want to print out r(1,2) , r(2,2), r(1,3) , r(2,3) so you can plot the positions of jupiter and earth. Also you should maybe go for more time so you can see a few full orbits of earth, and finally put a test at the bottom of the loop so it will exit when t>tend. Making these changes I got output that looked like this:

Using series to approximate log(2)

double k = 0;
int l = 1;
double digits = pow(0.1, 5);
k += (pow(-1, l - 1)/l);
} while((log(2)-k)>=digits);
I'm trying to write a little program based on an example I seen using a series of Σ_(l=1) (pow(-1, l - 1)/l) to estimate log(2);
It's supposed to be a guess refinement thing where time it gets closer and closer to the right value until so many digits match.
The above is what I tried but but it's not coming out right. After messing with it for quite a while I can't figure out where I'm messing up.
I assume that you are trying to extimate the natural logarithm of 2 by its Taylor series expansion:
∞ (-1)n + 1
ln(x) = ∑ ――――――――(x - 1)n
n=1 n
One of the problems of your code is the condition choosen to stop the iterations at a specified precision:
do { ... } while((log(2)-k)>=digits);
Besides using log(2) directly (aren't you supposed to find it out instead of using a library function?), at the second iteration (and for every other even iteration) log(2) - k gets negative (-0.3068...) ending the loop.
A possible (but not optimal) fix could be to use std::abs(log(2) - k) instead, or to end the loop when the absolute value of 1.0 / l (which is the difference between two consecutive iterations) is small enough.
Also, using pow(-1, l - 1) to calculate the sequence 1, -1, 1, -1, ... Is really a waste, especially in a series with such a slow convergence rate.
A more efficient series (see here) is:
∞ 1
ln(x) = 2 ∑ ――――――― ((x - 1) / (x + 1))2n + 1
n=0 2n + 1
You can extimate it without using pow:
double x = 2.0; // I want to calculate ln(2)
int n = 1;
double eps = 0.00001,
kpow = (x - 1.0) / (x + 1.0),
kpow2 = kpow * kpow,
k = 2 * kpow;
do {
n += 2;
kpow *= kpow2;
dk = 2 * kpow / n;
k += dk;
} while ( std::abs(dk) >= eps );

Understanding DEL2 function in Matlab in order to code it in C++

in order to code the DEL2 matlab function in c++ I need to understand the algorithm. I've managed to code the function for elements of the matrix that are not on the borders or the edges.
I've seen several topics about it and read the MATLAB code by typing "edit del2" or "type del2" but I don't understand the calculations that are made to obtain the borders and the edges.
Any help would be appreciated, thanks.
You want to approximate u'' knowing only the value of u on the right (or the left) of a point.
In order to have a second order approximation, you need 3 equations (basic taylor expansion):
u(i+1) = u(i) + h u' + (1/2) h^2 u'' + (1/6) h^3 u''' + O(h^4)
u(i+2) = u(i) + 2 h u' + (4/2) h^2 u'' + (8/6) h^3 u''' + O(h^4)
u(i+3) = u(i) + 3 h u' + (9/2) h^2 u'' + (27/6) h^3 u''' + O(h^4)
Solving for u'' gives (1):
h^2 u'' = -5 u(i+1) + 4 u(i+2) - u(i+3) + 2 u(i) +O(h^4)
To get the laplacian you need to replace the traditional formula with this one on the borders.
For example where "i = 0" you'll have:
del2(u) (i=0,j) = [-5 u(i+1,j) + 4 u(i+2,j) - u(i+3,j) + 2 u(i,j) + u(i,j+1) + u(i,j-1) - 2u(i,j) ]/h^2
EDIT clarifications:
The laplacian is the sum of the 2nd derivatives in the x and in the y directions. You can calculate the second derivative with the formula (2)
u'' = (u(i+1) + u(i-1) - 2u(i))/h^2
if you have both u(i+1) and u(i-1). If i=0 or i=imax you can use the first formula I wrote to compute the derivatives (notice that due to the simmetry of the 2nd derivative, if i = imax you can just replace "i+k" with "i-k"). The same applies for the y (j) direction:
On the edges you can mix up the formulas (1) and (2):
del2(u) (i=imax,j) = [-5 u(i-1,j) + 4 u(i-2,j) - u(i-3,j) + 2 u(i,j) + u(i,j+1) + u(i,j-1) - 2u(i,j) ]/h^2
del2(u) (i,j=0) = [-5 u(i,j+1) + 4 u(i,j+2) - u(i,j+3) + 2 u(i,j) + u(i+1,j) + u(i-1,j) - 2u(i,j) ]/h^2
del2(u) (i,j=jmax) = [-5 u(i,j-1) + 4 u(i,j-2) - u(i,j-3) + 2 u(i,j) + u(i+1,j) + u(i-1,j) - 2u(i,j) ]/h^2
And on the corners you'll just use (1) two times for both directions.
del2(u) (i=0,j=0) = [-5 u(i,j+1) + 4 u(i,j+2) - u(i,j+3) + 2 u(i,j) + -5 u(i,j+1) + 4 u(i+2,j) - u(i+3,j) + 2 u(i,j)]/h^2
Del2 is the 2nd order discrete laplacian, i.e. it permits to approximate the laplacian of a real continuous function given its values on a square cartesian grid NxN where the distance between two adjacent nodes is h.
h^2 is just a constant dimensional-factor, you can get the matlab implementation from these formulas by setting h^2 = 4.
For example, if you want to compute the real laplacian of u(x,y) on the (0,L) x (0,L) square, what you do is writing down the values of this function on an NxN cartesian grid, i.e. you calculate u(0,0), u(L/(N-1),0), u(2L/(N-1),0) ... u( (N-1)L/(N-1) =L,0) ... u(0,L/(N-1)), u(L/(N-1),L/(N-1)) etc. and you put down these N^2 values in a matrix A.
Then you'll have
ans = 4*del2(A)/h^2, where h = L/(N-1).
del2 will return the exact value of the continuous laplacian if your starting function is linear or quadratic (x^2+y^2 fine, x^3 + y^3 not fine). If the function is not linear nor quadratic, the result will be more accurate the more points you use (i.e. in the limit h -> 0)
I hope this is more clear, notice that i used 0-based indices for accessing matrix (C/C++ array style), while matlab uses 1-based.
DEL2 in MatLab represents Discrete Laplace operator, you can find some information about it here.
The main thing about the edges is that elements in the interior of the matrix have four neighbors, while elements on the edges and corners have three or two neighbors respectfully. So you calculate the corners and edges the same way, but using less elements.
Here is a module I wrote in Fortran 90 that replicates the "del2()" operator in MATLAB implementing the above ideas. It only works for arrays that that are atleast 4x4 or larger. It works successfully when I run it so I thought I would post it so that other people dont have to waste time making their own.
module del2_mod
implicit none
real, private :: pi
integer, private :: nr, nc, i, j, k
! nr is number of rows in array, while nc is the number of columns in the array.
subroutine del2(in, out)
real, dimension(:,:) :: in, out
real, dimension(nr,nc) :: interior, left, right, top, bottom, ul_corner, br_corner, disp
integer :: i, j
real :: h, ul, ur, bl, br
! Zero out internal arrays
out = 0.0; interior=0.0; left = 0.0; right = 0.0; top = 0.0; bottom = 0.0; ul_corner = 0.0; br_corner = 0.0;
! Interior Points
do j=1,nc
do i=1,nr
! Interior Point Calculations
if( j>1 .and. j<nc .and. i>1 .and. i<nr )then
interior(i,j) = ((in(i-1,j) + in(i+1,j) + in(i,j-1) + in(i,j+1)) - 4*in(i,j) )/(h**2)
end if
! Boundary Conditions for Left and Right edges
left(i,1) = (-5.0*in(i,2) + 4.0*in(i,3) - in(i,4) + 2.0*in(i,1) + in(i+1,1) + in(i-1,1) - 2.0*in(i,1) )/(h**2)
right(i,nc) = (-5.0*in(i,nc-1) + 4.0*in(i,nc-2) - in(i,nc-3) + 2.0*in(i,nc) + in(i+1,nc) + in(i-1,nc) - 2.0*in(i,nc) )/(h**2)
end do
! Boundary Conditions for Top and Bottom edges
top(1,j) = (-5.0*in(2,j) + 4.0*in(3,j) - in(4,j) + 2.0*in(1,j) + in(1,j+1) + in(1,j-1) - 2.0*in(1,j) )/(h**2)
bottom(nr,j) = (-5.0*in(nr-1,j) + 4.0*in(nr-2,j) - in(nr-3,j) + 2.0*in(nr,j) + in(nr,j+1) + in(nr,j-1) - 2.0*in(nr,j) )/(h**2)
end do
out = interior + left + right + top + bottom
! Calculate BC for the corners
ul = (-5.0*in(1,2) + 4.0*in(1,3) - in(1,4) + 2.0*in(1,1) - 5.0*in(2,1) + 4.0*in(3,1) - in(4,1) + 2.0*in(1,1))/(h**2)
br = (-5.0*in(nr,nc-1) + 4.0*in(nr,nc-2) - in(nr,nc-3) + 2.0*in(nr,nc) - 5.0*in(nr-1,nc) + 4.0*in(nr-2,nc) - in(nr-3,nc) + 2.0*in(nr,nc))/(h**2)
bl = (-5.0*in(nr,2) + 4.0*in(nr,3) - in(nr,4) + 2.0*in(nr,1) - 5.0*in(nr-1,1) + 4.0*in(nr-2,1) - in(nr-3,1) + 2.0*in(nr,1))/(h**2)
ur = (-5.0*in(1,nc-1) + 4.0*in(1,nc-2) - in(1,nc-3) + 2.0*in(1,nc) - 5.0*in(2,nc) + 4.0*in(3,nc) - in(4,nc) + 2.0*in(1,nc))/(h**2)
! Apply BC for the corners
end subroutine
end module
It's so hard! I wasted a few hours to understand and implement it in Java.
Here is:
Tested and compared to the original function DEL2 (Matlab)
I've found a typo in sbabbi response:
del2(u) (i=0,j=0) = [-5 u(i,j+1) + 4 u(i,j+2) - u(i,j+3) + 2 u(i,j) + -5 u(i,j+1) + 4 u(i+2,j) - u(i+3,j) + 2 u(i,j)]/h^2
del2(u) (i=0,j=0) = [-5 u(i,j+1) + 4 u(i,j+2) - u(i,j+3) + 2 u(i,j) + -5 u(i+1,j) + 4 u(i+2,j) - u(i+3,j) + 2 u(i,j)]/h^2

6 dimensional integral by Trapezoid in Fortran using Fortran 90

I need to calculate six dimensional integrals using Trapezoid in Fortran 90 in an efficient way. Here is an example of what I need to do:
Where F is a numerical (e.g. not analytical) function which is to be integrated over x1 to x6, variables. I have initially coded a one dimension subroutine:
SUBROUTINE trapzd(f,mass,x,nstep,deltam)
INTEGER nstep,i
DOUBLE PRECISION mass(nstep+1),f(nstep+1),x,deltam
do i=1,nstep
end do
Which seems to work fine with one dimension, however, I don't know how to scale this up to six dimensions. Can I re-use this six times, once for every dimension or shall I write a new subroutine?
If you have a fully coded (no library/API use) version of this in another language like Python, MATLAB or Java, I'd be very glad to have a look and get some ideas.
P.S. This is not school homework. I am a PhD student in Biomedicine and this is part of my research in modeling stem cell activities. I do not have a deep background of coding and mathematics.
Thank you in advance.
You could look at the Monte Carlo Integration chapter of the GNU Scientific Library (GSL). Which is both a library, and, since it is open source, source code that you can study.
Look at section 4.6 of numerical recipes for C.
Step one is to reduce the problem using, symmetry and analytical dependencies.
Step two is to chain the solution like this:
f2(x2,x3,..,x6) = Integrate(f(x,x2,x3..,x6),x,1,x1end)
f3(x3,x4,..,x6) = Integrate(f2(x,x3,..,x6),x,1,x2end)
f4(x4,..,x6) = ...
f6(x6) = Integrate(I4(x,x6),x,1,x5end)
result = Integrate(f6(x),x,1,x6end)
Direct evaluation of multiple integrals is computationally challenging. It might be better to use Monte Carlo, perhaps using importance sampling. However brute force direct integration is sometimes of interest for validation of methods.
The integration routine I use is "QuadMo" written by Luke Mo about 1970. I made it recursive and put it in a module. QuadMo refines the mesh were needed to get the requested integration accuracy. Here is a program that does an n-dimensional integral using QuadMo.
Here is the validation of the program using a Gaussian centered at 0.5 with SD 0.1 in all dimensions for nDim up to 6, using a G95 compile. It runs in a couple of seconds.
nDim ans expected nlvl
1 0.249 0.251 2
2 6.185E-02 6.283E-02 2 2
3 1.538E-02 1.575E-02 2 2 2
4 3.826E-03 3.948E-03 2 2 2 2
5 9.514E-04 9.896E-04 2 2 2 2 2
6 2.366E-04 2.481E-04 2 2 2 2 2 2
Here is the code:
module QuadMo_MOD
implicit none
abstract interface
function QuadMoFunct_interface(thet,k)
end function
end interface
abstract interface
function MultIntFunc_interface(thet)
end function
end interface
procedure(MultIntFunc_interface),pointer :: stored_func => null()
recursive function quadMoMult(funct,lower,upper,k) result(ans)
! very powerful integration routine written by Luke Mo
! then at the Stanford Linear Accelerator Center circa 1970
! QuadMo_Eps is error tolerance
! QuadMo_MinLvl determines initial grid of 2**(MinLvl+1) + 1 points
! to avoid missing a narrow peak, this may need to be increased.
! QuadMo_Nlvl returns number of subinterval refinements required beyond
! QuadMo_MaxLvl
! Modified by making recursive and adding argument k
! for multiple integrals (
procedure(QuadMoFunct_interface) :: funct
& ,fml,fmr,rombrg,coef,estl,estr,estint,area,abarea
real*8::valint(50,2), Middlex(50), Rightx(50), fmx(50), frx(50)
& ,fmrx(50), estrx(50), epsx(50)
integer retrn(50),i,level
level = 0
QuadMo_nlvlk(k) = 0
abarea = 0
Left = lower
Right = upper
fLeft = funct(Left,k)
fMiddle = funct((Left+Right)/2,k)
fRight = funct(Right,k)
fLeft = funct(Left)
fMiddle = funct((Left+Right)/2)
fRight = funct(Right)
est = 0
eps = QuadMo_Tol
100 level = level+1
Middle = (Left+Right)/2
coef = Right-Left
if( go to 150
rombrg = est
go to 300
150 continue
fml = funct((Left+Middle)/2,k)
fmr = funct((Middle+Right)/2,k)
fml = funct((Left+Middle)/2)
fmr = funct((Middle+Right)/2)
estl = (fLeft+4*fml+fMiddle)*coef
estr = (fMiddle+4*fmr+fRight)*coef
estint = estl+estr
area= abs(estl)+ abs(estr)
abarea=area+abarea- abs(est)
if( go to 200
QuadMo_nlvlk(k) = QuadMo_nlvlk(k)+1
rombrg = estint
go to 300
200 if(( abs(est-estint).gt.(eps*abarea)).or.
1( go to 400
rombrg = (16*estint-est)/15
300 level = level-1
i = retrn(level)
valint(level, i) = rombrg
go to (500, 600), i
400 retrn(level) = 1
Middlex(level) = Middle
Rightx(level) = Right
fmx(level) = fMiddle
fmrx(level) = fmr
frx(level) = fRight
estrx(level) = estr
epsx(level) = eps
eps = eps/1.4d0
Right = Middle
fRight = fMiddle
fMiddle = fml
est = estl
go to 100
500 retrn(level) = 2
Left = Middlex(level)
Right = Rightx(level)
fLeft = fmx(level)
fMiddle = fmrx(level)
fRight = frx(level)
est = estrx(level)
eps = epsx(level)
go to 100
600 rombrg = valint(level,1)+valint(level,2)
if( go to 300
ans = rombrg /12
end function quadMoMult
recursive function MultInt(k,func) result(ans)
! MultInt(nDim,func) returns multi-dimensional integral from 0 to 1
! in all dimensions of function func
! variable QuadMo_Mod: nDim needs to be set initially to number of dimensions
procedure(MultIntFunc_interface) :: func
stored_func => func
allocate (thet(nDim))
end function MultInt
recursive function MultIntegrand(thetARG,k) result(ans)
write(*,*)'MultIntegrand: not expected, k not present!'
end function MultIntegrand
end module QuadMo_MOD
module test_MOD
use QuadMo_MOD
implicit none
real*8 function func(thet) ! multidimensional function
! this is the function defined in nDim dimensions
! in this case a Gaussian centered at 0.5 with SD 0.1
& *((thet-5d-1)/1d-1))/2)
end function func
end module test_MOD
! test program to evaluate multiple integrals
use test_MOD
implicit none
! these values are set for speed, not accuracy
write(*,*)' nDim ans expected nlvl'
do nDim=1,6
! expected answer is (0.1 sqrt(2pi))**nDim
& ,QuadMo_nlvlk
double MultInt(int k);
double MultIntegrand(double thetARG, int k);
double quadMoMult(double(*funct)(double, int), double lower, double upper, int k);
double funkn(double *thet);
int QuadMo_MinLvl = 2;
int QuadMo_MaxLvl = 3;
double QuadMo_Tol = 0.1;
int *QuadMo_nlvlk;
double *thet;
int nDim;
//double MultInt(int k, double(*func)(double *))
double MultInt(int k)
//MultInt(nDim, func) returns multi - dimensional integral from 0 to 1
//in all dimensions of function func
double ans;
if (k == 0)
ans = funkn(thet);
ans = quadMoMult(MultIntegrand, 0.0, 1.0, k); //limits hardcoded here
return ans;
double MultIntegrand(double thetARG, int k)
double ans;
if (k > 0)
thet[k] = thetARG;
printf("\n***MultIntegrand: not expected, k not present!***\n");
//Recursive call
//ans = MultInt(k - 1, func);
ans = MultInt(k - 1);
return ans;
double quadMoMult(double(*funct)(double, int), double lower, double upper, int k)
//Integration routine written by Luke Mo
//Stanford Linear Accelerator Center circa 1970
//QuadMo_Eps is error tolerance
//QuadMo_MinLvl determines initial grid of 2 * *(MinLvl + 1) + 1 points
//to avoid missing a narrow peak, this may need to be increased.
//QuadMo_Nlvl returns number of subinterval refinements required beyond
//Modified by making recursive and adding argument k
//for multiple integrals(
double ans;
double Middle, Left, Right, eps, est, fLeft, fMiddle, fRight;
double fml, fmr, rombrg, coef, estl, estr, estint, area, abarea;
double valint[51][3], Middlex[51], Rightx[51], fmx[51], frx[51]; //Jack up arrays
double fmrx[51], estrx[51], epsx[51];
int retrn[51];
int i, level;
level = 0;
QuadMo_nlvlk[k] = 0;
abarea = 0.0;
Left = lower;
Right = upper;
if (k > 0)
fLeft = funct(Left, k);
fMiddle = funct((Left + Right) / 2, k);
fRight = funct(Right, k);
fLeft = funct(Left,0);
fMiddle = funct((Left + Right) / 2,0);
fRight = funct(Right,0);
est = 0.0;
eps = QuadMo_Tol;
level = level + 1;
Middle = (Left + Right) / 2;
coef = Right - Left;
if (coef != 0.0)
goto l150;
rombrg = est;
goto l300;
if (k > 0)
fml = funct((Left + Middle) / 2.0, k);
fmr = funct((Middle + Right) / 2.0, k);
fml = funct((Left + Middle) / 2.0, 0);
fmr = funct((Middle + Right) / 2.0, 0);
estl = (fLeft + 4 * fml + fMiddle)*coef;
estr = (fMiddle + 4 * fmr + fRight)*coef;
estint = estl + estr;
area = abs(estl) + abs(estr);
abarea = area + abarea - abs(est);
if (level != QuadMo_MaxLvl)
goto l200;
QuadMo_nlvlk[k] = QuadMo_nlvlk[k] + 1;
rombrg = estint;
goto l300;
if ((abs(est - estint) > (eps*abarea)) || (level < QuadMo_MinLvl))
goto l400;
rombrg = (16 * estint - est) / 15;
level = level - 1;
i = retrn[level];
valint[level][i] = rombrg;
if (i == 1)
goto l500;
if (i == 2)
goto l600;
retrn[level] = 1;
Middlex[level] = Middle;
Rightx[level] = Right;
fmx[level] = fMiddle;
fmrx[level] = fmr;
frx[level] = fRight;
estrx[level] = estr;
epsx[level] = eps;
eps = eps / 1.4;
Right = Middle;
fRight = fMiddle;
fMiddle = fml;
est = estl;
goto l100;
retrn[level] = 2;
Left = Middlex[level];
Right = Rightx[level];
fLeft = fmx[level];
fMiddle = fmrx[level];
fRight = frx[level];
est = estrx[level];
eps = epsx[level];
goto l100;
rombrg = valint[level][1] + valint[level][2];
if (level > 1)
goto l300;
ans = rombrg / 12.0;
return ans;
double funkn(double *thet)
//in this case a Gaussian centered at 0.5 with SD 0.1
double *sm;
double sum;
sm = new double[nDim];
sum = 0.0;
for (int i = 1; i <= nDim; i++)
sm[i] = (thet[i] - 0.5) / 0.1;
sm[i] *= sm[i];
sum = sum + sm[i];
return exp(-sum / 2.0);
int main() {
double ans;
printf("\nnDim ans expected nlvl\n");
for (nDim = 1; nDim <= 6; nDim++)
//expected answer is(0.1 sqrt(2pi))**nDim
QuadMo_nlvlk = new int[nDim + 1]; //array for x values
thet = new double[nDim + 1]; //array for x values
ans = MultInt(nDim);
printf("\n %d %f %f ", nDim, ans, pow((0.250663),nDim));
for (int j=1; j<=nDim; j++)
printf(" %d ", QuadMo_nlvlk[nDim]);
return 0;
Declare relevant parameters globally
int QuadMo_MinLvl = 2;
int QuadMo_MaxLvl = 3;
double QuadMo_Tol = 0.1;
int *QuadMo_nlvlk;
double *thet;
int nDim;
This coding is much clearer than the obfuscated antiquated fortran coding, with some tweaking the integral limits and tolerances could be parameterised!!
There are better algorithms to use with adaptive techniques and which handle singularities on the surfaces etc....

Range Reduction Poor Precision For Single Precision Floating Point

I am trying to implement range reduction as the first step of implementing the sine function.
I am following the method described in the paper "ARGUMENT REDUCTION FOR HUGE ARGUMENTS" by K.C. NG
I am getting error as large as 0.002339146 when using the input range of x from 0 to 20000. My error obviously shouldn't be that large, and I'm not sure how I can reduce it. I noticed that the error magnitude is associated with the input theta magnitude to cosine/sine.
I was able to obtain the nearpi.c code that the paper mentions, but I'm not sure how to utilize the code for single precision floating point. If anyone is interested, the nearpi.c file can be found at this link: nearpi.c
Here is my MATLAB code:
x = 0:0.1:20000;
% Perform range reduction
% Store constant 2/pi
twooverpi = single(2/pi);
% Compute y
y = (x.*twooverpi);
% Compute k (round to nearest integer
k = round(y);
% Solve for f
f = single(y-k);
% Solve for r
r = single(f*single(pi/2));
% Find last two bits of k
n = bitand(fi(k,1,32,0),fi(3,1,32,0));
n = single(n);
% Preallocate for speed
z(length(x)) = 0;
for i = 1:length(x)
case 0
case 1
z(i) = single(cos(r(i)));
case 2
z(i) = -sin(r(i));
case 3
z(i) = single(-cos(r(i)));
maxerror = max(abs(single(z - single(sin(single(x))))))
minerror = min(abs(single(z - single(sin(single(x))))))
I have edited the program nearpi.c so that it compiles. However I am not sure how to interpret the output. Also the file expects an input, which I had to input by hand, also I am not sure of the significance of the input.
Here is the working nearpi.c:
Name : nearpi.c
Author :
Version :
Copyright : Your copyright notice
Description : Hello World in C, Ansi-style
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
* Global macro definitions.
# define hex( double ) *(1 + ((long *) &double)), *((long *) &double)
# define sgn(a) (a >= 0 ? 1 : -1)
# define MAX_k 2500
# define D 56
# define MAX_EXP 127
# define THRESHOLD 2.22e-16
* Global Variables
int CFlength, /* length of CF including terminator */
double e,
f; /* [e,f] range of D-bit unsigned int of f;
form 1X...X */
// Function Prototypes
int dbleCF (double i[], double j[]);
void input (double i[]);
void nearPiOver2 (double i[]);
* This is the start of the main program.
int main (void)
int k; /* subscript variable */
double i[MAX_k],
j[MAX_k]; /* i and j are continued fractions
(coeffs) */
// fp = fopen("/src/cfpi.txt", "r");
* Compute global variables e and f, where
* e = 2 ^ (D-1), i.e. the D bit number 10...0
* and
* f = 2 ^ D - 1, i.e. the D bit number 11...1 .
e = 1;
for (k = 2; k <= D; k = k + 1)
e = 2 * e;
f = 2 * e - 1;
* Compute the continued fraction for (2/e)/(pi/2) , i.e.
* q's starting value for the first binade, given the continued
* fraction for pi as input; set the global variable CFlength
* to the length of the resulting continued fraction (including
* its negative valued terminator). One should use as many
* partial coefficients of pi as necessary to resolve numbers
* of the width of the underflow plus the overflow threshold.
* A rule of thumb is 0.97 partial coefficients are generated
* for every decimal digit of pi .
* Note: for radix B machines, subroutine input should compute
* the continued fraction for (B/e)/(pi/2) where e = B ^ (D - 1).
input (i);
* Begin main loop over all binades:
* For each binade, find the nearest multiples of pi/2 in that binade.
* [ Note: for hexadecimal machines ( B = 16 ), the rest of the main
* program simplifies(!) to
* B_ade = 1;
* while (B_ade < MAX_EXP)
* {
* dbleCF (i, j);
* dbleCF (j, i);
* dbleCF (i, j);
* CFlength = dbleCF (j, i);
* B_ade = B_ade + 1;
* }
* }
* because the alternation of source & destination are no longer necessary. ]
binade = 1;
while (binade < MAX_EXP)
* For the current (odd) binade, find the nearest multiples of pi/2.
nearPiOver2 (i);
* Double the continued fraction to get to the next (even) binade.
* To save copying arrays, i and j will alternate as the source
* and destination for the continued fractions.
CFlength = dbleCF (i, j);
binade = binade + 1;
* Check for main loop termination again because of the
* alternation.
if (binade >= MAX_EXP)
* For the current (even) binade, find the nearest multiples of pi/2.
nearPiOver2 (j);
* Double the continued fraction to get to the next (odd) binade.
CFlength = dbleCF (j, i);
binade = binade + 1;
return 0;
} /* end of Main Program */
* Subroutine DbleCF doubles a continued fraction whose partial
* coefficients are i[] into a continued fraction j[], where both
* arrays are of a type sufficient to do D-bit integer arithmetic.
* In my case ( D = 56 ) , I am forced to treat integers as double
* precision reals because my machine does not have integers of
* sufficient width to handle D-bit integer arithmetic.
* Adapted from a Basic program written by W. Kahan.
* Algorithm based on Hurwitz's method of doubling continued
* fractions (see Knuth Vol. 3, p.360).
* A negative value terminates the last partial quotient.
* Note: for the non-C programmers, the statement break
* exits a loop and the statement continue skips to the next
* case in the same loop.
* The call modf ( l / 2, &l0 ) assigns the integer portion of
* half of L to L0.
int dbleCF (double i[], double j[])
double k,
int n,
n = 1;
m = 0;
j0 = i[0] + i[0];
l = i[n];
while (1)
if (l < 0)
j[m] = j0;
modf (l / 2, &l0);
l = l - l0 - l0;
k = i[n + 1];
if (l0 > 0)
j[m] = j0;
j[m + 1] = l0;
j0 = 0;
m = m + 2;
if (l == 0) {
* Even case.
if (k < 0)
m = m - 1;
j0 = j0 + k + k;
n = n + 2;
l = i[n];
* Odd case.
if (k < 0)
j[m] = j0 + 2;
if (k == 0)
n = n + 2;
l = l + i[n];
j[m] = j0 + 1;
m = m + 1;
j0 = 1;
l = k - 1;
n = n + 1;
m = m + 1;
j[m] = -99999;
return (m);
* Subroutine input computes the continued fraction for
* (2/e) / (pi/2) , where e = 2 ^ (D-1) , given pi 's
* continued fraction as input. That is, double the continued
* fraction of pi D-3 times and place a zero at the front.
* One should use as many partial coefficients of pi as
* necessary to resolve numbers of the width of the underflow
* plus the overflow threshold. A rule of thumb is 0.97
* partial coefficients are generated for every decimal digit
* of pi . The last coefficient of pi is terminated by a
* negative number.
* I'll be happy to supply anyone with the partial coefficients
* of pi . My ARPA address is mcdonald#ucbdali.BERKELEY.ARPA .
* I computed the partial coefficients of pi using a method of
* Bill Gosper's. I need only compute with integers, albeit
* large ones. After writing the program in bc and Vaxima ,
* Prof. Fateman suggested FranzLisp . To my surprise, FranzLisp
* ran the fastest! the reason? FranzLisp's Bignum package is
* hand coded in assembler. Also, FranzLisp can be compiled.
* Note: for radix B machines, subroutine input should compute
* the continued fraction for (B/e)/(pi/2) where e = B ^ (D - 1).
* In the case of hexadecimal ( B = 16 ), this is done by repeated
* doubling the appropriate number of times.
void input (double i[])
int k;
double j[MAX_k];
* Read in the partial coefficients of pi from a precalculated file
* until a negative value is encountered.
k = -1;
k = k + 1;
scanf ("%lE", &i[k]);
printf("%d", k);
} while (i[k] >= 0);
* Double the continued fraction for pi D-3 times using
* i and j alternately as source and destination. On my
* machine D = 56 so D-3 is odd; hence the following code:
* Double twice (D-3)/2 times,
for (k = 1; k <= (D - 3) / 2; k = k + 1)
dbleCF (i, j);
dbleCF (j, i);
* then double once more.
dbleCF (i, j);
* Now append a zero on the front (reciprocate the continued
* fraction) and the return the coefficients in i .
i[0] = 0;
k = -1;
k = k + 1;
i[k + 1] = j[k];
} while (j[k] >= 0);
* Return the length of the continued fraction, including its
* terminator and initial zero, in the global variable CFlength.
CFlength = k;
* Given a continued fraction's coefficients in an array i ,
* subroutine nearPiOver2 finds all machine representable
* values near a integer multiple of pi/2 in the current binade.
void nearPiOver2 (double i[])
int k, /* subscript for recurrences (see
handout) */
K; /* like k , but used during cancel. elim.
double p[MAX_k], /* product of the q's (see
handout) */
q[MAX_k], /* successive tail evals of CF (see
handout) */
j[MAX_k], /* like convergent numerators (see
handout) */
tmp, /* temporary used during cancellation
elim. */
mk0, /* m[k - 1] (see
handout) */
mk, /* m[k] is one of the few ints (see
handout) */
mkAbs, /* absolute value of m sub k
mK0, /* like mk0 , but used during cancel.
elim. */
mK, /* like mk , but used during cancel.
elim. */
z, /* the object of our quest (the argument)
m0, /* the mantissa of z as a D-bit integer
x, /* the reduced argument (see
handout) */
ldexp (), /* sys routine to multiply by a power of
two */
fabs (), /* sys routine to compute FP absolute
value */
floor (), /* sys routine to compute greatest int <=
value */
ceil (); /* sys routine to compute least int >=
value */
* Compute the q's by evaluating the continued fraction from
* bottom up.
* Start evaluation with a big number in the terminator position.
q[CFlength] = 1.0 + 30;
for (k = CFlength - 1; k >= 0; k = k - 1)
q[k] = i[k] + 1 / q[k + 1];
* Let THRESHOLD be the biggest | x | that we are interesed in
* seeing.
* Compute the p's and j's by the recurrences from the top down.
* Stop when
* 1 1
* ----- >= THRESHOLD > ------ .
* 2 |j | 2 |j |
* k k+1
p[0] = 1;
j[0] = 0;
j[1] = 1;
k = 0;
p[k + 1] = -q[k + 1] * p[k];
if (k > 0)
j[1 + k] = j[k - 1] - i[k] * j[k];
k = k + 1;
} while (1 / (2 * fabs (j[k])) >= THRESHOLD);
* Then mk runs through the integers between
* k + k +
* (-1) e / p - 1/2 & (-1) f / p - 1/2 .
* k k
for (mkAbs = floor (e / fabs (p[k]));
mkAbs <= ceil (f / fabs (p[k])); mkAbs = mkAbs + 1)
mk = mkAbs * sgn (p[k]);
* For each mk , mk0 runs through integers between
* +
* m q - p THRESHOLD .
* k k k
for (mk0 = floor (mk * q[k] - fabs (p[k]) * THRESHOLD);
mk0 <= ceil (mk * q[k] + fabs (p[k]) * THRESHOLD);
mk0 = mk0 + 1)
* For each pair { mk , mk0 } , check that
* k
* m = (-1) ( j m - j m )
* 0 k-1 k k k-1
m0 = (k & 1 ? -1 : 1) * (j[k - 1] * mk - j[k] * mk0);
* lies between e and f .
if (e <= fabs (m0) && fabs (m0) <= f)
* If so, then we have found an
* k
* x = ((-1) m / p - m ) / j
* 0 k k k
* = ( m q - m ) / p .
* k k k-1 k
* But this later formula can suffer cancellation. Therefore,
* run the recurrence for the mk 's to get mK with minimal
* | mK | + | mK0 | in the hope mK is 0 .
K = k;
mK = mk;
mK0 = mk0;
while (fabs (mK) > 0)
p[K + 1] = -q[K + 1] * p[K];
tmp = mK0 - i[K] * mK;
if (fabs (tmp) > fabs (mK0))
mK0 = mK;
mK = tmp;
K = K + 1;
* Then
* x = ( m q - m ) / p
* K K K-1 K
* as accurately as one could hope.
x = (mK * q[K] - mK0) / p[K];
* To return z and m0 as positive numbers,
* x must take the sign of m0 .
x = x * sgn (m0);
m0 = fabs (m0);
* Set z = m0 * 2 ^ (binade+1-D) .
z = ldexp (m0, binade + 1 - D);
* Print z (hex), z (dec), m0 (dec), binade+1-D, x (hex), x (dec).
printf ("%08lx %08lx Z=%22.16E M=%17.17G L+1-%d=%3d %08lx %08lx x=%23.16E\n", hex (z), z, m0, D, binade + 1 - D, hex (x), x);
First let's note the difference using single-precision arithmetic makes.
[Equation 8] The minimal value of f can be larger. As double-precision numbers are a super-set of the single-precision numbers, the closest single to a multiple of 2/pi can only be farther away then ~2.98e-19, therefore the number of leading zeros in fixed-arithmetic representation of f must be at most 61 leading zeros (but will probably be less). Denote this quantity fdigits.
[Equation Before 9] Consequently, instead of 121 bits, y must be accurate to fdigits + 24 (non-zero significant bits in single-precision) + 7 (extra guard bits) = fdigits + 31, and at most 92.
[Equation 9] "Therefore, together with the width of x's exponent, 2/pi must contain 127 (maximal exponent of single) + 31 + fdigits, or 158 + fdigits and at most 219 bits.
[Subsection 2.5] The size of A is determined by the number of zeros in x before the binary point (and is unaffected by the move to single), while the size of C is determined by Equation Before 9.
For large x (x>=2^24), x looks like this: [24 bits, M zeros]. Multiplying it by A, whose size is the first M bits of 2/pi, will result in an integer (the zeros of x will just shift everything into the integers).
Choosing C to be starting from the M+d bit of 2/pi will result in the product x*C being of size at most d-24. In double precision, d is chosen to be 174 (and instead of 24, we have 53) so that the product will be of size at most 121. In single, it is enough to choose d such that d-24 <= 92, or more precisely, d-24 <= fdigits+31. That is, d can be chosen as fdigits+55, or at most 116.
As a result, B should be of size at most 116 bits.
We are therefore left with two problems :
Computing fdigits. This involves reading ref 6 from the linked paper and understanding it. Might not be that easy. :) As far as I can see, that's the only place where nearpi.c is used.
Computing B, the relevant bits of 2/pi. Since M is bounded below by 127, we can just compute the first 127+116 bits of 2/pi offline and store them in an array. See Wikipedia.
Computing y=x*B. This involves multipliying x by a 116-bits number. This is where Section 3 is used. The size of the blocks is chosen to be 24 because 2*24 + 2 (multiplying two 24-bits numbers, and adding 3 such numbers) is smaller than the precision of double, 53 (and because 24 divides 96). We can use blocks of size 11 bits for single arithmetic for similar reasons.
Note - the trick with B only applies to numbers whose exponents are positive (x>=2^24).
To summarize - first, you have to solve the problem with double precision. Your Matlab code doesn't work in double precision too (try removing single and computing sin(2^53), because your twooverpi only has 53 significant bits, not 175 (and anyway, you can't directly multiply such precise numbers in Matlab). Second, the scheme should be adapted to work with single, and again, the key problem is representing 2/pi precisely enough, and supporting multiplication of highly-precise numbers. Last, when everything works, you can try and figure out a better fdigits to reduce the number of bits you have to store and multiply.
Hopefully I'm not completely off - comments and contradictions are welcome.
As an example, let us compute sin(x) where x = single(2^24-1), which has no zeros after the significant bits (M = 0). This simplifies finding B, as B consists of the first 116 bits of 2/pi. Since x has precision of 24 bits and B of 116 bits, the product
y = x * B
will have 92 bits of precision, as required.
Section 3 in the linked paper describes how to perform this product with enough precision; the same algorithm can be used with blocks of size 11 to compute y in our case. Being drudgery, I hope I'm excused for not doing this explicitly, instead relying on Matlab's symbolic math toolbox. This toolbox provides us with the vpa function, which allows us to specify the precision of a number in decimal digits. So,
vpa('2/pi', ceil(116*log10(2)))
will produce an approximation of 2/pi of at least 116 bits of precision. Because vpa accepts only integers for its precision argument, we usually can't specify the binary precision of a number exactly, so we use the next-best.
The following code computes sin(x) according to the paper, in single precision :
x = single(2^24-1);
y = x * vpa('2/pi', ceil(116*log10(2))); % Precision = 103.075
k = round(y);
f = single(y - k);
r = f * single(pi) / 2;
switch mod(k, 4)
case 0
s = sin(r);
case 1
s = cos(r);
case 2
s = -sin(r);
case 3
s = -cos(r);
sin(x) - s % Expected value: exactly zero.
(The precision of y is obtained using Mathematica, which turned out to be a much better numerical tool than Matlab :) )
In libm
The other answer to this question (which has been deleted since) lead me to an implementation in libm, which although works on double-precision numbers, follows the linked paper very thoroughly.
See file s_sin.c for the wrapper (Table 2 from the linked paper appears as a switch statement at the end of the file), and e_rem_pio2.c for the argument reduction code (of particular interest is an array containing the first 396 hex-digits of 2/pi, starting at line 69).