How to handle the indices of a 9-dimensional matrix - c++

I am a physicist currently writing a C++ program dealing with multidimensional integration; in particular, the functions I am considering can have up to D=9 dimensions.
From a mathematical perspective, I need to handle a NxNxN...xN (D times) matrix, but from a programming point of view, I was instructed to use an array of NxNxN...xN elements instead. From what I know, an array is better for the sake of generality and for all the ensuing calculations involving pointers.
However, now I am stuck with a problem I cannot solve.
I need to perform some calculations where a single index of my matrix is fixed and all the other ones take all their different values.
If it were a 3x3x3 matrix, the code would be something similar to the following:
double test[3][3][3];
for(int i=0;i<3;i++) {
for(int j=0;j<3;j++) {
test[0][i][j]=i*j;
}
}
i.e. I could have an index fixed and cycle through the other ones.
The same process could be extended to the second and the third index as well.
How can I accomplish the same effect with a double test[3*3*3]? Please keep in mind that the three dimensional matrix is just an example; the real matrices I am dealing with are 9-dimensional, and so I need a general way to keep a single index of my matrix fixed and cycle through all the other ones.
TL;DR: I have an array which represents a NxNxN...xN (9 times) matrix.
I need to perform some calculations on the array as if a single index of my matrix were fixed and all the other ones were cycling through all their possible values.
I know there is a simple expression for the case where a 2-D matrix is mapped in a 1-D array; does something similar exist here?

Raster scan is the standard way of ordering elements for two dimensions.
If you have a 2-D array test[3][3], and you access it by test[i][j], the corresponding one-dimensional array would be
double raster[3 * 3];
and you would access it as follows:
raster[i * 3 + j];
This can be generalized to 3 dimensions:
double raster[3 * 3 * 3];
...
raster[a * 9 + b * 3 + c];
Or to 9 dimensions:
double raster[3 * 3 * 3 * 3 * 3 * 3 * 3 * 3 * 3];
...
raster[a * 6561 + b * 2187 + c * 729 + d * 243 + e * 81 + f * 27 + g * 9 + h * 3 + i];
Having any of the a ... i index variables constant, and changing the rest in a loop, will access a 8-D slice in your 9-D array.
You might want to define some struct to hold all these indices, for example:
struct Pos
{
int a, b, c, d, e, f, g, h, i;
};
Then you can convert a position to a 1-D index easily:
int index(Pos p)
{
return p.a * 6561 + p.b * 2187 + p.c * 729 + p.d * 243 + p.e * 81 + p.f * 27 + p.g * 9 + p.h * 3 + p.i;
}

Generally, a flattened array will contain its elements in the following way: the elements of the last dimension will be mapped into repeated groups, the inner-most groups will be the second dimension from the back and so on:
values[x][y][z] => { x0 = { y0_0 = { z0_0_0, z0_0_1, ..., z0_0_N }, y0_1 = { z0_1_0, z0_1_1, ... }, ... y0_N }, x1 = ... }
values[x*y*z] => { z0_0_0, z0_0_1, ..., z0_0_N, z0_1_0, z0_0_1, ... }
I hope this makes sense outside my brain.
So, any element access will need to calculate, how many blocks of elements come before it:
Accessing [2][1][3] means, skip 2 blocks of x, each containing y blocks with z elements, then skip another 1 block of y containing z elements and access the 3rd element from the next block:
values[2 * y * z + 1 * z + 3];
So more generally for N dimensions d1, d2, d3 .. dn, and an n-dimensional index i1, i2, .. iN to be accessed:
[i1 * d2 * ... * dN + i2 * d3 * ... * dN + ... + iN]
Back to your example:
double test[3*3*3];
for(int i = 0; i < 3; i++)
{
for(int j = 0; j < 3; j++)
{
// test[0*3*3 + i*3 + j] = i * j;
test[i*3 + j] = i * j;
}
}

If the matrix has the same size for all dimensions, then you can access them like this:
m[x + y*N + z*N*N + w*N*N*N ...]
In the case that the sizes are different, it is a little bit more complicated:
m[x + y*N1 + z*N1*N2 + w*N1*N2*N3 ...]

Related

How to calculate where an indexed value in a 3d array will be in memory? How to calculate where an indexed value in a char** will be in memory?

The problem states: Given the following array declarations and indexed accesses, compute the address where the indexed value will be in memory. Assume the array starts at location 200 on a 64-bit computer.
a. double d[3][4][4]; d[1][2][3] is at: _________
b. char *n[10]; n[3] is at: _________
I know the answers are 416 and 224 (respectively), but I do not understand how those numbers were reached.
For part a, I was told the equation:
address-in-3d-array= start-address + (p * numR * numC + (i * numC) + j) * size-of-type
(where start address = 200, the numR and numC come from the original array, and the i,j, and p come from the location you are trying to find).
Nothing I do makes this equation come to 416. I have been viewing the order of the array as d[row][column][plane]. Is that incorrect? I have also tried looking at it as d[plane][row][column], but that didn't seem to work either.
For part b, I'm not sure where to start as I thought that as the array is an array of pointers, it's location would be in the heap. I'm not sure how to get 224 from that.
I need to answer these questions by hand, not using code.
For this array declaration
double d[3][4][4];
to calculate the address of the expression
d[1][2][3]
You can use the following formula
reinterpret_cast<double *>( d ) + 1 * 16 + 2 * 4 + 3
that is the same (relative to the value of the expression) as
reinterpret_cast<char *>( d ) + 27 * sizeof( double )
So you can calculate the address like the address of the first element of the array plus the expression 27 * sizeof( double ) where double is equal to 8.
For this array
char *n[10];
the address of the expression
n[3]
is
reinterpret_cast<char *>( n ) + 3 * sizeof( char * )
In words:
Given a generic array d[s1][s2][s3] of elements of size S, the offset of the d[x][y][z] element is
[(x * s2 * s3) + (y * s3) + z] * S
In the array double d[3][4][4], with S = sizeof(double) = 8, the location of d[1][2][3] is at offset:
[(1 * 4 * 4) + (2 * 4) + 3] * 8 = 216
Sum the offset (216) to the start (200) to get 416.
Given a generic array n[s1] of elements of size S, the offset of the n[x] element is
x * S
In the array char * n[10], with S = 8 (pointers size on 64bit platforms), the location of n[3] is at offset
3 * 8 = 24
Sum the offset (24) to the start (200) to get 224.
In code:
int main()
{
double d[3][4][4];
size_t start = 200;
size_t offset =
sizeof(d[0]) * 1
+ sizeof(d[0][0]) * 2
+ sizeof(d[0][0][0]) * 3;
std::cout << start + offset << std::endl; //416 on my machine
char * n[10];
offset = 3 * sizeof(char*);
std::cout << start + offset << std::endl; //224 on every 64bit platforms
}

Need help understanding how to work with 2D/3D glyphs

Here's the code snippet I'd like help understanding
for (i = 0; i < samplesX; i++)
for (j = 0; j < samplesY; j++)
{
newI = DIM * i / samplesX;
newJ = DIM * j / samplesY;
idx = (round(newJ) * DIM) + round(newI);
if (color_dir == 1 && draw_vecs == 1) {
direction_to_color(vx[idx], vy[idx], color_dir);
}
if (color_dir == 1 && draw_vecs == 2) {
direction_to_color(fx[idx], fy[idx], color_dir);
}
else if (color_dir == 2) {
scalar = rho[idx];
set_colormap(scalar, min, max, clampLow, clampHigh);
}
else if (color_dir == 3) {
scalar = sqrt(vx[idx] * vx[idx] + vy[idx] * vy[idx]);
set_colormap(scalar, min, max, clampLow, clampHigh);
}
else if (color_dir == 4) {
scalar = sqrt(fx[idx] * fx[idx] + fy[idx] * fy[idx]);
set_colormap(scalar, min, max, clampLow, clampHigh);
}
/*if (draw_vecs == 1) {
glVertex2f(wn + (fftw_real)newI * wn, hn + (fftw_real)newJ * hn);
glVertex2f((wn + (fftw_real)newI * wn) + vec_scale * vx[idx], (hn + (fftw_real)newJ * hn) + vec_scale * vy[idx]);
}
else if (draw_vecs == 2) {
glVertex2f(wn + (fftw_real)newI * wn, hn + (fftw_real)newJ * hn);
glVertex2f((wn + (fftw_real)newI * wn) + vec_scale * fx[idx], (hn + (fftw_real)newJ * hn) + vec_scale * fy[idx]);
}*/
if (draw_vecs == 1) {
glVertex2f(wn + (fftw_real)i * wn, hn + (fftw_real)j * hn);
glVertex2f((wn + (fftw_real)i * wn) + vec_scale * vx[idx], (hn + (fftw_real)j * hn) + vec_scale * vy[idx]);
}
else if (draw_vecs == 2) {
glVertex2f(wn + (fftw_real)i * wn, hn + (fftw_real)j * hn);
glVertex2f((wn + (fftw_real)i * wn) + vec_scale * fx[idx], (hn + (fftw_real)j * hn) + vec_scale * fy[idx]);
}
}
glEnd();
}
What this currently does, as far as my understanding goes, is display these two-dimensional lines/arrows (hedgehogs) that visualize force/velocity in 2D as can be seen in the picture below.
Sadly, my understanding of linear algebra, calculus and computer graphics in general only goes so far and I'm having trouble dissecting this piece.
Ideally I'd like to understand this and also understand how I can take this pre-existing code and also add in functionality that can display two other glyph types that show a vector and/or scalar field such as
three-dimensional cones
three-dimensional ellipsoids
If I'm missing anything here, please let me know!
Some of the variables included in the above snippet:
const int DIM = 50; //size of simulation grid
int color_dir = 0; //use direction color-coding or not
float scalar;
int newI, newJ;
float temp;
float vec_scale = 1000; //scaling of hedgehogs
int draw_vecs = 1; //draw the vector field or not
The code snippet you have there could have been written simpler (also it takes some educated guessing what some of the variables and functions mean).
Let's break it down.
The first two lines are easy to understand, they're the standard stanza to iterate over a 2D array
for (i = 0; i < samplesX; i++)
for (j = 0; j < samplesY; j++)
i and j are running indices, that will iterate over every discrete coordinate tuple in (i,j) ∈ [i, samplesX) × [j, samplesY). The next two lines remap the 2D indices into into a new value range, specifically [i, samplesX)×[j, samplesY) → [0, DIM)×[0, DIM). A missing piece of information is, what type is DIM of. It would make for it to be some floating point type.
newI = DIM * i / samplesX;
newJ = DIM * j / samplesY;
The next line is bug prone. It translates newI and newJ into a running 1D index for a 1D array, that's addressed by i and j.
Why is this problematic? Because in the conversion to DIM-space information may have been lost. This kind of information loss may lead to security bugs(!), as a matter of fact, Skia, the rendering library used by Google Chrome, Android and other projects had exactly this kind of bug recently; the writeup is a worthwhile read: https://googleprojectzero.blogspot.com/2019/02/the-curious-case-of-convexity-confusion.html
The correct way to implement this is to have DIM be an integer and perform fixed point arithmetic on it, eventually truncating the fractional digits. But I digress. The next block is essentially performing a poor man's lookup table lookup. vx``vy and fx``fy are some flattened 2D arrays, accessed through an 1D index, and direction_to_color maps either to a value presumably to a call of glColor; the same probably also goes for set_colormap. This is a bad use of OpenGL.
The whole remapping from i and j to DIM and then the lookups are just poor implementation of a texture lookup. OpenGL already has textures. Just load as texture coordinate array and enable texturing.
Finally for each spine, two calls of glVertex are made, one with the staring point, which lies on grid centers (wn, hn), to an offset location (wn, hn) + (i, j).
My verdict of that code: Utter garbage! All of this could have been done far more elegantly, even back in 1994 with OpenGL-1.0, which is code seems to have been written for. If you want to implement your own vector field plot, don't use this as a starting point.
These days we have programmable GPUs with shaders. All of that bulk up there can be done is a few lines of shader code.

C++ Data Structure to Find Neighbouring Values in Multidimensional Array

I have a project where I read in an array that has 1 or more dimensions, and for this project I need to be able to determine a given element's neighbours quickly. I do not know the dimensionality ahead of time, and I likewise do not know the size of the dimensions ahead of time. What would be the best C++ data structure to store this data in? A colleague recommended a vector of vectors of vectors of . . ., but that seems incredibly unwieldy.
If you know the address of which element you need the neighbors for, could you just do pointer arithmetic to find out the neighbors. For example, if p is the location of the element, then p-- is the left neighbor and p++ is the right neighbor.
Think your multidimensional array as a 1D array. Let the dimension of the array is d1 * d2 * ....* dn
Then allocate memory for a 1D array, say A of size d1 * d2 * ....* dn. For example,
int *A = new int[d1 * d2 * ....* dn];
If you need to store data in the [i1][i2]...[in] th index, then store in the following index:
A[i1 * (d2*d3*d4.. *dn) + i2 * (d3*d4*....dn) + ..... + in]
Neighboring elements will be:
A[(i1 + 1) * (d2*d3*d4.. *dn) + i2 * (d3*d4*....dn) + ..... + in]
A[(i1 - 1) * (d2*d3*d4.. *dn) + i2 * (d3*d4*....dn) + ..... + in]
A[i1 * (d2*d3*d4.. *dn) + (i2 + 1) * (d3*d4*....dn) + ..... + in]
A[i1 * (d2*d3*d4.. *dn) + (i2 - 1) * (d3*d4*....dn) + ..... + in]
.............................
A[i1 * (d2*d3*d4.. *dn) + i2 * (d3*d4*....dn) + ..... + (in + 1)]
A[i1 * (d2*d3*d4.. *dn) + i2 * (d3*d4*....dn) + ..... + (in - 1)]

Indexing irregular grid X,Y,Z coordinates in a 1D array

As in my previous question, I'm working loading a 1D array with volumetric data of a .raw file. The answer by Jonathan Leffler proved helpful, but now I'm working with a volume dataset of different dimensions (X,Y,Z aren't the same). How would the formula be generalized?
pVolume[((x * 256) + y) * 256 + z] // works when all dims are 256
int XDIM=256, YDIM=256, ZDIM=256; // I want this sizes to be arbitrary
const int size = XDIM*YDIM*ZDIM;
bool LoadVolumeFromFile(const char* fileName) {
FILE *pFile = fopen(fileName,"rb");
if(NULL == pFile) {
return false;
}
GLubyte* pVolume=new GLubyte[size]; //<- here pVolume is a 1D byte array
fread(pVolume,sizeof(GLubyte),size,pFile);
fclose(pFile);
Access in strides follows a simple principle:
A[i][j][k] = B[k + j * Dim3 + i * Dim3 * Dim2];
// k = 1..Dim3, (or 0 <= k < Dim3, as one does in C)
// j = 1..Dim2,
// i = 1..Dim1.
Here B is a 1D array of size Dim1 * Dim2 * Dim3. The formula obviously generalizes to arbitrarily many dimensions. If you want a mnemonic, start the sum with the fasted index, and in each summand you multiply further by the extent of the previous dimension.

Range Reduction Poor Precision For Single Precision Floating Point

I am trying to implement range reduction as the first step of implementing the sine function.
I am following the method described in the paper "ARGUMENT REDUCTION FOR HUGE ARGUMENTS" by K.C. NG
I am getting error as large as 0.002339146 when using the input range of x from 0 to 20000. My error obviously shouldn't be that large, and I'm not sure how I can reduce it. I noticed that the error magnitude is associated with the input theta magnitude to cosine/sine.
I was able to obtain the nearpi.c code that the paper mentions, but I'm not sure how to utilize the code for single precision floating point. If anyone is interested, the nearpi.c file can be found at this link: nearpi.c
Here is my MATLAB code:
x = 0:0.1:20000;
% Perform range reduction
% Store constant 2/pi
twooverpi = single(2/pi);
% Compute y
y = (x.*twooverpi);
% Compute k (round to nearest integer
k = round(y);
% Solve for f
f = single(y-k);
% Solve for r
r = single(f*single(pi/2));
% Find last two bits of k
n = bitand(fi(k,1,32,0),fi(3,1,32,0));
n = single(n);
% Preallocate for speed
z(length(x)) = 0;
for i = 1:length(x)
switch(n(i))
case 0
z(i)=sin(r(i));
case 1
z(i) = single(cos(r(i)));
case 2
z(i) = -sin(r(i));
case 3
z(i) = single(-cos(r(i)));
otherwise
end
end
maxerror = max(abs(single(z - single(sin(single(x))))))
minerror = min(abs(single(z - single(sin(single(x))))))
I have edited the program nearpi.c so that it compiles. However I am not sure how to interpret the output. Also the file expects an input, which I had to input by hand, also I am not sure of the significance of the input.
Here is the working nearpi.c:
/*
============================================================================
Name : nearpi.c
Author :
Version :
Copyright : Your copyright notice
Description : Hello World in C, Ansi-style
============================================================================
*/
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
/*
* Global macro definitions.
*/
# define hex( double ) *(1 + ((long *) &double)), *((long *) &double)
# define sgn(a) (a >= 0 ? 1 : -1)
# define MAX_k 2500
# define D 56
# define MAX_EXP 127
# define THRESHOLD 2.22e-16
/*
* Global Variables
*/
int CFlength, /* length of CF including terminator */
binade;
double e,
f; /* [e,f] range of D-bit unsigned int of f;
form 1X...X */
// Function Prototypes
int dbleCF (double i[], double j[]);
void input (double i[]);
void nearPiOver2 (double i[]);
/*
* This is the start of the main program.
*/
int main (void)
{
int k; /* subscript variable */
double i[MAX_k],
j[MAX_k]; /* i and j are continued fractions
(coeffs) */
// fp = fopen("/src/cfpi.txt", "r");
/*
* Compute global variables e and f, where
*
* e = 2 ^ (D-1), i.e. the D bit number 10...0
* and
* f = 2 ^ D - 1, i.e. the D bit number 11...1 .
*/
e = 1;
for (k = 2; k <= D; k = k + 1)
e = 2 * e;
f = 2 * e - 1;
/*
* Compute the continued fraction for (2/e)/(pi/2) , i.e.
* q's starting value for the first binade, given the continued
* fraction for pi as input; set the global variable CFlength
* to the length of the resulting continued fraction (including
* its negative valued terminator). One should use as many
* partial coefficients of pi as necessary to resolve numbers
* of the width of the underflow plus the overflow threshold.
* A rule of thumb is 0.97 partial coefficients are generated
* for every decimal digit of pi .
*
* Note: for radix B machines, subroutine input should compute
* the continued fraction for (B/e)/(pi/2) where e = B ^ (D - 1).
*/
input (i);
/*
* Begin main loop over all binades:
* For each binade, find the nearest multiples of pi/2 in that binade.
*
* [ Note: for hexadecimal machines ( B = 16 ), the rest of the main
* program simplifies(!) to
*
* B_ade = 1;
* while (B_ade < MAX_EXP)
* {
* dbleCF (i, j);
* dbleCF (j, i);
* dbleCF (i, j);
* CFlength = dbleCF (j, i);
* B_ade = B_ade + 1;
* }
* }
*
* because the alternation of source & destination are no longer necessary. ]
*/
binade = 1;
while (binade < MAX_EXP)
{
/*
* For the current (odd) binade, find the nearest multiples of pi/2.
*/
nearPiOver2 (i);
/*
* Double the continued fraction to get to the next (even) binade.
* To save copying arrays, i and j will alternate as the source
* and destination for the continued fractions.
*/
CFlength = dbleCF (i, j);
binade = binade + 1;
/*
* Check for main loop termination again because of the
* alternation.
*/
if (binade >= MAX_EXP)
break;
/*
* For the current (even) binade, find the nearest multiples of pi/2.
*/
nearPiOver2 (j);
/*
* Double the continued fraction to get to the next (odd) binade.
*/
CFlength = dbleCF (j, i);
binade = binade + 1;
}
return 0;
} /* end of Main Program */
/*
* Subroutine DbleCF doubles a continued fraction whose partial
* coefficients are i[] into a continued fraction j[], where both
* arrays are of a type sufficient to do D-bit integer arithmetic.
*
* In my case ( D = 56 ) , I am forced to treat integers as double
* precision reals because my machine does not have integers of
* sufficient width to handle D-bit integer arithmetic.
*
* Adapted from a Basic program written by W. Kahan.
*
* Algorithm based on Hurwitz's method of doubling continued
* fractions (see Knuth Vol. 3, p.360).
*
* A negative value terminates the last partial quotient.
*
* Note: for the non-C programmers, the statement break
* exits a loop and the statement continue skips to the next
* case in the same loop.
*
* The call modf ( l / 2, &l0 ) assigns the integer portion of
* half of L to L0.
*/
int dbleCF (double i[], double j[])
{
double k,
l,
l0,
j0;
int n,
m;
n = 1;
m = 0;
j0 = i[0] + i[0];
l = i[n];
while (1)
{
if (l < 0)
{
j[m] = j0;
break;
};
modf (l / 2, &l0);
l = l - l0 - l0;
k = i[n + 1];
if (l0 > 0)
{
j[m] = j0;
j[m + 1] = l0;
j0 = 0;
m = m + 2;
};
if (l == 0) {
/*
* Even case.
*/
if (k < 0)
{
m = m - 1;
break;
}
else
{
j0 = j0 + k + k;
n = n + 2;
l = i[n];
continue;
};
}
/*
* Odd case.
*/
if (k < 0)
{
j[m] = j0 + 2;
break;
};
if (k == 0)
{
n = n + 2;
l = l + i[n];
continue;
};
j[m] = j0 + 1;
m = m + 1;
j0 = 1;
l = k - 1;
n = n + 1;
continue;
};
m = m + 1;
j[m] = -99999;
return (m);
}
/*
* Subroutine input computes the continued fraction for
* (2/e) / (pi/2) , where e = 2 ^ (D-1) , given pi 's
* continued fraction as input. That is, double the continued
* fraction of pi D-3 times and place a zero at the front.
*
* One should use as many partial coefficients of pi as
* necessary to resolve numbers of the width of the underflow
* plus the overflow threshold. A rule of thumb is 0.97
* partial coefficients are generated for every decimal digit
* of pi . The last coefficient of pi is terminated by a
* negative number.
*
* I'll be happy to supply anyone with the partial coefficients
* of pi . My ARPA address is mcdonald#ucbdali.BERKELEY.ARPA .
*
* I computed the partial coefficients of pi using a method of
* Bill Gosper's. I need only compute with integers, albeit
* large ones. After writing the program in bc and Vaxima ,
* Prof. Fateman suggested FranzLisp . To my surprise, FranzLisp
* ran the fastest! the reason? FranzLisp's Bignum package is
* hand coded in assembler. Also, FranzLisp can be compiled.
*
*
* Note: for radix B machines, subroutine input should compute
* the continued fraction for (B/e)/(pi/2) where e = B ^ (D - 1).
* In the case of hexadecimal ( B = 16 ), this is done by repeated
* doubling the appropriate number of times.
*/
void input (double i[])
{
int k;
double j[MAX_k];
/*
* Read in the partial coefficients of pi from a precalculated file
* until a negative value is encountered.
*/
k = -1;
do
{
k = k + 1;
scanf ("%lE", &i[k]);
printf("hello\n");
printf("%d", k);
} while (i[k] >= 0);
/*
* Double the continued fraction for pi D-3 times using
* i and j alternately as source and destination. On my
* machine D = 56 so D-3 is odd; hence the following code:
*
* Double twice (D-3)/2 times,
*/
for (k = 1; k <= (D - 3) / 2; k = k + 1)
{
dbleCF (i, j);
dbleCF (j, i);
};
/*
* then double once more.
*/
dbleCF (i, j);
/*
* Now append a zero on the front (reciprocate the continued
* fraction) and the return the coefficients in i .
*/
i[0] = 0;
k = -1;
do
{
k = k + 1;
i[k + 1] = j[k];
} while (j[k] >= 0);
/*
* Return the length of the continued fraction, including its
* terminator and initial zero, in the global variable CFlength.
*/
CFlength = k;
}
/*
* Given a continued fraction's coefficients in an array i ,
* subroutine nearPiOver2 finds all machine representable
* values near a integer multiple of pi/2 in the current binade.
*/
void nearPiOver2 (double i[])
{
int k, /* subscript for recurrences (see
handout) */
K; /* like k , but used during cancel. elim.
*/
double p[MAX_k], /* product of the q's (see
handout) */
q[MAX_k], /* successive tail evals of CF (see
handout) */
j[MAX_k], /* like convergent numerators (see
handout) */
tmp, /* temporary used during cancellation
elim. */
mk0, /* m[k - 1] (see
handout) */
mk, /* m[k] is one of the few ints (see
handout) */
mkAbs, /* absolute value of m sub k
*/
mK0, /* like mk0 , but used during cancel.
elim. */
mK, /* like mk , but used during cancel.
elim. */
z, /* the object of our quest (the argument)
*/
m0, /* the mantissa of z as a D-bit integer
*/
x, /* the reduced argument (see
handout) */
ldexp (), /* sys routine to multiply by a power of
two */
fabs (), /* sys routine to compute FP absolute
value */
floor (), /* sys routine to compute greatest int <=
value */
ceil (); /* sys routine to compute least int >=
value */
/*
* Compute the q's by evaluating the continued fraction from
* bottom up.
*
* Start evaluation with a big number in the terminator position.
*/
q[CFlength] = 1.0 + 30;
for (k = CFlength - 1; k >= 0; k = k - 1)
q[k] = i[k] + 1 / q[k + 1];
/*
* Let THRESHOLD be the biggest | x | that we are interesed in
* seeing.
*
* Compute the p's and j's by the recurrences from the top down.
*
* Stop when
*
* 1 1
* ----- >= THRESHOLD > ------ .
* 2 |j | 2 |j |
* k k+1
*/
p[0] = 1;
j[0] = 0;
j[1] = 1;
k = 0;
do
{
p[k + 1] = -q[k + 1] * p[k];
if (k > 0)
j[1 + k] = j[k - 1] - i[k] * j[k];
k = k + 1;
} while (1 / (2 * fabs (j[k])) >= THRESHOLD);
/*
* Then mk runs through the integers between
*
* k + k +
* (-1) e / p - 1/2 & (-1) f / p - 1/2 .
* k k
*/
for (mkAbs = floor (e / fabs (p[k]));
mkAbs <= ceil (f / fabs (p[k])); mkAbs = mkAbs + 1)
{
mk = mkAbs * sgn (p[k]);
/*
* For each mk , mk0 runs through integers between
*
* +
* m q - p THRESHOLD .
* k k k
*/
for (mk0 = floor (mk * q[k] - fabs (p[k]) * THRESHOLD);
mk0 <= ceil (mk * q[k] + fabs (p[k]) * THRESHOLD);
mk0 = mk0 + 1)
{
/*
* For each pair { mk , mk0 } , check that
*
* k
* m = (-1) ( j m - j m )
* 0 k-1 k k k-1
*/
m0 = (k & 1 ? -1 : 1) * (j[k - 1] * mk - j[k] * mk0);
/*
* lies between e and f .
*/
if (e <= fabs (m0) && fabs (m0) <= f)
{
/*
* If so, then we have found an
*
* k
* x = ((-1) m / p - m ) / j
* 0 k k k
*
* = ( m q - m ) / p .
* k k k-1 k
*
* But this later formula can suffer cancellation. Therefore,
* run the recurrence for the mk 's to get mK with minimal
* | mK | + | mK0 | in the hope mK is 0 .
*/
K = k;
mK = mk;
mK0 = mk0;
while (fabs (mK) > 0)
{
p[K + 1] = -q[K + 1] * p[K];
tmp = mK0 - i[K] * mK;
if (fabs (tmp) > fabs (mK0))
break;
mK0 = mK;
mK = tmp;
K = K + 1;
};
/*
* Then
* x = ( m q - m ) / p
* K K K-1 K
*
* as accurately as one could hope.
*/
x = (mK * q[K] - mK0) / p[K];
/*
* To return z and m0 as positive numbers,
* x must take the sign of m0 .
*/
x = x * sgn (m0);
m0 = fabs (m0);
/*d
* Set z = m0 * 2 ^ (binade+1-D) .
*/
z = ldexp (m0, binade + 1 - D);
/*
* Print z (hex), z (dec), m0 (dec), binade+1-D, x (hex), x (dec).
*/
printf ("%08lx %08lx Z=%22.16E M=%17.17G L+1-%d=%3d %08lx %08lx x=%23.16E\n", hex (z), z, m0, D, binade + 1 - D, hex (x), x);
}
}
}
}
Theory
First let's note the difference using single-precision arithmetic makes.
[Equation 8] The minimal value of f can be larger. As double-precision numbers are a super-set of the single-precision numbers, the closest single to a multiple of 2/pi can only be farther away then ~2.98e-19, therefore the number of leading zeros in fixed-arithmetic representation of f must be at most 61 leading zeros (but will probably be less). Denote this quantity fdigits.
[Equation Before 9] Consequently, instead of 121 bits, y must be accurate to fdigits + 24 (non-zero significant bits in single-precision) + 7 (extra guard bits) = fdigits + 31, and at most 92.
[Equation 9] "Therefore, together with the width of x's exponent, 2/pi must contain 127 (maximal exponent of single) + 31 + fdigits, or 158 + fdigits and at most 219 bits.
[Subsection 2.5] The size of A is determined by the number of zeros in x before the binary point (and is unaffected by the move to single), while the size of C is determined by Equation Before 9.
For large x (x>=2^24), x looks like this: [24 bits, M zeros]. Multiplying it by A, whose size is the first M bits of 2/pi, will result in an integer (the zeros of x will just shift everything into the integers).
Choosing C to be starting from the M+d bit of 2/pi will result in the product x*C being of size at most d-24. In double precision, d is chosen to be 174 (and instead of 24, we have 53) so that the product will be of size at most 121. In single, it is enough to choose d such that d-24 <= 92, or more precisely, d-24 <= fdigits+31. That is, d can be chosen as fdigits+55, or at most 116.
As a result, B should be of size at most 116 bits.
We are therefore left with two problems :
Computing fdigits. This involves reading ref 6 from the linked paper and understanding it. Might not be that easy. :) As far as I can see, that's the only place where nearpi.c is used.
Computing B, the relevant bits of 2/pi. Since M is bounded below by 127, we can just compute the first 127+116 bits of 2/pi offline and store them in an array. See Wikipedia.
Computing y=x*B. This involves multipliying x by a 116-bits number. This is where Section 3 is used. The size of the blocks is chosen to be 24 because 2*24 + 2 (multiplying two 24-bits numbers, and adding 3 such numbers) is smaller than the precision of double, 53 (and because 24 divides 96). We can use blocks of size 11 bits for single arithmetic for similar reasons.
Note - the trick with B only applies to numbers whose exponents are positive (x>=2^24).
To summarize - first, you have to solve the problem with double precision. Your Matlab code doesn't work in double precision too (try removing single and computing sin(2^53), because your twooverpi only has 53 significant bits, not 175 (and anyway, you can't directly multiply such precise numbers in Matlab). Second, the scheme should be adapted to work with single, and again, the key problem is representing 2/pi precisely enough, and supporting multiplication of highly-precise numbers. Last, when everything works, you can try and figure out a better fdigits to reduce the number of bits you have to store and multiply.
Hopefully I'm not completely off - comments and contradictions are welcome.
Example
As an example, let us compute sin(x) where x = single(2^24-1), which has no zeros after the significant bits (M = 0). This simplifies finding B, as B consists of the first 116 bits of 2/pi. Since x has precision of 24 bits and B of 116 bits, the product
y = x * B
will have 92 bits of precision, as required.
Section 3 in the linked paper describes how to perform this product with enough precision; the same algorithm can be used with blocks of size 11 to compute y in our case. Being drudgery, I hope I'm excused for not doing this explicitly, instead relying on Matlab's symbolic math toolbox. This toolbox provides us with the vpa function, which allows us to specify the precision of a number in decimal digits. So,
vpa('2/pi', ceil(116*log10(2)))
will produce an approximation of 2/pi of at least 116 bits of precision. Because vpa accepts only integers for its precision argument, we usually can't specify the binary precision of a number exactly, so we use the next-best.
The following code computes sin(x) according to the paper, in single precision :
x = single(2^24-1);
y = x * vpa('2/pi', ceil(116*log10(2))); % Precision = 103.075
k = round(y);
f = single(y - k);
r = f * single(pi) / 2;
switch mod(k, 4)
case 0
s = sin(r);
case 1
s = cos(r);
case 2
s = -sin(r);
case 3
s = -cos(r);
end
sin(x) - s % Expected value: exactly zero.
(The precision of y is obtained using Mathematica, which turned out to be a much better numerical tool than Matlab :) )
In libm
The other answer to this question (which has been deleted since) lead me to an implementation in libm, which although works on double-precision numbers, follows the linked paper very thoroughly.
See file s_sin.c for the wrapper (Table 2 from the linked paper appears as a switch statement at the end of the file), and e_rem_pio2.c for the argument reduction code (of particular interest is an array containing the first 396 hex-digits of 2/pi, starting at line 69).