This is an extension of my earlier question, but I am asking it separately because I am getting really frustrated, so please do not down-vote it!
Question: What could be the reason behind a cblas_sgemm call taking much less time for matrices with a large number of zeros as compared to the same cblas_sgemm call for dense matrices?
I know gemv is designed for matrix-vector multiplication but why can't I use gemm for vector-matrix multiplication if it takes less time, especially for sparse matrices
A short representative code is given below. It asks to enter a value, and then populates a vector with that value. It then replaces every 32nd value with its index. So, if we enter '0' then we get a sparse vector but for other values we get a dense vector.
#include <iostream>
#include <stdio.h>
#include <time.h>
#include <cblas.h>
using namespace std;
int main()
{
const int m = 5000;
timespec blas_start, blas_end;
long totalnsec; //total nano sec
double totalsec, totaltime;
int i, j;
float *A = new float[m]; // 1 x m
float *B = new float[m*m]; // m x m
float *C = new float[m]; // 1 x m
float input;
cout << "Enter a value to populate the vector (0 for sparse) ";
cin >> input; // enter 0 for sparse
// input martix A: every 32nd element is non-zero, rest of the values = input
for(i = 0; i < m; i++)
{
A[i] = input;
if( i % 32 == 0) //adjust for sparsity
A[i] = i;
}
// input matrix B: identity matrix
for(i = 0; i < m; i++)
for(j = 0; j < m; j++)
B[i*m + j] = (i==j);
clock_gettime(CLOCK_REALTIME, &blas_start);
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 1, m, m, 1.0f, A, m, B, m, 0.0f, C, m);
clock_gettime(CLOCK_REALTIME, &blas_end);
/* for(i = 0; i < m; i++)
printf("%f ", C[i]);
printf("\n\n"); */
// Print time
totalsec = (double)blas_end.tv_sec - (double)blas_start.tv_sec;
totalnsec = blas_end.tv_nsec - blas_start.tv_nsec;
if(totalnsec < 0)
{
totalnsec += 1e9;
totalsec -= 1;
}
totaltime = totalsec + (double)totalnsec*1e-9;
cout<<"Duration = "<< totaltime << "\n";
return 0;
}
I run it as follows in Ubuntu 14.04 with blas 3.0
erisp#ubuntu:~/uas/stackoverflow$ g++ gemmcomp.cpp -o gemmcomp.o -lblas
erisp#ubuntu:~/uas/stackoverflow$ ./gemmcomp.o
Enter a value to populate the vector (0 for sparse) 5
Duration = 0.0291558
erisp#ubuntu:~/uas/stackoverflow$ ./gemmcomp.o
Enter a value to populate the vector (0 for sparse) 0
Duration = 0.000959521
EDIT
If I replace the gemm call with the following gemv call then matrix sparsity does not matter
//cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 1, m, m, 1.0f, A, m, B, m, 0.0f, C, m);
cblas_sgemv(CblasRowMajor, CblasNoTrans, m, m, 1.0f, B, m, A, 1, 0.0f, C, 1);
Results
erisp#ubuntu:~/uas/stackoverflow$ g++ gemmcomp.cpp -o gemmcomp.o -lblas
erisp#ubuntu:~/uas/stackoverflow$ ./gemmcomp.o
Enter a value to populate the vector (0 for sparse) 5
Duration = 0.0301581
erisp#ubuntu:~/uas/stackoverflow$ ./gemmcomp.o
Enter a value to populate the vector (0 for sparse) 0
Duration = 0.0299282
But the issue is that I am trying to optimize someone else's code using cublas and he is successfully and efficiently using gemm to perform this vector-matrix multiplication. So, I have to test against it or to categorically prove this call to be incorrect
EDIT
I have even updated my blas library today by using
sudo apt-get install libblas-dev liblapack-dev
EDIT: Executed the following commands as suggested by different contributors
erisp#ubuntu:~/uas/stackoverflow$ ll -d /usr/lib/libblas* /etc/alternatives/libblas.*
lrwxrwxrwx 1 root root 26 مارچ 13 2015 /etc/alternatives/libblas.a -> /usr/lib/libblas/libblas.a
lrwxrwxrwx 1 root root 27 مارچ 13 2015 /etc/alternatives/libblas.so -> /usr/lib/libblas/libblas.so
lrwxrwxrwx 1 root root 29 مارچ 13 2015 /etc/alternatives/libblas.so.3 -> /usr/lib/libblas/libblas.so.3
lrwxrwxrwx 1 root root 29 مارچ 13 2015 /etc/alternatives/libblas.so.3gf -> /usr/lib/libblas/libblas.so.3
drwxr-xr-x 2 root root 4096 مارچ 13 2015 /usr/lib/libblas/
lrwxrwxrwx 1 root root 27 مارچ 13 2015 /usr/lib/libblas.a -> /etc/alternatives/libblas.a
lrwxrwxrwx 1 root root 28 مارچ 13 2015 /usr/lib/libblas.so -> /etc/alternatives/libblas.so
lrwxrwxrwx 1 root root 30 مارچ 13 2015 /usr/lib/libblas.so.3 -> /etc/alternatives/libblas.so.3
lrwxrwxrwx 1 root root 32 مارچ 13 2015 /usr/lib/libblas.so.3gf -> /etc/alternatives/libblas.so.3gf
erisp#ubuntu:~/uas/stackoverflow$ ldd ./gemmcomp.o
linux-gate.so.1 => (0xb76f6000)
libblas.so.3 => /usr/lib/libblas.so.3 (0xb765e000)
libstdc++.so.6 => /usr/lib/i386-linux-gnu/libstdc++.so.6 (0xb7576000)
libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb73c7000)
libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xb7381000)
/lib/ld-linux.so.2 (0xb76f7000)
libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0xb7364000)
Question: What could be the reason behind a cblas_sgemm call taking much less time for matrices with a large number of zeros as compared to the same cblas_sgemm call for dense matrices?
It seems that the BLAS implementation provided by the default libblas-dev package for Ubuntu 14.04 (and probably other Ubuntu distributions) includes an optimization for cases where certain matrix elements are zero.
For Ubuntu 14.04, the source code for the BLAS (and cblas) implementation/package can be downloaded from here.
After unpacking that archive, we have a cblas/src directory that contains the cblas API, and we have another src directory that contains F77 implementations of various blas routines.
In the case of cblas_sgemm, when the parameter CblasRowMajor is specified, the cblas/src/cblas_sgemm.c code will call the underlying fortran routine as follows:
void cblas_sgemm(const enum CBLAS_ORDER Order, const enum CBLAS_TRANSPOSE TransA,
const enum CBLAS_TRANSPOSE TransB, const int M, const int N,
const int K, const float alpha, const float *A,
const int lda, const float *B, const int ldb,
const float beta, float *C, const int ldc)
{
...
} else if (Order == CblasRowMajor)
...
F77_sgemm(F77_TA, F77_TB, &F77_N, &F77_M, &F77_K, &alpha, B, &F77_ldb, A, &F77_lda, &beta, C, &F77_ldc);
Note that for this row major call, the order of the A and B matrices are reversed when passed to the F77_sgemm routine. This is sensible but I won't delve into why here. It's sufficient to note that A has become B in the fortran call/code, and B has become A.
When we inspect the corresponding fortran routine in src/sgemm.f, we see the following sequence of code:
*
* Start the operations.
*
IF (NOTB) THEN
IF (NOTA) THEN
*
* Form C := alpha*A*B + beta*C.
*
DO 90 J = 1,N
IF (BETA.EQ.ZERO) THEN
DO 50 I = 1,M
C(I,J) = ZERO
50 CONTINUE
ELSE IF (BETA.NE.ONE) THEN
DO 60 I = 1,M
C(I,J) = BETA*C(I,J)
60 CONTINUE
END IF
DO 80 L = 1,K
IF (B(L,J).NE.ZERO) THEN ***OPTIMIZATION
TEMP = ALPHA*B(L,J)
DO 70 I = 1,M
C(I,J) = C(I,J) + TEMP*A(I,L)
70 CONTINUE
END IF
80 CONTINUE
90 CONTINUE
The above is the section of code that handles the case where No transpose of A and No transpose of B are indicated (which is true for this cblas row-major test case). The matrix row/column multiply operation is handled at the loops beginning where I have added the note ***OPTIMIZATION. In particular, if the matrix element B(L,J) is zero, then the DO-loop closing at line 70 is skipped. But remember B here corresponds to the A matrix passed to the cblas_sgemm routine.
The skipping of this do-loop allows the sgemm function implemented this way to be substantially faster for the cases where there are a large number of zeroes in the A matrix passed to cblas_sgemm when row-major is specified.
Experimentally, not all blas implementation have this optimization. Testing on the exact same platform but using libopenblas-dev instead of libblas-dev provides no such speed-up, i.e. essentially no execution time difference when the A matrix is mostly zeroes, vs. the case when it is not.
Note that the fortran (77) code here appears to be similar to or identical to older published versions of the sgemm.f routine such as here. Newer published versions of this fortran routine that I could find do not contain this optimization, such as here.
Related
Preamble
Some time ago I asked a question about performance of Matlab vs Python (Performance: Matlab vs Python). I was surprised that Matlab is faster than Python, especially in meshgrid. In the discussion of that question, it was pointed to me that I should use a wrapper in Python to call my C++ code because C++ code is also available to me. I have the same code in C++, Matlab and Python.
While doing that, I was surprised once again to find that Matlab is faster than C++ in matrix assembly and computation.I have a slightly larger code, from which I am investigating a segment of matrix-vector multiplication. The larger code performs such multiplications at multiple instances. Overall the code in C++ is much much faster than Matlab (because function calling in Matlab has an overhead etc.), but Matlab seems to be outperforming C++ in the matrix-vector multiplication (code snippet at the bottom).
Results
The table below shows the comparison of time it takes to assemble the kernel matrix and the time it takes to multiply the matrix with the vector. The results are compiled for a matrix size NxN where N varies from 10,000 to 40,000. Which is not that large. But the interesting thing is that Matlab outperforms C++ the larger the N gets. Matlab is 3.8 - 5.8 times faster in total time. Moreover it is also faster in both matrix assembly and computation.
___________________________________________
|N=10,000 Assembly Computation Total |
|MATLAB 0.3387 0.031 0.3697 |
|C++ 1.15 0.24 1.4 |
|Times faster 3.8 |
___________________________________________
|N=20,000 Assembly Computation Total |
|MATLAB 1.089 0.0977 1.187 |
|C++ 5.1 1.03 6.13 |
|Times faster 5.2 |
___________________________________________
|N=40,000 Assembly Computation Total |
|MATLAB 4.31 0.348 4.655 |
|C++ 23.25 3.91 27.16 |
|Times faster 5.8 |
-------------------------------------------
Question
Is there a faster way of doing this in C++? Am I missing something? I understand that C++ is using for loops but my understanding is that Matlab will also be doing something similar in meshgrid.
Code Snippets
Matlab Code:
%% GET INPUT DATA FROM DATA FILES ------------------------------------------- %
% Read data from input file
Data = load('Input/input.txt');
location = Data(:,1:2);
charges = Data(:,3:end);
N = length(location);
m = size(charges,2);
%% EXACT MATRIX VECTOR PRODUCT ---------------------------------------------- %
kex1=ex1;
tic
Q = kex1.kernel_2D(location , location);
fprintf('\n Assembly time: %f ', toc);
tic
potential_exact = Q * charges;
fprintf('\n Computation time: %f \n', toc);
Class (Using meshgrid):
classdef ex1
methods
function [kernel] = kernel_2D(obj, x,y)
[i1,j1] = meshgrid(y(:,1),x(:,1));
[i2,j2] = meshgrid(y(:,2),x(:,2));
kernel = sqrt( (i1 - j1) .^ 2 + (i2 - j2) .^2 );
end
end
end
C++ Code:
EDIT
Compiled using a make file with following flags:
CC=g++
CFLAGS=-c -fopenmp -w -Wall -DNDEBUG -O3 -march=native -ffast-math -ffinite-math-only -I header/ -I /usr/include
LDFLAGS= -g -fopenmp
LIB_PATH=
SOURCESTEXT= src/read_Location_Charges.cpp
SOURCESF=examples/matvec.cpp
OBJECTSF= $(SOURCESF:.cpp=.o) $(SOURCESTEXT:.cpp=.o)
EXECUTABLEF=./exec/mykernel
mykernel: $(SOURCESF) $(SOURCESTEXT) $(EXECUTABLEF)
$(EXECUTABLEF): $(OBJECTSF)
$(CC) $(LDFLAGS) $(KERNEL) $(INDEX) $(OBJECTSF) -o $# $(LIB_PATH)
.cpp.o:
$(CC) $(CFLAGS) $(KERNEL) $(INDEX) $< -o $#
`
# include"environment.hpp"
using namespace std;
using namespace Eigen;
class ex1
{
public:
void kernel_2D(const unsigned long M, double*& x, const unsigned long N, double*& y, MatrixXd& kernel) {
kernel = MatrixXd::Zero(M,N);
for(unsigned long i=0;i<M;++i) {
for(unsigned long j=0;j<N;++j) {
double X = (x[0*N+i] - y[0*N+j]) ;
double Y = (x[1*N+i] - y[1*N+j]) ;
kernel(i,j) = sqrt((X*X) + (Y*Y));
}
}
}
};
int main()
{
/* Input ----------------------------------------------------------------------------- */
unsigned long N = 40000; unsigned m=1;
double* charges; double* location;
charges = new double[N * m](); location = new double[N * 2]();
clock_t start; clock_t end;
double exactAssemblyTime; double exactComputationTime;
read_Location_Charges ("input/test_input.txt", N, location, m, charges);
MatrixXd charges_ = Map<MatrixXd>(charges, N, m);
MatrixXd Q;
ex1 Kex1;
/* Process ------------------------------------------------------------------------ */
// Matrix assembly
start = clock();
Kex1.kernel_2D(N, location, N, location, Q);
end = clock();
exactAssemblyTime = double(end-start)/double(CLOCKS_PER_SEC);
//Computation
start = clock();
MatrixXd QH = Q * charges_;
end = clock();
exactComputationTime = double(end-start)/double(CLOCKS_PER_SEC);
cout << endl << "Assembly time: " << exactAssemblyTime << endl;
cout << endl << "Computation time: " << exactComputationTime << endl;
// Clean up
delete []charges;
delete []location;
return 0;
}
As said in the comments MatLab relies on Intel's MKL library for matrix products, which is the fastest library for such kind of operations. Nonetheless, Eigen alone should be able to deliver similar performance. To this end, make sure to use latest Eigen (e.g. 3.4), and proper compilation flags to enable AVX/FMA if available and multithreading:
-O3 -DNDEBUG -march=native
Since charges_ is a vector, better use a VectorXd to Eigen knows that you want a matrix-vector product and not a matrix-matrix one.
If you have Intel's MKL, then you can also let Eigen uses it to get exact same performance than MatLab for this precise operation.
Regarding the assembly, better inverse the two loops to enable vectorization, then enable multithreading with OpenMP (add -fopenmp as compiler flags) to make the outermost loop run in parallel, and finally you can simplify your code using Eigen:
void kernel_2D(const unsigned long M, double* x, const unsigned long N, double* y, MatrixXd& kernel) {
kernel.resize(M,N);
auto x0 = ArrayXd::Map(x,M);
auto x1 = ArrayXd::Map(x+M,M);
auto y0 = ArrayXd::Map(y,N);
auto y1 = ArrayXd::Map(y+N,N);
#pragma omp parallel for
for(unsigned long j=0;j<N;++j)
kernel.col(j) = sqrt((x0-y0(j)).abs2() + (x1-y1(j)).abs2());
}
With multi-threading you need to measure the wall clock time. Here (Haswell with 4 physical cores running at 2.6GHz) the assembly time drops to 0.36s for N=20000, and the matrix-vector products take 0.24s so 0.6s in total that is faster than MatLab whereas my CPU seems to be slower than yours.
You might be interested to look at the MATLAB Central contribution mtimesx.
Mtimesx is a mex function that optimizes matrix multiplications using the BLAS library, openMP and other methods. In my experience, when it was originally posted it could be beat stock MATLAB by 3 orders of magnitude in some cases. (Somewhat embarrassing for MATHWORKS, I presume.) These days MATLAB has improved its own methods (I suspect borrowing from this.) and the differences are less severe. MATLAB sometimes out-performs it.
I am trying to write a program for matrix calculations using C/CUDA.
I have the following program:
In main.cu
#include <cuda.h>
#include <iostream>
#include "teste.cuh"
using std::cout;
int main(void)
{
const int Ndofs = 2;
const int Nel = 4;
double *Gh = new double[Ndofs*Nel*Ndofs*Nel];
double *Gg;
cudaMalloc((void**)& Gg, sizeof(double)*Ndofs*Nel*Ndofs*Nel);
for (int ii = 0; ii < Ndofs*Nel*Ndofs*Nel; ii++)
Gh[ii] = 0.;
cudaMemcpy(Gh, Gg, sizeof(double)*Ndofs*Nel*Ndofs*Nel, cudaMemcpyHostToDevice);
integraG<<<256, 256>>>(Nel, Gg);
cudaMemcpy(Gg, Gh, sizeof(double)*Ndofs*Nel*Ndofs*Nel, cudaMemcpyDeviceToHost);
for (int ii = 0; ii < Ndofs*Nel*Ndofs*Nel; ii++)
cout << ii + 1 << " " << Gh[ii] << "\n";
return 0;
}
In mtrx.cuh
#ifndef TESTE_CUH_
#define TESTE_CUH_
__global__ void integraG(const int N, double* G)
{
const int szmodel = 2*N;
int idx = threadIdx.x + blockIdx.x*blockDim.x;
int idy = threadIdx.y + blockIdx.y*blockDim.y;
int offset = idx + idy*blockDim.x*gridDim.x;
int posInit = szmodel*offset;
G[posInit + 0] = 1;
G[posInit + 1] = 1;
G[posInit + 2] = 1;
G[posInit + 3] = 1;
}
#endif
The result (which is supposed to be a matrix filled with 1's) is copied back to the host array; The problem is: nothing happens! Apparently, my program is not calling the gpu kernel, and I am still getting an array full of zeros.
I am very new to CUDA programming and I am using CUDA by example (Jason Sanders) as a reference book.
My questions are:
What is wrong with my code?
Is this the best way to deal with matrices using GPU, using matrices vectorized form?
Is there another reference that can provide more examples on matrices using GPU's?
These are your questions:
What is wrong with my code?
Is this the best way to deal with matrices using GPU, using matrices vectorized form?
Is there another reference that can provide more examples on matrices using GPU's?
For your first question. First of all, your problem should explicitly be defined. What do you want to do with this code? what sort of calculations do you want to do on the Matrix?
Try to check for errors properly THIS is a very good way to do so. There are some obvious bugs in your code as well. some of your bugs:
You're passing the wrong address pointers to the cudaMemcpy, the pointers that are passed to the source and the destination have to be swapped with each other, Check here
Change them to:
"NdofsNelNdofs*Nel" shows that you're interested in the value of the first 64 numbers of the array, so why calling 256 Threads and 256 blocks?
This part of your code:
int idx = threadIdx.x + blockIdx.xblockDim.x;
int idy = threadIdx.y + blockIdx.yblockDim.y;
shows that you want to use 2-Dim threads and blocks; to do that so, you need to use Dim type.
By making the following changes:
cudaMemcpy(Gg, Gh, sizeof(double)*Ndofs*Nel*Ndofs*Nel, cudaMemcpyHostToDevice); //HERE
dim3 block(2,2); //HERE
dim3 thread(4,4); //HERE
integraG<<<block, thread>>>(Nel, Gg); //HERE
cudaMemcpy(Gh, Gg, sizeof(double)*Ndofs*Nel*Ndofs*Nel, cudaMemcpyDeviceToHost); //HERE
You'll get a result like the following:
1 1
2 1
3 1
4 1
5 0
6 0
7 0
8 0
9 1
10 1
11 1
12 1
.
.
.
57 1
58 1
59 1
60 1
61 0
62 0
63 0
64 0
Anyway, if you state your problem and goal more clearly, better suggestions can be provided for you.
Regarding to your last two questions:
In my opinion CUDA C PROGRAMMING GUIDE and CUDA C BEST PRACTICES GUIDE are the two must documents to read when starting with CUDA, and they include examples on Matrix calculations as well.
I need to optimize an expression of the form:
(a > b) || (a > c)
I tried several optimized forms one of which is as follows:
(a * 2) > (b + c)
Optimization is not from the compiler's point of view. I would like to reduce the two >s to one.
This is based on the assumption that 1 <= (a, b, c) <= 26
However, this works only for some cases. Is the optimization I am trying to do, really possible? If yes, a start would be really helpful.
The answer is probably: you do not want to optimize that. Moreover, I doubt that there's any way to write this more efficiently. If you say that a, b and c are values between 1 and 26, you shouldn't be using integers (you don't need that precision) if you wanted to be optimal (in size) anyway.
If a > b, the expression a > c will not be executed anyway. So you have at maximum 2 (and at minimum 1) conditional operations, which is really not worth an optimization.
I'm quite doubtful this is even an optimisation in most cases.
a > b || a > c
will evaluate to:
compare a b
jump not greater
compare a c
jump not greater
where
a * 2 > b + c
gives:
shift a left 1 (in temp1)
add b to c (in temp2)
compare temp1 temp2
jump if not greater
As always with performance, it's always much better to base your decision on actual performance measurements (preferably on a selection of processor architectures).
The best I can come up with is this
char a, b, c;
std::cin >> a >> b >> c;
if (((b-a) | (c-a)) & 0x80) {
// a > b || a > c
}
With gcc -O2 this generates only one conditional branch
40072e: 29 c8 sub %ecx,%eax
400730: 29 ca sub %ecx,%edx
400732: 09 d0 or %edx,%eax
400734: a8 80 test $0x80,%al
400736: 74 17 je 40074f <main+0x3f>
This leverages the constraints of the input values, since the values cannot be greater than 26 then subtracting a from b will give you a negative value when a > b, in two's complement you know bit 7 will be set in that case - the same applies to c. I then OR both so that bit 7 indicates whether a > b || a > c, lastly we inspect bit 7 by AND with 0x80 and branch on that.
Update: Out of curiosity I timed 4 different ways of coding this. To generate test data I used a simple linear congruential pseudo-random number generator. I timed it in a loop for 100 million iterations. I assumed for simplicity that if the condition is true we want to add 5 to a counter, do nothing otherwise. I timed it using g++ (GCC) 4.6.3 20120306 (Red Hat 4.6.3-2) on an Intel Xeon X5570 # 2.93GHz using -O2 optimization level.
Here's the code (comment out all but one of the conditional variants):
#include <iostream>
unsigned myrand() {
static unsigned x = 1;
return (x = x * 1664525 + 1013904223);
}
int main() {
size_t count = 0;
for(size_t i=0; i<100000000; ++i ) {
int a = 1 + myrand() % 26;
int b = 1 + myrand() % 26;
int c = 1 + myrand() % 26;
count += 5 & (((b-a) | (c-a)) >> 31); // 0.635 sec
//if (((b-a) | (c-a)) & 0x80) count += 5; // 0.660 sec
//if (a > std::max(b,c)) count += 5; // 0.677 sec
//if ( a > b || a > c) count += 5; // 1.164 sec
}
std::cout << count << std::endl;
return 0;
}
The fastest is a modification on the suggestion in my answer, where we use sign extension to generate a mask that is either 32 1s or 32 0s depending on whether the condition is true of false, and use that to mask the 5 being added so that it either adds 5 or 0. This variation has no branches. The times are in a comment on each line. The slowest was the original expression ( a > b || a > c).
Is there a C++ library that can take nth roots of big numbers (numbers than can't fit in an unsigned long long)?
You can use GMP, a popular open source arbitrary-precision math library. It has C++ bindings.
If you want to code this yourself, check out the Wikipedia page on nth roots:
http://en.wikipedia.org/wiki/Nth_root
The iterative algorithm is quite simple:
The nth root of a number A can be computed by the nth root algorithm, a special case of Newton's method. Start with an initial guess x(0) and then iterate using the recurrence relation
x(k+1) = [(n - 1) * x(k) + A / x(k)^(n - 1)] / n
Stop once you've converged to the desired accuracy.
It depends how much bigger than 2^64 you want to go, I guess. Just using doubles is good to about 1 part in 10^9. I wrote a test program in C:
#include <stdio.h>
#include <math.h>
int main(int argc, char **argv)
{
unsigned long long x;
double dx;
int i;
//make x the max possible value
x = ~0ULL;
dx = (double)x;
printf("Starting with dx = %f\n", dx);
//print the 2th to 20th roots
for (i = 2; i < 21; i++)
{
printf("%dth root %.15f\n", i, pow(dx, 1.0/i));
}
return 0;
}
which produced the following output:
Starting with dx = 18446744073709551616.000000
2th root 4294967296.000000000000000
3th root 2642245.949629130773246
4th root 65536.000000000000000
5th root 7131.550214521852467
6th root 1625.498677215435691
7th root 565.293831000991759
8th root 256.000000000000000
9th root 138.247646578215154
10th root 84.448506289465257
11th root 56.421840319745364
12th root 40.317473596635935
13th root 30.338480458853493
14th root 23.775908626191171
15th root 19.248400577313866
16th root 16.000000000000000
17th root 13.592188707483222
18th root 11.757875938204789
19th root 10.327513583579238
20th root 9.189586839976281
Then I compared with Wolfram Alpha for each root to get the error I quoted above.
Depending on your application, perhaps this will be good enough.
Try also MAPM and qd.
MAPM is written in C but also has a C++ API. qd is written in C++ but also has a C API.
Heres a perfect loop for it. I get the exact answer everytime.
// n1 = <input>, n2 = <base limit>, nmrk = <number of cycles>
long double n3 = 0, n2 = 0, n1 = input_number, n4 = 0;
long double mk = 0, tptm = 0, mnk = 0, dad = 0;
for (n3 = 0; tptm != n1 || mnk > 65535 ; nmrk++) {
n4 += 0.19625;
n2 += 0.15625;
n3 += 0.015625;
mk += 0.0073125;
dad += 0.00390625;
mnk = pow(n1, 1.0/(n4+n2+mk+n3+dad));
tptm = pow((mnk), (n4+n2+mk+n3+dad));
}
if (tptm - n1 < 1)
{
uint64_t y = (tptm);
return make_tuple(nmrk, (n1 - y), mnk);
}
I found this to minutes faster
Long division method is the best method to compute the nth root of any positive real number. It gives the best precision of each digit computed. No initial guess and no iterative approximation is required.
My team need the "Sobol quasi-random number generator" - a common RNG which is famous for good quality results and speed of operation. I found what looks like a simple C implementation on the web. At home I was able to compile it almost instantaneously using my Linux GCC compiler.
The following day I tried it at work: If I compile in Visual Studio in debug mode it takes about 1 minute. If I were to compile it in release mode it takes about 40 minutes.
Why?
I know that "release" mode triggers some compiler optimization... but how on earth could a file this small take so long to optimize? It's mostly comments and static-data. There's hardly anything worth optimizing.
None of these PCs are particularly slow, and in any case I know that the compile time is consistent across a range of Windows computers. I've also heard that newer versions of Visual Studio have a faster compile time, however for now we are stuck with Visual Studio.Net 2003. Compiling on GCC (the one bundled with Ubuntu 8.04) always takes microseconds.
To be honest, I'm not really sure the codes that good. It's got a nasty smell in it. Namely, this function:
unsigned int i4_xor ( unsigned int i, unsigned int j )
//****************************************************************************80
//
// Purpose:
//
// I4_XOR calculates the exclusive OR of two integers.
//
// Modified:
//
// 16 February 2005
//
// Author:
//
// John Burkardt
//
// Parameters:
//
// Input, unsigned int I, J, two values whose exclusive OR is needed.
//
// Output, unsigned int I4_XOR, the exclusive OR of I and J.
//
{
unsigned int i2;
unsigned int j2;
unsigned int k;
unsigned int l;
k = 0;
l = 1;
while ( i != 0 || j != 0 )
{
i2 = i / 2;
j2 = j / 2;
if (
( ( i == 2 * i2 ) && ( j != 2 * j2 ) ) ||
( ( i != 2 * i2 ) && ( j == 2 * j2 ) ) )
{
k = k + l;
}
i = i2;
j = j2;
l = 2 * l;
}
return k;
}
There's an i8_xor too. And a couple of abs functions.
I think a post to the DailyWTF is in order.
EDIT: For the non-c programmers, here's a quick guide to what the above does:
function xor i:unsigned, j:unsigned
answer = 0
bit_position = 1
while i <> 0 or j <> 0
if least significant bit of i <> least significant bit of j
answer = answer + bit_position
end if
bit_position = bit_position * 2
i = i / 2
j = j / 2
end while
return answer
end function
To determine if the least significant bit is set or cleared, the following is used:
bit set if i <> (i / 2) * 2
bit clear if i == (i / 2) * 2
What makes the code extra WTFy is that C defines an XOR operator, namely '^'. So, instead of:
result = i4_xor (a, b);
you can have:
result = a ^ b; // no function call at all!
The original programmer really should have know about the xor operator. But even if they didn't (and granted, it's another obfuscated C symbol), their implementation of an XOR function is unbelievably poor.
I'm using VC++ 2003 and it compiled instantly in both debug/release modes.
Edit:
Do you have the latest service pack installed on your systems?
I would recommend you download a trial edition of Visual Studio 2008 and try the compile there, just to see if the problem is inherent. Also, if it does happen on a current version, you would be able to report the problem, and Microsoft might fix it.
On the other hand, there is no chance that Microsoft will fix whatever bug is in VS2003.