I am trying to measure the CPU time of following code -
struct timespec time1, time2, temp_time;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &time1);
long diff = 0;
for(int y=0; y<n; y++) {
for(int x=0; x<n; x++) {
float v = 0.0f;
for(int i=0; i<n; i++)
v += a[y * n + i] * b[i * n + x];
c[y * n + x] = v;
}
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &time2);
temp_time.tv_sec = time2.tv_sec - time1.tv_sec;
temp_time.tv_nsec = time2.tv_nsec - time1.tv_nsec;
diff = temp_time.tv_sec * 1000000000 + temp_time.tv_nsec;
printf("finished calculations using CPU in %ld ms \n", (double) diff/1000000);
But the time value is negative when i increase the value of n.
Code prints correct value for n = 500 but it prints negative value for n = 700
Any help would be appreciated.
Here is the full code structure -
void run(float A[], float B[], float C[], int nelements){
struct timespec time1, time2, temp_time;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &time1);
long diff = 0;
for(int y=0; y<nelements; y++) {
for(int x=0; x<nelements; x++) {
float v = 0.0f;
for(int i=0; i<nelements; i++)
v += A[y * nelements + i] * B[i * nelements + x];
C[y * nelements + x] = v;
}
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &time2);
temp_time.tv_sec = time2.tv_sec - time1.tv_sec;
temp_time.tv_nsec = time2.tv_nsec - time1.tv_nsec;
diff = temp_time.tv_sec * 1000000000 + temp_time.tv_nsec;
printf("finished calculations using CPU in %ld ms \n"(double) diff/1000000);
}
This function abovr is called from different fil as follows:
SIZE = 500;
a = (float*)malloc(SIZE * SIZE * sizeof(float));
b = (float*)malloc(SIZE * SIZE * sizeof(float));
c = (float*)malloc(SIZE * SIZE * sizeof(float));
//initialize a &b
run(&a[SIZE],&b[SIZE],&c[SIZE],SIZE);
looks like an overflow use unsigned long or better double for diff
One of possible problem causes is that the printf format is for a long signed integer value (%ld), but the parameter has the double type. To fix the problem is necessary change %ld to %lf in the format string.
Look at your print statement:
printf("finished calculations using CPU in %ld ms \n", (double) diff/1000000);
The second parameter you pass is a double, but you are printing out this floating point value as a long (%ld). I suspect that's half your problem.
This may generate better results:
printf("finished calculations using CPU in %f ms \n", diff/1000000.0);
I also agree with keety, you likely should be using unsigned types Or you could possibly avoid the overflow issues altogether by staying in millisecond units instead of nanoseconds. Here's why I stick with 64-bit unsigned integers and just stay in the millisecond realm.
unsigned long long diffMilliseconds;
diffMilliseconds = (time2.tv_sec * 1000LL + time2.tv_nsec/1000000) - (time1.tv_sec * 1000LL + time1.tv_nsec/1000000);
printf("finished calculations using CPU in %llu ms \n", diffMilliseconds);
The 'tv_nsec' field should never exceed 10^9 (1000000000), for obvious reasons:
if (time1.tv_nsec < time2.tv_nsec)
{
int adj = (time2.tv_nsec - time1.tv_nsec) / (1000000000) + 1;
time2.tv_nsec -= (1000000000) * adj;
time2.tv_sec += adj;
}
if (time1.tv_nsec - time2.tv_nsec > (1000000000))
{
int adj = (time1.tv_nsec - time2.tv_nsec) / (1000000000);
time2.tv_nsec += (1000000000) * adj;
time2.tv_sec -= adj;
}
temp_time.tv_sec = time1.tv_sec - time2.tv_sec;
temp_time.tv_nsec = time1.tv_nsec - time2.tv_nsec;
diff = temp_time.tv_sec * (1000000000) + temp_time.tv_nsec;
This code could be simplified, as it makes no assumptions about the sign of the 'tv_sec' field. Most Linux sys headers (and glibc?) provide macros to handle this sort of timespec arithmetic correctly don't they?
Related
For the function below, I would not add the noise (Inoise in the code below) in every time step but, for example, only in every second time step. So while dt=0.0025 serves as the time step for the numerical integration, I would, for example, add Inoise only in every second time step (i.e. in 0.005 steps).
What is the best way to insert this into my existing function?
runs = 1000;
t_end = 5;
dt = 0.0025;
t_steps = t_end/dt;
for(int j=0; j<runs; j++){
double vT = v0;
double mT = m0;
double hT = h0;
double nT = n0;
for(int i=0; i<t_steps; i++){
double IStim = 0.0;
if ((delay / dt <= (double)i) && ((double)i <= (delay + duration) / dt))
IStim = I;
mT = (mT + dt * alphaM(vT)) / (1.0 + dt * (alphaM(vT) + betaM(vT)));
hT = (hT + dt * alphaH(vT)) / (1.0 + dt * (alphaH(vT) + betaH(vT)));
nT = (nT + dt * alphaN(vT)) / (1.0 + dt * (alphaN(vT) + betaN(vT)));
const double iNa = gNa * pow(mT, 3.0) * hT * (vT - vNa);
const double iK = gK * pow(nT, 4.0) * (vT - vK);
const double iL = gL * (vT-vL);
const double Inoise = (doubleRand() * knoise * sqrt(gNa * A));
const double IIon = ((iNa + iK + iL) * A) + Inoise;
vT += ((-IIon + IStim) / C) * dt;
voltage[i] = vT;
if(vT > 60.0) {
count++;
break;
}
}
}
return count;
}
You could accumulate the elapsed time and only add the noise once enough steps have passed:
double elapsedTime = 0;
double INoiseThreshold = 0.005;
for(int j=0; j<runs; j++){
//...
for(int i=0; i<t_steps; i++){
//...
double Inoise = 0;
elapsedTime += dt;
if(elapsedTime >= INoiseThreshold){
Inoise = (doubleRand() * knoise * sqrt(gNa * A));
elapsedTime = 0;
}
const double IIon = ((iNa + iK + iL) * A) + Inoise;
//...
}
}
return count;
Instead of comparing the floating point numbers directly, you could check if their differences are within an small epsilon to allow for rounding errors.
Instead of making the value of Inoise dependent on the condition, you could make the presence in the IIon formula dependent e.g.:
const double IIon = ((iNa + iK + iL) * A) + (elapsedTime >= INoiseThreshold) ? Inoise : 0;
just remember to reset elapsedTime once it surpassed the threshold.
I have implemented a cascaded addition function for a large vector of float values on my GPU and my CPU. That simply means that all elements of this vector shell be summed up into one result. The CPU algorithm is quite trivial and works fine, but the GPU algorithm is always 35200 off the desired result.
The minimal working code for the algorithm and comparison to the CPU is below.
The output is always this:
CPU Time: 22.760059 ms, bandwidth: 3.514929 GB/s
GPU Time (improved): 12.077088 ms, bandwidth: 6.624114 GB/s
- CPU result does not match GPU result in improved atomic add.
CPU: 10000000.000000, GPU: 10035200.000000, diff:-35200.000000
I checked it with cuda-memcheck but no errors occured in that run. I have tried many many different things but none of themworked. It if not due to the inaccuracy of the float datatype because I changed all floats to ints and still got the exact same result.
This is my code:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <chrono>
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
void reductionWithCudaImproved(float *result, const float *input);
__global__ void reductionKernelImproved(float *result, const float *input);
void reductionCPU(float *result, const float *input);
#define SIZE 10000000
#define TILE 32
#define ILP 8
#define BLOCK_X_IMPR (TILE / ILP)
#define BLOCK_Y_IMPR 32
#define BLOCK_COUNT_X_IMPR 100
int main()
{
int i;
float *input;
float resultCPU, resultGPU;
double cpuTime, cpuBandwidth;
input = (float*)malloc(SIZE * sizeof(float));
resultCPU = 0.0;
resultGPU = 0.0;
srand((int)time(NULL));
auto start = std::chrono::high_resolution_clock::now();
auto end = std::chrono::high_resolution_clock::now();
for (i = 0; i < SIZE; i++)
input[i] = 1.0;
start = std::chrono::high_resolution_clock::now();
reductionCPU(&resultCPU, input);
end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> diff = end - start;
cpuTime = (diff.count() * 1000);
cpuBandwidth = (sizeof(float) * SIZE * 2) / (cpuTime * 1000000);
printf("CPU Time: %f ms, bandwidth: %f GB/s\n\n", cpuTime, cpuBandwidth);
reductionWithCudaImproved(&resultGPU, input);
if (resultCPU != resultGPU)
printf("- CPU result does not match GPU result in improved atomic add. CPU: %f, GPU: %f, diff:%f\n\n", resultCPU, resultGPU, (resultCPU - resultGPU));
else
printf("+ CPU result matches GPU result in improved atomic add. CPU: %f, GPU: %f\n\n", resultCPU, resultGPU);
return 0;
}
void reductionCPU(float *result, const float *input)
{
for (int i = 0; i < SIZE; i++)
*result += input[i];
}
__global__ void reductionKernelImproved(float *result, const float *input)
{
int i;
int col = (blockDim.x * blockIdx.x + threadIdx.x) * ILP;
int row = blockDim.y * blockIdx.y + threadIdx.y;
int index = row * blockDim.x * BLOCK_COUNT_X_IMPR + col;
__shared__ float interResult;
if (threadIdx.x == 0 && threadIdx.y == 0)
interResult = 0.0;
__syncthreads();
#pragma unroll ILP
for (i = 0; i < ILP; i++)
{
if (index < SIZE)
{
atomicAdd(&interResult, input[index]);
index++;
}
}
__syncthreads();
if (threadIdx.x == 0 && threadIdx.y == 0)
atomicAdd(result, interResult);
}
void reductionWithCudaImproved(float *result, const float *input)
{
dim3 dim_grid, dim_block;
float *dev_input = 0;
float *dev_result = 0;
cudaEvent_t start, stop;
float elapsed = 0;
double gpuBandwidth;
dim_block.x = BLOCK_X_IMPR;
dim_block.y = BLOCK_Y_IMPR;
dim_block.z = 1;
dim_grid.x = BLOCK_COUNT_X_IMPR;
dim_grid.y = (int)ceil((float)SIZE / (float)(TILE * dim_block.y* BLOCK_COUNT_X_IMPR));
dim_grid.z = 1;
cudaSetDevice(0);
cudaMalloc((void**)&dev_input, SIZE * sizeof(float));
cudaMalloc((void**)&dev_result, sizeof(float));
cudaMemcpy(dev_input, input, SIZE * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(dev_result, result, sizeof(float), cudaMemcpyHostToDevice);
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
reductionKernelImproved << <dim_grid, dim_block >> >(dev_result, dev_input);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsed, start, stop);
gpuBandwidth = (sizeof(float) * SIZE * 2) / (elapsed * 1000000);
printf("GPU Time (improved): %f ms, bandwidth: %f GB/s\n", elapsed, gpuBandwidth);
cudaDeviceSynchronize();
cudaMemcpy(result, dev_result, sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(dev_input);
cudaFree(dev_result);
return;
}
I think you have overlapping indices in your kernel call:
int col = (blockDim.x * blockIdx.x + threadIdx.x) * ILP;
int row = blockDim.y * blockIdx.y + threadIdx.y;
int index = row * blockDim.x * BLOCK_COUNT_X_IMPR + col;
If I am not mistaken, your blockDim.x = 4 and BLOCK_COUNT_X_IMPR = 100, so each row will jump 400 indices.
However, your col can go as high as 400 * 8.
Consider:
blockIdx = (12, 0)
threadIdx = (3, 0)
=> col = (12*4 + 3) * 8 = 408
row = 0
index = 408
blockIdx = (0, 0)
threadIdx = (1, 1)
=> col = (0*4 + 1) * 8 = 8
row = 1
index = 1 * 400 + 8 = 408
So I guess you should rewrite your index
// gridDim.x = BLOCK_COUNT_X_IMPR
int index = row * blockDim.x * gridDim.x * ILP + col;
I'm working with the fftw library in C++. I know that the calculation of the fft is most efficient for powers of 2, but I created a minimal example of a two-dimensional fft and I get a different result. The 2d-fft with no power of 2 is calculated much faster than the other one. Here is my code:
int N = 2083;
int M = 2087;
int Npow2 = pow(2, ceil(log2(N)));
int Mpow2 = pow(2, ceil(log2(M)));
fftw_complex * signala = (fftw_complex *)fftw_malloc(sizeof(fftw_complex)* N * M);
for (int i = 0; i < N; i++)
{
for (int j = 0; j < M; j++)
{
signala[i*M + j][0] = rand();
signala[i*M + j][0] = 0;
}
}
fftw_complex * signala_ext = (fftw_complex *)fftw_malloc(sizeof(fftw_complex)* Npow2 * Mpow2);
fftw_complex * outa = (fftw_complex *)fftw_malloc(sizeof(fftw_complex)* N * M);
fftw_complex * outaext = (fftw_complex *)fftw_malloc(sizeof(fftw_complex)* Npow2 * Mpow2);
//Create Plans
fftw_plan pa = fftw_plan_dft_2d(N, M, signala, outa, FFTW_FORWARD, FFTW_ESTIMATE);
fftw_plan paext = fftw_plan_dft_2d(Npow2, Mpow2, signala_ext, outaext, FFTW_FORWARD, FFTW_ESTIMATE);
//zeropadding
memset(signala_ext, 0, sizeof(fftw_complex)* Npow2 * Mpow2); //Null setzen
for (int i = 0; i < N; i++)
{
for (int j = 0; j < M; j++)
{
signala_ext[i*Mpow2 + j][0] = signala[i*M + j][0];
signala_ext[i*Mpow2 + j][1] = signala[i*M + j][1];
}
}
//Execute FFT
double tstart1 = clock();
fftw_execute(pa);
double time1 = (clock() - tstart1) / CLOCKS_PER_SEC;
printf("Time: %f sec\n", time1);
double tstart2 = clock();
fftw_execute(paext);
double time2 = (clock() - tstart2) / CLOCKS_PER_SEC;
printf("Time: %f sec\n", time2);
I choosed prime numbers for N and M. My programms returns:
For signala (non-power-of-2): 2.95 sec
For signala_ext (power-of-2): 5.232 sec
Why is the fft with power of 2 so much slower? What have I done wrong?
I will be thankful for any help!
FFTW likes dimensions which are products of powers of small primes. The nearest value above 2083 or 2087 which meets this criterion is 2100 (2100 = 22 * 3 * 52 * 7), so if you go for dimensions of 2100 x 2100 then you should see decent performance.
I've been using the openCV to do some block matching and I've noticed it's sum of squared differences code is very fast compared to a straight forward for loop like this:
int SSD = 0;
for(int i =0; i < arraySize; i++)
SSD += (array1[i] - array2[i] )*(array1[i] - array2[i]);
If I look at the source code to see where the heavy lifting happens, the
OpenCV folks have their for loops do 4 squared difference calculations at a time in each iteration of the loop. The function to do the block matching looks like this.
int64
icvCmpBlocksL2_8u_C1( const uchar * vec1, const uchar * vec2, int len )
{
int i, s = 0;
int64 sum = 0;
for( i = 0; i <= len - 4; i += 4 )
{
int v = vec1[i] - vec2[i];
int e = v * v;
v = vec1[i + 1] - vec2[i + 1];
e += v * v;
v = vec1[i + 2] - vec2[i + 2];
e += v * v;
v = vec1[i + 3] - vec2[i + 3];
e += v * v;
sum += e;
}
for( ; i < len; i++ )
{
int v = vec1[i] - vec2[i];
s += v * v;
}
return sum + s;
}
This calculation is for unsigned 8 bit integers. They perform a similar calculation for 32-bit floats in this function:
double
icvCmpBlocksL2_32f_C1( const float *vec1, const float *vec2, int len )
{
double sum = 0;
int i;
for( i = 0; i <= len - 4; i += 4 )
{
double v0 = vec1[i] - vec2[i];
double v1 = vec1[i + 1] - vec2[i + 1];
double v2 = vec1[i + 2] - vec2[i + 2];
double v3 = vec1[i + 3] - vec2[i + 3];
sum += v0 * v0 + v1 * v1 + v2 * v2 + v3 * v3;
}
for( ; i < len; i++ )
{
double v = vec1[i] - vec2[i];
sum += v * v;
}
return sum;
}
I was wondering if anyone had any idea if breaking a loop up into chunks of 4 like this might speed up code? I should add that there is no multithreading occuring in this code.
My guess is that this is just a simple implementation of unrolling the loop - it saves 3 additions and 3 compares on each pass of the loop, which can be a great savings if, for example, checking len involves a cache miss. The downside is that this optimization adds code complexity (e.g. the additional for loop at the end to finish the loop for the len % 4 items left if the length is not evenly divisible by 4) and, of course, it's an architecture-dependent optimization whose magnitude of improvement will vary by hardware/compiler/etc...
Still, it's straightforward to follow compared to most optimizations and will probably result in some sort of performance increase regardless of the architecture, so it's low risk to just throw it in there and hope for the best. Since OpenCV is such a well-supported chunk of code, I'm sure that someone instrumented these chunks of code and found them to be well worth it - as you yourself have done.
There is one obvious optimisation of your code, viz:
int SSD = 0;
for(int i = 0; i < arraySize; i++)
{
int v = array1[i] - array2[i];
SSD += v * v;
}
We need to change/reimplement standard DFT implementation in GSL, which is
int
FUNCTION(gsl_dft_complex,transform) (const BASE data[],
const size_t stride, const size_t n,
BASE result[],
const gsl_fft_direction sign)
{
size_t i, j, exponent;
const double d_theta = 2.0 * ((int) sign) * M_PI / (double) n;
/* FIXME: check that input length == output length and give error */
for (i = 0; i < n; i++)
{
ATOMIC sum_real = 0;
ATOMIC sum_imag = 0;
exponent = 0;
for (j = 0; j < n; j++)
{
double theta = d_theta * (double) exponent;
/* sum = exp(i theta) * data[j] */
ATOMIC w_real = (ATOMIC) cos (theta);
ATOMIC w_imag = (ATOMIC) sin (theta);
ATOMIC data_real = REAL(data,stride,j);
ATOMIC data_imag = IMAG(data,stride,j);
sum_real += w_real * data_real - w_imag * data_imag;
sum_imag += w_real * data_imag + w_imag * data_real;
exponent = (exponent + i) % n;
}
REAL(result,stride,i) = sum_real;
IMAG(result,stride,i) = sum_imag;
}
return 0;
}
In this implementation, GSL iterates over input vector twice for sample/input size. However, we need to construct for different frequency bins. For instance, we have 4096 samples, but we need to calculate DFT for 128 different frequencies. Could you help me to define or implement required DFT behaviour? Thanks in advance.
EDIT: We do not search for first m frequencies.
Actually, is below approach correct for finding DFT result with given frequency bin number?
N = sample size
B = frequency bin size
k = 0,...,127 X[k] = SUM(0,N){x[i]*exp(-j*2*pi*k*i/B)}
EDIT: I might have not explained the problem for DFT elaborately, nevertheless, I am happy to provide the answer below:
void compute_dft(const std::vector<double>& signal,
const std::vector<double>& frequency_band,
std::vector<double>& result,
const double sampling_rate)
{
if(0 == result.size() || result.size() != (frequency_band.size() << 1)){
result.resize(frequency_band.size() << 1, 0.0);
}
//note complex signal assumption
const double d_theta = -2.0 * PI * sampling_rate;
for(size_t k = 0; k < frequency_band.size(); ++k){
const double f_k = frequency_band[k];
double real_sum = 0.0;
double imag_sum = 0.0;
for(size_t n = 0; n < (signal.size() >> 1); ++n){
double theta = d_theta * f_k * (n + 1);
double w_real = cos(theta);
double w_imag = sin(theta);
double d_real = signal[2*n];
double d_imag = signal[2*n + 1];
real_sum += w_real * d_real - w_imag * d_imag;
imag_sum += w_real * d_imag + w_imag * d_real;
}
result[2*k] = real_sum;
result[2*k + 1] = imag_sum;
}
}
Assuming you just want the the first m output frequencies:
int
FUNCTION(gsl_dft_complex,transform) (const BASE data[],
const size_t stride,
const size_t n, // input size
const size_t m, // output size (m <= n)
BASE result[],
const gsl_fft_direction sign)
{
size_t i, j, exponent;
const double d_theta = 2.0 * ((int) sign) * M_PI / (double) n;
/* FIXME: check that m <= n and give error */
for (i = 0; i < m; i++) // for each of m output bins
{
ATOMIC sum_real = 0;
ATOMIC sum_imag = 0;
exponent = 0;
for (j = 0; j < n; j++) // for each of n input points
{
double theta = d_theta * (double) exponent;
/* sum = exp(i theta) * data[j] */
ATOMIC w_real = (ATOMIC) cos (theta);
ATOMIC w_imag = (ATOMIC) sin (theta);
ATOMIC data_real = REAL(data,stride,j);
ATOMIC data_imag = IMAG(data,stride,j);
sum_real += w_real * data_real - w_imag * data_imag;
sum_imag += w_real * data_imag + w_imag * data_real;
exponent = (exponent + i) % n;
}
REAL(result,stride,i) = sum_real;
IMAG(result,stride,i) = sum_imag;
}
return 0;
}