Multiply two complex number on GPU using OpenCL - c++

I'm trying to write a OpenCL based code to calculate exp() of some complex numbers on GPU using the following kernel function:
#include <complex.h>
inline float complex exp(float complex z) {
return (exp(__real__(z)) * (cos(__imag__(z)) + sin(__imag__(z))*I ));
}
__kernel void
calculate(__global float * c)
{
int nIndex = get_global_id(0);
float complex rays = 1.0f + 1.0f * I;
float complex ans = exp(rays);
c[nIndex] = __real__(ans * ans);
}
But I get the following error:
ASSERTION FAILED: I.hasStructRetAttr() == false
The * works well with other complex numbers but it produce error for multiplying exp() functions output. Also I use + and - operators with exp() functions output without any problem. Just I have problem with * and / operators.

Related

Armadillo - no member named i in matrix expression

According to Armadillo docs:
.i()
Member function of any matrix expression
Provides an inverse of the matrix expression
...
However, when I try to compile this snippet:
#include <armadillo>
#include <iostream>
arma::sp_mat linReg(arma::sp_mat X, arma::sp_mat Y) {
return (X.t() * X).i() * X.t() * Y;
}
int main() {
arma::sp_mat X = arma::sprandu(1000, 10, 0.3);
arma::sp_mat y = arma::sprandu(1000, 10, 0.3);
std::cout << linReg(X,y).t() << std::endl;
}
I get the following error
lreg.cpp: In function ‘arma::sp_mat linReg(arma::sp_mat,
arma::sp_mat)’: lreg.cpp:6:24: error: ‘arma::enable_if2<true, const
arma::SpGluearma::SpOp<arma::SpMat<double, arma::spop_htrans>,
arma::SpMat, arma::spglue_times> >::result’ {aka ‘const class
arma::SpGluearma::SpOp<arma::SpMat<double, arma::spop_htrans>,
arma::SpMat, arma::spglue_times>’} has no member named ‘i’
6 | return (X.t() * X).i() * X.t() * Y;
|
I already tried with mat and it works fine. Any clue why it's not working with sparse matrix? And if so, how can we calculate the inverse of a sparse matrix?
Taking the inverse of a sparse matrix is often not desired as you end up with a dense matrix. Often the explicit inverse is not required.
Instead of taking the inverse here, maybe treat the problem as solving a system of linear equations. Then reformulate using solve() or spsolve(). Below is an untested example for demonstrating the general approach:
arma::mat linReg(const arma::sp_mat& X, const arma::sp_mat& Y) {
arma::sp_mat A = X.t() * X;
arma::mat B = arma::mat(X.t() * Y); // convert to dense matrix
arma::mat result;
bool ok = arma::spsolve(result, A, B);
if(ok == false) {
// handle failure here
}
return result;
}

Fast C++ sine and cosine alternatives for real-time signal processing

I need to implement a real-time synchronous quadrature detector. The detector receives a stream of input data (from PCI ADC) and returns the amplitude of the harmonics w. There is simpified C++ code:
double LowFreqFilter::process(double in)
{
avg = avg * a + in * (1 - a);
return avg;
}
class QuadroDetect
{
double wt;
const double wdt;
LowFreqFilter lf1;
LowFreqFilter lf2;
QuadroDetect(const double w, const double dt) : wt(0), wdt(w * dt)
{}
inline double process(const double in)
{
double f1 = lf1.process(in * sin(wt));
double f2 = lf2.process(in * cos(wt));
double out = sqrt(f1 * f1 + f2 * f2);
wt += wdt;
return out;
}
};
My problem is that sin and cos calculating takes too much time. I was advised to use a pre-calculated sin and cos table, but available ADC sampling frequencies is not multiple of w, so there is fragments stitching problem. Are there any fast alternatives for sin and cos calculations? I would be grateful for any advice on how to improve the performance of this code.
UPD
Unfortunately, I was wrong in the code, removing the filtering calls, the code has lost its meaning. Thanks Eric Postpischil.
I know a solution that can suit you. Recall the school formula of sine and cosine for the sum of angles:
sin(a + b) = sin(a) * cos(b) + cos(a) * sin(b)
cos(a + b) = cos(a) * cos(b) - sin(a) * sin(b)
Suppose that wdt is a small increment of the wtangle, then we get the recursive calculation formula for the sin and cos for next time:
sin(wt + wdt) = sin(wt) * cos(wdt) + cos(wt) * sin(wdt)
cos(wt + wdt) = cos(wt) * cos(wdt) - sin(wt) * sin(wdt)
We need to calculate the sin(wdt) and cos(wdt) values only once. For other computations we need only addition and multiplication operations. Recursion can be continued from any time moment, so we can replace the values with exactly calculated time by time to avoid indefinitely error accumulation.
There is final code:
class QuadroDetect
{
const double sinwdt;
const double coswdt;
const double wdt;
double sinwt = 0;
double coswt = 1;
double wt = 0;
QuadroDetect(double w, double dt) :
sinwdt(sin(w * dt)),
coswdt(cos(w * dt)),
wdt(w * dt)
{}
inline double process(const double in)
{
double f1 = in * sinwt;
double f2 = in * coswt;
double out = sqrt(f1 * f1 + f2 * f2);
double tmp = sinwt;
sinwt = sinwt * coswdt + coswt * sinwdt;
coswt = coswt * coswdt - tmp * sinwdt;
// Recalculate sinwt and coswt to avoid indefinitely error accumulation
if (wt > 2 * M_PI)
{
wt -= 2 * M_PI;
sinwt = sin(wt);
coswt = cos(wt);
}
wt += wdt;
return out;
}
};
Please note that such recursive calculations provides less accurate results than sin(wt) cos(wt), but I used it and it worked well.
If you can use std::complex the implementation becomes much simpler. Technical its the same solution as from #Dmytro Dadyka as complex numbers are working this way. If the optimiser works well it should be run the same time.
class QuadroDetect
{
public:
std::complex<double> wt;
std::complex <double> wdt;
LowFreqFilter lf1;
LowFreqFilter lf2;
QuadroDetect(const double w, const double dt)
: wt(1.0, 0.0)
, wdt(std::polar(1.0, w * dt))
{
}
inline double process(const double in)
{
auto f = in * wt;
f.imag(lf1.process(f.imag()));
f.real(lf2.process(f.real()));
wt *= wdt;
return std::abs(f);
}
};

CUDA Math API: unable to use atan2 function [duplicate]

I'm new in CUDA, and cannot understand what I'm doing wrong.
I'm trying to calculate the distance of object it has id in array, axis x in array and axis y in array to find neighbors for each object
__global__
void dist(int *id_d, int *x_d, int *y_d,
int *dist_dev, int dimBlock, int i)
{
int idx = threadIdx.x + blockIdx.x*blockDim.x;
while(idx < dimBlock){
int i;
for(i= 0; i< dimBlock; i++){
if (idx == i)continue;
dist_dev[idx] = pow(x_d[idx] - x_d[i], 2) + pow(y_d[idx] - y_d[i], 2); // error here
}
}
}
Is pow not defined in kernel code?
Your problem is that while pow is defined in the CUDA math API (see here), it is not template specialised for integer arguments, ie. there is no version like this:
__device__ ​ int pow ( int x, int y )
This is why you are getting an error. You will need to explicitly cast the base argument to a floating point type like this:
dist_dev[idx] = pow((double)(x_d[idx] - x_d[i]), 2.0) +
pow((double)(y_d[idx] - y_d[i]), 2.0);
Having said that, using double precision floating point exponential in your example for a integer square will be poor from an efficiency point of view. It would be preferable to perform the calculation using integer multiplication instead:
int dx = x_d[idx] - x_d[i];
int dy = y_d[idx] - y_d[i];
dist_dev[idx] = (dx * dx) + (dy * dy);

Improving computation speed for sine/cosine and large arrays

for signal processing I need to compute relatively large C arrays as shown in the code part below. This is working fine so far, unfortunately, the implementation is slow. The size of "calibdata" is arround 150k and needs to be calculated for different frequencies/phases. Is there a way to improve speed significantly? Doing the same with logical indexing in MATLAB is way faster.
What I tried already:
using taylor approximation of sine: no siginificant improvement.
using std::vector, also no siginificant improvement.
code:
double phase_func(double* calibdata, long size, double* freqscale, double fs, double phase, int currentcarrier){
for (int i = 0; i < size; i++)
result += calibdata[i] * cos((2 * PI*freqscale[currentcarrier] * i / fs) + (phase*(PI / 180) - (PI / 2)));
result = fabs(result / size);
return result;}
Best regards,
Thomas
When optimizing code for speed, step 1 is to enable compiler optimizations. I hope you've done that already.
Step 2 is to profile the code and see exactly how the time is being spent. Without profiling, you're just guessing, and you could end up trying to optimize the wrong thing.
For example, your guess seems to be that the cos function is the bottleneck. But the other possibility is that the calculation of the angle is the bottleneck. Here's how I would refactor the code to reduce the time spent calculating the angle.
double phase_func(double* calibdata, long size, double* freqscale, double fs, double phase, int currentcarrier)
{
double result = 0;
double angle = phase * (PI / 180) - (PI / 2);
double delta = 2 * PI * freqscale[currentcarrier] / fs;
for (int i = 0; i < size; i++)
{
result += calibdata[i] * cos( angle );
angle += delta;
}
return fabs(result / size);
}
Okay, I'm probably going to get flogged for this answer, but I would use the GPU for this. Because your array doesn't appear to be self-referential, the best speedup you're going to get for large arrays is through parallelization... by far. I don't use MATLAB, but I just did a quick search for GPU utilization on the MathWorks site:
http://www.mathworks.com/company/newsletters/articles/gpu-programming-in-matlab.html?requestedDomain=www.mathworks.com
Outside of MATLAB you could use OpenCL or CUDA yourself.
Your enemies in execution time are:
Division
Function calls (including implicit ones in loops)
Accessing data from diffent areas
Operating dissimilar instructions
You should research on Data Driving programming and using the data cache effectively.
Division
Whether with hardware support or software support division takes a long time by its very nature. Eliminate if possibly by changing the numeric base or factoring out of the loop (if possible).
Function Calls
The most efficient method of execution is sequential. Processors are optimized for this. A branch may require the processor perform some additional calculation (branch prediction) or reloading of the instruction cache / pipeline. A waste of time (that could be spent executing data instructions).
The optimization for this is to use techniques like loop unrolling and inlining of small functions. Also reduce the quantity of branches by simplifying expressions and using Boolean algebra.
Accessing data from different areas
Modern processors are optimized to operate on local data (data in one area). One example is loading an internal cache with data. Specifically, loading a cache line with data. For example, if the data from your arrays is in one location and the cosine data in another, this may cause the data cache to be reloaded, again wasting time.
A better solution is to place all data contiguously or to contiguously access all the data. Rather than making many discontiguous accesses to the cosine table, look up a batch of cosine values sequentially (without any other data accesses between).
Dissimilar Instructions
Modern processors are more efficient at processing a batch of similar instructions. For example the pattern load, add, store is more efficient for blocks when all the loading is performed, then all adding, then all storing.
Summary
Here's an example:
register double result = 0.0;
register unsigned int i = 0U;
for (i = 0; i < size; i += 2)
{
register double cos_angle1 = /* ... */;
register double cos_angle2 = /* ... */;
result += calibdata[i + 0] * cos_angle1;
result += calibdata[i + 1] * cos_angle2;
}
The above loop is unrolled and like operations are performed in groups.
Although the keyword register may be deprecated, it is a suggestion to the compiler to use dedicated registers (if possible).
You can try to use the definition of cosine based on the complex exponential:
where j^2=-1.
Store exp((2 * PI*freqscale[currentcarrier] / fs)*j) and exp(phase*j). Evaluating cos(...) then resumes to a couple of products and additions in the for loops, and sin(), cos() and exp() are only called a couple of times.
Here goes the implementation:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <complex.h>
#include <time.h>
#define PI 3.141592653589
typedef struct cos_plan{
double complex* expo;
int size;
}cos_plan;
double phase_func(double* calibdata, long size, double* freqscale, double fs, double phase, int currentcarrier){
double result=0; //initialization
for (int i = 0; i < size; i++){
result += calibdata[i] * cos ( (2 * PI*freqscale[currentcarrier] * i / fs) + (phase*(PI / 180.) - (PI / 2.)) );
//printf("i %d cos %g\n",i,cos ( (2 * PI*freqscale[currentcarrier] * i / fs) + (phase*(PI / 180.) - (PI / 2.)) ));
}
result = fabs(result / size);
return result;
}
double phase_func2(double* calibdata, long size, double* freqscale, double fs, double phase, int currentcarrier, cos_plan* plan){
//first, let's compute the exponentials:
//double complex phaseexp=cos(phase*(PI / 180.) - (PI / 2.))+sin(phase*(PI / 180.) - (PI / 2.))*I;
//double complex phaseexpm=conj(phaseexp);
double phasesin=sin(phase*(PI / 180.) - (PI / 2.));
double phasecos=cos(phase*(PI / 180.) - (PI / 2.));
if (plan->size<size){
double complex *tmp=realloc(plan->expo,size*sizeof(double complex));
if(tmp==NULL){fprintf(stderr,"realloc failed\n");exit(1);}
plan->expo=tmp;
plan->size=size;
}
plan->expo[0]=1;
//plan->expo[1]=exp(2 *I* PI*freqscale[currentcarrier]/fs);
plan->expo[1]=cos(2 * PI*freqscale[currentcarrier]/fs)+sin(2 * PI*freqscale[currentcarrier]/fs)*I;
//printf("%g %g\n",creall(plan->expo[1]),cimagl(plan->expo[1]));
for(int i=2;i<size;i++){
if(i%2==0){
plan->expo[i]=plan->expo[i/2]*plan->expo[i/2];
}else{
plan->expo[i]=plan->expo[i/2]*plan->expo[i/2+1];
}
}
//computing the result
double result=0; //initialization
for(int i=0;i<size;i++){
//double coss=0.5*creall(plan->expo[i]*phaseexp+conj(plan->expo[i])*phaseexpm);
double coss=creall(plan->expo[i])*phasecos-cimagl(plan->expo[i])*phasesin;
//printf("i %d cos %g\n",i,coss);
result+=calibdata[i] *coss;
}
result = fabs(result / size);
return result;
}
int main(){
//the parameters
long n=100000000;
double* calibdata=malloc(n*sizeof(double));
if(calibdata==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
int freqnb=42;
double* freqscale=malloc(freqnb*sizeof(double));
if(freqscale==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
for (int i = 0; i < freqnb; i++){
freqscale[i]=i*i*0.007+i;
}
double fs=n;
double phase=0.05;
//populate calibdata
for (int i = 0; i < n; i++){
calibdata[i]=i/((double)n);
calibdata[i]=calibdata[i]*calibdata[i]-calibdata[i]+0.007/(calibdata[i]+3.0);
}
//call to sample code
clock_t t;
t = clock();
double res=phase_func(calibdata,n, freqscale, fs, phase, 13);
t = clock() - t;
printf("first call got %g in %g seconds.\n",res,((float)t)/CLOCKS_PER_SEC);
//initialize
cos_plan plan;
plan.expo=malloc(n*sizeof(double complex));
plan.size=n;
t = clock();
res=phase_func2(calibdata,n, freqscale, fs, phase, 13,&plan);
t = clock() - t;
printf("second call got %g in %g seconds.\n",res,((float)t)/CLOCKS_PER_SEC);
//cleaning
free(plan.expo);
free(calibdata);
free(freqscale);
return 0;
}
Compile with gcc main.c -o main -std=c99 -lm -Wall -O3. Using the code you provided, it take 8 seconds with size=100000000 on my computer while the execution time of the proposed solution takes 1.5 seconds... It is not so impressive, but it is not negligeable.
The solution that is presented does not involve any call to cos of sin in the for loops. Indeed, there are only multiplications and additions. The bottleneck is either the memory bandwidth or the tests and access to memory in the exponentiation by squaring (most likely first issue, since i add to use an additional array of complex).
For complex number in c, see:
How to work with complex numbers in C?
Computing e^(-j) in C
If the problem is memory bandwidth, then parallelism is required... and directly computing cos would be easier. Additional simplifications coud have be performed if freqscale[currentcarrier] / fs were an integer. Your problem is really close to the computation of Discrete Cosine Transform, the present trick is close to the Discrete Fourier Transform and the FFTW library is really good at computing these transforms.
Notice that the present code can produce innacurate results due to loss of significance : result can be much larger than cos(...)*calibdata[] when size is large. Using partial sums can resolve the issue.
Simple trig identity to eliminate the - (PI / 2). This is also more accurate than attempting the subtraction which uses machine_PI. This is important when values are near π/2.
cosine(x - π/2) == -sine(x)
Use of const and restrict: Good compilers can perform more optimizations with this knowledge. (See also #user3528438)
// double phase_func(double* calibdata, long size,
// double* freqscale, double fs, double phase, int currentcarrier) {
double phase_func(const double* restrict calibdata, long size,
const double* restrict freqscale, double fs, double phase, int currentcarrier) {
Some platforms perform faster calculations with float vs double with a tolerable loss of precision. YMMV. Profile code both ways.
// result += calibdata[i] * cos(...
result += calibdata[i] * cosf(...
Minimize recalculations.
double angle_delta = ...;
double angle_current = ...;
for (int i = 0; i < size; i++) {
result += calibdata[i] * cos(angle_current);
angle_current += angle_delta;
}
Unclear why code uses long size and and int currentcarrier. I'd expect the same type and to use type size_t. This is idiomatic for array indexing. #Daniel Jour
Reversing loops can allow a compare to 0 rather than compare to variable. Sometimes a modest performance gain.
Insure compiler optimizations are well enabled.
All together
double phase_func2(const double* restrict calibdata, size_t size,
const double* restrict freqscale, double fs, double phase,
size_t currentcarrier) {
double result = 0.0;
double angle_delta = 2.0 * PI * freqscale[currentcarrier] / fs;
double angle_current = angle_delta * (size - 1) + phase * (PI / 180);
size_t i = size;
while (i) {
result -= calibdata[--i] * sinf(angle_current);
angle_current -= angle_delta;
}
result = fabs(result / size);
return result;
}
Leveraging the cores you have, without resorting to the GPU, use OpenMP. Testing with VS2015, the invariants are lifted out of the loop by the optimizer. Enabling AVX2 and OpenMP.
double phase_func3(double* calibdata, const int size, const double* freqscale,
const double fs, const double phase, const size_t currentcarrier)
{
double result{};
constexpr double PI = 3.141592653589;
#pragma omp parallel
#pragma omp for reduction(+: result)
for (int i = 0; i < size; ++i) {
result += calibdata[i] *
cos( (2 * PI*freqscale[currentcarrier] * i / fs) + (phase*(PI / 180.0) - (PI / 2.0)));
}
result = fabs(result / size);
return result;
}
The original version with AVX enabled took: ~1.4 seconds
and adding OpenMP brought it down to: ~0.51 seconds.
Pretty nice return for two pragmas and a compiler switch.

Smoothstep function

I'm trying to get some results to plot on a graph by using the Smoothstep function provided by AMD which was fount on this Wikipedia page Smoothstep. Using the;
A C/C++ example implementation provided by AMD[4] follows.
float smoothstep(float edge0, float edge1, float x)
{
// Scale, bias and saturate x to 0..1 range
x = clamp((x - edge0) / (edge1 - edge0), 0.0, 1.0);
// Evaluate polynomial
return x*x*(3 - 2 * x);
}
The problem is that I am not able to use this method due to the static method clamp not being available.
I have imported the following;
#include <math.h>
#include <cmath>
#include <algorithm>
Yet there is no clamp method defined.
My maths skill is not the best, but is there a way to implement the Smoothstep function just like there is way to implement a LERP function;
float linearIntepolate(float currentLocation, float Goal, float time){
return (1 - time) * currentLocation + time * Goal;
}
Perhaps it was just the namespace "std" which was missing : here is my code that compile :
#include <algorithm>
float smoothstep(float edge0, float edge1, float x) {
// Scale, bias and saturate x to 0..1 range
x = std::clamp((x - edge0) / (edge1 - edge0), 0.0f, 1.0f);
// Evaluate polynomial
return x * x * (3 - 2 * x);
}