ArrayFire: Translate a batch of images at the same time - c++

I'm using arrayfire and I need to translate a lot of images at once and store it in a new array. The images are contained in a single array of size (w, h, c, b) and the amount by which each image needs to be translated is inside a (2, 1, 1, b) array.
The sequential implementation is as follows
for (int i=0; i<b; i++)
float x = coords(0, 0, 0, i).scalar<float>();
float y = coords(1, 0, 0, i).scalar<float>();
af::array t_imgs(af::span,af::span,af::span,i) =
af::translate(imgs(af::span,af::span,af::span,i), x, y);
How could I parallelize it? Translate doesn't accept arrays as arguments, so I can't do something like this:
gfor(af::seq i, b)
af::array x = coords(0, 0, 0, i);
af::array y = coords(1, 0, 0, i);
t_imgs(af::span,af::span,af::span,i) =
af::translate(imgs(af::span,af::span,af::span,i), x, y);


include c++ libraries into openCL kernel?

Is it possible to utilize C++ style libraries for use in an openCL kernel?
I'm trying to implement a kernel that performs the tasks seen in the following code. There are two things that could make this really difficult: 1. The fact that I'm using the GLM math library, and 2. That I'm using structs (land_map_t).
For example, if I wanted to use a kernel to loop through a large 3-dimensional array, is it possible to include the GLM math library inside of the kernel and utilize its functionalities such as glm::simplex? I've heard that modern C++ functionalities such as classes aren't compatible with kernels.
And if that's not possible, how would one pass a struct to the kernel? should I define the same struct in both the kernel and my implementation? All the struct contains is a 3-dimensional array, so I could easily just turn it into a default C++ type if it was necessary.
land_map_t * Chunk::terrain_gen(glm::ivec3 pos)
float frequency = 500;
float noise_1;
land_map_t* landmap = new land_map_t;
for (int x = 0; x < chunkSize + 2; x++) {
for (int y = 0; y < chunkSize + 2; y++) {
for (int z = 0; z < chunkSize + 2; z++) {
noise_1 = (glm::simplex(
glm::vec2(glm::ivec2(x, z) + glm::ivec2(pos.x, pos.z)) / frequency));
landmap->i[x][y][z] = BLOCK::AIR;
if (pow(noise_1, 2) * 40.0 + 6.0 > (y + pos.y))
landmap->i[x][y][z] = BLOCK::DIRT;
return landmap;
You cannot include C++ libraries in OpenCL C. OpenCL is C99, not C++. There are no classes and only 1D arrays in OpenCL. Within a kernel there is also no dynamic memory allocation possible with the new operator.
The best solution is to split the class components up into arrays and within each array use linear indexing to get from (x, y, z)=(n%(Lx*Ly)%Lx, n%(Lx*Ly)/Lx, n/(Lx*Ly)) in the rectangular box of the size (Lx,Ly,Lz) to the linear index n=x+(y+z*Ly)*Lx; and back.
Your code in OpenCL could look like this:
kernel void terrain_gen(global uchar* landmap_flags, global float3* pos)
const uint n = get_global_id(0);
const uint x = n%((chunkSize+2)*(chunkSize+2))%(chunkSize+2);
const uint y = n%((chunkSize+2)*(chunkSize+2))/(chunkSize+2);
const uint z = n/((chunkSize+2)*(chunkSize+2))
// paste the SimplexNoise struct definition here
SimplexNoise simplexnoise;
const float frequency = 500;
const float noise_1 = (simplexnoise.noise(x,z)+simplexnoise.noise(pos[n].x, pos[n].z))/ frequency;
landmap_flags[n] = (noise_1*noise_1*40.0f+6.0f>(y+pos[n].y)) ? BLOCK_DIRT : BLOCK_AIR;
Regarding GLM, you have to port over the required functions into OpenCL C. For simplex noise, you can use something like this:
struct SimplexNoise { // simplex noise in 2D, sources:,
const float3 grad3[12] = {
(float3)( 1, 1, 0), (float3)(-1, 1, 0), (float3)( 1,-1, 0), (float3)(-1,-1, 0),
(float3)( 1, 0, 1), (float3)(-1, 0, 1), (float3)( 1, 0,-1), (float3)(-1, 0,-1),
(float3)( 0, 1, 1), (float3)( 0,-1, 1), (float3)( 0, 1,-1), (float3)( 0,-1,-1)
const uchar p[256] = {
151,160,137, 91, 90, 15,131, 13,201, 95, 96, 53,194,233, 7,225,140, 36,103, 30, 69,142, 8, 99, 37,240, 21, 10, 23,190, 6,148,
247,120,234, 75, 0, 26,197, 62, 94,252,219,203,117, 35, 11, 32, 57,177, 33, 88,237,149, 56, 87,174, 20,125,136,171,168, 68,175,
74,165, 71,134,139, 48, 27,166, 77,146,158,231, 83,111,229,122, 60,211,133,230,220,105, 92, 41, 55, 46,245, 40,244,102,143, 54,
65, 25, 63,161, 1,216, 80, 73,209, 76,132,187,208, 89, 18,169,200,196,135,130,116,188,159, 86,164,100,109,198,173,186, 3, 64,
52,217,226,250,124,123, 5,202, 38,147,118,126,255, 82, 85,212,207,206, 59,227, 47, 16, 58, 17,182,189, 28, 42,223,183,170,213,
119,248,152, 2, 44,154,163, 70,221,153,101,155,167, 43,172, 9,129, 22, 39,253, 19, 98,108,110,79,113,224,232,178,185, 112,104,
218,246, 97,228,251, 34,242,193,238,210,144, 12,191,179,162,241, 81, 51,145,235,249, 14,239,107, 49,192,214, 31,181,199,106,157,
184, 84,204,176,115,121, 50, 45,127, 4,150,254,138,236,205, 93,222,114, 67, 29, 24, 72,243,141,128,195, 78, 66,215, 61,156,180
const float F2=0.5f*(sqrt(3.0f)-1.0f), G2=(3.0f-sqrt(3.0f))/6.0f; // skewing and unskewing factors for 2, 3, and 4 dimensions
const float F3=1.0f/3.0f, G3=1.0f/6.0f;
const float F4=(sqrt(5.0f)-1.0f)*0.25f, G4=(5.0f-sqrt(5.0f))*0.05f;
uchar perm[512]; // to remove the need for index wrapping, double the permutation table length
uchar perm12[512];
//int floor(const float x) const { return (int)x-(x<=0.0f); }
float dot(const float3 g, const float x, const float y) const { return g.x*x+g.y*y; }
void initialize() {
for(int i=0; i<512; i++) {
perm[i] = p[i&255];
perm12[i] = (uchar)(perm[i]%12);
float noise(float x, float y) const { // 2D simplex noise
float n0, n1, n2; // noise contributions from the three corners, skew the input space to determine simplex cell
float s = (x+y)*F2; // hairy factor for 2D
int i=floor(x+s), j=floor(y+s);
float t = (i+j)*G2;
float X0=i-t, Y0=j-t; // unskew the cell origin back to (x,y) space
float x0=x-X0, y0=y-Y0; // the x,y distances from the cell origin
// for the 2D case, the simplex shape is an equilateral triangle, determine simplex
int i1, j1; // offsets for second (middle) corner of simplex in (i,j) coords
if(x0>y0) { i1=1; j1=0; } // lower triangle, XY order: (0,0)->(1,0)->(1,1)
else /**/ { i1=0; j1=1; } // upper triangle, YX order: (0,0)->(0,1)->(1,1)
float x1=x0- i1+ G2, y1=y0- j1+ G2; // offsets for middle corner in (x,y) unskewed coords
float x2=x0-1.0f+2.0f*G2, y2=y0-1.0f+2.0f*G2; // offsets for last corner in (x,y) unskewed coords
int ii=i&255, jj=j&255; // work out the hashed gradient indices of the three simplex corners
int gi0 = perm12[ii +perm[jj ]];
int gi1 = perm12[ii+i1+perm[jj+j1]];
int gi2 = perm12[ii+ 1+perm[jj+ 1]];
float t0 = 0.5f-x0*x0-y0*y0; // calculate the contribution from the three corners
if(t0<0) n0 = 0.0f; else { t0 *= t0; n0 = t0*t0*dot(grad3[gi0], x0, y0); } // (x,y) of grad3 used for 2D gradient
float t1 = 0.5f-x1*x1-y1*y1;
if(t1<0) n1 = 0.0f; else { t1 *= t1; n1 = t1*t1*dot(grad3[gi1], x1, y1); }
float t2 = 0.5f-x2*x2-y2*y2;
if(t2<0) n2 = 0.0f; else { t2 *= t2; n2 = t2*t2*dot(grad3[gi2], x2, y2); }
return 70.0f*(n0+n1+n2); // add contributions from each corner to get the final noise value, result is scaled to stay inside [-1,1]

Halide: Schedule multi-stage pipeline without inlining

I'm trying to write a modular mutli-stage processing pipeline, but I'm having trouble scheduling it.
The code structure is as follows:
#include <halide/Halide.h>
Halide::Var x, y, c;
Halide::Func producer(Halide::Func in) {
Halide::Func producer("producer");
producer(x, y, c) = in(x, y, c);
return producer;
Halide::Func rectification(Halide::Func in, const Halide::Image<float>& rectificationMapBuffer)
// Fractional pixel positions according to rectification map
Halide::Expr x_in_frac("x_in_frac");
Halide::Expr y_in_frac("y_in_frac");
x_in_frac = rectificationMapBuffer(x * 2 + 0, y);
y_in_frac = rectificationMapBuffer(x * 2 + 1, y);
// Cast fractions down to integers. This allows to address the pixels
// surrounding the fractional position
Halide::Expr x_in("x_in");
Halide::Expr y_in("y_in");
x_in = Halide::cast(Halide::Int(32), x_in_frac);
y_in = Halide::cast(Halide::Int(32), y_in_frac);
// Linearly interpolate pixel values
Halide::Func interpolate("interpolate");
interpolate(x, y, c) =
Halide::lerp(Halide::lerp(in(x_in + 0, y_in + 0, c), in(x_in + 1, y_in + 0, c),
Halide::lerp(in(x_in + 0, y_in + 1, c), in(x_in + 1, y_in + 1, c),
Halide::Func rectification("rectification");
rectification(x, y, c) = Halide::cast(Halide::UInt(8), interpolate(x, y, c));
return rectification;
Halide::Func schedule(const Halide::Image<uint8_t>& image, const Halide::Image<float>& rectificationMap) {
Halide::Func clamped;
clamped = Halide::BoundaryConditions::repeat_edge(image);
Halide::Func producerFunc;
producerFunc = producer(clamped);
Halide::Func consumerFunc;
consumerFunc = rectification(producerFunc, rectificationMap);
return consumerFunc;
int main(int argc, char *argv[])
int width = 100;
int height = 100;
int channels = 3;
Halide::Image<uint8_t> input(width, height, channels);
Halide::Image<float> rectificationMap(width * 2, height);
Halide::Buffer output(Halide::UInt(8), width, height, channels, 0);
Halide::Func f = schedule(input, rectificationMap);
return 0;
Now the problem is that compute_root statement. I get the error:
The pure definition of Function rectification calls function producer in an
unbounded way in dimension 0
When I remove the compute_root part, the producer function is inlined and I don't see any problems.
I tried adding .bound constraints in the schedule function, but that didn't help. Can someone help me figure out what this error means?

inverse fft of fft not returning expected data

I'm trying to make sure FFTW does what I think it should do, but am having problems. I'm using OpenCV's cv::Mat. I made a test program that, given a Mat f, computes ifft(fft(f)) and compares the result to f. I would expect the difference between the two to be negligible, but there's a strange pattern in the data..
In this case, f is initialized to be an 8x8 array of floats with positive values less than 1.
Here's my test program code:
Mat f = .. //populate f
if (f.type() != CV_32FC1)
DLOG << "Bad f type";
const int y = f.rows;
const int x = f.cols;
double* input = fftw_alloc_real(y * 2*(x/2 + 1));
// forward fft
fftw_plan plan = fftw_plan_dft_r2c_2d(x, y, input, (fftw_complex*)input, FFTW_MEASURE);
// inverse fft
fftw_plan iplan = fftw_plan_dft_c2r_2d(x, y, (fftw_complex*)input, input, FFTW_MEASURE);
// populate fftw data from f
for (int yi = 0; yi < y; ++yi)
const float* yptr = f.ptr<float>(yi);
for (int xi = 0; xi < x; ++xi)
input[yi*x + xi] = (double)yptr[xi];
// put data into another cv::Mat for comparison
Mat check(y, x, f.type());
for (int yi = 0; yi < y; ++yi)
float* yptr = check.ptr<float>(yi);
for (int xi = 0; xi < x ; ++xi)
yptr[xi] = (float)input[yi*x + xi];
DLOG << Util::summary(f, "f");
DLOG << f;
DLOG << Util::summary(check, "check");
DLOG << check;
Mat diff = f*x*y - check;
DLOG << Util::summary(diff, "diff");
DLOG << diff;
Where DLOG is my logger and Util::summary(cv::Mat m) just prints passed string and the dimensions, channels, min, and max of the mat.
Here's what the data looks like (output):
f: rows:8 cols:8 chans:1 min:0.00257996 max:0.4
[0.050668437, 0.04509116, 0.033668514, 0.10986148, 0.12855141, 0.048241843, 0.12613985,.09731093;
0.028602425, 0.0092236707, 0.037089188, 0.118964, 0.075040311, 0.40000001, 0.11959606, 0.071930833;
0.0025799556, 0.051522054, 0.22233701, 0.052993439, 0.032000393, 0.12673819, 0.015244827, 0.044803992;
0.13946071, 0.019708242, 0.0112687, 0.047459811, 0.019342113, 0.030085485, 0.018739942, 0.0098618753;
0.041809395, 0.029681522, 0.026837418, 0.16038358, 0.29034778, 0.17247421, 0.1789207, 0.042179305;
0.025630442, 0.017192598, 0.060540862, 0.1854037, 0.21287154, 0.04813192, 0.042614728, 0.034764063;
0.0030835248, 0.018511582, 0.0071733585, 0.017076733, 0.064545207, 0.0026390438, 0.088922881, 0.045725599;
0.12798512, 0.23215951, 0.027465452, 0.03174505, 0.04352935, 0.025079668, 0.044403922, 0.035459157]
check: rows:8 cols:8 chans:1 min:-3.26489 max:25.6
[3.24278, 2.8858342, 2.1547849, 7.0311346, 8.2272902, 3.0874779, 8.0729504, 6.2278996;
0.30818239, 0, 2.373708, 7.6136961, 4.8025799, 25.6, 7.6541481, 4.6035733;
0.16511716, 3.2974114, -3.2648909, 0, 2.0480251, 8.1112442, 0.97566891, 2.8674555;
8.9254856, 1.2613275, 0.72119683, 3.0374279, -0.32588482, 0, 1.1993563, 0.63116002;
2.6758013, 1.8996174, 1.7175947, 10.264549, 18.582258, 11.038349, 0.042666838, 0;
1.6403483, 1.1003263, 3.8746152, 11.865837, 13.623778, 3.0804429, 2.7273426, 2.2249;
0.44932228, 0, 0.45909494, 1.0929109, 4.1308932, 0.16889881, 5.6910644, 2.9264383;
8.1910477, 14.858209, -0.071794562, 0, 2.7858784, 1.6050987, 2.841851, 2.2693861]
diff: rows:8 cols:8 chans:1 min:-0.251977 max:17.4945
[0, 0, 0, 0, 0, 0, 0, 0;
1.5223728, 0.59031492, 0, 0, 0, 0, 0, 0;
0, 0, 17.494459, 3.3915801, 0, 0, 0, 0;
0, 0, 0, 0, 1.5637801, 1.9254711, 0, 0;
0, 0, 0, 0, 0, 0, 11.408258, 2.6994755;
0, 0, 0, 0, 0, 0, 0, 0;
-0.2519767, 1.1847413, 0, 0, 0, 0, 0, 0;
0, 0, 1.8295834, 2.0316832, 0, 0, 0, 0]
The difficult part for me is the nonzero entries in the diff matrix. I've accounted for the scaling FFTW does on the values and the padding needed to do an in-place fft on real data; what am I missing?
I find it surprising that the data could be off by a value of 17 (which is 66% of the max value), when there are so many zeros. Also, the data irregularities seem to form a diagonal pattern.
As you may have noticed when writting fftw_alloc_real(y * 2*(x/2 + 1)); fftw needs extra space in the x direction to store complex data. In your case, as x=8, it needs 2*(x/2+1)=10 reals. should take care of this as you populate the input array or retreive values from it.
You way change
input[yi*x + xi] = (double)yptr[xi];
int xfft=2*(x/2 + 1);
input[yi*xfft + xi] = (double)yptr[xi];
yptr[xi] = (float)input[yi*x + xi];
yptr[xi] = (float)input[yi*xfft + xi];
It should solve your problem since the non-nul points in your diff correspond to the extra padding.

Segmentation Fault when loading image pixel by pixel using CImg

I am trying to compute the mean of an image by loading pixel by pixel.
My image has 6 channels, height and with are 512 and depth is 1.
It is stored at the first position of an ImgList containing 2 elements.
My code is as follows:
int main(){
float mean = 0;
CImgList<float> img;
int c, x, y;
for(c=0; c<6; ++c)
for(x=0; x<512; ++x)
for(y=0; y<512; ++y){
img.load_cimg("test_images/Simul_PolSAR.cimg", 0, 0, x, y, 0, c, x, y, 0, c);
mean += img(0)(0,0,0,0);
mean = mean/(6*512*512);
When I run it, everything works fine until the value of "c" changes from 0 to 1. Then, the line accessing img(0)(0,0,0,0) makes the program crash with a segmentation fault error.
Also if I check:
img.load_cimg("image.cimg", 0, 0, 0, 0, 0, 1, 0, 0, 0, 1);
The result is:
CImg<float>: this = 0x14330f8, size = (0,0,0,0) [0 b], data = (float*)(nil) (non-shared) = [ ].
I am very sure about the correctness of the code, and the integrity of the image (I tried with different ones). Any idea why it is happening?

Tensor Product Algorithm Optimization

double data[12] = {1, z, z^2, z^3, 1, y, y^2, y^3, 1, x, x^2, x^3};
double result[64] = {1, z, z^2, z^3, y, zy, (z^2)y, (z^3)y, y^2, z(y^2), (z^2)(y^2), (z^3)(y^2), y^3, z(y^3), (z^2)(y^3), (z^3)(y^3), x, zx, (z^2)x, (z^3)x, yx, zyx, (z^2)yx, (z^3)yx, (y^2)x, z(y^2)x, (z^2)(y^2)x, (z^3)(y^2)x, (y^3)x, z(y^3)x, (z^2)(y^3)x, (z^3)(y^3)x, x^2, z(x^2), (z^2)(x^2), (z^3)(x^2), y(x^2), zy(x^2), (z^2)y(x^2), (z^3)y(x^2), (y^2)(x^2), z(y^2)(x^2), (z^2)(y^2)(x^2), (z^3)(y^2)(x^2), (y^3)(x^2), z(y^3)(x^2), (z^2)(y^3)(x^2), (z^3)(y^3)(x^2), x^3, z(x^3), (z^2)(x^3), (z^3)(x^3), y(x^3), zy(x^3), (z^2)y(x^3), (z^3)y(x^3), (y^2)(x^3), z(y^2)(x^3), (z^2)(y^2)(x^3), (z^3)(y^2)(x^3), (y^3)(x^3), z(y^3)(x^3), (z^2)(y^3)(x^3), (z^3)(y^3)(x^3)};
What is the fastest (fewest executions) to produce result given data? Assume, that data is variable in size, but always a factor of 4 (e.g., 4, 8, 12, etc.).
No Boost. I am trying to keep my dependencies small. STL Algorithms are ok.
HINT: result array size should always be 4^(multiple size) (e.g., 4, 16, 64, etc.).
BONUS: If you can compute result just given x, y, z
Additional examples:
double data[4] = {1, z, z^2, z^3};
double result[4] = {1, z, z^2, z^3};
double data[8] = {1, z, z^2, z^3, 1, y, y^2, y^3};
double result[16] = { ... };
I chose the accepted answer code after running this benchmark: Basically, the top two codes were run and the one with the smallest execution time won.
void Tensor(std::vector<double>& result, double x, double y, double z) {
result.resize(64); //almost noop if already right size
double tz = z*z;
double ty = y*y;
double tx = x*x;
std::array<double, 12> data = {0, 0, tz, tz*z, 1, y, ty, ty*y, 1, x, tx, tx*x};
register std::vector<double>::iterator iter = result.begin();
register int yi;
register double xy;
for(register int xi=0; xi<4; ++xi) {
for(yi=0; yi<4; ++yi) {
xy = data[4+yi]*data[8+xi];
*iter = xy; //a smart compiler can do these four in parallell
*(++iter) = z*xy;
*(++iter) = data[2]*xy;
*(++iter) = data[3]*xy;
++iter; //workaround for speed!
There's probably at least one bug in here somewhere, but it should be fast, with no dependancies (outside of std::vector/std::array), just takes x,y,z. I avoided recursion though, so it only works for 3 in/64 out. The concept can be applied to any number of parameters though. You just have to instantiate yourself.
A good compiler will autovectorize this I guess none of my compilers are good:
void tensor(const double *restrict data,
int dimensions,
double *restrict result) {
result[0] = 1.0;
for (int i = 0; i < dimensions; i++) {
for (int j = (1 << (i * 2)) - 1; j > -1; j--) {
double alpha = result[j];
double *restrict dst = &result[j * 4];
const double *restrict src = &data[(dimensions - 1 - i) * 4];
for (int k = 0; k < 4; k++) dst[k] = alpha * src[k];
you should use dynamic algorithm. that is, you can use previous results. for example, you keep y^2 result and use it when computing (y^2)z instead of computing it again.
#include <vector>
#include <cstddef>
#include <cmath>
void Tensor(std::vector<double>& result, const std::vector<double>& variables, size_t index)
double p1 = variables[index];
double p2 = p1*p1;
double p3 = p1*p2;
if (index == variables.size() - 1) {
} else {
Tensor(result, variables, index+1);
ptrdiff_t size = result.size();
for(int j=0; j<size; ++j)
for(int j=0; j<size; ++j)
for(int j=0; j<size; ++j)
std::vector<double> Tensor(const std::vector<double>& params) {
std::vector<double> result;
double rsize = (1<<(2*params.size());
Tensor(result, params);
return result;
int main() {
std::vector<double> params;
std::vector<double> result = Tensor(params);
I verified that this one compiles and runs ( It runs fast, with no dependancies (outside of std::vector). It also takes any number of parameters. Since calling the recursive form is awkward, I made a wrapper. It makes one function call for each parameter, and one call to dynamic memory (in the wrapper).
You should look for Pascal's pyramid to get fast solution. Useful link 1, useful link 2, useful link 3 and useful link 4.
One more thing: as I see it would be a base of a finite element solver. Usually to write own BLAS solver is not a good idea. Do not reinvent the wheel! I think you should use a BLAS solver like intel MKL or Cuda base BLAS.