I trying to create a function to transpose in-place a bitmap. But so far, the result I get is all messed up, and I can’t find what I’m doing wrong.
Source bitmaps are as a 1d pixel array in ARGB format.
void transpose(uint8_t* buffer, const uint32_t width, const uint32_t height)
{
const size_t stride = width * sizeof(uint32_t);
for (uint32_t i = 0; i < height; i++)
{
uint32_t* row = (uint32_t*)(buffer + (stride * i));
uint8_t* section = buffer + (i * sizeof(uint32_t));
for (uint32_t j = i + 1; j < height; j++)
{
const uint32_t tmp = row[j];
row[j] = *((uint32_t*)(section + (stride * j)));
*((uint32_t*)(section + (stride * j))) = tmp;
}
}
}
UPDATE:
To clarify and avoid confusions as it seems some people think this is just a rotate image question. Transposing an image is composed by 2 transformations: 1) flip horizontally 2) Rotate by 90 CCW. (As shown in the image example, see the arrow directions)
I think the problem is more complex than you realise and is not simply a case of swapping the pixels at x, y with the pixels at y, x. If you consider a 3*7 pixel image in which I've labelled the pixels a-u:
abcdefg
hijklmn
opqrstu
Rotating this image gives:
aho
bip
cjq
dkr
els
fmt
gnu
Turning both images into a 1D array gives:
abcdefghijklmnopqrstu
ahobipcjqdkrelsfmtgnu
Notice that b has moved to the position of d but has been replaced by h.
Rethink your algorithm, draw it out for a small image and make sure it works before attempting to implement it.
Due to the complexity of the task it may actually end up being faster to create a temporary buffer, rotate into that buffer then copy back as it could end up with fewer copies (2 per pixel) than the inplace algorithm that you come up with.
Mostly equivalent code that should be easier to debug:
inline uint32_t * addr(uint8_t* buffer, const uint32_t width, uint32_t i, uint32_t j) {
uint32_t * tmp = buffer;
return tmp+i*width+j;
}
void transpose(uint8_t* buffer, const uint32_t width, const uint32_t height) {
for (uint32_t i = 0; i < min(width,height); i++) {
for (uint32_t j = 0; j < i; j++) {
uint32_t * a = addr(buffer, width, i, j);
uint32_t * b = addr(buffer, width, j, i);
const uint32_t tmp = *a;
*a = *b;
*b = tmp;
}
}
}
If this doesn't work right, it is possible that it needs to know not just the width of the picture, but also the width of the underlying buffer. This only flips the square portion at the top-left, more work would be needed for non-square bitmaps. (or just pad everything to square before using...)
Note that transposing a matrix in place is not trivial when N!=M. See eg here for details.
The reason is that when N=M you can simply iterate through half of the matrix and swap elements. When N!=M this isnt the case.
For illustration, consider a simpler case:
First a 2d view on 1d data:
struct my2dview {
std::vector<int>& data;
int width,height;
my2dview(std::vector<int>& data,int width,int height):data(data),width(width),height(height){}
int operator()(int x,int y) const { return data[x*width + y]; }
int& operator()(int x,int y){ return data[x*width + y]; }
my2dview get_transposed() { return my2dview(data,height,width);}
};
std::ostream& operator<<(std::ostream& out, const my2dview& x){
for (int h=0;h<x.height;++h){
for (int w=0;w<x.width;++w){
out << x(h,w) << " ";
}
out << "\n";
}
return out;
}
Now a transpose that would work for N=M:
my2dview broken_transpose(my2dview x){
auto res = x.get_transposed();
for (int i=0;i<x.height;++i){
for (int j=0;j<x.width;++j){
res(j,i) = x(i,j);
}
}
return res;
}
Using it for some small matrix
int main() {
std::vector<int> x{1,2,3,4,5,6};
auto v = my2dview(x,2,3);
std::cout << v << '\n';
std::cout << v.get_transposed() << '\n';
auto v2 = broken_transpose(v);
std::cout << v2;
}
prints
1 2
3 4
5 6
1 2 3
4 5 6
1 3 2
2 2 6
Conclusion: The naive swapping elements approach does not work for non-square matrices.
Actually this answer just rephrases the one by #Alan Birtles. I felt challenged by his
Due to the complexity of the task it may actually end up being faster to create a temporary buffer [...]
just to come to the same conclusion ;).
Related
I have a binary BMP image of size 284x1280. The image contains the digits 9 4 3 6. I want to perform component labelling on the image and mark the labels whenever the digits occur. Initially, it is a binary image with only 0 and 1 in the 2D array (0 marked as background and 1 marked as the digits)
I tried to write a component labelling function (checking 8 neighbours) and incrementing a counter whenever I find a component labelled 1:
#include<stdio.h>
#include<string.h>
#include<malloc.h>
#include<stdlib.h>
int func(int w, int h, int a[][1280], int i, int j, int c)
{
if(i==h||j==w)
{
return 0;
}
if(a[i][j+1]==1)
{
a[i][j+1]=c; return func(w,h,a,i,j+1,c);
}
if(a[i+1][j]==1)
{
a[i+1][j]=c; return func(w,h,a,i+1,j,c);
}
if(a[i+1][j+1]==1)
{
a[i+1][j+1]=c; return func(w,h,a,i+1,j+1,c);
}
else
{
return 0;
}
}
unsigned char* read_bmp(char *fname, int* _w, int* _h)
{
unsigned char head[54];
FILE *f=fopen(fname,"rb");
//BMP header is 54 bytes
fread(head,1,54,f);
int w=head[18]+(((int)head[19]) << 8)+(((int)head[20]) << 16)+
(((int)head[21]) << 24);
int h=head[22]+(((int)head[23]) << 8)+(((int)head[24]) << 16)+
(((int)head[25]) << 24);
//lines are aligned on 4-byte boundary
int lineSize = (w / 8 + (w / 8) % 4);
int fileSize=lineSize * h;
unsigned char *img, *data;
img =(unsigned char*)malloc(w * h), data =(unsigned
char*)malloc(fileSize);
//skip the header
fseek(f,54,SEEK_SET);
//skip palette - two rgb quads, 8 bytes
fseek(f,8,SEEK_CUR);
//read data
fread(data,1,fileSize,f);
//decode bits
int i, j, k, rev_j;
for(j=0, rev_j=h-1;j<h;j++,rev_j--)
{
for(i=0;i<w/8;i++)
{
int fpos= j * lineSize + i, pos = rev_j * w + i * 8;
for(k=0;k<8;k++)
{
img[pos+(7-k)]=(data[fpos] >> k) & 1;
}
}
}
free(data);
*_w = w; *_h = h;
return img;
}
int main()
{
int w, h, i, j, c1=0, c2=0, c3=0, c4=0, c5=0, c6=0;
unsigned char* img=read_bmp("binary.bmp",&w,&h);
int array[h][1280];
char ch;
for(j=0;j<h;j++)
{
for(i=0;i<1280;i++)
{
array[j][i]=(int(img[j * w + i])==0);
}
}
register int c=2;
for(i=0;i<h;i++)
{
for(j=0;j<1280;j++)
{
if(array[i][j]==1)
{
array[i][j]=c;
func(w,h,array,i,j,c);
}
}
}
for(i=0;i<h;i++)
{
for(j=0;j<w;j++)
{
printf("%d",array[i][j]);
}
printf("\n");
}
return 0;
}
I am getting an array of just 0 and 2, whereas it should contain 0,2,3,4,5 labels for other digits. How to fix it?
You never increment c, hence you get stuck at label 2.
Once you fix that, you’ll notice single objects being broken up into many labels. This is because you check only 3 neighbors in your recursive function. You need to check all 8 (or 4 for 4-connected neighborhood). Yes, your recursive function must be able also to travel to the left and up to follow complex shapes.
This recursive function is very inefficient and with an object large enough it could cause a stack overflow. You could instead write a loop that propagates all along the line within the object. The best algorithms for object labeling use the union-find algorithm, I encourage you to look that up.
I am trying to make optimal algorithm to draw rectangle onto 1D array. I wrote this function:
/** Draws a rectangle in 1D array
* Arguments:
* pixmap - 1D array of Color
* color - rectangle color
* w - rectangle width
* h - rectanhle height
* x - x position, negative coordinates are outside draw area
* y - y position, negative coordinates are outside draw area
* pixmapWidth - width of the image (height can be deducted from width if needed but is practically unnecessary) */
void rectangle(std::vector<int>& pixmap, const int& color, const int w, const int h, int x, const int y, const int pixmapWidth)
{
if(x>=pixmapWidth)
return;
if(x+w<0)
return;
if(y+h<0)
return;
// Width of one consistent line of color of the rectangle
// if the rectangle is partially out of pixmap area,
// thw width is smaller than rectangle width
const int renderWidth = std::min(w, pixmapWidth-x);
// offset in the arrray where the rendering starts
// 0 would be for [0,0] coordinate
int tg_offset = y*pixmapWidth+x;
// maximum offset to ever render, which is the array size
const int tg_end = pixmap.size();
int lines = 0;
for(; tg_offset<tg_end && lines<h; tg_offset+=pixmapWidth) {
for(int cx=0; cx<renderWidth; ++cx) {
// This check keeps failing and my program crashes
if(tg_offset+cx >= pixmap.size())
throw "Oh no, what a bad thing to happen!";
pixmap[tg_offset+cx] = color;
}
lines++;
}
}
Note that I know there's a lot of picture drawing libraries, but I'm trying to learn by doing this. But now I'm stuck and I need help.
The problem is that in the inner loop, condition if(tg_offset+cx >= pixmap.size()) keeps failing meaning I am trying to render outside the array. I have no idea why this keeps happening.
Example problematic code:
const int pixmap_width = 20;
const int pixmap_height = 20;
std::vector<int> pixmap(pixmap_width*pixmap_height);
// tries to render outside the array
rectangle(pixmap, 0, 10, 10, -1, 18, pixmap_width);
Here is a testcase including ASCII output of the pixmap: http://ideone.com/SoJPFF
I don't know how could I improve the question any more...
Making no changes produces a quadrilateral. Is this not the desired functionality?
for(; tg_offset<tg_end && lines<h; tg_offset+=pixmapWidth) {
cout <<"" << endl;
for(int cx=0; cx<renderWidth; ++cx) {
cout << " " << pixmap[tg_offset+cx];
// This check keeps failing and my program crashes
if(tg_offset+cx >= pixmap.size())
throw "Oh no, what a bad thing to happen!";
pixmap[tg_offset+cx] = color;
}
lines++;
}
}
int main()
{
std::vector<int> pixmap(16);
pixmap = { 1,1,1,1,1,0,0,1,1,0,0,1,1,1,1,1 };
int color = 0;
int w = 4;
int h = 4;
int x = 0;
int y = 0;
int pixmapWidth = 4;
cout << "Hello World" << endl;
rectangle(pixmap, color, w, h, x, y, pixmapWidth);
return 0;
}
produces:
Hello World
1 1 1 1
1 0 0 1
1 0 0 1
1 1 1 1
I think a large part of the problem with your function is it being a lot more complex than it needs to be. Here's a much simpler version of your function, done by simply looping over x and y.
void rectangle(std::vector<int>& pixmap, const int& color, const int width, const int height,
int left, const int top, const int pixmapWidth)
{
for (int x = std::max(left, 0); x < left + width && x < pixmapWidth; x++)
for (int y = std::max(top, 0); y < top + height && y*pixmapWidth + x < pixmap.size(); y++)
pixmap[y*pixmapWidth + x] = color;
}
I'm not sure exactly what the output you want when x or y are negative. In your actual algorithm things goes wrong if x is negative due the fact that tg_offset goes back, so the tg_offset + cx can fail.
To solve this you can limit the second for to avoid this, like this:
for(int cx=0; cx<std::min(renderWidth, tg_end - tg_offset); ++cx)
but I think that limiting x and y to be only positive is more correct:
if ( x < 0 ) x = 0;
if ( y < 0 ) y = 0;
im writing Cuda Program to Transpose Square Matrix, the idea is to do it in two parts depending on size of matrix; the matrix size cut into even size with Tile , and remain rectangle part left i transpose it separately Ex: 67 x 67 Matrix with Tile : 32, first part is 64x64 transposed, then second part is 3x67.
my problem is in the rectangle part,
first below code shows the main code with the defined values:
const int TILE_DIM = 32;
const int BLOCK_ROWS = 8;
const int NUM_REPS = 100;
const int Nx = 2024; //size of the matrix
const int Ny = 2024;
int main(int argc, char **argv)
{
const int nx = Nx;
const int ny = Ny; // Size of the Arrays
const int mem_size = nx*ny*sizeof(int);// Size of the Orig.Arr
int *h_idata = (int*)malloc(mem_size); // original Host Arr.
int *d_idata; //device Arr.
checkCuda(cudaMalloc(&d_idata, mem_size));
dim3 dimGridX(nx / TILE_DIM, 1, 1); //grid dimension used
dim3 dimBlockX(TILE_DIM, 1, 1); // number of threads used
// the Kernel Function for only the rectangle
EdgeTransposeX << < dimGrid, dimBlock >> >(d_idata);
cudaEventRecord(startEvent, 0);
cudaEventRecord(stopEvent, 0);
cudaEventSynchronize(stopEvent);
cudaEventElapsedTime(&ms, startEvent, stopEvent);
cudaMemcpy(h_idata, d_idata, mem_size, cudaMemcpyDeviceToHost);
the Kernel Code i was advised not to use shared, so below is how ive done :
__global__ void EdgeTransposeX(int *idata)
{
int tile_C[Edge][Nx];
int tile_V[Nx][Edge];
int x = blockIdx.x * TILE_DIM + threadIdx.x;
if (x == (nEven - 1))
{
for (int j = 0; j < Nx; j++)
for (int i = 1; i <= Edge; i++)
{
tile_V[j][i - 1] = idata[j*Nx + (x + i)];
tile_C[i - 1][j] = idata[(x + i)*Nx + j];}
__syncthreads();
for (int j = 0; j < Nx; j++)
for (int i = 1; i <= Edge; i++)
{
idata[j*Nx + (x + i)] = tile_C[i - 1][j];
idata[(x + i)*Nx + j] = tile_V[j][i - 1];}
} }
the code works Okay until matrix size reaches 1025, after that it stops working, any idea why ? am i missing something here ?
your two-dimentional arrays tile_C and tile_V are fisically stored in GPU's local memory. The amount of local memory per thread is 512KB. Verify that you are not using more than 512KB of local memory per thread.
An automatic variable declared in device code without any of the device,
shared and constant qualifiers described in this section generally resides in a register. However in some cases the compiler might choose to place it in local memory. This fragment was taken from "CUDA C PROGRAMMING GUIDE 2015" pag 89.
My suggestion is that you use the visual profiler to check the occupancy, register and local memory usage.
This link may be helpful for you: link.
I implemented the Transpose of a Square Matrix using cuda surfaces in 2D, it works fine for sizes from 2 to 16384 with increments in power of two. If you dont mind implement a no tiled version, i recomend this approach.
EDIT You can checkout my implementation on Github: https://github.com/Sheljohn/WalshHadamard
I am looking for an implementation, or indications on how to implement, the sequency-ordered Fast Walsh Hadamard transform (see this and this).
I slightly adapted a very nice implementation found online:
// (a,b) -> (a+b,a-b) without overflow
void rotate( long& a, long& b )
{
static long t;
t = a;
a = a + b;
b = t - b;
}
// Integer log2
long ilog2( long x )
{
long l2 = 0;
for (; x; x >>=1) ++l2;
return l2;
}
/**
* Fast Walsh-Hadamard transform
*/
void fwht( std::vector<long>& data )
{
const long l2 = ilog2(data.size()) - 1;
for (long i = 0; i < l2; ++i)
{
for (long j = 0; j < (1 << l2); j += 1 << (i+1))
for (long k = 0; k < (1 << i ); ++k)
rotate( data[j + k], data[j + k + (1<<i)] );
}
}
but it does not compute the WHT in sequency order (the natural Hadamard matrix is used implicitly). Note that in the code above (and if you try it), the size of data needs to be a power of 2.
My question is: is there a simple adaptation of this implementation that gives the sequency-ordered FWHT?
A possible solution would be to write a small function to compute dynamically the elements of Hn (the Hadamard matrix of order n), count the number of zero crossings, and create a ranking of the rows, but I am wondering whether there is a smarter way. Thanks in advance for any input! Cheers
As indicated here (linked from within your reference):
The sequency ordering of the rows of the Walsh matrix can be derived from the ordering of the Hadamard matrix by first applying the bit-reversal permutation and then the Gray code permutation.
There are various implementations of bit-reversal algorithm such as this:
// Bit-reversal
// adapted from http://www.idi.ntnu.no/~elster/pubs/elster-bit-rev-1989.pdf
void bitrev(int t, std::vector<long>& c)
{
long n = 1<<t;
long L = 1;
c[0] = 0;
for (int q=0; q<t; ++q)
{
n /= 2;
for (long j=0; j<L; ++j)
{
c[L+j] = c[j] + n;
}
L *= 2;
}
}
The gray code can be obtained from here:
/*
The purpose of this function is to convert an unsigned
binary number to reflected binary Gray code.
The operator >> is shift right. The operator ^ is exclusive or.
*/
unsigned int binaryToGray(unsigned int num)
{
return (num >> 1) ^ num;
}
These can be combined to yields the final permutation:
// Compute a permutation of size 2^order
// to reorder the Fast Walsh-Hadamard transform's output
// into the Walsh-ordered (sequency-ordered)
void sequency_permutation(long order, std::vector<long>& p)
{
long n = 1<<order;
std::vector<long> tmp(n);
bitrev(order, tmp);
p.resize(n);
for (long i=0; i<n; ++i)
{
p[i] = tmp[binaryToGray(i)];
}
}
All that's left to do is to apply the permutation to the normal Walsh-Hadamard Transform output.
void permuted_fwht(std::vector<long>& data, const std::vector<long>& permutation)
{
std::vector<long> tmp = data;
fwht(tmp);
for (long i=0; i<data.size(); ++i)
{
data[i] = tmp[permutation[i]];
}
}
Note that the permutation is fixed for a given data size, so it only needs to be computed once (assuming you are processing multiple blocks of data). So, putting it all together you would get something such as:
std::vector<long> p;
const long order = ilog2(data_block_size) - 1;
sequency_permutation(order, p);
permuted_fwht( data_block_1, p);
permuted_fwht( data_block_2, p);
//...
From some comments that I have read in here, for some reason it is preferable to have Structure of Arrays (SoA) over Array of Structures (AoS) for parallel implementations like CUDA? If that is true, can anyone explain why?
Thanks in advance!
Choice of AoS versus SoA for optimum performance usually depends on access pattern. This is not just limited to CUDA however - similar considerations apply for any architecture where performance can be significantly affected by memory access pattern, e.g. where you have caches or where performance is better with contiguous memory access (e.g. coalesced memory accesses in CUDA).
E.g. for RGB pixels versus separate RGB planes:
struct {
uint8_t r, g, b;
} AoS[N];
struct {
uint8_t r[N];
uint8_t g[N];
uint8_t b[N];
} SoA;
If you are going to be accessing the R/G/B components of each pixel concurrently then AoS usually makes sense, since the successive reads of R, G, B components will be contiguous and usually contained within the same cache line. For CUDA this also means memory read/write coalescing.
However if you are going to process color planes separately then SoA might be preferred, e.g. if you want to scale all R values by some scale factor, then SoA means that all R components will be contiguous.
One further consideration is padding/alignment. For the RGB example above each element in an AoS layout is aligned to a multiple of 3 bytes, which may not be convenient for CUDA, SIMD, et al - in some cases perhaps even requiring padding within the struct to make alignment more convenient (e.g. add a dummy uint8_t element to ensure 4 byte alignment). In the SoA case however the planes are byte aligned which can be more convenient for certain algorithms/architectures.
For most image processing type applications the AoS scenario is much more common, but for other applications, or for specific image processing tasks this may not always be the case. When there is no obvious choice I would recommend AoS as the default choice.
See also this answer for more general discussion of AoS v SoA.
I just want to provide a simple example showing how a Struct of Arrays (SoA) performs better than an Array of Structs (AoS).
In the example, I'm considering three different versions of the same code:
SoA (v1)
Straight arrays (v2)
AoS (v3)
In particular, version 2 considers the use of straight arrays. The timings of versions 2 and 3 are the same for this example and result to be better than version 1. I suspect that, in general, straight arrays could be preferable, although at the expense of readability, since, for example, loading from uniform cache could be enabled through const __restrict__ for this case.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <thrust\device_vector.h>
#include "Utilities.cuh"
#include "TimingGPU.cuh"
#define BLOCKSIZE 1024
/******************************************/
/* CELL STRUCT LEADING TO ARRAY OF STRUCT */
/******************************************/
struct cellAoS {
unsigned int x1;
unsigned int x2;
unsigned int code;
bool done;
};
/*******************************************/
/* CELL STRUCT LEADING TO STRUCT OF ARRAYS */
/*******************************************/
struct cellSoA {
unsigned int *x1;
unsigned int *x2;
unsigned int *code;
bool *done;
};
/*******************************************/
/* KERNEL MANIPULATING THE ARRAY OF STRUCT */
/*******************************************/
__global__ void AoSvsSoA_v1(cellAoS *d_cells, const int N) {
const int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < N) {
cellAoS tempCell = d_cells[tid];
tempCell.x1 = tempCell.x1 + 10;
tempCell.x2 = tempCell.x2 + 10;
d_cells[tid] = tempCell;
}
}
/******************************/
/* KERNEL MANIPULATING ARRAYS */
/******************************/
__global__ void AoSvsSoA_v2(unsigned int * __restrict__ d_x1, unsigned int * __restrict__ d_x2, const int N) {
const int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < N) {
d_x1[tid] = d_x1[tid] + 10;
d_x2[tid] = d_x2[tid] + 10;
}
}
/********************************************/
/* KERNEL MANIPULATING THE STRUCT OF ARRAYS */
/********************************************/
__global__ void AoSvsSoA_v3(cellSoA cell, const int N) {
const int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < N) {
cell.x1[tid] = cell.x1[tid] + 10;
cell.x2[tid] = cell.x2[tid] + 10;
}
}
/********/
/* MAIN */
/********/
int main() {
const int N = 2048 * 2048 * 4;
TimingGPU timerGPU;
thrust::host_vector<cellAoS> h_cells(N);
thrust::device_vector<cellAoS> d_cells(N);
thrust::host_vector<unsigned int> h_x1(N);
thrust::host_vector<unsigned int> h_x2(N);
thrust::device_vector<unsigned int> d_x1(N);
thrust::device_vector<unsigned int> d_x2(N);
for (int k = 0; k < N; k++) {
h_cells[k].x1 = k + 1;
h_cells[k].x2 = k + 2;
h_cells[k].code = k + 3;
h_cells[k].done = true;
h_x1[k] = k + 1;
h_x2[k] = k + 2;
}
d_cells = h_cells;
d_x1 = h_x1;
d_x2 = h_x2;
cellSoA cell;
cell.x1 = thrust::raw_pointer_cast(d_x1.data());
cell.x2 = thrust::raw_pointer_cast(d_x2.data());
cell.code = NULL;
cell.done = NULL;
timerGPU.StartCounter();
AoSvsSoA_v1 << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(thrust::raw_pointer_cast(d_cells.data()), N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
printf("Timing AoSvsSoA_v1 = %f\n", timerGPU.GetCounter());
//timerGPU.StartCounter();
//AoSvsSoA_v2 << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(thrust::raw_pointer_cast(d_x1.data()), thrust::raw_pointer_cast(d_x2.data()), N);
//gpuErrchk(cudaPeekAtLastError());
//gpuErrchk(cudaDeviceSynchronize());
//printf("Timing AoSvsSoA_v2 = %f\n", timerGPU.GetCounter());
timerGPU.StartCounter();
AoSvsSoA_v3 << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(cell, N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
printf("Timing AoSvsSoA_v3 = %f\n", timerGPU.GetCounter());
h_cells = d_cells;
h_x1 = d_x1;
h_x2 = d_x2;
// --- Check results
for (int k = 0; k < N; k++) {
if (h_x1[k] != k + 11) {
printf("h_x1[%i] not equal to %i\n", h_x1[k], k + 11);
break;
}
if (h_x2[k] != k + 12) {
printf("h_x2[%i] not equal to %i\n", h_x2[k], k + 12);
break;
}
if (h_cells[k].x1 != k + 11) {
printf("h_cells[%i].x1 not equal to %i\n", h_cells[k].x1, k + 11);
break;
}
if (h_cells[k].x2 != k + 12) {
printf("h_cells[%i].x2 not equal to %i\n", h_cells[k].x2, k + 12);
break;
}
}
}
The following are the timings (runs performed on a GTX960):
Array of struct 9.1ms (v1 kernel)
Struct of arrays 3.3ms (v3 kernel)
Straight arrays 3.2ms (v2 kernel)
SoA is effectly good for SIMD processing.
For several reason, but basically it's more efficient to load 4 consecutive floats in a register. With something like:
float v [4] = {0};
__m128 reg = _mm_load_ps( v );
than using:
struct vec { float x; float, y; ....} ;
vec v = {0, 0, 0, 0};
and create an __m128 data by accessing all member:
__m128 reg = _mm_set_ps(v.x, ....);
if your arrays are 16-byte aligned data load/store are faster and some op can be perform directly in memory.