Modifying a large dynamically sized 3D array mex/C++ - c++

Short Story: Trying to modify a large 3D array which gets allocated at run-time on the heap. I believe the function that modifies the array, vcross shown below, is creating memory that doesn't get destroyed.
Longer Story:
I have a large 3D double array (~126000x3x3 or about 8.6MB) that I need to run some operations on. I don't know how large the first dimension of this array is at compile time, so I allocate memory to the heap using the new and delete operations.
When I try to store values to this array, I get a segmentation violation. This makes me think that while storing values to the array, I'm creating memory somewhere that goes to waste, and eventually fills up the heap.
The code compiles fine, but hits a seg violation when I run it.
static void inpolyh(
double (*f)[3],//pointer to array[3], treated as 2D array where I don't know the first dimension until run-time.
double (*v)[3],
double (*p)[3],
size_t numF,
size_t numP)
{
/*Calculate the baseNormals*/
//allocate memory on the heap
double (*baseNormals)[3] = NULL;//pointer to an array[3]
if ( !(baseNormals = new double[numF][3]) ) { out_of_memory(); }
//store the vector cross products in each array[3] of baseNormals
for (int i=0; i<numF; i++) {
vcross(baseNormals[i],
v[(int)f[i][0]],
v[(int)f[i][1]],
v[(int)f[i][2]]);
//THIS WORKS
}
/*Calculate face normals of tetrahedron*/
//allocate memory on the heap (THIS WORKS)
double (*faceNormals)[3][3] = NULL; //pointer to an array[3] of arrays[3]
if ( !(faceNormals = new double[numP][3][3]) ) { out_of_memory(); }
//store vector cross products into each array[3] of faceNormals
for (int i=0; i<numP; i++ ) {
for (int j=0; j<3; j++ ) {
vcross(faceNormals[i][j],
p[i],
v[ (int) f[i][j] ],
v[ (int) f[i][ (j + 1) % 3 ] ] );
//SEG VIOLATION at i=37560
}
}
delete [] baseNormals;
delete [] faceNormals;
}
This is where I believe the culprit to be. I think this function creates memory somewhere that never gets destroyed. The vector cross product functions accepts four array[3] parameters and assigns some values to the first input parameter, which is passed by reference.
static void vcross(
double (&n)[3],
double a[3],
double b[3],
double c[3])
{
n[0] = b[1] * c[2] - a[1] * c[2] + a[1] * b[2] - b[2]
* c[1] + a[2] * c[1] - a[2] * b[1];
n[1] = b[2] * c[0] - a[2] * c[0] + a[2] * b[0] - b[0]
* c[2] + a[0] * c[2] - a[0] * b[2];
n[2] = b[0] * c[1] - a[0] * c[1] + a[0] * b[1] - b[1]
* c[0] + a[1] * c[0] - a[1] * b[0];
return;
}
Other details that may matter:
This is intended to be a mex function to run in matlab.
Default Encoding: windows-1252
MATLAB Root : C:\Program Files\MATLAB\R2012b
MATLAB Version : 8.0.0.783 (R2012b)
Operating System: Microsoft Windows 7
Processor ID : x86 Family 6 Model 58 Stepping 9, GenuineIntel
Virtual Machine : Java 1.6.0_17-b04 with Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM mixed mode
Window System : Version 6.1 (Build 7601: Service Pack 1)

Are you sure that (int)f[i][j] is always in the range [0, <dimension of v>)? The first thing I'd do is to print out the values of (int)f[i][j] when the loop runs. Or just fire a debugger to see the values at the moment of the crash.

Related

C++ performance optimization for linear combination of large matrices?

I have a large tensor of floating point data with the dimensions 35k(rows) x 45(cols) x 150(slices) which I have stored in an armadillo cube container. I need to linearly combine all the 150 slices together in under 35 ms (a must for my application). The linear combination floating point weights are also stored in an armadillo container. My fastest implementation so far takes 70 ms, averaged over a window of 30 frames, and I don't seem to be able to beat that. Please note I'm allowed CPU parallel computations but not GPU.
I have tried multiple different ways of performing this linear combination but the following code seems to be the fastest I can get (70 ms) as I believe I'm maximizing the cache hit chances by fetching the largest possible contiguous memory chunk at each iteration.
Please note that Armadillo stores data in column major format. So in a tensor, it first stores the columns of the first channel, then the columns of the second channel, then third and so forth.
typedef std::chrono::system_clock Timer;
typedef std::chrono::duration<double> Duration;
int rows = 35000;
int cols = 45;
int slices = 150;
arma::fcube tensor(rows, cols, slices, arma::fill::randu);
arma::fvec w(slices, arma::fill::randu);
double overallTime = 0;
int window = 30;
for (int n = 0; n < window; n++) {
Timer::time_point start = Timer::now();
arma::fmat result(rows, cols, arma::fill::zeros);
for (int i = 0; i < slices; i++)
result += tensor.slice(i) * w(i);
Timer::time_point end = Timer::now();
Duration span = end - start;
double t = span.count();
overallTime += t;
cout << "n = " << n << " --> t = " << t * 1000.0 << " ms" << endl;
}
cout << endl << "average time = " << overallTime * 1000.0 / window << " ms" << endl;
I need to optimize this code by at least 2x and I would very much appreciate any suggestions.
First at all I need to admit, I'm not familiar with the arma framework or the memory layout; the least if the syntax result += slice(i) * weight evaluates lazily.
Two primary problem and its solution anyway lies in the memory layout and the memory-to-arithmetic computation ratio.
To say a+=b*c is problematic because it needs to read the b and a, write a and uses up to two arithmetic operations (two, if the architecture does not combine multiplication and accumulation).
If the memory layout is of form float tensor[rows][columns][channels], the problem is converted to making rows * columns dot products of length channels and should be expressed as such.
If it's float tensor[c][h][w], it's better to unroll the loop to result+= slice(i) + slice(i+1)+.... Reading four slices at a time reduces the memory transfers by 50%.
It might even be better to process the results in chunks of 4*N results (reading from all the 150 channels/slices) where N<16, so that the accumulators can be allocated explicitly or implicitly by the compiler to SIMD registers.
There's a possibility of a minor improvement by padding the slice count to multiples of 4 or 8, by compiling with -ffast-math to enable fused multiply accumulate (if available) and with multithreading.
The constraints indicate the need to perform 13.5GFlops, which is a reasonable number in terms of arithmetic (for many modern architectures) but also it means at least 54 Gb/s memory bandwidth, which could be relaxed with fp16 or 16-bit fixed point arithmetic.
EDIT
Knowing the memory order to be float tensor[150][45][35000] or float tensor[kSlices][kRows * kCols == kCols * kRows] suggests to me to try first unrolling the outer loop by 4 (or maybe even 5, as 150 is not divisible by 4 requiring special case for the excess) streams.
void blend(int kCols, int kRows, float const *tensor, float *result, float const *w) {
// ensure that the cols*rows is a multiple of 4 (pad if necessary)
// - allows the auto vectorizer to skip handling the 'excess' code where the data
// length mod simd width != 0
// one could try even SIMD width of 16*4, as clang 14
// can further unroll the inner loop to 4 ymm registers
auto const stride = (kCols * kRows + 3) & ~3;
// try also s+=6, s+=3, or s+=4, which would require a dedicated inner loop (for s+=2)
for (int s = 0; s < 150; s+=5) {
auto src0 = tensor + s * stride;
auto src1 = src0 + stride;
auto src2 = src1 + stride;
auto src3 = src2 + stride;
auto src4 = src3 + stride;
auto dst = result;
for (int x = 0; x < stride; x++) {
// clang should be able to optimize caching the weights
// to registers outside the innerloop
auto add = src0[x] * w[s] +
src1[x] * w[s+1] +
src2[x] * w[s+2] +
src3[x] * w[s+3] +
src4[x] * w[s+4];
// clang should be able to optimize this comparison
// out of the loop, generating two inner kernels
if (s == 0) {
dst[x] = add;
} else {
dst[x] += add;
}
}
}
}
EDIT 2
Another starting point (before adding multithreading) would be consider changing the layout to
float tensor[kCols][kRows][kSlices + kPadding]; // padding is optional
The downside now is that kSlices = 150 can't anymore fit all the weights in registers (and secondly kSlices is not a multiple of 4 or 8). Furthermore the final reduction needs to be horizontal.
The upside is that reduction no longer needs to go through memory, which is a big thing with the added multithreading.
void blendHWC(float const *tensor, float const *w, float *dst, int n, int c) {
// each thread will read from 4 positions in order
// to share the weights -- finding the best distance
// might need some iterations
auto src0 = tensor;
auto src1 = src0 + c;
auto src2 = src1 + c;
auto src3 = src2 + c;
for (int i = 0; i < n/4; i++) {
vec8 acc0(0.0f), acc1(0.0f), acc2(0.0f), acc3(0.0f);
// #pragma unroll?
for (auto j = 0; j < c / 8; c++) {
vec8 w(w + j);
acc0 += w * vec8(src0 + j);
acc1 += w * vec8(src1 + j);
acc2 += w * vec8(src2 + j);
acc3 += w * vec8(src3 + j);
}
vec4 sum = horizontal_reduct(acc0,acc1,acc2,acc3);
sum.store(dst); dst+=4;
}
}
These vec4 and vec8 are some custom SIMD classes, which map to SIMD instructions either through intrinsics, or by virtue of the compiler being able to do compile using vec4 = float __attribute__ __attribute__((vector_size(16))); to efficient SIMD code.
As #hbrerkere suggested in the comment section, by using the -O3 flag and making the following changes, the performance improved by almost 65%. The code now runs at 45 ms as opposed to the initial 70 ms.
int lastStep = (slices / 4 - 1) * 4;
int i = 0;
while (i <= lastStep) {
result += tensor.slice(i) * w_id(i) + tensor.slice(i + 1) * w_id(i + 1) + tensor.slice(i + 2) * w_id(i + 2) + tensor.slice(i + 3) * w_id(i + 3);
i += 4;
}
while (i < slices) {
result += tensor.slice(i) * w_id(i);
i++;
}
Without having the actual code, I'm guessing that
+= tensor.slice(i) * w_id(i)
creates a temporary object and then adds it to the lhs. Yes, overloaded operators look nice, but I would write a function
addto( lhs, slice1, w1, slice2, w2, ....unroll to 4... )
which translates to pure loops over the elements:
for (i=....)
for (j=...)
lhs[i][j] += slice1[i][j]*w1[j] + slice2[i][j] &c
It would surprise me if that doesn't buy you an extra factor.

How to calculate where an indexed value in a 3d array will be in memory? How to calculate where an indexed value in a char** will be in memory?

The problem states: Given the following array declarations and indexed accesses, compute the address where the indexed value will be in memory. Assume the array starts at location 200 on a 64-bit computer.
a. double d[3][4][4]; d[1][2][3] is at: _________
b. char *n[10]; n[3] is at: _________
I know the answers are 416 and 224 (respectively), but I do not understand how those numbers were reached.
For part a, I was told the equation:
address-in-3d-array= start-address + (p * numR * numC + (i * numC) + j) * size-of-type
(where start address = 200, the numR and numC come from the original array, and the i,j, and p come from the location you are trying to find).
Nothing I do makes this equation come to 416. I have been viewing the order of the array as d[row][column][plane]. Is that incorrect? I have also tried looking at it as d[plane][row][column], but that didn't seem to work either.
For part b, I'm not sure where to start as I thought that as the array is an array of pointers, it's location would be in the heap. I'm not sure how to get 224 from that.
I need to answer these questions by hand, not using code.
For this array declaration
double d[3][4][4];
to calculate the address of the expression
d[1][2][3]
You can use the following formula
reinterpret_cast<double *>( d ) + 1 * 16 + 2 * 4 + 3
that is the same (relative to the value of the expression) as
reinterpret_cast<char *>( d ) + 27 * sizeof( double )
So you can calculate the address like the address of the first element of the array plus the expression 27 * sizeof( double ) where double is equal to 8.
For this array
char *n[10];
the address of the expression
n[3]
is
reinterpret_cast<char *>( n ) + 3 * sizeof( char * )
In words:
Given a generic array d[s1][s2][s3] of elements of size S, the offset of the d[x][y][z] element is
[(x * s2 * s3) + (y * s3) + z] * S
In the array double d[3][4][4], with S = sizeof(double) = 8, the location of d[1][2][3] is at offset:
[(1 * 4 * 4) + (2 * 4) + 3] * 8 = 216
Sum the offset (216) to the start (200) to get 416.
Given a generic array n[s1] of elements of size S, the offset of the n[x] element is
x * S
In the array char * n[10], with S = 8 (pointers size on 64bit platforms), the location of n[3] is at offset
3 * 8 = 24
Sum the offset (24) to the start (200) to get 224.
In code:
int main()
{
double d[3][4][4];
size_t start = 200;
size_t offset =
sizeof(d[0]) * 1
+ sizeof(d[0][0]) * 2
+ sizeof(d[0][0][0]) * 3;
std::cout << start + offset << std::endl; //416 on my machine
char * n[10];
offset = 3 * sizeof(char*);
std::cout << start + offset << std::endl; //224 on every 64bit platforms
}

What happens when we calculate this-(object of current class)

i have class name DPPoint what happens when we calculate this-(object of DPPoint) in same class and assign value to a variable of type int
If I understand correctly the question is about this:
DPPoint* p1 = new DPPoint;
DPPoint* p2 = new DPPoint;
int k = p1 - p2; // what is k?
That's perfectly valid code. It's called "pointer arithmetic". It will give you distance between p1 and p2 in sizes of DPPoint.
DPPoint array [10];
int k = &array[5] - &array[3]; // k = 2
int n = (int)&array[5] - (int)&array[3]; // n = 2 * sizeof (DPPoint)
(&array[5] == &array[3] + 2); // true
(&array[5] == array + 5); // also true
The pointers don't have to be in the same array. Can be two random addresses in memory (skipping alignment issues for simplicity).

How __shared__ is working in the following code?

I am not getting line no 9 and 10; the index being used and calculated via the formula Col + (m*TILE_WIDTH + ty)*Width.
Can someone help me in understanding this code, i.e. the use of __shared__?
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
__shared__float Mds[TILE_WIDTH][TILE_WIDTH];
__shared__float Nds[TILE_WIDTH][TILE_WIDTH];
3. int bx = blockIdx.x; int by = blockIdx.y;
4. int tx = threadIdx.x; int ty = threadIdx.y;
// Identify the row and column of the Pd element to work on
5. int Row = by * TILE_WIDTH + ty;
6. int Col = bx * TILE_WIDTH + tx;
7. float Pvalue = 0; ;
// Loop over the Md and Nd tiles required to compute the Pd element
8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Coolaborative loading of Md and Nd tiles into shared memory
9.Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
10.Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width];
11.__syncthreads();
11. for (int k = 0; k < TILE_WIDTH; ++k)
12.Pvalue += Mds[ty][k] * Nds[k][tx];
13. Synchthreads();
}
Pd[Row*Width+Col] = Pvalue;
}
__shared__ memory is a fast (but small) on-chip resource for the GPU.
The matrices to be multiplied start out in global memory (Md and Nd). Lines 10 and 11:
Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)]; // line 10
Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width]; // line 11
each load a "tile" (square sub-section) of the matrix to be multiplied (either Md or Nd) into a shared memory copy (Mds or Nds). The reason a single line of code can load a whole "tile" is because all threads of the threadblock execute that one line of code. As a result, a threadblock-size "chunk" or "tile" of data is moved from global to shared memory.
Once it is in shared memory, the actual multiplication is done in line 14. Since line 14 is operating out of shared memory instead of global memory, and because there is data reuse amongst adjacent threads in the block, the overall multiplication operation runs more quickly, because shared memory can be accessed more rapidly than global memory.
A similar code and description is provided in the programming guide.

Location of Intel's __assume affects performance

I am using an 8-th order finite difference time stepping function (for 2D acoustic wave equation) shown below.
I am observing substantial (up to 25%) performance increase from placing Intel's __assume statement inside the inner loop, compared to placing it at the beginning of the function body. (This happens regardless of number of OpenMP threads).
The code is compiled by Intel 2016-update1 compiler, Linux, with -O3 optimization option, and for AVX-capable architecture (Xeon E5-2695 v2).
Is it a compiler problem?
/* Finite difference, 8-th order scheme for acoustic 2D equation.
p - current pressure
q - previous and next pressure
c - velocity
n0 x n1 - problem size
p1 - stride
*/
void fdtd_2d( float const* const __restrict__ p,
float * const __restrict__ q,
float const* const __restrict__ c,
int const n0,
int const n1,
int const p1 )
{
// Stencil coefficients.
static const float C[5] = { -5.6944444e+0f, 1.6000000e+0f, -2.0000000e-1f, 2.5396825e-2f, -1.7857143e-3f };
// INTEL OPTIMIZER PROBLEM?
// PLACING THE FOLLOWING LINE INSIDE THE LOOP BELOW
// INSTEAD OF HERE SPEEDS UP THE CODE!
// __assume( p1 % 16 == 0 );
#pragma omp parallel for default(none)
for ( int i1 = 0; i1 < n1; ++i1 )
{
float const* const __restrict__ ps = p + i1 * p1;
float * const __restrict__ qs = q + i1 * p1;
float const* const __restrict__ cs = c + i1 * p1;
#pragma omp simd aligned( ps, qs, cs : 64 )
for ( int i0 = 0; i0 < n0; ++i0 )
{
// INTEL OPTIMIZER PROBLEM?
// PLACING THE FOLLOWING LINE HERE
// INSTEAD OF THE ABOVE SPEEDS UP THE CODE!
__assume( p1 % 16 == 0 );
// Laplacian cross stencil:
// center and 4 points up, down, left and right from the center
auto lap = C[0] * ps[i0];
for ( int r = 1; r <= 4; ++r )
lap += C[r] * ( ps[i0 + r] + ps[i0 - r] + ps[i0 + r * p1] + ps[i0 - r * p1] );
qs[i0] = 2.0f * ps[i0] - qs[i0] + cs[i0] * lap;
}
}
}
I was pointed to the following on Intel website:
Clauses such as __assume_aligned and __assume tell the compiler that the property holds at the particular point in the program where the clause appears. So the statement __assume_aligned(a, 64); means the pointer a is aligned at 64 bytes whenever program execution reaches this point. Compiler may propagate that property to other points in the program (such as a later loop), but this behavior is not guaranteed (it is possible that compiler has to make conservative assumptions and cannot apply the property safely for a later loop in the same function).
So when I place __assume at the beginning of the function body, the assumption is not propagated into the inner loops, which results in less optimal code.
Although, my expectation was reasonable: since p1 is declared as const, the compiler could have propagated the assumption.