SSE _mm_store_ps Segmentation fault issues - c++

I am having trouble with the _mm_store_ps command. I am getting a segmentation fault when I use it (and I know that is the problem because when I comment out that line the segmentation fault goes away). It is strange though because I am using a static array which I manually ask the compiler to align, and using _mm_storeu_ps does not make the problem go away. Here is the relevant section of code:
//Directly access array instead of using Boost interface
boost::numeric::ublas::matrix<float>::iterator2 it = result.begin2();
float temp[4] __attribute__((aligned__(16))), temp2 = 0;
//Use SSE
__m128 m1, sse_right1, sse_left1, store_sse __attribute__((aligned (16))) = _mm_set_ps1(0);
unsigned k = 0;
//Iterate over the dimensions of the matrices
for (unsigned i = 0; i < ls1; i++)
{
for (unsigned j = 0; j < rs2; j++)
{
while (k + 3 < ls2)
{
sse_right1 = _mm_load_ps(arr + k + j * rs1);
sse_left1 = _mm_load_ps(left_arr + k + i * ls2);
m1 = _mm_mul_ps(sse_right1, sse_left1);
store_sse = _mm_add_ps(store_sse,m1);
k += 4;
}
//If ls2 isn't divisible by 4
while (k < ls2)
{
temp2 += left_arr[i * ls2 + k] * arr[k + j * rs1];
k++;
}
if (ls2 >= 4)
{
_mm_store_ps(temp, store_sse);
for (unsigned l = 0; l < 4; l++)
{
temp2 += temp[l];
}
}
*it = temp2;
store_sse = _mm_set_ps1(0);
temp2 = 0;
k = 0;
it++;
}
The segmentation fault isn't a problem with the array bounds because the execution makes it down to the _mm_store_ps line. Any help would be appreciated, thanks!
Edit: The problem is with _mm_load_ps, when I use _mm_loadu_ps it runs fine. I am using static arrays as the arguments to _mm_load_ps, so I don't know why I am having problems.

SSE requires its memory access to be with 16-byte aligned addresses. If you're not reading from outside of the array, this is likely your problem.
Try using _mm_storeu_ps and _mm_loadu_ps, which are unaligned versions. They will run a little slower, but they will work. After you've verified that's the problem, try aligning the memory in the first place for maximum performance.

Related

OpenMP Segmentation Fault in C++

I have a very straightforward function that counts how many inner entries of an N by N 2D matrix (represented by a pointer arr) is below a certain threshold, and updates a counter below_threshold that is passed by reference:
void count(float *arr, const int N, const float threshold, int &below_threshold) {
below_threshold = 0; // make sure it is reset
bool comparison;
float temp;
#pragma omp parallel for shared(arr, N, threshold) private(temp, comparison) reduction(+:below_threshold)
for (int i = 1; i < N-1; i++) // count only the inner N-2 rows
{
for (int j = 1; j < N-1; j++) // count only the inner N-2 columns
{
temp = *(arr + i*N + j);
comparison = (temp < threshold);
below_threshold += comparison;
}
}
}
When I do not use OpenMP, it runs fine (thus, the allocation and initialization were done correctly already).
When I use OpenMP with an N that is less than around 40000, it runs fine.
However, once I start using a larger N with OpenMP, it keeps giving me a segmentation fault (I am currently testing with N = 50000 and would like to eventually get it up to ~100000).
Is there something wrong with this at a software level?
P.S. The allocation was done dynamically ( float *arr = new float [N*N] ), and here is the code used to randomly initialize the entire matrix, which didn't have any issues with OpenMP with large N:
void initialize(float *arr, const int N)
{
#pragma omp parallel for
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
{
*(arr + i*N + j) = static_cast <float> (rand()) / static_cast <float> (RAND_MAX);
}
}
}
UPDATE:
I have tried changing i, j, and N to long long int, and it still has not fixed my segmentation fault. If this was the issue, why has it already worked without OpenMP? It is only once I add #pragma omp ... that it fails.
I think, it is because, your value (50000*50000 = 2500000000) reached above INT_MAX (2147483647) in c++. As a result, the array access behaviour will be undefined.
So, you should use UINT_MAX or some other types that suits with your usecase.

Optimization through loop unrolling and blocking

I'm not sure how else I can optimize this piece of code so that it is efficient. So far I've unrolled the inner for loop by 16 with respect to j and it is producing a mean CPE of 1.4. I need to get a mean CPE around 2.5 through optimization techniques. I've read the other questions available on this and they're a bit different compared to the code mine question involves. The first part of code shows what I'm given which is followed by my attempt at unrolling the loop. The given code will scan the rows of the source image matrix and copy to the flipped row of the destination image matrix. Any help would be greatly appreciated!
RIDX Macro:
#define RIDX(i,j,n) ((i)*(n)+(j))
Given:
void naive_rotate(int dim, struct pixel_t *src, struct pixel_t *dst)
{
int i, j;
for(i = 0; i < dim; i++)
{
for(j = 0; j < dim; j++)
{
dst[RIDX(dim-1-i, j, dim)] = src[RIDX(i, j, dim)];
}
}
}
My attempt: This does optimize it but only a bit as the mean CPE goes up from 1.0 to 1.4. I'd like it to be around a 2.5 and I've tried various types of blocking and stuff I've read about online but have not managed to optimize it more.
for(i = 0; i < dim; i++){
for(j = 0; j < dim; j+=16){
dst[RIDX(dim-1-i,j, dim)] = src[RIDX(i,j,dim)];
dst[RIDX(dim-1-i,j+1, dim)] = src[RIDX(i,j+1,dim)];
dst[RIDX(dim-1-i,j+2, dim)] = src[RIDX(i,j+2,dim)];
dst[RIDX(dim-1-i,j+3, dim)] = src[RIDX(i,j+3,dim)];
dst[RIDX(dim-1-i,j+4, dim)] = src[RIDX(i,j+4,dim)];
dst[RIDX(dim-1-i,j+5, dim)] = src[RIDX(i,j+5,dim)];
dst[RIDX(dim-1-i,j+6, dim)] = src[RIDX(i,j+6,dim)];
dst[RIDX(dim-1-i,j+7, dim)] = src[RIDX(i,j+7,dim)];
dst[RIDX(dim-1-i,j+8, dim)] = src[RIDX(i,j+8,dim)];
dst[RIDX(dim-1-i,j+9, dim)] = src[RIDX(i,j+9,dim)];
dst[RIDX(dim-1-i,j+10, dim)] = src[RIDX(i,j+10,dim)];
dst[RIDX(dim-1-i,j+11, dim)] = src[RIDX(i,j+11,dim)];
dst[RIDX(dim-1-i,j+12, dim)] = src[RIDX(i,j+12,dim)];
dst[RIDX(dim-1-i,j+13, dim)] = src[RIDX(i,j+13,dim)];
dst[RIDX(dim-1-i,j+14, dim)] = src[RIDX(i,j+14,dim)];
dst[RIDX(dim-1-i,j+15, dim)] = src[RIDX(i,j+15,dim)];
Here's a quick old-school memcpy optimization. This is usually very efficiebt. Should not need unrolling.
From RIDX:
#define RIDX(i,j,n) ((i)*(n)+(j))
We know that incrementing the 'j'component translates to a simple pointer increment.
struct pixel_t* s = src[RIDX(0, 0, dim)];
struct pixel_t* d = dst[RIDX[dim - 1, 0, dim];
for (int i = 0; i < dim; ++i, d -= (2 * dim))
{
for (int j = 0; j < dim; ++j, ++s, ++d)
{
//dst[RIDX(dim-1-i, j, dim)] = src[RIDX(i, j, dim)];
*d = *s;
// you could do it the hard way and start loop unrolling from here
}
}
In the inner loop in the code above, ++s, ++d give a hint that a memcpy optimization is possible. Note that a memcpy optimization is only possible if the type we're copying can be moved safely. Most type are. But it's something that has to be taken into account. Using memcpy does bend the strict rules of c++ a bit, but memcpy is fast.
The loops then become:
struct pixel_t* s = src[RIDX(0, 0, dim)];
struct pixel_t* d = dst[RIDX[dim - 1, 0, dim];
for (int i = 0; i < dim; ++i, d -= dim, s += dim)
{
memcpy(d, s, dim * sizeof(pixel_t));
// or...
std::copy(s, s + dim, d); // which is 'safer' but could be slower...
}
in most modern stl implementations, std::copy will translate to a memcpy in most cases. memcpy uses all the tricks in the book to make the copy faster - loop unrolling, cache look-ahead, etc...

C++ heap corruption on new

I'm writing simple ANN (neural network) for functions' approximation. I got crash with message: "Heap corrupted". I found few advices how to resolve it, but nothing help.
I got error at first line of this function:
void LU(double** A, double** &L, double** &U, int s){
U = new double*[s];
L = new double*[s];
for (int i = 0; i < s; i++){
U[i] = new double[s];
L[i] = new double[s];
for (int j = 0; j < s; j++)
U[i][j] = A[i][j];
}
for (int i = 0, j = 0; i < s; i = ++j){
L[i][j] = 1;
for (int k = i + 1; k < s - 1; k++){
L[k][j] = U[k][j] / U[i][j];
double* vec_t = mul(U[i], L[k][j], s);
for (int z = 0; z < s; z++)
U[k][z] = U[k][z] - vec_t[z];
delete[] vec_t;
}
}
};
As I understood from debagger's information: two arrays (U and L) has been passed to function with some addresses in memory. And it's quite strange because I didn't initialize it. I call this function two times and first time it works nicely (ok, at least it works), but at second call it crashes. I have no idea how to resolve it.
There is link to whole project: CLICK
I'm working in MS Visual Studio 2013 under Windows 7 x64.
UPDATE
According to some commentaries below I should provide some additive information.
First of all, sorry for quality of code. I wrote it only for myself for 2 days.
Second, when I said "at second call", I mean that first I call LU when I need to get determinant of S (I use LU decomposition fot this) and it working without any crashes. Second call it's when I trying to get inverse of matrix (the same, S). And when I call detLU at [0, 0] point of matrix (to get cofactor) I got this crash.
Third, if I get information from debagger correctly, arrays L and U passes in function at second call with already defined memory's addresses. I can't understand why, becouse before LU call I have just wrote "double** L; double** U;" without any initialization.
I can try provide some additional debug information or some tests, if somebody explain me what exactly I have to do.
The point you get a heap corruption error/crash is typically just the symptom of an actual heap overflow/underflow or other memory error at some other time/point in the past. This is why heap corruptions can be difficult to track down.
You have a lot of code and all the double-pointers are difficult to track but I did notice one potential issue:
double** initInWeights(double f, int h, int w) {
double** W = new double*[h];
for (int i = 0; i < 10; i++) {
W[i] = new double[w];
The loop will overflow W[] if h is less than 10. Chances are that somewhere in your code you have a buffer overflow/underflow or are using memory after it is freed. The complexity and design of your code makes it difficult to pinpoint at a glance.
Is there a reason you are using raw double-pointers instead of simply std::vector<std::vector<double>>? This would remove all your manual memory management code, making your code shorter, simpler, and more importantly remove the heap corruption issue.
Barring that you should double-check that all manually allocated memory is the correct size and access loops can never go out-of-bounds.
Update -- I think your problem may lie with a buffer overflow in the extract() function in matrix.cpp:
double** extract(double** mat, int s, int col, int row)
{
double** ext = new double*[s - 1];
for (int i = 0; i < s - 1; i++)
{
ext[i] = new double[s - 1];
}
int ext_c = 0, ext_r = 0;
for (int i = 0; i < s; i++)
{
if (i != row)
{
for (int j = 0; j < s; j++)
{ // Overflow on ext_c here
if (j != col) ext[ext_r][ext_c++] = mat[i][j];
}
ext_r++;
}
}
return ext;
};
You never reset ext_c so it simply keeps increasing in size up to (s-1)*(s-1) which obviously overflows the ext[] array. To fix this you simply need to change the inner loop definition to:
for (int j = 0, ext_c = 0; j < s; j++)
At least that one change lets me run your project without any heap corruption errors.

I want to optimize this short loop

I would like to optimize this simple loop:
unsigned int i;
while(j-- != 0){ //j is an unsigned int with a start value of about N = 36.000.000
float sub = 0;
i=1;
unsigned int c = j+s[1];
while(c < N) {
sub += d[i][j]*x[c];//d[][] and x[] are arrays of float
i++;
c = j+s[i];// s[] is an array of unsigned int with 6 entries.
}
x[j] -= sub; // only one memory-write per j
}
The loop has an execution time of about one second with a 4000 MHz AMD Bulldozer. I thought about SIMD and OpenMP (which I normally use to get more speed), but this loop is recursive.
Any suggestions?
think you may want to transpose the matrix d -- means store it in such a way that you can exchange the indices -- make i the outer index:
sub += d[j][i]*x[c];
instead of
sub += d[i][j]*x[c];
This should result in better cache performance.
I agree with transposing for better caching (but see my comments on that at the end), and there's more to do, so let's see what we can do with the full function...
Original function, for reference (with some tidying for my sanity):
void MultiDiagonalSymmetricMatrix::CholeskyBackSolve(float *x, float *b){
//We want to solve L D Lt x = b where D is a diagonal matrix described by Diagonals[0] and L is a unit lower triagular matrix described by the rest of the diagonals.
//Let D Lt x = y. Then, first solve L y = b.
float *y = new float[n];
float **d = IncompleteCholeskyFactorization->Diagonals;
unsigned int *s = IncompleteCholeskyFactorization->StartRows;
unsigned int M = IncompleteCholeskyFactorization->m;
unsigned int N = IncompleteCholeskyFactorization->n;
unsigned int i, j;
for(j = 0; j != N; j++){
float sub = 0;
for(i = 1; i != M; i++){
int c = (int)j - (int)s[i];
if(c < 0) break;
if(c==j) {
sub += d[i][c]*b[c];
} else {
sub += d[i][c]*y[c];
}
}
y[j] = b[j] - sub;
}
//Now, solve x from D Lt x = y -> Lt x = D^-1 y
// Took this one out of the while, so it can be parallelized now, which speeds up, because division is expensive
#pragma omp parallel for
for(j = 0; j < N; j++){
x[j] = y[j]/d[0][j];
}
while(j-- != 0){
float sub = 0;
for(i = 1; i != M; i++){
if(j + s[i] >= N) break;
sub += d[i][j]*x[j + s[i]];
}
x[j] -= sub;
}
delete[] y;
}
Because of the comment about parallel divide giving a speed boost (despite being only O(N)), I'm assuming the function itself gets called a lot. So why allocate memory? Just mark x as __restrict__ and change y to x everywhere (__restrict__ is a GCC extension, taken from C99. You might want to use a define for it. Maybe the library already has one).
Similarly, though I guess you can't change the signature, you can make the function take only a single parameter and modify it. b is never used when x or y have been set. That would also mean you can get rid of the branch in the first loop which runs ~N*M times. Use memcpy at the start if you must have 2 parameters.
And why is d an array of pointers? Must it be? This seems too deep in the original code, so I won't touch it, but if there's any possibility of flattening the stored array, it will be a speed boost even if you can't transpose it (multiply, add, dereference is faster than dereference, add, dereference).
So, new code:
void MultiDiagonalSymmetricMatrix::CholeskyBackSolve(float *__restrict__ x){
// comments removed so that suggestions are more visible. Don't remove them in the real code!
// these definitions got long. Feel free to remove const; it does nothing for the optimiser
const float *const __restrict__ *const __restrict__ d = IncompleteCholeskyFactorization->Diagonals;
const unsigned int *const __restrict__ s = IncompleteCholeskyFactorization->StartRows;
const unsigned int M = IncompleteCholeskyFactorization->m;
const unsigned int N = IncompleteCholeskyFactorization->n;
unsigned int i;
unsigned int j;
for(j = 0; j < N; j++){ // don't use != as an optimisation; compilers can do more with <
float sub = 0;
for(i = 1; i < M && j >= s[i]; i++){
const unsigned int c = j - s[i];
sub += d[i][c]*x[c];
}
x[j] -= sub;
}
// Consider using processor-specific optimisations for this
#pragma omp parallel for
for(j = 0; j < N; j++){
x[j] /= d[0][j];
}
for( j = N; (j --) > 0; ){ // changed for clarity
float sub = 0;
for(i = 1; i < M && j + s[i] < N; i++){
sub += d[i][j]*x[j + s[i]];
}
x[j] -= sub;
}
}
Well it's looking tidier, and the lack of memory allocation and reduced branching, if nothing else, is a boost. If you can change s to include an extra UINT_MAX value at the end, you can remove more branches (both the i<M checks, which again run ~N*M times).
Now we can't make any more loops parallel, and we can't combine loops. The boost now will be, as suggested in the other answer, to rearrange d. Except… the work required to rearrange d has exactly the same cache issues as the work to do the loop. And it would need memory allocated. Not good. The only options to optimise further are: change the structure of IncompleteCholeskyFactorization->Diagonals itself, which will probably mean a lot of changes, or find a different algorithm which works better with data in this order.
If you want to go further, your optimisations will need to impact quite a lot of the code (not a bad thing; unless there's a good reason for Diagonals being an array of pointers, it seems like it could do with a refactor).
I want to give an answer to my own question: The bad performance was caused by cache conflict misses due to the fact that (at least) Win7 aligns big memory blocks to the same boundary. In my case, for all buffers, the adresses had the same alignment (bufferadress % 4096 was same for all buffers), so they fall into the same cacheset of L1 cache. I changed memory allocation to align the buffers to different boundaries to avoid cache conflict misses and got a speedup of factor 2. Thanks for all the answers, especially the answers from Dave!

Segmentation fault when implementing insertion sort

#include <iostream>
using namespace std;
int main(){
int a[6] = {5, 2, 4, 6, 1, 3}; // create an array of size 6
int j, key = 0;
for (int i = 1; i < 6; i++) {
key = a[i];
j = i - 1;
while ((j >= 0) && (a[j] > key)) {
a[j + 1] = a[j];
j -= 1;
}
a[j + 1] = key;
}
for (int l = 0; l < 6; l++) {
cout << a[l];
}
return 0;
}
I'm trying to test my insertion sort code using an array
the code complies but when I try to execute the a.out file,
it gives me "Segmentation Fault",
I look up what segmentation fault is, it's basically an error that we are trying to access the forbidden memory location, however, I'm wondering where exactly is the error in my code. Also, if i get rid of the
for (int l = 0; l < 6; l++) {
cout << a[l];
}
no error is found.
Your variable j is not initialized and so may be anything when you first access a[j]. That causes the segmentation fault. (int j,key =0; only sets key to 0 but not j.)
Always compile your code with -Wall, this would have told you about the use of the uninitialized variable. (Correction: My gcc 4.7 doesn't catch it. How lame.)
(The reason why the error goes away when you remove the printing is that you have compiler optimizations turned on: The compiler then notices that you never do anything practical with the computed values and arrays and just throws everything into the bin and gives you an empty program.)
sorting is one of the algorithms in the stl. you should really be using std::sort like
std::sort( a, a+6 );
PS: j is initialized before use in the line
j = i - 1;
so that is not the cause of the crash.