Getting User store segfault error - c++

I am receiving the error "User store segfault # 0x000000007feff598" for a large convolution operation.
I have defined the resultant array as
int t3_isize = 0;
int t3_irowcount = 0;
t3_irowcount=atoi(argv[2]);
t3_isize = atoi(argv[3]);
int iarray_size = t3_isize*t3_irowcount;
uint64_t t_result[iarray_size];
I noticed that if the array size is less than 2^16 - 1, the operation doesn't fail, but for the array size 2^16 or higher, I get the segfault error.
Any idea why this is happening? And how can i rectify this?

“I noticed that if the array size is greater than 2^16 - 1, the operation doesn't fail, but for the array size 2^16 or higher, I get the segfault error”
↑ Seems a bit self-contradictory.
But probably you're just allocating a too large array on the stack. Using dynamic memory allocation (e.g., just switch to using std::vector) you avoid that problem. For example:
std::vector<uint64_t> t_result(iarray_size);
In passing, I would ditch the Hungarian notation-like prefixes. For example, t_ reads like this is a type. The time for Hungarian notation was late 1980's, and its purpose was to support Microsoft's Programmer's Workbench, a now dicontinued (for very long) product.

You're probably declaring too large of an array for the stack. 216 elements of 8 bytes each is quite a lot (512K bytes).
If you just need static allocation, move the array to file scope.
Otherwise, consider using std::vector, which will allocate storage from the heap and manage it for you.

Using malloc() solved the issue.
uint64_t* t_result = (uint64_t*) malloc(sizeof(uint64_t)*iarray_size);

Related

Aligning buffer to an N-byte boundary but not a 2N-byte one?

I would like to allocate some char buffers0, to be passed to an external non-C++ function, that have a specific alignment requirement.
The requirement is that the buffer be aligned to a N-byte1 boundary, but not to a 2N boundary. For example, if N is 64, then an the pointer to this buffer p should satisfy ((uintptr_t)p) % 64 == 0 and ((uintptr_t)p) % 128 != 0 - at least on platforms where pointers have the usual interpretation as a plain address when cast to uintptr_t.
Is there a reasonable way to do this with the standard facilities of C++11?
If not, is there is a reasonable way to do this outside the standard facilities2 which works in practice for modern compilers and platforms?
The buffer will be passed to an outside routine (adhering to the C ABI but written in asm). The required alignment will usually be greater than 16, but less than 8192.
Over-allocation or any other minor wasted-resource issues are totally fine. I'm more interested in correctness and portability than wasting a few bytes or milliseconds.
Something that works on both the heap and stack is ideal, but anything that works on either is still pretty good (with a preference towards heap allocation).
0 This could be with operator new[] or malloc or perhaps some other method that is alignment-aware: whatever makes sense.
1 As usual, N is a power of two.
2 Yes, I understand an answer of this type causes language-lawyers to become apoplectic, so if that's you just ignore this part.
Logically, to satisfy "aligned to N, but not 2N", we align to 2N then add N to the pointer. Note that this will over-allocate N bytes.
So, assuming we want to allocate B bytes, if you just want stack space, alignas would work, perhaps.
alignas(N*2) char buffer[B+N];
char *p = buffer + N;
If you want heap space, std::aligned_storage might do:
typedef std::aligned_storage<B+N,N*2>::type ALIGNED_CHAR;
ALIGNED_CHAR buffer;
char *p = reinterpret_cast<char *>(&buffer) + N;
I've not tested either out, but the documentation suggests it should be OK.
You can use _aligned_malloc(nbytes,alignment) (in MSVC) or _mm_malloc(nbytes,alignment) (on other compilers) to allocate (on the heap) nbytes of memory aligned to alignment bytes, which must be an integer power of two.
Then you can use the trick from Ken's answer to avoid alignment to 2N:
void*ptr_alloc = _mm_malloc(nbytes+N,2*N);
void*ptr = static_cast<void*>(static_cast<char*>(ptr_alloc) + N);
/* do your number crunching */
_mm_free(ptr_alloc);
We must ensure to keep the pointer returned by _mm_malloc() for later de-allocation, which must be done via _mm_free().

What is QList's maximum size?

Has anyone encountered a maximum size for QList?
I have a QList of pointers to my objects and have found that it silently throws an error when it reaches the 268,435,455th item, which is exactly 28 bits. I would have expected it to have at least a 31bit maximum size (minus one bit because size() returns a signed integer), or a 63bit maximum size on my 64bit computer, but this doesn't appear to be the case. I have confirmed this in a minimal example by executing QList<void*> mylist; mylist.append(0); in a counting loop.
To restate the question, what is the actual maximum size of QList? If it's not actually 2^32-1 then why? Is there a workaround?
I'm running a Windows 64bit build of Qt 4.8.5 for MSVC2010.
While the other answers make a useful attempt at explaining the problem, none of them actually answer the question or missed the point. Thanks to everyone for helping me track down the issue.
As Ali Mofrad mentioned, the error thrown is a std::bad_alloc error when the QList fails to allocate additional space in my QList::append(MyObject*) call. Here's where that happens in the Qt source code:
qlist.cpp: line 62:
static int grow(int size) //size = 268435456
{
//this is the problem line
volatile int x = qAllocMore(size * sizeof(void *), QListData::DataHeaderSize) / sizeof(void *);
return x; //x = -2147483648
}
qlist.cpp: line 231:
void **QListData::append(int n) //n = 1
{
Q_ASSERT(d->ref == 1);
int e = d->end;
if (e + n > d->alloc) {
int b = d->begin;
if (b - n >= 2 * d->alloc / 3) {
//...
} else {
realloc(grow(d->alloc + n)); //<-- grow() is called here
}
}
d->end = e + n;
return d->array + e;
}
In grow(), the new size requested (268,435,456) is multiplied by sizeof(void*) (8) to compute the size of the new block of memory to accommodate the growing QList. The problem is, 268435456*8 equals +2,147,483,648 if it's an unsigned int32, or -2,147,483,648 for a signed int32, which is what's getting returned from grow() on my OS. Therefore, when std::realloc() is called in QListData::realloc(int), we're trying to grow to a negative size.
The workaround here, as ddriver suggested, is to use QList::reserve() to pre-allocate the space, preventing my QList from ever having to grow.
In short, the maximum size for QList is 2^28-1 items unless you pre-allocate, in which case the maximum size truly is 2^31-1 as expected.
Update (Jan 2020): This appears to have changed in Qt 5.5, such that 2^28-1 is now the maximum size allowed for QList and QVector, regardless of whether or not you reserve in advance. A shame.
Has anyone encountered a maximum size for QList? I have a QList of pointers to my objects and have found that it silently throws an error when it reaches the 268,435,455th item, which is exactly 28 bits. I would have expected it to have at least a 31bit maximum size (minus one bit because size() returns a signed integer), or a 63bit maximum size on my 64bit computer, but this doesn't appear to be the case.
Theoretical maximum positive number stored in int is 2^31 - 1. Size of pointer is 4 bytes (for 32bit machine), so maximum possible number of them is 2^29 - 1. Appending data to the container will increases fragmentation of heap memory, so there is possible that you can allocate only half of possible memory. Try use reserve() or resize() instead.
Moreover, Win32 has some limits for memory allocation. So application compiled without special options cannot allocate more than this limit (1G or 2G).
Are you sure about this huge containers? Is it better to optimize application?
QList stores its elements in a void * array.
Hence, a list with 228 items, of which each one is a void *, will be 230 bytes long on a 32 bit machine, and 231 bytes on a 64 bit machine. I doubt you can request such a big chunk of contiguous memory.
And why allocating such a huge list anyhow? Are you sure you really need it?
The idea of be backed by an array of void * elements is because several operations on the list can be moved to non-templated code, therefore reducing the amount of generated code.
QList stores items straight in the void * array if the type is small enough (i.e. sizeof(T) <= sizeof(void*)), and if the type can be moved in memory via memmove. Otherwise, each item will be allocated on the heap via new, and the array will store the pointers to those items. A set of type traits is used to figure out how to handle each type, see Q_DECLARE_TYPEINFO.
While in theory this approach may sound attractive, in practice:
For all primitive types smaller than void * (char; int and float on 64 bit; etc.) you waste from 50 to 75% of the allocated space in the array
For all movable types bigger than void * (double on 32bit, QVariant, ...), you pay a heap allocation per each item in the list (plus the array itself)
QList code is generally less optimized than QVector one
Compilers these days do a pretty good job at merging template instantiations, hence the original reason for this design gets lost.
Today it's a much better idea to stick with QVector. Unfortunately the Qt APIs expose QList everywhere and can't change them (and we need C++11 to define QList as a template alias for QVector...)
I test this in Ubuntu 32bit with 4GB RAM using qt4.8.6. Maximum size for me is 268,435,450
I test this in Windows7 32bit with 4GB RAM using qt4.8.4. Maximum size for me is 134,217,722
This error happend : 'std::bad_alloc'
#include <QCoreApplication>
#include <QDebug>
int main(int argc, char *argv[])
{
QCoreApplication a(argc, argv);
QList<bool> li;
for(int i=0; ;i++)
{
li.append(true);
if(i>268435449)
qDebug()<<i;
}
return a.exec();
}
Output is :
268435450
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc

2D array access time comparison

I have two ways of constructing a 2D array:
int arr[NUM_ROWS][NUM_COLS];
//...
tmp = arr[i][j]
and flattened array
int arr[NUM_ROWS*NUM_COLS];
//...
tmp = arr[i*NuM_COLS+j];
I am doing image processing so even a little improvement in access time is necessary. Which one is faster? I am thinking the first one since the second one needs calculation, but then the first one requires two addressing so I am not sure.
I don't think there is any performance difference. System will allocate same amount of contiguous memory in both cases. For calculate i*Numcols+j, either you would do it for 1D array declaration, or system would do it in 2D case. Only concern is ease of usage.
You should have trust into the capabilities of your compiler in optimizing standard code.
Also you should have trust into modern CPUs having fast numeric multiplication instructions.
Don't bother to use one or another!
I - decades ago - optimized some code greatly by using pointers instead of using 2d-array-calculation --> but this will a) only be useful if it is an option to store the pointer - e.g. in a loop and b) have low impact since i guess modern cpus should do 2d array access in a single cycle? Worth measuring! May be related to the array size.
In any case pointers using ptr++ or ptr += NuM_COLS will for sure be a little bit faster if applicable!
The first method will almost always be faster. IN GENERAL (because there are always corner cases) processor and memory architecture as well as compilers may have optimizations built in to aid with 2d arrays or other similar data structures. For example, GPUs are optimized for matrix (2d array) math.
So, again in general, I would allow the compiler and hardware to optimize your memory and address arithmetic if possible.
...also I agree with #Paul R, there are much bigger considerations when it comes to performance than your array allocation and address arithmetic.
There are two cases to consider: compile time definition and run-time definition of the array size. There is big difference in performance.
Static allocation, global or file scope, fixed size array:
The compiler knows the size of the array and tells the linker to allocate space in the data / memory section. This is the fastest method.
Example:
#define ROWS 5
#define COLUMNS 6
int array[ROWS][COLUMNS];
int buffer[ROWS * COLUMNS];
Run time allocation, function local scope, fixed size array:
The compiler knows the size of the array, and tells the code to allocate space in the local memory (a.k.a. stack) for the array. In general, this means adding a value to a stack register. Usually one or two instructions.
Example:
void my_function(void)
{
unsigned short my_array[ROWS][COLUMNS];
unsigned short buffer[ROWS * COLUMNS];
}
Run Time allocation, dynamic memory, fixed size array:
Again, the compiler has already calculated the amount of memory required for the array since it was declared with fixed size. The compiler emits code to call the memory allocation function with the required amount (usually passed as a parameter). A little slower because of the function call and the overhead required to find some dynamic memory (and maybe garbage collection).
Example:
void another_function(void)
{
unsigned char * array = new char [ROWS * COLS];
//...
delete[] array;
}
Run Time allocation, dynamic memory, variable size:
Regardless of the dimensions of the array, the compiler must emit code to calculate the amount of memory to allocate. This quantity is then passed to the memory allocation function. A little slower than above because of the code required to calculate the size.
Example:
int * create_board(unsigned int rows, unsigned int columns)
{
int * board = new int [rows * cols];
return board;
}
Since your goal is image processing then I would assume your images are too large for static arrays. The correct question you should be about dynamically allocated arrays
In C/C++ there are multiple ways you can allocate a dynamic 2D array How do I work with dynamic multi-dimensional arrays in C?. To make this work in both C/C++ we can use malloc with casting (for C++ only you can use new)
Method 1:
int** arr1 = (int**)malloc(NUM_ROWS * sizeof(int*));
for(int i=0; i<NUM_ROWS; i++)
arr[i] = (int*)malloc(NUM_COLS * sizeof(int));
Method 2:
int** arr2 = (int**)malloc(NUM_ROWS * sizeof(int*));
int* arrflat = (int*)malloc(NUM_ROWS * NUM_COLS * sizeof(int));
for (int i = 0; i < dimension1_max; i++)
arr2[i] = arrflat + (i*NUM_COLS);
Method 2 essentially creates a contiguous 2D array: i.e. arrflat[NUM_COLS*i+j] and arr2[i][j] should have identical performance. However, arrflat[NUM_COLS*i+j] and arr[i][j] from method 1 should not be expected to have identical performance since arr1 is not contiguous. Method 1, however, seems to be the method that is most commonly used for dynamic arrays.
In general, I use arrflat[NUM_COLS*i+j] so I don't have to think of how to allocated dynamic 2D arrays.

Reading different data types in shared memory

I want to share some memory between different processes running a DLL. Therefore i create a memory-mapped-file by HANDLE hSharedFile = CreateFileMapping(...) then LPBYTE hSharedView = MapViewOfFile(...) and LPBYTE aux = hSharedView
Now I want to read a bool, a int, a float and a char from the aux array. Reading a bool and char is easy. But how would I go around reading a int or float? Notice that the int or float could start at position 9 e.g. a position that is not dividable by 4.
I know you can read a char[4] and then memcpy it into a float or int. But i really need this to be very fast. I am wondering if it is possible to do something with pointers?
Thanks in advance
If you know, for instance, that array elements aux[13..16] contain a float, then you can access this float in several ways:
float f = *(float*)&aux[13] ; // Makes a copy. The simplest solution.
float* pf = (float*)&aux[13] ; // Here you have to use *pf to access the float.
float& rf = *(float*)&aux[13] ; // Doesn't make a copy, and is probably what you want.
// (Just use rf to access the float.)
There is nothing wrong with grabbing an int at offset 9:
int* intptr = (int*) &data[9];
int mynumber = *intptr;
There might be a really tiny performance penalty for this "unaligned" access, but it will still work correctly, and the chances of you noticing any differences are slim.
First of all, I think you should measure. There are three options you can go with that I can think of:
with unaligned memory
with memcpy into buffers
with custom-aligned memory
Unaligned memory will work fine, it will just be slower than aligned. How slower is that, and does it matter to you? Measure to find out.
Copying into a buffer will trade off the slower unaligned accesses for additional copy operations. Measuring will tell you if it's worth it.
If using unaligned memory is too slow for you and you don't want to copy data around (perhaps because of the performance cost), then you can possibly do faster by wasting some memory space and increasing your program complexity. Don't use the mapped memory blindly: round your "base" pointer upwards to a suitable value (e.g. 8 bytes) and only do reads/writes at 8-byte increments of this "base" value. This will ensure that all your accesses will be aligned.
But do measure before you go into all this trouble.

Dynamically allocating and setting to zero an array of floats

How do I automatically set a dynamically allocated array of floats to zero(0.0) during allocation
Is this OK
float* delay_line = new float[filter_len];
//THIS
memset(delay_line, 0.0, filter_len); //can I do this for a float??
//OR THIS
for (int i = 0; i < filter_len; i++)
delay_line[i] = 0.0;
Which is the most efficient way
Thanks
Use sizeof(float) * filter_len unless you are working in some odd implementation where sizeof(float) == sizeof(char).
memset(delay_line, 0, sizeof(float) * filter_len);
Edit: As Stephan202 points out in the comments, 0.0 is a particularly easy floating point value to code for memset since the IEEE standard representation for 0.0 is all zero bits.
memset is operating in the realm of memory, not the realm of numbers. The second parameter, declared an int, is cast to an unsigned char. If your implementation of C++ uses four bytes per float, the following relationships hold:
If you memset the float with 0, the value will be 0.0.
If you memset the float with 1, the value will be 2.36943e-38.
If you memset the float with 42, the value will be 1.51137e-13.
If you memset the float with 64, the value will be 3.00392.
So zero is a special case.
If this seems peculiar, recall that memset is declared in <cstring> or <string.h>, and is often used for making things like "***************" or "------------------". That it can also be used to zero memory is a nifty side-effect.
As Milan Babuškov points out in the comments, there is a function bzero (nonstandard and deprecated), available for the moment on Mac and Linux but not Microsoft, which, because it is specially tailored to setting memory to zero, safely omits a few instructions. If you use it, and a puritanical future release of your compiler omits it, it is trivial to implement bzero yourself in a local compatibility patch, unless the future release has re-used the name for some other purpose.
use
#include <algorithm>
...
std::fill_n( delay_line, filer_len, 0 )
The elements of a dynamically allocated array can be initialized to the default value of the element type by following the array size by an empty pair of parentheses:
float* delay_line = new float[filter_len]();
Use a std::vector instead:
std::vector<float> delay_line( filter_len );
The vector will be zero initialised.
Now that we're at it: even better would be to use the vector class.
std::vector< float > delay_line( filter_len, 0.0 );
Another option is to use calloc to allocate and zero at the same time:
float *delay_line = (float *)calloc(sizeof(float), filter_len);
The advantage here is that, depending on your malloc implementation, it may be possible to avoid zeroing the array if it's known to be allocated from memory that's already zeroed (as pages allocated from the operating system often are)
Keep in mind that you must use free() rather than delete [] on such an array.
Which is the most efficient way
memset maybe a tad faster, BUT WHO CARES!?!? Micro-optimization down to this level is a total waste of time, unless you're programming a calculator, and probably not even then.
I think the memset way is clearer, BUT I think you really had better check your man-pages for memset... I'd be suprised if your version of standard libraries has a memset function which takes a float as the second argument.
PS: The bit pattern representing zero is the same for both integers and floats... this is by design, not just good luck.
Good Luck ;-)
Cheers. Keith.