I have previously used SIMD operators to improve the efficiency of my code, however I am now facing a new error which I cannot resolve. For this task, speed is paramount.
The size of the array will not be known until the data is imported, and may be very small (100 values) or enormous (10 million values). For the latter case, the code works fine, however I am encountering an error when I use fewer than 130036 array values.
Does anyone know what is causing this issue and how to resolve it?
I have attached the (tested) code involved, which will be used later in a more complicated function. The error occurs at "arg1List[i] = ..."
#include <iostream>
#include <xmmintrin.h>
#include <emmintrin.h>
void main()
{
int j;
const int loop = 130036;
const int SIMDloop = (int)(loop/4);
__m128 *arg1List = new __m128[SIMDloop];
printf("sizeof(arg1List)= %d, alignof(Arg1List)= %d, pointer= %p", sizeof(arg1List), __alignof(arg1List), arg1List);
std::cout << std::endl;
for (int i = 0; i < SIMDloop; i++)
{
j = 4*i;
arg1List[i] = _mm_set_ps((j+1)/100.0f, (j+2)/100.0f, (j+3)/100.0f, (j+4)/100.0f);
}
}
Alignment is the reason.
MOVAPS--Move Aligned Packed Single-Precision Floating-Point Values
[...] The operand must be aligned on a 16-byte boundary or a general-protection exception (#GP) will be generated.
You can see the issue is gone as soon as you align your pointer:
__m128 *arg1List = new __m128[SIMDloop + 1];
arg1List = (__m128*) (((int) arg1List + 15) & ~15);
Related
I have the following C++ code. It should print 1. There is only one illegal memory access and it is a reading one, i.e., for i=5 in the for loop, we obtain c[5]=cdata[5] and cdata[5] does not exist. Still, all writing calls are legal, i.e., the program never tries to write to some non-allocated array cell. If I compile it with g++ -O3, it prints 0. If I compile without -O3 it prints 1. It's the strangest error I've ever seen with g++.
Later edit to address some answers. The undefined behavior should be limited to reading cdata[5] and writing it to c[5]. The assignment c[5]=cdata[5] may read some unknown garbage or something undefined from cdata[5]. Ok, but that should only copy the garbage to c[5] without writing anything whatsoever somewhere else!
#include <iostream>
#include <cstdlib>
using namespace std;
int n,d;
int *f, *c;
void loadData(){
int fdata[] = {7, 2, 2, 7, 7, 1};
int cdata[] = {66, 5, 4, 3, 2};
n = 6;
d = 3;
f = new int[n+1];
c = new int[n];
f[0] = fdata[0];
c[0] = cdata[0];
for (int i = 1;i<n;i++){
f[i] = fdata[i];
c[i] = cdata[i];
}
cout<<f[5]<<endl;
}
int main(){
loadData();
}
It will be very hard to find out exactly what is happening to your code before and after the optimisation. As you already knew and pointed out yourself, you are trying to go out of bound of an array, which leads to undefined behaviour.
However, you are curious on why it is (apparently) the -O3 flag which "causes" the issue!
Let's start by saying that it is actually the flag -O -fpeel-loops which is causing your code to re-organise in a way that your error becomes apparent. The -O3 will enable several optimisation flags, among which -O -fpeel-loops.
You can read here about what the compiler flags are at which stage of optimisation.
In a nutshell, -fpeel-loops wheel re-organise the loop, so that the first and last couple of iterations are actually taken out of the loop itself, and some variables are actually cleared of memory. Small loops may even be taken apart completely!
With this said and considered, try running this piece of code, with -O -fpeel-loops or even with -O3:
#include <iostream>
#include <cstdlib>
using namespace std;
int n,d;
int *f, *c;
void loadData(){
int fdata[] = {7, 2, 2, 7, 7, 1};
int cdata[] = {66, 5, 4, 3, 2};
n = 6;
d = 3;
f = new int[n+1];
c = new int[n];
f[0] = fdata[0];
c[0] = cdata[0];
for (int i = 1;i<n;i++){
cout << f[i];
f[i] = fdata[i];
c[i] = cdata[i];
}
cout << "\nFINAL F[5]:" << f[5]<<endl;
}
int main(){
loadData();
}
You will see that it will print 1 regardless of your optimisation flags.
That is because of the statement: cout << f[i], which will change the way that fpeel-loops is operating.
Also, experiment with this block of code:
f[0] = fdata[0];
c[0] = cdata[0];
c[1] = cdata[1];
c[2] = cdata[2];
c[3] = cdata[3];
c[4] = cdata[4];
for (int i = 1; i<n; i++) {
f[i] = fdata[i];
c[5] = cdata[5];
}
cout << "\nFINAL F[5]:" << f[5] <<endl;
You will see that even in this case, with all your optimisation flags, the output is 1 and not 0. Even in this case:
for (int i = 1; i<n; i++) {
c[i] = cdata[i];
}
for (int i = 1; i<n; i++) {
f[i] = fdata[i];
}
The produced output is actually 1. This is, again, because we have changed the structure of the loop, and fpeel-loops is not able to reorganise it as before, in the way that the error was produced. It's also the case of wrapping it into a while loop:
int i = 1;
while (i < 6) {
f[i] = fdata[i];
c[i] = cdata[i];
i++;
}
Although on a while loop -O3 will prevent compilation here because of -Waggressive-loop-optimizations, so you should test it with -O -fpeel-loops
So, we can't really know for sure how your compiler is reorganising that loop for you, however it is using the so-called as-if rule to do so, and you can read more about it here.
Of course, your compiler takes the freedom o refactoring that code for you basing on the fact that you abide to set rules. The as-if rule for the compiler will always produce the desired output, providing that the programmer has not caused undefined behaviour. In our case, we do indeed have broken the rules by going out of bounds of that array, hence the strategy with which that loop was refactored, was built upon the idea that this could not happen, but we made it happen.
So yes, it is not everything as simple and straightforward as saying that all your problems are created by reading at an unallocated memory address. Of course, it is ultimately why the problem has verfied itself, but you were reading cdata out of bounds, which is a different array! So it is not as simple and easy as saying that the mere reading out of bounds of your array is causing the issue, as if you do it like that:
int i = 0;
f[i] = fdata[i];
c[i] = cdata[i];
i++;
f[i] = fdata[i];
c[i] = cdata[i];
i++;
f[i] = fdata[i];
c[i] = cdata[i];
i++;
f[i] = fdata[i];
c[i] = cdata[i];
i++;
f[i] = fdata[i];
c[i] = cdata[i];
i++;
f[i] = fdata[i];
c[i] = cdata[i];
i++;
It will work and print 1! Even if we are definitely going out of bounds with that cdata array! So it is not the mere fact of reading at an unallocated memory address that is causing the issue.
Once again, it is actually the loop-refactoring strategy of fpeel-loops, which believed that you would not go out-of-bounds of that array and changed your loop accordingly.
Lastly, the fact that optimisations flags will lead you to have an output of 0 is strictly related to your machine. That 0 has no meaning, is is not the product of an actual logical operation, because it is actually a junk value, found at a non allocated memory address, or the product of a logical operation performed on a junk value, resulting in NaN.
If you want to discover why this happens, you can examine the generated assembly code. Here it is in Compiler Explorer: https://godbolt.org/z/bs7ej7
Note that with -O3, the generated assembly code is not exactly easy to read.
As you mentioned, you have memory problem in your code.
"cdata" has 5 member and maximum index for it is 4.
But in for loop when "i = 5", invalid data is read and it could cause a undefined behavior.
c[i] = cdata[i];
Please modify your code like this and try again.
for (int i = 1; i < n; i++)
f[i] = fdata[i];
for (int i = 1; i < 5; i++)
c[i] = cdata[i];
The answer is simple - you are accessing memory out of bounds of the array. Who knows what is there? It could be some other variable in your program, it could be random garbage, it is completely undefined.
So, when you compile sometimes you just so happen to get data there that is valid and other times when you compile the data is not what you expect.
It is purely coincidental that the data you expect is there or not, and depending on what the compiler does will determine what data is there i.e. it is completely unpredictable.
As for memory access, reading invalid memory is still a big no-no. So is reading potentially uninitialized values. You are doing both. They may not cause a segmentation fault, but they are still big no-no's.
My recommendation is use valgrind or a similar memory checking tool. If you run this code through valgrind, it will yell at you. With this, you can catch memory errors in your code that the compiler might not catch. This memory error is obvious, but some are hard to track down and almost all lead to undefined behavior, which makes you want to pull your teeth out with a rusty pair of pliers.
In short, don't access out of bounds elements and pray that the answer you are looking for just so happens to exist at that memory address.
I'm trying to initialize an integer array and set all elements to 1. I need the array to have an upper bound of 4294967295, or the maximum number possible for a 32-bit unsigned int.
This seems like a trivial task to me, and it should be, but I am running into segfault. I can run the for loop empty and it seems to work fine (albeit slowly, but it's processing nearly 4.3 billion numbers so I won't complain). The problem seems to show up when I try to perform any kind of action within the loop. The instruction I have below - primeArray[i] = 1; - causes the segfault error. As best as I can tell, this shouldn't cause me to overrun the array. If I comment out that line, no segfault.
It's late and my tired eyes are probably just missing something simple, but I could use another pair.
Here's what I've got:
#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>
#include <stdint.h>
#define LIMIT 0xFFFFFFFF;
int main(int argc, char const *argv[])
{
uint32_t i;
uint32_t numberOfPrimes = LIMIT; // hardcoded for debugging
int *primeArray = (int*) malloc(numberOfPrimes * sizeof(int));
for (i = 0; i < numberOfPrimes; ++i) {
primeArray[i] = 1;
}
}
Check the return code from malloc() to make sure the array was actually allocated. I suspect that the following test would fail:
int *primeArray = (int*) malloc(numberOfPrimes * sizeof(int));
if (primeArray != NULL) { /* check that array was allocated */
for (i = 0; i < numberOfPrimes; ++i) {
primeArray[i] = 1;
}
}
Your malloc call requests 16 gigabytes of memory from the system. If you don't have that much free virtual memory, or if you are running on any 32-bit system, the call will fail. If you don't check for the failure of malloc, as your code doesn't, the array will be NULL and any subsequent access to its elements will cause a segmentation fault.
If you really need to work with an array that large, you will either need to get a 64-bit system with a lot of memory, or rewrite your program to work with a smaller working set, and persist the rest to disk.
Does anyone see anything obvious about the loop code below that I'm not seeing as to why this cannot be auto-vectorized by VS2012's C++ compiler?
All the compiler gives me is info C5002: loop not vectorized due to reason '1200' when I use the /Qvec-report:2 command-line switch.
Reason 1200 is documented in MSDN as:
Loop contains loop-carried data dependences that prevent
vectorization. Different iterations of the loop interfere with each
other such that vectorizing the loop would produce wrong answers, and
the auto-vectorizer cannot prove to itself that there are no such data
dependences.
I know (or I'm pretty sure that) there aren't any loop-carried data dependencies but I'm not sure what's preventing the compiler from realizing this.
These source and dest pointers do not ever overlap nor alias the same memory and I'm trying to provide the compiler with that hint via __restrict.
pitch is always a positive integer value, something like 4096, depending on the screen resolution, since this is a 8bpp->32bpp rendering/conversion function, operating column-by-column.
byte * __restrict source;
DWORD * __restrict dest;
int pitch;
for (int i = 0; i < count; ++i) {
dest[(i*2*pitch)+0] = (source[(i*8)+0]);
dest[(i*2*pitch)+1] = (source[(i*8)+1]);
dest[(i*2*pitch)+2] = (source[(i*8)+2]);
dest[(i*2*pitch)+3] = (source[(i*8)+3]);
dest[((i*2+1)*pitch)+0] = (source[(i*8)+4]);
dest[((i*2+1)*pitch)+1] = (source[(i*8)+5]);
dest[((i*2+1)*pitch)+2] = (source[(i*8)+6]);
dest[((i*2+1)*pitch)+3] = (source[(i*8)+7]);
}
The parens around each source[] are remnants of a function call which I have elided here because the loop still won't be auto-vectorized without the function call, in its most simplest form.
EDIT:
I've simplified the loop to its most trivial form that I can:
for (int i = 0; i < 200; ++i) {
dest[(i*2*4096)+0] = (source[(i*8)+0]);
}
This still produces the same 1200 reason code.
EDIT (2):
This minimal test case with local allocations and identical pointer types still fails to auto-vectorize. I'm simply baffled at this point.
const byte * __restrict source;
byte * __restrict dest;
source = (const byte * __restrict ) new byte[1600];
dest = (byte * __restrict ) new byte[1600];
for (int i = 0; i < 200; ++i) {
dest[(i*2*4096)+0] = (source[(i*8)+0]);
}
Let's just say there's more than just a couple of things preventing this loop from vectorizing...
Consider this:
int main(){
byte *source = new byte[1000];
DWORD *dest = new DWORD[1000];
for (int i = 0; i < 200; ++i) {
dest[(i*2*4096)+0] = (source[(i*8)+0]);
}
for (int i = 0; i < 200; ++i) {
dest[i*2*4096] = source[i*8];
}
for (int i = 0; i < 200; ++i) {
dest[i*8192] = source[i*8];
}
for (int i = 0; i < 200; ++i) {
dest[i] = source[i];
}
}
Compiler Output:
main.cpp(10) : info C5002: loop not vectorized due to reason '1200'
main.cpp(13) : info C5002: loop not vectorized due to reason '1200'
main.cpp(16) : info C5002: loop not vectorized due to reason '1203'
main.cpp(19) : info C5002: loop not vectorized due to reason '1101'
Let's break this down:
The first two loops are the same. So they give the original reason 1200 which is the loop-carried dependency.
The 3rd loop is the same as the 2nd loop. Yet the compiler gives a different reason 1203:
Loop body includes non-contiguous accesses into an array
Okay... Why a different reason? I dunno. But this time, the reason is correct.
The 4th loop gives 1101:
Loop contains a non-vectorizable conversion operation (may be implicit)
So VC++ doesn't isn't smart enough to issue the SSE4.1 pmovzxbd instruction.
That's a pretty niche case, I wouldn't have expected any modern compiler to be able to do this. And if it could, you'd need to specify SSE4.1.
So the only thing that's out-of-the ordinary is why the initial loop reports a loop-carried dependency.Well, that's a tough call... I would go so far to say that the compiler just isn't issuing the correct reason. (When it really should be non-contiguous access.)
Getting back to the point, I wouldn't expect MSVC or any compiler to be able to vectorize your original loop. Your original loop has accesses grouped in chunks of 4 - which makes it contiguous enough to vectorize. But it's a long-shot to expect the compiler to be able to recognize that.
So if it matters, I suggest manually vectorizing this loop. The intrinsic that you will need is _mm_cvtepu8_epi32().
Your original loop:
for (int i = 0; i < count; ++i) {
dest[(i*2*pitch)+0] = (source[(i*8)+0]);
dest[(i*2*pitch)+1] = (source[(i*8)+1]);
dest[(i*2*pitch)+2] = (source[(i*8)+2]);
dest[(i*2*pitch)+3] = (source[(i*8)+3]);
dest[((i*2+1)*pitch)+0] = (source[(i*8)+4]);
dest[((i*2+1)*pitch)+1] = (source[(i*8)+5]);
dest[((i*2+1)*pitch)+2] = (source[(i*8)+6]);
dest[((i*2+1)*pitch)+3] = (source[(i*8)+7]);
}
vectorizes as follows:
for (int i = 0; i < count; ++i) {
__m128i s0 = _mm_loadl_epi64((__m128i*)(source + i*8));
__m128i s1 = _mm_unpackhi_epi64(s0,s0);
*(__m128i*)(dest + (i*2 + 0)*pitch) = _mm_cvtepu8_epi32(s0);
*(__m128i*)(dest + (i*2 + 1)*pitch) = _mm_cvtepu8_epi32(s1);
}
Disclaimer: This is untested and ignores alignment.
From the MSDN documentation, a case in which an error 1203 would be reported
void code_1203(int *A)
{
// Code 1203 is emitted when non-vectorizable memory references
// are present in the loop body. Vectorization of some non-contiguous
// memory access is supported - for example, the gather/scatter pattern.
for (int i=0; i<1000; ++i)
{
A[i] += A[0] + 1; // constant memory access not vectorized
A[i] += A[i*2+2] + 2; // non-contiguous memory access not vectorized
}
}
It really could be the computations at the indexes that mess with the auto-vectorizer. Funny the error code shown is not 1203, though.
MSDN Parallelizer and Vectorizer Messages
Consider the following code snippet
double *x, *id;
int i, n; // = vector size
// allocate and zero x
// set id to 0:n-1
for(i=0; i<n; i++) {
long iid = (long)id[i];
if(iid>=0 && iid<n && (double)iid==id[i]){
x[iid] = 1;
} else break;
}
The code uses values in vector id of type double as indices into vector x. In order for the indices to be valid I verify that they are greater than or equal to 0, less than vector size n, and that doubles stored in id are in fact integers. In this example id stores integers from 1 to n, so all vectors are accessed linearly and branch prediction of the if statement should always work.
For n=1e8 the code takes 0.21s on my computer. Since it seems to me it is a computationally light-weight loop, I expect it to be memory bandwidth bounded. Based on the benchmarked memory bandwidth I expect it to run in 0.15s. I calculate the memory footprint as 8 bytes per id value, and 16 bytes per x value (it needs to be both written, and read from memory since I assume SSE streaming is not used). So a total of 24 bytes per vector entry.
The questions:
Am I wrong saying that this code should be memory bandwidth bounded, and that it can be improved?
If not, do you know a way in which I could improve the performance so that it works with the speed of the memory?
Or maybe everything is fine and I can not easily improve it otherwise than running it in parallel?
Changing the type of id is not an option - it must be double. Also, in the general case id and x have different sizes and must be kept as separate arrays - they come from different parts of the program. In short, I wonder if it is possible to write the bound checks and the type cast/integer validation in a more efficient manner.
For convenience, the entire code:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
static struct timeval tb, te;
void tic()
{
gettimeofday(&tb, NULL);
}
void toc(const char *idtxt)
{
long s,u;
gettimeofday(&te, NULL);
s=te.tv_sec-tb.tv_sec;
u=te.tv_usec-tb.tv_usec;
printf("%-30s%10li.%.6li\n", idtxt,
(s*1000000+u)/1000000, (s*1000000+u)%1000000);
}
int main(int argc, char *argv[])
{
double *x = NULL;
double *id = NULL;
int i, n;
// vector size is a command line parameter
n = atoi(argv[1]);
printf("x size %i\n", n);
// not included in timing in MATLAB
x = calloc(sizeof(double),n);
memset(x, 0, sizeof(double)*n);
// create index vector
tic();
id = malloc(sizeof(double)*n);
for(i=0; i<n; i++) id[i] = i;
toc("id = 1:n");
// use id to index x and set all entries to 4
tic();
for(i=0; i<n; i++) {
long iid = (long)id[i];
if(iid>=0 && iid<n && (double)iid==id[i]){
x[iid] = 1;
} else break;
}
toc("x(id) = 1");
}
EDIT: Disregard if you can't split the arrays!
I think it can be improved by taking advantage of a common cache concept. You can either make data accesses close in time or location. With tight for-loops, you can achieve a better data hit-rate by shaping your data structures like your for-loop. In this case, you access two different arrays, usually the same indices in each array. Your machine is loading chunks of both arrays each iteration through that loop. To increase the use of each load, create a structure to hold an element of each array, and create a single array with that struct:
struct my_arrays
{
double x;
int id;
};
struct my_arrays* arr = malloc(sizeof(my_arrays)*n);
Now, each time you load data into cache, you'll hit everything you load because the arrays are close together.
EDIT: Since your intent is to check for an integer value, and you make the explicit assumption that the values are small enough to be represented precisely in a double with no loss of precision, then I think your comparison is fine.
My previous answer had a reference to beware comparing large doubles after implicit casting, and I referenced this:
What is the most effective way for float and double comparison?
It might be worth considering examination of double type representation.
For example, the following code shows how to compare a double number greater than 1 to 999:
bool check(double x)
{
union
{
double d;
uint32_t y[2];
};
d = x;
bool answer;
uint32_t exp = (y[1] >> 20) & 0x3ff;
uint32_t fraction1 = y[1] << (13 + exp); // upper bits of fractiona part
uint32_t fraction2 = y[0]; // lower 32 bits of fractional part
if (fraction2 != 0 || fraction1 != 0)
answer = false;
else if (exp > 8)
answer = false;
else if (exp == 8)
answer = (y[1] < 0x408f3800); // this is the representation of 999
else
answer = true;
return answer;
}
This looks like much code, but it might be vectorized easily (using e.g. SSE), and if your bound is a power of 2, it might simplify the code further.
I'm trying to sum the elements of an array indexed by another array using the Thrust library, but I couldn't find an example. In other words, I want to implement Matlab's syntax
sum(x(indices))
Here is a guideline code trying to point out what do I like to achieve:
#define N 65536
// device array copied using cudaMemcpyToSymbol
__device__ int global_array[N];
// function to implement with thrust
__device__ int support(unsigned short* _memory, unsigned short* _memShort)
{
int support = 0;
for(int i=0; i < _memSizeShort; i++)
support += global_array[_memory[i]];
return support;
}
Also, from the host code, can I use the global_array[N] without copying it back with cudaMemcpyFromSymbol ?
Every comment/answer is appreciated :)
Thanks
This is a very late answer provided here to remove this question from the unanswered list. I'm sure that the OP has already found a solution (since May 2012 :-)), but I believe that the following could be useful to other users.
As pointed out by #talonmies, the problem can be solved by a fused gather-reduction. The solution is indeed an application of Thurst's permutation_iterator and reduce. The permutation_iterator allows to (implicitly) reorder the target array x according to the indices in the indices array. reduce performs the sum of the (implicitly) reordered array.
This application is part of Thrust's documentation, below reported for convenience
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/reduce.h>
#include <thrust/device_vector.h>
// this example fuses a gather operation with a reduction for
// greater efficiency than separate gather() and reduce() calls
int main(void)
{
// gather locations
thrust::device_vector<int> map(4);
map[0] = 3;
map[1] = 1;
map[2] = 0;
map[3] = 5;
// array to gather from
thrust::device_vector<int> source(6);
source[0] = 10;
source[1] = 20;
source[2] = 30;
source[3] = 40;
source[4] = 50;
source[5] = 60;
// fuse gather with reduction:
// sum = source[map[0]] + source[map[1]] + ...
int sum = thrust::reduce(thrust::make_permutation_iterator(source.begin(), map.begin()),
thrust::make_permutation_iterator(source.begin(), map.end()));
// print sum
std::cout << "sum is " << sum << std::endl;
return 0;
}
In the above example, map plays the role of indices, while source plays the role of x.
Concerning the additional question in your comment (iterating over a reduced number of terms), it will be sufficient to change the following line
int sum = thrust::reduce(thrust::make_permutation_iterator(source.begin(), map.begin()),
thrust::make_permutation_iterator(source.begin(), map.end()));
to
int sum = thrust::reduce(thrust::make_permutation_iterator(source.begin(), map.begin()),
thrust::make_permutation_iterator(source.begin(), map.begin()+N));
if you want to iterate only over the first N terms of the indexing array map.
Finally, concerning the possibility of using global_array from the host, you should notice that this is a vector residing on the device, so you do need a cudaMemcpyFromSymbol to move it to the host first.