Visual Studio 2019 C6385 / C6386 (buffer overrun warning) on __m256 array - c++

I'm allocating an array as follows:
__m256 *v256f_valid_mask = (__m256*)malloc(sizeof(__m256) * p_ranks);
The compiler is showing warning C6385 / C6386 (depending on exact context) on all lines where I access this array, except for at [0], indicating that 64 bytes may be read. The definition clearly states it's an array of 32-byte values.
Using _aligned_malloc() doesn't help.
Sample code to reproduce the warning:
void func(const size_t p_ranks)
{
__m256 v256f_x = _mm256_set1_ps(1.0f);
__m256* v256f_valid_mask = (__m256*)malloc(sizeof(__m256) * p_ranks);
for (size_t rank = 1; rank < p_ranks; rank++)
{
v256f_valid_mask[rank] = _mm256_cmp_ps(v256f_x, _mm256_setzero_ps(), _CMP_GT_OQ); // <<
}
}
Exact warning:
I fixed the C6011 warning with a null check.
Is there an error in my code or is this a false positive?

It is a false positive but the code analyser doesn't know it (probably because it doesn't 'trust' the malloc() call)! Using 'new' instead clears the warning (at least, in my VS2019 solution) …
void func(const size_t p_ranks)
{
__m256 v256f_x = _mm256_set1_ps(1.0f);
// __m256* v256f_valid_mask = (__m256*)malloc(sizeof(__m256) * p_ranks);
#if defined(__cplusplus)
__m256* v256f_valid_mask = new __m256[p_ranks];
#else
#define MAXRANKS 100 // Would probably be defined elsewhere!
__m256 v256f_valid_mask[MAXRANKS];
#endif
for (size_t rank = 1; rank < p_ranks; rank++)
{
v256f_valid_mask[rank] = _mm256_cmp_ps(v256f_x, _mm256_setzero_ps(), _CMP_GT_OQ); // <<
}
}
Please try and see!

Related

OpenCL result changes with arbitrary code alterations that are not related

This is a very strange issue. I'm working on an GPU based crypto miner and I have an issue with a SHA hash function.
1 - The initial function calls a SHA256 routine and then prints the results. I'm comparing those results to a CPU based SHA256 to make sure I get the same thing.
2 - Later on in the function, there are other operations that occur, such as adding, XOR and additional SHA rounds.
As part of the miner kernel, I wrote an auxiliary function to decompose an array of 8 uints into an array of 32 unsigned char, using AND mask and bit shift.
I'm calling the kernel with global/local work unit of 1.
So, here's where things get really strange. The part I am comparing is the very first SHA. I get a buffer of 80 bytes in, SHA it and then print the result. It matches under certain conditions. However, if I make changes to the code that is executing AFTER that SHA executes, then it doesnt match. This is what I've been able to narrow down:
1 - If I put a printf debug in the decomposition auxiliary function, the results match. Just removing that printf causes it to mismatch.
2 - There are 4 operations I use to decompose the uint into char. I tried lots of different ways to do this with the same result. However, if I remove any 1 of the 4 "for" loops in the routine, it matches. Simply removing a for loop in code that gets executed -after- the initial code, changes the result of the initial SHA.
3 - If I change my while loop to never execute then it matches. Again, this is all -after- the initial SHA comparison.
4 - If I remove all the calls to the auxiliary function, then it matches. Simply calling the function after the initial SHA causes a mismatch.
I've tried adding memory guards everywhere, however being that its 1 global and 1 local work unit, I don't see how that could apply.
I'd love to debug this, but apparently openCL cannot be debugged in VS 2019 (really?)
Any thoughts, guesses, insight would be appreciated.
Thanks!
inline void loadUintHash ( __global unsigned char* dest, __global uint* src) {
//**********if I remove this it doesn't work
printf ("src1 %08x%08x%08x%08x%08x%08x%08x%08x",
src[0],
src[1],
src[2],
src[3],
src[4],
src[5],
src[6],
src[7]
);
//**********if I take away any one of these 4 for loops, then it works
for ( int i = 0; i < 8; i++)
dest[i*4+3] = (src[i] & 0xFF000000) >> 24;
for ( int i = 0; i < 8; i++)
dest[i*4+2] = (src[i] & 0x00FF0000) >> 16;
for ( int i = 0; i < 8; i++)
dest[i*4+1] = (src[i] & 0x0000FF00) >> 8;
for ( int i = 0; i < 8; i++)
dest[i*4] = (src[i] & 0x000000FF);
//**********if I remove this it doesn't work
printf ("src2 %08x%08x%08x%08x%08x%08x%08x%08x",
src[0],
src[1],
src[2],
src[3],
src[4],
src[5],
src[6],
src[7]
);
}
#define HASHOP_ADD 0
#define HASHOP_XOR 1
#define HASHOP_SHA_SINGLE 2
#define HASHOP_SHA_LOOP 3
#define HASHOP_MEMGEN 4
#define HASHOP_MEMADD 5
#define HASHOP_MEMXOR 6
#define HASHOP_MEM_SELECT 7
#define HASHOP_END 8
__kernel void dyn_hash (__global uint* byteCode, __global uint* memGenBuffer, int memGenSize, __global uint* hashResult, __global char* foundFlag, __global unsigned char* header, __global unsigned char* shaScratch) {
int computeUnitID = get_global_id(0);
__global uint* myMemGen = &memGenBuffer[computeUnitID * memGenSize * 8]; //each memGen unit is 256 bits, or 8 bytes
__global uint* myHashResult = &hashResult[computeUnitID * 8];
__global char* myFoundFlag = foundFlag + computeUnitID;
__global unsigned char* myHeader = header + (computeUnitID * 80);
__global unsigned char* myScratch = shaScratch + (computeUnitID * 32);
sha256 ( computeUnitID, 80, myHeader, myHashResult );
//**********this is the result I am comparing
if (computeUnitID == 0) {
printf ("gpu first sha uint %08x%08x%08x%08x%08x%08x%08x%08x",
myHashResult[0],
myHashResult[1],
myHashResult[2],
myHashResult[3],
myHashResult[4],
myHashResult[5],
myHashResult[6],
myHashResult[7]
);
}
uint linePtr = 0;
uint done = 0;
uint currentMemSize = 0;
uint instruction = 0;
//**********if I change this to done == 1, then it works
while (done == 0) {
if (byteCode[linePtr] == HASHOP_ADD) {
linePtr++;
uint arg1[8];
for ( int i = 0; i < 8; i++)
arg1[i] = byteCode[linePtr+i];
linePtr += 8;
}
else if (byteCode[linePtr] == HASHOP_XOR) {
linePtr++;
uint arg1[8];
for ( int i = 0; i < 8; i++)
arg1[i] = byteCode[linePtr+i];
linePtr += 8;
}
else if (byteCode[linePtr] == HASHOP_SHA_SINGLE) {
linePtr++;
}
else if (byteCode[linePtr] == HASHOP_SHA_LOOP) {
printf ("HASHOP_SHA_LOOP");
linePtr++;
uint loopCount = byteCode[linePtr];
for ( int i = 0; i < loopCount; i++) {
loadUintHash(myScratch, myHashResult);
sha256 ( computeUnitID, 32, myScratch, myHashResult );
if (computeUnitID == 1) {
loadUintHash(myScratch, myHashResult);
... more irrelevant code...
This is how the kernel is being called:
size_t globalWorkSize = 1;// computeUnits;
size_t localWorkSize = 1;
returnVal = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &globalWorkSize, &localWorkSize, 0, NULL, NULL);
The issue ended up being multiple things. 1 - The CPU SHA had a bug in it that was causing an incorrect result in some cases. 2 - There was a very strange syntax error which seems to have broken the compiler in a weird way:
void otherproc () {
...do stuff...
}
if (something) {/
...other code
}
That forward slash after the opening curly brace was messing up "otherproc" in a weird way, and the compiler did not throw an error. After staring at the code line by line I found that slash, removed it, and everything started working.
If anyone is interested, the working implementation of a GPU miner can be found here:
https://github.com/dynamofoundation/dyn_miner

Vectorizing sparse matrix vector product with Compressed Sparse Row SegFault [duplicate]

I have the following function:
template <typename T>
void SSE_vectormult(T * A, T * B, int size)
{
__m128d a;
__m128d b;
__m128d c;
double A2[2], B2[2], C[2];
const double * A2ptr, * B2ptr;
A2ptr = &A2[0];
B2ptr = &B2[0];
a = _mm_load_pd(A);
for(int i = 0; i < size; i+=2)
{
std::cout << "In SSE_vectormult: i is: " << i << '\n';
A2[0] = A[i];
B2[0] = B[i];
A2[1] = A[i+1];
B2[1] = B[i+1];
std::cout << "Values from A and B written to A2 and B2\n";
a = _mm_load_pd(A2ptr);
b = _mm_load_pd(B2ptr);
std::cout << "Values converted to a and b\n";
c = _mm_mul_pd(a,b);
_mm_store_pd(C, c);
A[i] = C[0];
A[i+1] = C[1];
};
// const int mask = 0xf1;
// __m128d res = _mm_dp_pd(a,b,mask);
// r1 = _mm_mul_pd(a, b);
// r2 = _mm_hadd_pd(r1, r1);
// c = _mm_hadd_pd(r2, r2);
// c = _mm_scale_pd(a, b);
// _mm_store_pd(A, c);
}
When I am calling it on Linux, everything is fine, but when I am calling it on a windows OS, my program crashes with "program is not working anymore". What am I doing wrong, and how can I determine my error?
Your data is not guaranteed to be 16 byte aligned as required by SSE loads. Either use _mm_loadu_pd:
a = _mm_loadu_pd(A);
...
a = _mm_loadu_pd(A2ptr);
b = _mm_loadu_pd(B2ptr);
or make sure that your data is correctly aligned where possible, e.g. for static or locals:
alignas(16) double A2[2], B2[2], C[2]; // C++11, or C11 with <stdalign.h>
or without C++11, using compiler-specific language extensions:
__attribute__ ((aligned(16))) double A2[2], B2[2], C[2]; // gcc/clang/ICC/et al
__declspec (align(16)) double A2[2], B2[2], C[2]; // MSVC
You could use #ifdef to #define an ALIGN(x) macro that works on the target compiler.
Let me try and answer why your code works in Linux and not Windows. Code compiled in 64-bit mode has the stack aligned by 16 bytes. However, code compiled in 32-bit mode is only 4 byte aligned on windows and is not guaranteed to be 16 byte aligned on Linux.
GCC defaults to 64-bit mode on 64-bit systems. However MSVC defaults to 32-bit mode even on 64-bit systems. So I'm going to guess that you did not compile your code in 64-bit mode in windows and _mm_load_pd and _mm_store_pd both need 16 byte aligned addresses so the code crashes.
You have at least three different solutions to get your code working in Windows as well.
Compile your code in 64 bit mode.
Use unaligned loads and stores (e.g. _mm_storeu_pd)
Align the data yourself as Paul R suggested.
The best solution is the third solution since then your code will work on 32 bit systems and on older systems where unaligned loads/stores are much slower.
If you look at http://msdn.microsoft.com/en-us/library/cww3b12t(v=vs.90).aspx you can see that the function __mm_load_pd is defined as:
__m128d _mm_load_pd (double *p);
So, in your code A should be of type double, but A is of tipe T that is a template param. You should be sure that you are calling your SSE_vectormult function with the rights template params or just remove the template and use the double type instead,

Whats the correct replacement for posix_memalign in Windows?

I currently trying to build word2vec in Windows. But there are problems with posix_memalign() function. Everyone is suggesting to use _aligned_malloc(), but the number of parameters are different. So what's the best equivalent for posix_memalign() in Windows?
Thanks everyone. Based on code I fond in some repository and your advices I build EXE sucessfully. Here the code I used:
#ifdef _WIN32
static int check_align(size_t align)
{
for (size_t i = sizeof(void *); i != 0; i *= 2)
if (align == i)
return 0;
return EINVAL;
}
int posix_memalign(void **ptr, size_t align, size_t size)
{
if (check_align(align))
return EINVAL;
int saved_errno = errno;
void *p = _aligned_malloc(size, align);
if (p == NULL)
{
errno = saved_errno;
return ENOMEM;
}
*ptr = p;
return 0;
}
#endif
UPDATE:
Looks like #alk suggest the best sollution for this problem:
#define posix_memalign(p, a, s) (((*(p)) = _aligned_malloc((s), (a))), *(p) ?0 :errno)
_aligned_malloc() should be decent replacement for posix_memalign() the arguments differ because posix_memalign() returns an error rather than set errno on failure, other they are the same:
void* ptr = NULL;
int error = posix_memalign(&ptr, 16, 1024);
if (error != 0) {
// OMG: it failed!, error is either EINVAL or ENOMEM, errno is indeterminate
}
Becomes:
void* ptr = _aligned_malloc(1024, 16);
if (!ptr) {
// OMG: it failed! error is stored in errno.
}
Be careful that memory obtained from _aligned_malloc() must be freed with _aligned_free(), while posix_memalign() just uses regular free(). So you'd want to add something like:
#ifdef _WIN32
#define posix_memalign_free _aligned_free
#else
#define posix_memalign_free free
#endif
if you compare possix_memalign declaration:
int posix_memalign(void **memptr, size_t alignment, size_t size);
with _aligned_malloc declaration:
void * _aligned_malloc(size_t size, size_t alignment);
you see that _aligned_malloc is missing void **memptr param, but it returns void * instead.
If your code was something like this:
void * mem;
posix_memalign(&mem, x, y);
now it will be (take notice that x, y is now y, x):
void * mem;
mem = _aligned_malloc(y, x);
Since C11, there is aligned_alloc in the C standard library.
The memory can be freed with a regular free, which is the main advantage of this function, compared to _aligned_malloc/_aligned_free.
int *p2 = aligned_alloc(1024, 1024*sizeof *p2);
printf("1024-byte aligned addr: %p\n", (void*)p2);
free(p2);
However, the ease of use with free is precisely the reason why Visual Studio is unlikely to ever implement it, see std::aligned_alloc() missing from visual studio 2019?c++.

SIMD alignment issue with PPL Combinable

I'm trying to sum the elements of array in parallel with SIMD.
To avoid locking I'm using combinable thread local which is not always aligned on 16 bytes
because of that _mm_add_epi32 is throwing exception
concurrency::combinable<__m128i> sum_combine;
int length = 40; // multiple of 8
concurrency::parallel_for(0, length , 8, [&](int it)
{
__m128i v1 = _mm_load_si128(reinterpret_cast<__m128i*>(input_arr + it));
__m128i v2 = _mm_load_si128(reinterpret_cast<__m128i*>(input_arr + it + sizeof(uint32_t)));
auto temp = _mm_add_epi32(v1, v2);
auto &sum = sum_combine.local(); // here is the problem
TRACE(L"%d\n", it);
TRACE(L"add %x\n", &sum);
ASSERT(((unsigned long)&sum & 15) == 0);
sum = _mm_add_epi32(temp, sum);
}
);
here is defination of combinable from ppl.h
template<typename _Ty>
class combinable
{
private:
// Disable warning C4324: structure was padded due to __declspec(align())
// This padding is expected and necessary.
#pragma warning(push)
#pragma warning(disable: 4324)
__declspec(align(64))
struct _Node
{
unsigned long _M_key;
_Ty _M_value; // this might not be aligned on 16 bytes
_Node* _M_chain;
_Node(unsigned long _Key, _Ty _InitialValue)
: _M_key(_Key), _M_value(_InitialValue), _M_chain(NULL)
{
}
};
sometimes alignment is ok and the code works fine, but most of time its not working
I have tried to used the following, but this doesn't compile
union combine
{
unsigned short x[sizeof(__m128i) / sizeof(unsigned int)];
__m128i y;
};
concurrency::combinable<combine> sum_combine;
then auto &sum = sum_combine.local().y;
Any suggestions for correcting the alignment issue, still using combinable.
On x64 it works fine bcause of default 16 bytes alignment. On x86 sometimes alignment problems exists.
Just loaded sum using unaligned load
auto &sum = sum_combine.local();
#if !defined(_M_X64)
if (((unsigned long)&sum & 15) != 0)
{
// just for breakpoint means, sum is unaligned.
int a = 5;
}
auto sum_temp = _mm_loadu_si128(&sum);
sum = _mm_add_epi32(temp, sum_temp);
#else
sum = _mm_add_epi32(temp, sum);
#endif
Since the sum variable being used with _mm_add_epi32 is not aligned you need to explicitly load/store sum using unaligned loads/stores (_mm_loadu_si128/_mm_storeu_si128). Change:
sum = _mm_add_epi32(temp, sum);
to:
__m128i v2 = _mm_loadu_si128((__m128i *)&sum);
v2 = _mm_add_epi32(v2, temp);
_mm_storeu_si128((__m128i *)&sum, v2);

Array Error - Access violation reading location 0xffffffff

I have previously used SIMD operators to improve the efficiency of my code, however I am now facing a new error which I cannot resolve. For this task, speed is paramount.
The size of the array will not be known until the data is imported, and may be very small (100 values) or enormous (10 million values). For the latter case, the code works fine, however I am encountering an error when I use fewer than 130036 array values.
Does anyone know what is causing this issue and how to resolve it?
I have attached the (tested) code involved, which will be used later in a more complicated function. The error occurs at "arg1List[i] = ..."
#include <iostream>
#include <xmmintrin.h>
#include <emmintrin.h>
void main()
{
int j;
const int loop = 130036;
const int SIMDloop = (int)(loop/4);
__m128 *arg1List = new __m128[SIMDloop];
printf("sizeof(arg1List)= %d, alignof(Arg1List)= %d, pointer= %p", sizeof(arg1List), __alignof(arg1List), arg1List);
std::cout << std::endl;
for (int i = 0; i < SIMDloop; i++)
{
j = 4*i;
arg1List[i] = _mm_set_ps((j+1)/100.0f, (j+2)/100.0f, (j+3)/100.0f, (j+4)/100.0f);
}
}
Alignment is the reason.
MOVAPS--Move Aligned Packed Single-Precision Floating-Point Values
[...] The operand must be aligned on a 16-byte boundary or a general-protection exception (#GP) will be generated.
You can see the issue is gone as soon as you align your pointer:
__m128 *arg1List = new __m128[SIMDloop + 1];
arg1List = (__m128*) (((int) arg1List + 15) & ~15);