OS portable memcpy optimized for SSE2 & SSE3 - c++

If I was to write a OS portable memcpy optimized for SSE2/SSE3 how would that look like? I want to support both the GCC and ICC compilers. The reason I ask is that memcpy is written in assembler code in glibc and not optimized for SSE2/SSE3, and other generic memcpy implementations may not fully take advantage of the systems capabilities with data alignment and size etc.
Here is my current memcpy that take data alignment into consideration and is optimized for SSE2 (I think) but not for SSE3:
#ifdef __SSE2__
// SSE2 optimized memcpy()
void *CMemUtils::MemCpy(void *restrict b, const void *restrict a, size_t n)
{
char *s1 = b;
const char *s2 = a;
for(; 0<n; --n)*s1++ = *s2++;
return b;
}
#else
// Generic memcpy() implementation
void *CMemUtils::MemCpy(void *dest, const void *source, size_t count) const
{
#ifdef _USE_SYSTEM_MEMCPY
// Use system memcpy()
return memcpy(dest, source, count);
#else
size_t blockIdx;
size_t blocks = count >> 3;
size_t bytesLeft = count - (blocks << 3);
// Copy 64-bit blocks first
_UINT64 *sourcePtr8 = (_UINT64*)source;
_UINT64 *destPtr8 = (_UINT64*)dest;
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr8[blockIdx] = sourcePtr8[blockIdx];
if (!bytesLeft) return dest;
blocks = bytesLeft >> 2;
bytesLeft = bytesLeft - (blocks << 2);
// Copy 32-bit blocks
_UINT32 *sourcePtr4 = (_UINT32*)&sourcePtr8[blockIdx];
_UINT32 *destPtr4 = (_UINT32*)&destPtr8[blockIdx];
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr4[blockIdx] = sourcePtr4[blockIdx];
if (!bytesLeft) return dest;
blocks = bytesLeft >> 1;
bytesLeft = bytesLeft - (blocks << 1);
// Copy 16-bit blocks
_UINT16 *sourcePtr2 = (_UINT16*)&sourcePtr4[blockIdx];
_UINT16 *destPtr2 = (_UINT16*)&destPtr4[blockIdx];
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr2[blockIdx] = sourcePtr2[blockIdx];
if (!bytesLeft) return dest;
// Copy byte blocks
_UINT8 *sourcePtr1 = (_UINT8*)&sourcePtr2[blockIdx];
_UINT8 *destPtr1 = (_UINT8*)&destPtr2[blockIdx];
for (blockIdx = 0; blockIdx < bytesLeft; blockIdx++) destPtr1[blockIdx] = sourcePtr1[blockIdx];
return dest;
#endif
}
#endif
Not all memcpy implementations are thread-safe, which is just another reason to make our own version. All this leads me to conclude I should at least try to make a thread-safe OS portable memcpy that is optimized for SSE2/SSE3 where available.
I've also read that GCC supports aggressive unrolling with the -funroll-loops compiler option, could this improve performance with SSE2 and/or SSE3 if there are no significant cache misses?
Is there a performance gain of making different memcpy versions for 32-and 64-bit architectures?
Is there any performance gain of pre-aligning internal memory buffers before copying?
How do I use the #pragma loop to controls how loop code is to be considered by the SSE2/SSE3 auto-parallelizer? Supposedly one could use #pragma loop on contiguous data regions are moved by a for() loop.
Do I need to use the GCC compiler option -fno-builtin-memcpy even with -O3 to force the compiler from inlining the GCC memcpy when adding my own memcpy? Or perhaps just overriding memcpy in my code is enough?
Update:
After some tests it seems to me that an SSE2 optimized memcpy() is not that much faster for it to be worth the effort. I've asked a question in that regard on the Intel C/C++ Compiler forums.

Related

How do I get clang/gcc to vectorize looped array comparisons?

bool equal(uint8_t * b1,uint8_t * b2){
b1=(uint8_t*)__builtin_assume_aligned(b1,64);
b2=(uint8_t*)__builtin_assume_aligned(b2,64);
for(int ii = 0; ii < 64; ++ii){
if(b1[ii]!=b2[ii]){
return false;
}
}
return true;
}
Looking at the assembly, clang and gcc don't seem to have any optimizations to add(with flags -O3 -mavx512f -msse4.2) apart from loop unrolling. I would think its pretty easy to just put both memory regions in 512 bit registers and compare them. Even more surprisingly both compilers also fail to optimize this(ideally only a single 64 bit compare required and no special large registers required):
bool equal(uint8_t * b1,uint8_t * b2){
b1=(uint8_t*)__builtin_assume_aligned(b1,8);
b2=(uint8_t*)__builtin_assume_aligned(b2,8);
for(int ii = 0; ii < 8; ++ii){
if(b1[ii]!=b2[ii]){
return false;
}
}
return true;
}
So are both compilers just dumb or is there a reason that this code isn't vectorized? And is there any way to force vectorization short of writing inline assembly?
"I assume" the following is most efficient
memcmp(b1, b2, any_size_you_need);
especially for huge arrays!
(For small arrays, there is not a lot to gain anyway!)
Otherwise, you would need to vectorize manually using Intel Intrinsics. (Also mentioned by chtz.) I started to look at that until i thought about memcmp.
The compiler must assume that once the function returns (or it is exiting the loop), it can't read any bytes behind the current index -- for example if one of the pointers happens to point to somewhere near the boundary of invalid memory.
You can give the compiler a chance to optimize this by using (non-lazy) bitwise &/| operators to combine the results of the individual comparisons:
bool equal(uint8_t * b1,uint8_t * b2){
b1=(uint8_t*)__builtin_assume_aligned(b1,64);
b2=(uint8_t*)__builtin_assume_aligned(b2,64);
bool ret = true;
for(int ii = 0; ii < 64; ++ii){
ret &= (b1[ii]==b2[ii]);
}
return ret;
}
Godbolt demo: https://godbolt.org/z/3ePh7q5rM
gcc still fails to vectorize this, though. So you may need to write manual vectorized versions of this, if this function is performance critical.

Gcc misoptimises sse function

I'm converting a project to compile with gcc from clang and I've ran into a issue with a function that uses sse functions:
void dodgy_function(
const short* lows,
const short* highs,
short* mins,
short* maxs,
int its
)
{
__m128i v00[2] = { _mm_setzero_si128(), _mm_setzero_si128() };
__m128i v10[2] = { _mm_setzero_si128(), _mm_setzero_si128() };
for (int i = 0; i < its; ++i) {
reinterpret_cast<short*>(v00)[i] = lows[i];
reinterpret_cast<short*>(v10)[i] = highs[i];
}
reinterpret_cast<short*>(v00)[its] = reinterpret_cast<short*>(v00)[its - 1];
reinterpret_cast<short*>(v10)[its] = reinterpret_cast<short*>(v10)[its - 1];
__m128i v01[2] = {_mm_setzero_si128(), _mm_setzero_si128()};
__m128i v11[2] = {_mm_setzero_si128(), _mm_setzero_si128()};
__m128i min[2];
__m128i max[2];
min[0] = _mm_min_epi16(_mm_max_epi16(v11[0], v01[0]), _mm_min_epi16(v10[0], v00[0]));
max[0] = _mm_max_epi16(_mm_max_epi16(v11[0], v01[0]), _mm_max_epi16(v10[0], v00[0]));
min[1] = _mm_min_epi16(_mm_min_epi16(v11[1], v01[1]), _mm_min_epi16(v10[1], v00[1]));
max[1] = _mm_max_epi16(_mm_max_epi16(v11[1], v01[1]), _mm_max_epi16(v10[1], v00[1]));
reinterpret_cast<__m128i*>(mins)[0] = _mm_min_epi16(reinterpret_cast<__m128i*>(mins)[0], min[0]);
reinterpret_cast<__m128i*>(maxs)[0] = _mm_max_epi16(reinterpret_cast<__m128i*>(maxs)[0], max[0]);
reinterpret_cast<__m128i*>(mins)[1] = _mm_min_epi16(reinterpret_cast<__m128i*>(mins)[1], min[1]);
reinterpret_cast<__m128i*>(maxs)[1] = _mm_max_epi16(reinterpret_cast<__m128i*>(maxs)[1], max[1]);
}
Now with clang it gives it gives me the expected output but in gcc it prints all zeros: godbolt link
Playing around I discovered that gcc gives me the right results when I compile with -O1 but goes wrong with -O2 and -O3, suggesting the optimiser is going awry. Is there something particularly wrong I'm doing that would cause this behavior?
As a workaround I can wrap things up in a union and gcc will then give me the right result, but that feels a little icky: godbolt link 2
Any ideas?
The problem is that you're using short* to access the elements of a __m128i* object. That violates the strict-aliasing rule. It's only safe to go the other way, using __m128i* dereference or more normally _mm_load_si128( (const __m128i*)ptr ).
__m128i* is exactly like char* - you can point it at anything, but not vice versa: Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?
The only standard blessed way to do type punning is with memcpy:
memcpy(v00, lows, its * sizeof(short));
memcpy(v10, highs, its * sizeof(short));
memcpy(reinterpret_cast<short*>(v00) + its, lows + its - 1, sizeof(short));
memcpy(reinterpret_cast<short*>(v10) + its, highs + its - 1, sizeof(short));
https://godbolt.org/z/f63q7x
I prefer just using aligned memory of the correct type directly:
alignas(16) short v00[16];
alignas(16) short v10[16];
auto mv00 = reinterpret_cast<__m128i*>(v00);
auto mv10 = reinterpret_cast<__m128i*>(v10);
_mm_store_si128(mv00, _mm_setzero_si128());
_mm_store_si128(mv10, _mm_setzero_si128());
_mm_store_si128(mv00 + 1, _mm_setzero_si128());
_mm_store_si128(mv10 + 1, _mm_setzero_si128());
for (int i = 0; i < its; ++i) {
v00[i] = lows[i];
v10[i] = highs[i];
}
v00[its] = v00[its - 1];
v10[its] = v10[its - 1];
https://godbolt.org/z/bfanne
I'm not positive that this setup is actually standard-blessed (it definitely is for _mm_load_ps since you can do it without type punning at all) but it does seem to also fix the issue. I'd guess that any reasonable implementation of the load/store intrinsics is going to have to provide the same sort of aliasing guarantees that memcpy does since it's more or less the kosher way to go from straight line to vectorized code in x86.
As you mentioned in your question, you can also force the alignment with a union, and I've used that too in pre c++11 contexts. Even in that case though, I still personally always write the loads and stores explicitly (even if they're just going to/from aligned memory) because issues like this tend to pop up if you don't.

Process unaligned part of a double array, vectorize the rest

I am generating sse/avx instructions and currently i have to use unaligned load and stores. I operate on a float/double array and i will never know whether it will be aligned or not. So before vectorizing it, i would like to have a pre and possibly a post loop, which takes care about the unaligned part. The main vectorized loop operates then on the aligned part.
But how do i determine when an array is aligned? Can i check the pointer value? When should the pre-loop stop and the post-loop start?
Here is my simple code example:
void func(double * in, double * out, unsigned int size){
for( as long as in unaligned part ){
out[i] = do_something_with_array(in[i])
}
for( as long as aligned ){
awesome avx code that loads operates and stores 4 doubles
}
for( remaining part of array ){
out[i] = do_something_with_array(in[i])
}
}
Edit:
I have been thinking about it. Theoretically the pointer to the i'th element should be dividable (something like &a[i]%16==0) by 2,4,16,32 (depending whether it is double and whether it is sse or avx). So the first loop should cover up the elements, which are not dividable.
Practically i will try the compiler pragmas and flags out, to see what does the compiler produce. If no one gives a good answer i will then post my solution (if any) on weekend.
Here is some example C code that does what you want
#include <stdio.h>
#include <x86intrin.h>
#include <inttypes.h>
#define ALIGN 32
#define SIMD_WIDTH (ALIGN/sizeof(double))
int main(void) {
int n = 17;
int c = 1;
double* p = _mm_malloc((n+c) * sizeof *p, ALIGN);
double* p1 = p+c;
for(int i=0; i<n; i++) p1[i] = 1.0*i;
double* p2 = (double*)((uintptr_t)(p1+SIMD_WIDTH-1)&-ALIGN);
double* p3 = (double*)((uintptr_t)(p1+n)&-ALIGN);
if(p2>p3) p2 = p3;
printf("%p %p %p %p\n", p1, p2, p3, p1+n);
double *t;
for(t=p1; t<p2; t+=1) {
printf("a %p %f\n", t, *t);
}
puts("");
for(;t<p3; t+=SIMD_WIDTH) {
printf("b %p ", t);
for(int i=0; i<SIMD_WIDTH; i++) printf("%f ", *(t+i));
puts("");
}
puts("");
for(;t<p1+n; t+=1) {
printf("c %p %f\n", t, *t);
}
}
This generates a 32-byte aligned buffer but then offsets it by one double in size so it's no longer 32-byte aligned. It loops over scalar values until reaching 32-btye alignment, loops over the 32-byte aligned values, and then lastly finishes with another scalar loop for any remaining values which are not a multiple of the SIMD width.
I would argue that this kind of optimization only really makes a lot of sense for Intel x86 processors before Nehalem. Since Nehalem the latency and throughput of unaligned loads and stores are the same as for the aligned loads and stores. Additionally, since Nehalem the costs of the cache line splits is small.
There is one subtle point with SSE since Nehalem in that unaligned loads and stores cannot fold with other operations. Therefore, aligned loads and stores are not obsolete with SSE since Nehalem. So in principle this optimization could make a difference even with Nehalem but in practice I think there are few cases where it will.
However, with AVX unaligned loads and stores can fold so the aligned loads and store instructions are obsolete.
I looked into this with GCC, MSVC, and Clang. GCC if it cannot assume a pointer is aligned to e.g. 16 bytes with SSE then it will generate code similar to the code above to reach 16 byte alignment to avoid the cache line splits when vectorizing.
Clang and MSVC don't do this so they would suffer from the cache-line splits. However, the cost of the additional code to do this makes up for cost of the cache-line splits which probably explains why Clang and MSVC don't worry about it.
The only exception is before Nahalem. In this case GCC is much faster than Clang and MSVC when the pointer is not aligned. If the pointer is aligned and Clang knows it then it will use aligned loads and stores and be fast like GCC. MSVC vectorization still uses unaligned stores and loads and is therefore slow pre-Nahalem even when a pointer is 16-byte aligned.
Here is a version which I think is a bit clearer using pointer differences
#include <stdio.h>
#include <x86intrin.h>
#include <inttypes.h>
#define ALIGN 32
#define SIMD_WIDTH (ALIGN/sizeof(double))
int main(void) {
int n = 17, c =1;
double* p = _mm_malloc((n+c) * sizeof *p, ALIGN);
double* p1 = p+c;
for(int i=0; i<n; i++) p1[i] = 1.0*i;
double* p2 = (double*)((uintptr_t)(p1+SIMD_WIDTH-1)&-ALIGN);
double* p3 = (double*)((uintptr_t)(p1+n)&-ALIGN);
int n1 = p2-p1, n2 = p3-p2;
if(n1>n2) n1=n2;
printf("%d %d %d\n", n1, n2, n);
int i;
for(i=0; i<n1; i++) {
printf("a %p %f\n", &p1[i], p1[i]);
}
puts("");
for(;i<n2; i+=SIMD_WIDTH) {
printf("b %p ", &p1[i]);
for(int j=0; j<SIMD_WIDTH; j++) printf("%f ", p1[i+j]);
puts("");
}
puts("");
for(;i<n; i++) {
printf("c %p %f\n", &p1[i], p1[i]);
}
}

Setting a buffer of char* with intermediate casting to int*

I could not fully understand the consequences of what I read here: Casting an int pointer to a char ptr and vice versa
In short, would this work?
set4Bytes(unsigned char* buffer) {
const uint32_t MASK = 0xffffffff;
if ((uintmax_t)buffer % 4) {//misaligned
for (int i = 0; i < 4; i++) {
buffer[i] = 0xff;
}
} else {//4-byte alignment
*((uint32_t*) buffer) = MASK;
}
}
Edit
There was a long discussion (it was in the comments, which mysteriously got deleted) about what type the pointer should be casted to in order to check the alignment. The subject is now addressed here.
This conversion is safe if you are filling same value in all 4 bytes. If byte order matters then this conversion is not safe.
Because when you use integer to fill 4 Bytes at a time it will fill 4 Bytes but order depends on the endianness.
No, it won't work in every case. Aside from endianness, which may or may not be an issue, you assume that the alignment of uint32_t is 4. But this quantity is implementation-defined (C11 Draft N1570 Section 6.2.8). You can use the _Alignof operator to get the alignment in a portable way.
Second, the effective type (ibid. Sec. 6.5) of the location pointed to by buffer may not be compatible to uint32_t (e.g. if buffer points to an unsigned char array). In that case you break strict aliasing rules once you try reading through the array itself or through a pointer of different type.
Assuming that the pointer actually points to an array of unsigned char, the following code will work
typedef union { unsigned char chr[sizeof(uint32_t)]; uint32_t u32; } conv_t;
void set4Bytes(unsigned char* buffer) {
const uint32_t MASK = 0xffffffffU;
if ((uintptr_t)buffer % _Alignof(uint32_t)) {// misaligned
for (size_t i = 0; i < sizeof(uint32_t); i++) {
buffer[i] = 0xffU;
}
} else { // correct alignment
conv_t *cnv = (conv_t *) buffer;
cnv->u32 = MASK;
}
}
This code might be of help to you. It shows a 32-bit number being built by assigning its contents a byte at a time, forcing misalignment. It compiles and works on my machine.
#include<stdint.h>
#include<stdio.h>
#include<inttypes.h>
#include<stdlib.h>
int main () {
uint32_t *data = (uint32_t*)malloc(sizeof(uint32_t)*2);
char *buf = (char*)data;
uintptr_t addr = (uintptr_t)buf;
int i,j;
i = !(addr%4) ? 1 : 0;
uint32_t x = (1<<6)-1;
for( j=0;j<4;j++ ) buf[i+j] = ((char*)&x)[j];
printf("%" PRIu32 "\n",*((uint32_t*) (addr+i)) );
}
As mentioned by #Learner, endianness must be obeyed. The code above is not portable and would break on a big endian machine.
Note that my compiler throws the error "cast from ‘char*’ to ‘unsigned int’ loses precision [-fpermissive]" when trying to cast a char* to an unsigned int, as done in the original post. This post explains that uintptr_t should be used instead.
In addition to the endian issue, which has already been mentioned here:
CHAR_BIT - the number of bits per char - should also be considered.
It is 8 on most platforms, where for (int i=0; i<4; i++) should work fine.
A safer way of doing it would be for (int i=0; i<sizeof(uint32_t); i++).
Alternatively, you can include <limits.h> and use for (int i=0; i<32/CHAR_BIT; i++).
Use reinterpret_cast<>() if you want to ensure the underlying data does not "change shape".
As Learner has mentioned, when you store data in machine memory endianess becomes a factor. If you know how the data is stored correctly in memory (correct endianess) and you are specifically testing its layout as an alternate representation, then you would want to use reinterpret_cast<>() to test that memory, as a specific type, without modifying the original storage.
Below, I've modified your example to use reinterpret_cast<>():
void set4Bytes(unsigned char* buffer) {
const uint32_t MASK = 0xffffffff;
if (*reinterpret_cast<unsigned int *>(buffer) % 4) {//misaligned
for (int i = 0; i < 4; i++) {
buffer[i] = 0xff;
}
} else {//4-byte alignment
*reinterpret_cast<unsigned int *>(buffer) = MASK;
}
}
It should also be noted, your function appears to set the buffer (32-bytes of contiguous memory) to 0xFFFFFFFF, regardless of which branch it takes.
Your code is perfect for working with any architecture with 32bit and up. There is no issue with byte ordering since all your source bytes are 0xFF.
At x86 or x64 machines, the extra work necessary to deal with eventually unaligned access to RAM are managed by the CPU and transparent to the programmer (since Pentium II), with some performance cost at each access. So, if you are just setting the first four bytes of a buffer a few times, you are good to simplify your function:
void set4Bytes(unsigned char* buffer) {
const uint32_t MASK = 0xffffffff;
*((uint32_t *)buffer) = MASK;
}
Some readings:
A Linux kernel doc about UNALIGNED MEMORY ACCESSES
Intel Architecture Optimization Manual, section 3.4
Windows Data Alignment on IPF, x86, and x64
A Practical 'Aligned vs. unaligned memory access', by Alexander Sandler

How can I speedup data transfer from memory to CPU?

I am trying to speedup a popcount function. Here is the code:
extern ll LUT16[];
typedef long long ll;
typedef unsigned char* pUChar;
ll LUT16Word32Monobit(pUChar buf, int size) {
assert(buf != NULL);
assert(size > 0);
assert(size % sizeof(unsigned) == 0);
int n = size / sizeof(unsigned);
unsigned* p = (unsigned*)buf;
ll numberOfOneBits = 0;
for(int i = 0; i < n; i++) {
unsigned int val1 = p[i];
numberOfOneBits += LUT16[val1 >> 16] + LUT16[val1 & 0xFFFF];
}
return numberOfOneBits;
}
Here are a few details:
buf contains 1 GB of data
LUT16[i] contains the number of one bits in the binary representation of i, for all 0 <= i < 2^16
I tried to use openMP for speeding things up, but it doesn't work. I must add that I am using MS Visual Studio 2010 and that I have enabled the openMP directives. I believe that one of the reasons openMP doesn't speedup things up is due to memory access time. Is there any way I could make use of DMA(direct memory access)?
Also, I should warn you that my openMP skills are missing; that being said here is the openMP part(kind of the same code as above):
#pragma omp for schedule(dynamic,CHUNKSIZE)
for(int i = 0; i < n; i++) {
unsigned int val1 = p[i];
numberOfOneBits += LUT16[val1 >> 16] + LUT16[val1 & 0xFFFF];
}
CHUNKSIZE is set to 64. If I set it lower, the results are worse than in the serial version, if I set it higher, it doesn't do any good.
Also, I don't want to use the popcount instruction that processors provide, neither the SSE instructions.
Your LUT16 array is 512kB (assuming a long long is 64-bit), which will completely destroy your L1/L2 cache performance for arbitrary/random data (L1 is typically 32kB, L2 is typically 256kB).
Firstly, you don't need long long for this. Secondly, try LUT8 instead. Thirdly, just use the builtin __popcnt intrinsic.