Process unaligned part of a double array, vectorize the rest

Process unaligned part of a double array, vectorize the rest - c++

I am generating sse/avx instructions and currently i have to use unaligned load and stores. I operate on a float/double array and i will never know whether it will be aligned or not. So before vectorizing it, i would like to have a pre and possibly a post loop, which takes care about the unaligned part. The main vectorized loop operates then on the aligned part.
But how do i determine when an array is aligned? Can i check the pointer value? When should the pre-loop stop and the post-loop start?
Here is my simple code example:
void func(double * in, double * out, unsigned int size){
for( as long as in unaligned part ){
out[i] = do_something_with_array(in[i])
}
for( as long as aligned ){
awesome avx code that loads operates and stores 4 doubles
}
for( remaining part of array ){
out[i] = do_something_with_array(in[i])
}
}
Edit:
I have been thinking about it. Theoretically the pointer to the i'th element should be dividable (something like &a[i]%16==0) by 2,4,16,32 (depending whether it is double and whether it is sse or avx). So the first loop should cover up the elements, which are not dividable.
Practically i will try the compiler pragmas and flags out, to see what does the compiler produce. If no one gives a good answer i will then post my solution (if any) on weekend.

Here is some example C code that does what you want
#include <stdio.h>
#include <x86intrin.h>
#include <inttypes.h>
#define ALIGN 32
#define SIMD_WIDTH (ALIGN/sizeof(double))
int main(void) {
int n = 17;
int c = 1;
double* p = _mm_malloc((n+c) * sizeof *p, ALIGN);
double* p1 = p+c;
for(int i=0; i<n; i++) p1[i] = 1.0*i;
double* p2 = (double*)((uintptr_t)(p1+SIMD_WIDTH-1)&-ALIGN);
double* p3 = (double*)((uintptr_t)(p1+n)&-ALIGN);
if(p2>p3) p2 = p3;
printf("%p %p %p %p\n", p1, p2, p3, p1+n);
double *t;
for(t=p1; t<p2; t+=1) {
printf("a %p %f\n", t, *t);
}
puts("");
for(;t<p3; t+=SIMD_WIDTH) {
printf("b %p ", t);
for(int i=0; i<SIMD_WIDTH; i++) printf("%f ", *(t+i));
puts("");
}
puts("");
for(;t<p1+n; t+=1) {
printf("c %p %f\n", t, *t);
}
}
This generates a 32-byte aligned buffer but then offsets it by one double in size so it's no longer 32-byte aligned. It loops over scalar values until reaching 32-btye alignment, loops over the 32-byte aligned values, and then lastly finishes with another scalar loop for any remaining values which are not a multiple of the SIMD width.
I would argue that this kind of optimization only really makes a lot of sense for Intel x86 processors before Nehalem. Since Nehalem the latency and throughput of unaligned loads and stores are the same as for the aligned loads and stores. Additionally, since Nehalem the costs of the cache line splits is small.
There is one subtle point with SSE since Nehalem in that unaligned loads and stores cannot fold with other operations. Therefore, aligned loads and stores are not obsolete with SSE since Nehalem. So in principle this optimization could make a difference even with Nehalem but in practice I think there are few cases where it will.
However, with AVX unaligned loads and stores can fold so the aligned loads and store instructions are obsolete.
I looked into this with GCC, MSVC, and Clang. GCC if it cannot assume a pointer is aligned to e.g. 16 bytes with SSE then it will generate code similar to the code above to reach 16 byte alignment to avoid the cache line splits when vectorizing.
Clang and MSVC don't do this so they would suffer from the cache-line splits. However, the cost of the additional code to do this makes up for cost of the cache-line splits which probably explains why Clang and MSVC don't worry about it.
The only exception is before Nahalem. In this case GCC is much faster than Clang and MSVC when the pointer is not aligned. If the pointer is aligned and Clang knows it then it will use aligned loads and stores and be fast like GCC. MSVC vectorization still uses unaligned stores and loads and is therefore slow pre-Nahalem even when a pointer is 16-byte aligned.
Here is a version which I think is a bit clearer using pointer differences
#include <stdio.h>
#include <x86intrin.h>
#include <inttypes.h>
#define ALIGN 32
#define SIMD_WIDTH (ALIGN/sizeof(double))
int main(void) {
int n = 17, c =1;
double* p = _mm_malloc((n+c) * sizeof *p, ALIGN);
double* p1 = p+c;
for(int i=0; i<n; i++) p1[i] = 1.0*i;
double* p2 = (double*)((uintptr_t)(p1+SIMD_WIDTH-1)&-ALIGN);
double* p3 = (double*)((uintptr_t)(p1+n)&-ALIGN);
int n1 = p2-p1, n2 = p3-p2;
if(n1>n2) n1=n2;
printf("%d %d %d\n", n1, n2, n);
int i;
for(i=0; i<n1; i++) {
printf("a %p %f\n", &p1[i], p1[i]);
}
puts("");
for(;i<n2; i+=SIMD_WIDTH) {
printf("b %p ", &p1[i]);
for(int j=0; j<SIMD_WIDTH; j++) printf("%f ", p1[i+j]);
puts("");
}
puts("");
for(;i<n; i++) {
printf("c %p %f\n", &p1[i], p1[i]);
}
}

Related

Setting a buffer of char* with intermediate casting to int*

I could not fully understand the consequences of what I read here: Casting an int pointer to a char ptr and vice versa
In short, would this work?
set4Bytes(unsigned char* buffer) {
const uint32_t MASK = 0xffffffff;
if ((uintmax_t)buffer % 4) {//misaligned
for (int i = 0; i < 4; i++) {
buffer[i] = 0xff;
}
} else {//4-byte alignment
*((uint32_t*) buffer) = MASK;
}
}
Edit
There was a long discussion (it was in the comments, which mysteriously got deleted) about what type the pointer should be casted to in order to check the alignment. The subject is now addressed here.

This conversion is safe if you are filling same value in all 4 bytes. If byte order matters then this conversion is not safe.
Because when you use integer to fill 4 Bytes at a time it will fill 4 Bytes but order depends on the endianness.

No, it won't work in every case. Aside from endianness, which may or may not be an issue, you assume that the alignment of uint32_t is 4. But this quantity is implementation-defined (C11 Draft N1570 Section 6.2.8). You can use the _Alignof operator to get the alignment in a portable way.
Second, the effective type (ibid. Sec. 6.5) of the location pointed to by buffer may not be compatible to uint32_t (e.g. if buffer points to an unsigned char array). In that case you break strict aliasing rules once you try reading through the array itself or through a pointer of different type.
Assuming that the pointer actually points to an array of unsigned char, the following code will work
typedef union { unsigned char chr[sizeof(uint32_t)]; uint32_t u32; } conv_t;
void set4Bytes(unsigned char* buffer) {
const uint32_t MASK = 0xffffffffU;
if ((uintptr_t)buffer % _Alignof(uint32_t)) {// misaligned
for (size_t i = 0; i < sizeof(uint32_t); i++) {
buffer[i] = 0xffU;
}
} else { // correct alignment
conv_t *cnv = (conv_t *) buffer;
cnv->u32 = MASK;
}
}

This code might be of help to you. It shows a 32-bit number being built by assigning its contents a byte at a time, forcing misalignment. It compiles and works on my machine.
#include<stdint.h>
#include<stdio.h>
#include<inttypes.h>
#include<stdlib.h>
int main () {
uint32_t *data = (uint32_t*)malloc(sizeof(uint32_t)*2);
char *buf = (char*)data;
uintptr_t addr = (uintptr_t)buf;
int i,j;
i = !(addr%4) ? 1 : 0;
uint32_t x = (1<<6)-1;
for( j=0;j<4;j++ ) buf[i+j] = ((char*)&x)[j];
printf("%" PRIu32 "\n",*((uint32_t*) (addr+i)) );
}
As mentioned by #Learner, endianness must be obeyed. The code above is not portable and would break on a big endian machine.
Note that my compiler throws the error "cast from ‘char*’ to ‘unsigned int’ loses precision [-fpermissive]" when trying to cast a char* to an unsigned int, as done in the original post. This post explains that uintptr_t should be used instead.

In addition to the endian issue, which has already been mentioned here:
CHAR_BIT - the number of bits per char - should also be considered.
It is 8 on most platforms, where for (int i=0; i<4; i++) should work fine.
A safer way of doing it would be for (int i=0; i<sizeof(uint32_t); i++).
Alternatively, you can include <limits.h> and use for (int i=0; i<32/CHAR_BIT; i++).

Use reinterpret_cast<>() if you want to ensure the underlying data does not "change shape".
As Learner has mentioned, when you store data in machine memory endianess becomes a factor. If you know how the data is stored correctly in memory (correct endianess) and you are specifically testing its layout as an alternate representation, then you would want to use reinterpret_cast<>() to test that memory, as a specific type, without modifying the original storage.
Below, I've modified your example to use reinterpret_cast<>():
void set4Bytes(unsigned char* buffer) {
const uint32_t MASK = 0xffffffff;
if (*reinterpret_cast<unsigned int *>(buffer) % 4) {//misaligned
for (int i = 0; i < 4; i++) {
buffer[i] = 0xff;
}
} else {//4-byte alignment
*reinterpret_cast<unsigned int *>(buffer) = MASK;
}
}
It should also be noted, your function appears to set the buffer (32-bytes of contiguous memory) to 0xFFFFFFFF, regardless of which branch it takes.

Your code is perfect for working with any architecture with 32bit and up. There is no issue with byte ordering since all your source bytes are 0xFF.
At x86 or x64 machines, the extra work necessary to deal with eventually unaligned access to RAM are managed by the CPU and transparent to the programmer (since Pentium II), with some performance cost at each access. So, if you are just setting the first four bytes of a buffer a few times, you are good to simplify your function:
void set4Bytes(unsigned char* buffer) {
const uint32_t MASK = 0xffffffff;
*((uint32_t *)buffer) = MASK;
}
Some readings:
A Linux kernel doc about UNALIGNED MEMORY ACCESSES
Intel Architecture Optimization Manual, section 3.4
Windows Data Alignment on IPF, x86, and x64
A Practical 'Aligned vs. unaligned memory access', by Alexander Sandler

OS portable memcpy optimized for SSE2 & SSE3

If I was to write a OS portable memcpy optimized for SSE2/SSE3 how would that look like? I want to support both the GCC and ICC compilers. The reason I ask is that memcpy is written in assembler code in glibc and not optimized for SSE2/SSE3, and other generic memcpy implementations may not fully take advantage of the systems capabilities with data alignment and size etc.
Here is my current memcpy that take data alignment into consideration and is optimized for SSE2 (I think) but not for SSE3:
#ifdef __SSE2__
// SSE2 optimized memcpy()
void *CMemUtils::MemCpy(void *restrict b, const void *restrict a, size_t n)
{
char *s1 = b;
const char *s2 = a;
for(; 0<n; --n)*s1++ = *s2++;
return b;
}
#else
// Generic memcpy() implementation
void *CMemUtils::MemCpy(void *dest, const void *source, size_t count) const
{
#ifdef _USE_SYSTEM_MEMCPY
// Use system memcpy()
return memcpy(dest, source, count);
#else
size_t blockIdx;
size_t blocks = count >> 3;
size_t bytesLeft = count - (blocks << 3);
// Copy 64-bit blocks first
_UINT64 *sourcePtr8 = (_UINT64*)source;
_UINT64 *destPtr8 = (_UINT64*)dest;
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr8[blockIdx] = sourcePtr8[blockIdx];
if (!bytesLeft) return dest;
blocks = bytesLeft >> 2;
bytesLeft = bytesLeft - (blocks << 2);
// Copy 32-bit blocks
_UINT32 *sourcePtr4 = (_UINT32*)&sourcePtr8[blockIdx];
_UINT32 *destPtr4 = (_UINT32*)&destPtr8[blockIdx];
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr4[blockIdx] = sourcePtr4[blockIdx];
if (!bytesLeft) return dest;
blocks = bytesLeft >> 1;
bytesLeft = bytesLeft - (blocks << 1);
// Copy 16-bit blocks
_UINT16 *sourcePtr2 = (_UINT16*)&sourcePtr4[blockIdx];
_UINT16 *destPtr2 = (_UINT16*)&destPtr4[blockIdx];
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr2[blockIdx] = sourcePtr2[blockIdx];
if (!bytesLeft) return dest;
// Copy byte blocks
_UINT8 *sourcePtr1 = (_UINT8*)&sourcePtr2[blockIdx];
_UINT8 *destPtr1 = (_UINT8*)&destPtr2[blockIdx];
for (blockIdx = 0; blockIdx < bytesLeft; blockIdx++) destPtr1[blockIdx] = sourcePtr1[blockIdx];
return dest;
#endif
}
#endif
Not all memcpy implementations are thread-safe, which is just another reason to make our own version. All this leads me to conclude I should at least try to make a thread-safe OS portable memcpy that is optimized for SSE2/SSE3 where available.
I've also read that GCC supports aggressive unrolling with the -funroll-loops compiler option, could this improve performance with SSE2 and/or SSE3 if there are no significant cache misses?
Is there a performance gain of making different memcpy versions for 32-and 64-bit architectures?
Is there any performance gain of pre-aligning internal memory buffers before copying?
How do I use the #pragma loop to controls how loop code is to be considered by the SSE2/SSE3 auto-parallelizer? Supposedly one could use #pragma loop on contiguous data regions are moved by a for() loop.
Do I need to use the GCC compiler option -fno-builtin-memcpy even with -O3 to force the compiler from inlining the GCC memcpy when adding my own memcpy? Or perhaps just overriding memcpy in my code is enough?
Update:
After some tests it seems to me that an SSE2 optimized memcpy() is not that much faster for it to be worth the effort. I've asked a question in that regard on the Intel C/C++ Compiler forums.

How can I speedup data transfer from memory to CPU?

I am trying to speedup a popcount function. Here is the code:
extern ll LUT16[];
typedef long long ll;
typedef unsigned char* pUChar;
ll LUT16Word32Monobit(pUChar buf, int size) {
assert(buf != NULL);
assert(size > 0);
assert(size % sizeof(unsigned) == 0);
int n = size / sizeof(unsigned);
unsigned* p = (unsigned*)buf;
ll numberOfOneBits = 0;
for(int i = 0; i < n; i++) {
unsigned int val1 = p[i];
numberOfOneBits += LUT16[val1 >> 16] + LUT16[val1 & 0xFFFF];
}
return numberOfOneBits;
}
Here are a few details:
buf contains 1 GB of data
LUT16[i] contains the number of one bits in the binary representation of i, for all 0 <= i < 2^16
I tried to use openMP for speeding things up, but it doesn't work. I must add that I am using MS Visual Studio 2010 and that I have enabled the openMP directives. I believe that one of the reasons openMP doesn't speedup things up is due to memory access time. Is there any way I could make use of DMA(direct memory access)?
Also, I should warn you that my openMP skills are missing; that being said here is the openMP part(kind of the same code as above):
#pragma omp for schedule(dynamic,CHUNKSIZE)
for(int i = 0; i < n; i++) {
unsigned int val1 = p[i];
numberOfOneBits += LUT16[val1 >> 16] + LUT16[val1 & 0xFFFF];
}
CHUNKSIZE is set to 64. If I set it lower, the results are worse than in the serial version, if I set it higher, it doesn't do any good.
Also, I don't want to use the popcount instruction that processors provide, neither the SSE instructions.

Your LUT16 array is 512kB (assuming a long long is 64-bit), which will completely destroy your L1/L2 cache performance for arbitrary/random data (L1 is typically 32kB, L2 is typically 256kB).
Firstly, you don't need long long for this. Secondly, try LUT8 instead. Thirdly, just use the builtin __popcnt intrinsic.

Array Error - Access violation reading location 0xffffffff

I have previously used SIMD operators to improve the efficiency of my code, however I am now facing a new error which I cannot resolve. For this task, speed is paramount.
The size of the array will not be known until the data is imported, and may be very small (100 values) or enormous (10 million values). For the latter case, the code works fine, however I am encountering an error when I use fewer than 130036 array values.
Does anyone know what is causing this issue and how to resolve it?
I have attached the (tested) code involved, which will be used later in a more complicated function. The error occurs at "arg1List[i] = ..."
#include <iostream>
#include <xmmintrin.h>
#include <emmintrin.h>
void main()
{
int j;
const int loop = 130036;
const int SIMDloop = (int)(loop/4);
__m128 *arg1List = new __m128[SIMDloop];
printf("sizeof(arg1List)= %d, alignof(Arg1List)= %d, pointer= %p", sizeof(arg1List), __alignof(arg1List), arg1List);
std::cout << std::endl;
for (int i = 0; i < SIMDloop; i++)
{
j = 4*i;
arg1List[i] = _mm_set_ps((j+1)/100.0f, (j+2)/100.0f, (j+3)/100.0f, (j+4)/100.0f);
}
}

Alignment is the reason.
MOVAPS--Move Aligned Packed Single-Precision Floating-Point Values
[...] The operand must be aligned on a 16-byte boundary or a general-protection exception (#GP) will be generated.
You can see the issue is gone as soon as you align your pointer:
__m128 *arg1List = new __m128[SIMDloop + 1];
arg1List = (__m128*) (((int) arg1List + 15) & ~15);

Why is memset() incorrectly initializing int?

Why is the output of the following program 84215045?
int grid[110];
int main()
{
memset(grid, 5, 100 * sizeof(int));
printf("%d", grid[0]);
return 0;
}

memset sets each byte of the destination buffer to the specified value. On your system, an int is four bytes, each of which is 5 after the call to memset. Thus, grid[0] has the value 0x05050505 (hexadecimal), which is 84215045 in decimal.
Some platforms provide alternative APIs to memset that write wider patterns to the destination buffer; for example, on OS X or iOS, you could use:
int pattern = 5;
memset_pattern4(grid, &pattern, sizeof grid);
to get the behavior that you seem to expect. What platform are you targeting?
In C++, you should just use std::fill_n:
std::fill_n(grid, 100, 5);

memset(grid, 5, 100 * sizeof(int));
You are setting 400 bytes, starting at (char*)grid and ending at (char*)grid + (100 * sizeof(int)), to the value 5 (the casts are necessary here because memset deals in bytes, whereas pointer arithmetic deals in objects.
84215045 in hex is 0x05050505; since int (on your platform/compiler/etc.) is represented by four bytes, when you print it, you get "four fives."

memset is about setting bytes, not values. One of the many ways to set array values in C++ is std::fill_n:
std::fill_n(grid, 100, 5);

Don't use memset.
You set each byte [] of the memory to the value of 5. Each int is 4 bytes long [5][5][5][5], which the compiler correctly interprets as 5*256*256*256 + 5*256*256 + 5*256 + 5 = 84215045. Instead, use a for loop, which also doesn't require sizeof(). In general, sizeof() means you're doing something the hard way.
for(int i=0; i<110; ++i)
grid[i] = 5;

Well, the memset writes bytes, with the selected value. Therefore an int will look something like this:
00000101 00000101 00000101 00000101
Which is then interpreted as 84215045.

You haven't actually said what you want your program to do.
Assuming that you want to set each of the first 100 elements of grid to 5 (and ignoring the 100 vs. 110 discrepancy), just do this:
for (int i = 0; i < 100; i ++) {
grid[i] = 5;
}
I understand that you're concerned about speed, but your concern is probably misplaced. On the one hand, memset() is likely to be optimized and therefore faster than a simple loop. On the other hand, the optimization is likely to consist of writing more than one byte at a time, which is what this loop does. On the other other hand, memset() is a loop anyway; writing the loop explicitly rather than burying it in a function call doesn't change that. On the other other other hand, even if the loop is slow, it's not likely to matter; concentrate on writing clear code, and think about optimizing it if actual measurements indicate that there's a significant performance issue.
You've spent many orders of magnitude more time writing the question than your computer will spend setting grid.
Finally, before I run out of hands (too late!), it doesn't matter how fast memset() is if it doesn't do what you want. (Not setting grid at all is even faster!)

If you type man memset on your shell, it tells you that
void * memset(void *b, int c, size_t len)
A plain English explanation of this would be, it fills a byte string b of length len with each byte a value c.
For your case,
memset(grid, 5, 100 * sizeof(int));
Since sizeof(int)==4, thus the above code pieces looked like:
for (int i=0; i<100; i++)
grid[i]=0x05050505;
OR
char *grid2 = (char*)grid;
for (int i=0; i<100*sizeof(int); i++)
grid2[i]=0x05;
It would print out 84215045
But in most C code, we want to initialize a piece of memory block to value zero.
char type --> \0 or NUL
int type --> 0
float type --> 0.0f
double type --> 0.0
pointer type --> nullptr
And either gcc or clang etc. modern compilers can take well care of this for you automatically.
// variadic length array (VLA) introduced in C99
int len = 20;
char carr[len];
int iarr[len];
float farr[len];
double darr[len];
memset(carr, 0, sizeof(char)*len);
memset(iarr, 0, sizeof(int)*len);
memset(farr, 0, sizeof(float)*len);
memset(darr, 0, sizeof(double)*len);
for (int i=0; i<len; i++)
{
printf("%2d: %c\n", i, carr[i]);
printf("%2d: %i\n", i, iarr[i]);
printf("%2d: %f\n", i, farr[i]);
printf("%2d: %lf\n", i, darr[i]);
}
But be aware, C ISO Committee does not imposed such definitions, it is compiler-specific.

Since the memset writes bytes,I usually use it to set an int array to zero like:
int a[100];
memset(a,0,sizeof(a));
or you can use it to set a char array,since a char is exactly a byte:
char a[100];
memset(a,'*',sizeof(a));
what's more,an int array can also be set to -1 by memset:
memset(a,-1,sizeof(a));
This is because -1 is 0xffffffff in int,and is 0xff in char(a byte).

This code has been tested. Here is a way to memset an "Integer" array to a value between 0 to 255.
MinColCost=new unsigned char[(Len+1) * sizeof(int)];
memset(MinColCost,0x5,(Len+1)*sizeof(int));
memset(MinColCost,0xff,(Len+1)*sizeof(int));

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Process unaligned part of a double array, vectorize the rest - c++

Related

Setting a buffer of char* with intermediate casting to int*

OS portable memcpy optimized for SSE2 & SSE3

How can I speedup data transfer from memory to CPU?

Array Error - Access violation reading location 0xffffffff

Why is memset() incorrectly initializing int?

Categories

Resources