I've got some trouble understanding a SSE2-instruction. According to the microsoft documentation, _mm_load_si128 requires a 16-byte-aligned address as parameter. In the code, which I try to understand, this seems not to be the case:
void f(uchar* buf0, const int n)
{
ushort* buf = (ushort*)alignPtr(buf0, 16);
for(int i = 0; i < n; i += 16)
{
__m128i v0 = _mm_load_si128((__m128i*)(buf+i)); // 16-byte-aligned, since buf is 16-byte-aligned and i is divisable by 16.
__m128i v1 = _mm_load_si128((__m128i*)(buf+i+8)); // If buf+i is 16-byte-aligned, then buf+i+8 cannot be 16-byte-aligned.
}
}
I reduced the code to the relevant part and renamed some variables. The original code is from the OpenCV implementation of Konoliges blockmatching algorithm (stereobm.cpp, especially line 313). My question is, why is the code correct and what is written into v1?
Related
I'm converting a project to compile with gcc from clang and I've ran into a issue with a function that uses sse functions:
void dodgy_function(
const short* lows,
const short* highs,
short* mins,
short* maxs,
int its
)
{
__m128i v00[2] = { _mm_setzero_si128(), _mm_setzero_si128() };
__m128i v10[2] = { _mm_setzero_si128(), _mm_setzero_si128() };
for (int i = 0; i < its; ++i) {
reinterpret_cast<short*>(v00)[i] = lows[i];
reinterpret_cast<short*>(v10)[i] = highs[i];
}
reinterpret_cast<short*>(v00)[its] = reinterpret_cast<short*>(v00)[its - 1];
reinterpret_cast<short*>(v10)[its] = reinterpret_cast<short*>(v10)[its - 1];
__m128i v01[2] = {_mm_setzero_si128(), _mm_setzero_si128()};
__m128i v11[2] = {_mm_setzero_si128(), _mm_setzero_si128()};
__m128i min[2];
__m128i max[2];
min[0] = _mm_min_epi16(_mm_max_epi16(v11[0], v01[0]), _mm_min_epi16(v10[0], v00[0]));
max[0] = _mm_max_epi16(_mm_max_epi16(v11[0], v01[0]), _mm_max_epi16(v10[0], v00[0]));
min[1] = _mm_min_epi16(_mm_min_epi16(v11[1], v01[1]), _mm_min_epi16(v10[1], v00[1]));
max[1] = _mm_max_epi16(_mm_max_epi16(v11[1], v01[1]), _mm_max_epi16(v10[1], v00[1]));
reinterpret_cast<__m128i*>(mins)[0] = _mm_min_epi16(reinterpret_cast<__m128i*>(mins)[0], min[0]);
reinterpret_cast<__m128i*>(maxs)[0] = _mm_max_epi16(reinterpret_cast<__m128i*>(maxs)[0], max[0]);
reinterpret_cast<__m128i*>(mins)[1] = _mm_min_epi16(reinterpret_cast<__m128i*>(mins)[1], min[1]);
reinterpret_cast<__m128i*>(maxs)[1] = _mm_max_epi16(reinterpret_cast<__m128i*>(maxs)[1], max[1]);
}
Now with clang it gives it gives me the expected output but in gcc it prints all zeros: godbolt link
Playing around I discovered that gcc gives me the right results when I compile with -O1 but goes wrong with -O2 and -O3, suggesting the optimiser is going awry. Is there something particularly wrong I'm doing that would cause this behavior?
As a workaround I can wrap things up in a union and gcc will then give me the right result, but that feels a little icky: godbolt link 2
Any ideas?
The problem is that you're using short* to access the elements of a __m128i* object. That violates the strict-aliasing rule. It's only safe to go the other way, using __m128i* dereference or more normally _mm_load_si128( (const __m128i*)ptr ).
__m128i* is exactly like char* - you can point it at anything, but not vice versa: Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?
The only standard blessed way to do type punning is with memcpy:
memcpy(v00, lows, its * sizeof(short));
memcpy(v10, highs, its * sizeof(short));
memcpy(reinterpret_cast<short*>(v00) + its, lows + its - 1, sizeof(short));
memcpy(reinterpret_cast<short*>(v10) + its, highs + its - 1, sizeof(short));
https://godbolt.org/z/f63q7x
I prefer just using aligned memory of the correct type directly:
alignas(16) short v00[16];
alignas(16) short v10[16];
auto mv00 = reinterpret_cast<__m128i*>(v00);
auto mv10 = reinterpret_cast<__m128i*>(v10);
_mm_store_si128(mv00, _mm_setzero_si128());
_mm_store_si128(mv10, _mm_setzero_si128());
_mm_store_si128(mv00 + 1, _mm_setzero_si128());
_mm_store_si128(mv10 + 1, _mm_setzero_si128());
for (int i = 0; i < its; ++i) {
v00[i] = lows[i];
v10[i] = highs[i];
}
v00[its] = v00[its - 1];
v10[its] = v10[its - 1];
https://godbolt.org/z/bfanne
I'm not positive that this setup is actually standard-blessed (it definitely is for _mm_load_ps since you can do it without type punning at all) but it does seem to also fix the issue. I'd guess that any reasonable implementation of the load/store intrinsics is going to have to provide the same sort of aliasing guarantees that memcpy does since it's more or less the kosher way to go from straight line to vectorized code in x86.
As you mentioned in your question, you can also force the alignment with a union, and I've used that too in pre c++11 contexts. Even in that case though, I still personally always write the loads and stores explicitly (even if they're just going to/from aligned memory) because issues like this tend to pop up if you don't.
I need an implementation of upper_bound as described in the STL for my metal compute kernel. Not having anything in the metal standard library, I essentially copied it from <algorithm> into my shader file like so:
static device float* upper_bound( device float* first, device float* last, float val)
{
ptrdiff_t count = last - first;
while( count > 0){
device float* it = first;
ptrdiff_t step = count/2;
it += step;
if( !(val < *it)){
first = ++it;
count -= step + 1;
}else count = step;
}
return first;
}
I created a simple kernel to test it like so:
kernel void upper_bound_test(
device float* input [[buffer(0)]],
device uint* output [[buffer(1)]]
)
{
device float* where = upper_bound( input, input + 5, 3.1);
output[0] = where - input;
}
Which for this test has a hardcoded input size and search value. I also hardcoded a 5 element input buffer on the framework side as you'll see below. This kernel I expect to return the index of the first input greater than 3.1
It doesn't work. In fact output[0] is never written--as I preloaded the buffer with a magic number to see if it gets over-written. It doesn't. In fact after waitUntilCompleted, commandBuffer.error looks like this:
Error Domain = MTLCommandBufferErrorDomain
Code = 1
NSLocalizedDescription = "IOAcceleratorFamily returned error code 3"
What does error code 3 mean? Did my kernel get killed before it had a chance to finish?
Further, I tried just a linear search version of upper_bound like so:
static device float* upper_bound2( device float* first, device float* last, float val)
{
while( first < last && *first <= val)
++first;
return first;
}
This one works (sort-of). I have the same problem with a binary search lower_bound from <algorithm>--yet a naive linear version works (sort-of). BTW, I tested my STL copied versions from straight C-code (with device removed obviously) and they work fine outside of shader-land. Please tell me I'm doing something wrong and this is not a metal compiler bug.
Now about that "sort-of" above: the linear search versions work on a 5s and mini-2 (A7s) (returns index 3 in the example above), but on a 6+ (A8) it gives the right answer + 2^31. What the heck! Same exact code. Note on the framework side I use uint32_t and on the shader side I use uint--which are the same thing. Note also that every pointer subtraction (ptrdiff_t are signed 8-byte things) are small non-negative values. Why is the 6+ setting that high order bit? And of course, why don't my real binary search versions work?
Here is the framework side stuff:
id<MTLFunction> upperBoundTestKernel = [_library newFunctionWithName: #"upper_bound_test"];
id <MTLComputePipelineState> upperBoundTestPipelineState = [_device
newComputePipelineStateWithFunction: upperBoundTestKernel
error: &err];
float sortedNumbers[] = {1., 2., 3., 4., 5.};
id<MTLBuffer> testInputBuffer = [_device
newBufferWithBytes:(const void *)sortedNumbers
length: sizeof(sortedNumbers)
options: MTLResourceCPUCacheModeDefaultCache];
id<MTLBuffer> testOutputBuffer = [_device
newBufferWithLength: sizeof(uint32_t)
options: MTLResourceCPUCacheModeDefaultCache];
*(uint32_t*)testOutputBuffer.contents = 42;//magic number better get clobbered
id<MTLCommandBuffer> commandBuffer = [_commandQueue commandBuffer];
id<MTLComputeCommandEncoder> commandEncoder = [commandBuffer computeCommandEncoder];
[commandEncoder setComputePipelineState: upperBoundTestPipelineState];
[commandEncoder setBuffer: testInputBuffer offset: 0 atIndex: 0];
[commandEncoder setBuffer: testOutputBuffer offset: 0 atIndex: 1];
[commandEncoder
dispatchThreadgroups: MTLSizeMake( 1, 1, 1)
threadsPerThreadgroup: MTLSizeMake( 1, 1, 1)];
[commandEncoder endEncoding];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
uint32_t answer = *(uint32_t*)testOutputBuffer.contents;
Well, I've found a solution/work-around. I guessed it was a pointer-aliasing problem since first and last pointed into the same buffer. So I changed them to offsets from a single pointer variable. Here's a re-written upper_bound2:
static uint upper_bound2( device float* input, uint first, uint last, float val)
{
while( first < last && input[first] <= val)
++first;
return first;
}
And a re-written test kernel:
kernel void upper_bound_test(
device float* input [[buffer(0)]],
device uint* output [[buffer(1)]]
)
{
output[0] = upper_bound2( input, 0, 5, 3.1);
}
This worked--completely. That is, not only did it fix the "sort-of" problem for the linear search, but a similarly re-written binary search worked too. I don't want to believe this though. The metal shader language is supposed to be a subset of C++, yet standard pointer semantics don't work? Can I really not compare or subtract pointers?
Anyway, I don't recall seeing any docs saying there can be no pointer aliasing or what declaration incantation would help me here. Any more help?
[UPDATE]
For the record, as pointed out by "slime" on Apple's dev forum:
https://developer.apple.com/library/ios/documentation/Metal/Reference/MetalShadingLanguageGuide/func-var-qual/func-var-qual.html#//apple_ref/doc/uid/TP40014364-CH4-SW3
"Buffers (device and constant) specified as argument values to a graphics or kernel function cannot be aliased—that is, a buffer passed as an argument value cannot overlap another buffer passed to a separate argument of the same graphics or kernel function."
But it's also worth noting that upper_bound() is not a kernel function and upper_bound_test() is not passed aliased arguments. What upper_bound_test() does do is create a local temporary that points into the same buffer as one of its arguments. Perhaps the docs should say what it means, something like: "No pointer aliasing to device and constant memory in any function is allowed including rvalues." I don't actually know if this is too strong.
I could not fully understand the consequences of what I read here: Casting an int pointer to a char ptr and vice versa
In short, would this work?
set4Bytes(unsigned char* buffer) {
const uint32_t MASK = 0xffffffff;
if ((uintmax_t)buffer % 4) {//misaligned
for (int i = 0; i < 4; i++) {
buffer[i] = 0xff;
}
} else {//4-byte alignment
*((uint32_t*) buffer) = MASK;
}
}
Edit
There was a long discussion (it was in the comments, which mysteriously got deleted) about what type the pointer should be casted to in order to check the alignment. The subject is now addressed here.
This conversion is safe if you are filling same value in all 4 bytes. If byte order matters then this conversion is not safe.
Because when you use integer to fill 4 Bytes at a time it will fill 4 Bytes but order depends on the endianness.
No, it won't work in every case. Aside from endianness, which may or may not be an issue, you assume that the alignment of uint32_t is 4. But this quantity is implementation-defined (C11 Draft N1570 Section 6.2.8). You can use the _Alignof operator to get the alignment in a portable way.
Second, the effective type (ibid. Sec. 6.5) of the location pointed to by buffer may not be compatible to uint32_t (e.g. if buffer points to an unsigned char array). In that case you break strict aliasing rules once you try reading through the array itself or through a pointer of different type.
Assuming that the pointer actually points to an array of unsigned char, the following code will work
typedef union { unsigned char chr[sizeof(uint32_t)]; uint32_t u32; } conv_t;
void set4Bytes(unsigned char* buffer) {
const uint32_t MASK = 0xffffffffU;
if ((uintptr_t)buffer % _Alignof(uint32_t)) {// misaligned
for (size_t i = 0; i < sizeof(uint32_t); i++) {
buffer[i] = 0xffU;
}
} else { // correct alignment
conv_t *cnv = (conv_t *) buffer;
cnv->u32 = MASK;
}
}
This code might be of help to you. It shows a 32-bit number being built by assigning its contents a byte at a time, forcing misalignment. It compiles and works on my machine.
#include<stdint.h>
#include<stdio.h>
#include<inttypes.h>
#include<stdlib.h>
int main () {
uint32_t *data = (uint32_t*)malloc(sizeof(uint32_t)*2);
char *buf = (char*)data;
uintptr_t addr = (uintptr_t)buf;
int i,j;
i = !(addr%4) ? 1 : 0;
uint32_t x = (1<<6)-1;
for( j=0;j<4;j++ ) buf[i+j] = ((char*)&x)[j];
printf("%" PRIu32 "\n",*((uint32_t*) (addr+i)) );
}
As mentioned by #Learner, endianness must be obeyed. The code above is not portable and would break on a big endian machine.
Note that my compiler throws the error "cast from ‘char*’ to ‘unsigned int’ loses precision [-fpermissive]" when trying to cast a char* to an unsigned int, as done in the original post. This post explains that uintptr_t should be used instead.
In addition to the endian issue, which has already been mentioned here:
CHAR_BIT - the number of bits per char - should also be considered.
It is 8 on most platforms, where for (int i=0; i<4; i++) should work fine.
A safer way of doing it would be for (int i=0; i<sizeof(uint32_t); i++).
Alternatively, you can include <limits.h> and use for (int i=0; i<32/CHAR_BIT; i++).
Use reinterpret_cast<>() if you want to ensure the underlying data does not "change shape".
As Learner has mentioned, when you store data in machine memory endianess becomes a factor. If you know how the data is stored correctly in memory (correct endianess) and you are specifically testing its layout as an alternate representation, then you would want to use reinterpret_cast<>() to test that memory, as a specific type, without modifying the original storage.
Below, I've modified your example to use reinterpret_cast<>():
void set4Bytes(unsigned char* buffer) {
const uint32_t MASK = 0xffffffff;
if (*reinterpret_cast<unsigned int *>(buffer) % 4) {//misaligned
for (int i = 0; i < 4; i++) {
buffer[i] = 0xff;
}
} else {//4-byte alignment
*reinterpret_cast<unsigned int *>(buffer) = MASK;
}
}
It should also be noted, your function appears to set the buffer (32-bytes of contiguous memory) to 0xFFFFFFFF, regardless of which branch it takes.
Your code is perfect for working with any architecture with 32bit and up. There is no issue with byte ordering since all your source bytes are 0xFF.
At x86 or x64 machines, the extra work necessary to deal with eventually unaligned access to RAM are managed by the CPU and transparent to the programmer (since Pentium II), with some performance cost at each access. So, if you are just setting the first four bytes of a buffer a few times, you are good to simplify your function:
void set4Bytes(unsigned char* buffer) {
const uint32_t MASK = 0xffffffff;
*((uint32_t *)buffer) = MASK;
}
Some readings:
A Linux kernel doc about UNALIGNED MEMORY ACCESSES
Intel Architecture Optimization Manual, section 3.4
Windows Data Alignment on IPF, x86, and x64
A Practical 'Aligned vs. unaligned memory access', by Alexander Sandler
In my program I read in a file (here only a test file of about 200k data points afterwards there will be millions.) Now what I do is:
for (int i=0;i<n;i++) {
fid.seekg(4,ios_base::cur);
fid.read((char*) &x[i],8);
fid.seekg(8,ios_base::cur);
fid.read((char*) &y[i],8);
fid.seekg(8,ios_base::cur);
fid.read((char*) &z[i],8);
fid.read((char*) &d[i],8);
d[i] = (d[i] - p)/p;
z[i] *= cc;
}
Whereby n denotes the number of points to read in.
Afterwards I write them again with
for(int i=0;i<n;i++){
fid.write((char*) &d[i],8);
fid.write((char*) &z[i],8);
temp = (d[i] + 1) * p;
fid.write((char*) &temp,8);
}
Whereby the writing is faster then the reading.(time measured with clock_t)
My Question is now. Have I done some rather stupid mistake with the reading or can this behavior be expected?
I'm using Win XP with a magnetic drive.
yours magu_
You're using seekg too often. I see that you're using it to skip bytes, but you could as well read the complete buffer and then skip the bytes in the buffer:
char buffer[52];
for (int i=0;i<n;i++) {
fid.read(buffer, sizeof(buffer));
memcpy(&x[i], &buffer[4], sizeof(x[i]));
memcpy(&y[i], &buffer[20], sizeof(y[i]));
// etc
}
However, you can define a struct that represents the data in your file:
#pragma pack(push, 1)
struct Item
{
char dummy1[4]; // skip 4 bytes
__int64 x;
char dummy2[8]; // skip 8 bytes
__int64 y;
char dummy3[8]; // skip 8 bytes
__int64 z;
__int64 d;
};
#pragma pack(pop)
then declare an array of those structs and read all data at once:
Item* items = new Item[n];
fid.read(items, n * sizeof(Item)); // read all data at once will be amazing fast
(remark: I don't know the types of x, y, z and d, so I assume __int64 here)
I personally would (at least) do this:
for (int i=0;i<n;i++) {
char dummy[8];
fid.read(dummy,4);
fid.read((char*) &x[i],8);
fid.read(dummy,8);
fid.read((char*) &y[i],8);
fid.read(dummy,8);
fid.read((char*) &z[i],8);
fid.read((char*) &d[i],8);
d[i] = (d[i] - p)/p;
z[i] *= cc;
}
Doing a struct, or reading large amounts of data in one go (say adding a second layer, where you read 4KB at a time, and then using a pair of functions that do "skip" and "fetch" of the different fields would be a bit more work, but likely much faster).
Another option is to use mmap in Linux or MapViewOfFile in Windows. This method reduces the overhead in reading a file by a small portion, since there is one less copy required to transfer the data to the application.
Edit: I should add "Make sure you make comparative measurements", and if your application is meant to run on many machines, make sure you make measurements on more than one type of machine, with different alternatives of disk drive, processor and memory. You don't really want to tweak the code so that it runs 50% faster on your machine, but 25% slower on another machine.
The assert() statements are the most important part of this code so that if your platform ever changes and the width of your native types change then the assertions will fail. Instead of seeking, I would read to a dummy area. The p* variables make the code easier to read, IMO.
assert(sizeof x[0] == 8);
assert(sizeof y[0] == 8);
assert(sizeof z[0] == 8);
assert(sizeof d[0] == 8);
for (int i=0;i<n;i++) {
char unused[8];
char * px = (char *) &x[i];
char * py = (char *) &y[i];
char * pz = (char *) &z[i];
char * pd = (char *) &d[i];
fid.read(unused, 4);
fid.read(px, 8);
fid.read(unused, 8);
fid.read(py, 8);
fid.read(unused, 8);
fid.read(pz, 8);
fid.read(pd, 8);
d[i] = (d[i] - p)/p;
z[i] *= cc;
}
I have previously used SIMD operators to improve the efficiency of my code, however I am now facing a new error which I cannot resolve. For this task, speed is paramount.
The size of the array will not be known until the data is imported, and may be very small (100 values) or enormous (10 million values). For the latter case, the code works fine, however I am encountering an error when I use fewer than 130036 array values.
Does anyone know what is causing this issue and how to resolve it?
I have attached the (tested) code involved, which will be used later in a more complicated function. The error occurs at "arg1List[i] = ..."
#include <iostream>
#include <xmmintrin.h>
#include <emmintrin.h>
void main()
{
int j;
const int loop = 130036;
const int SIMDloop = (int)(loop/4);
__m128 *arg1List = new __m128[SIMDloop];
printf("sizeof(arg1List)= %d, alignof(Arg1List)= %d, pointer= %p", sizeof(arg1List), __alignof(arg1List), arg1List);
std::cout << std::endl;
for (int i = 0; i < SIMDloop; i++)
{
j = 4*i;
arg1List[i] = _mm_set_ps((j+1)/100.0f, (j+2)/100.0f, (j+3)/100.0f, (j+4)/100.0f);
}
}
Alignment is the reason.
MOVAPS--Move Aligned Packed Single-Precision Floating-Point Values
[...] The operand must be aligned on a 16-byte boundary or a general-protection exception (#GP) will be generated.
You can see the issue is gone as soon as you align your pointer:
__m128 *arg1List = new __m128[SIMDloop + 1];
arg1List = (__m128*) (((int) arg1List + 15) & ~15);