C++ SSE Optimisation with multiple functions - c++

I have some code that is structurally similar to the below. There is a bunch of small SSE helper functions, a larger one that does most of the work, and the public function that organises data, runs the large function in a loop and deals with any left over data.
This gave about a 2x speed boost over the scalar implementation, however I would like to obtain more if possible. As well as some conceptual issues there were some things in the disassembly (only looked at x86 VC++ 2010 in detail, but support x86 and GCC) I did not like.
For at least some targets I can only use SSE and SSE2 here, but if it is worth a separate build I could possibly use newer instruction sets as well.
Problem 1:
All the small helpers got inlined into the large helper nicely, and the large one didn't.
However, even though it is only referenced by one function in one source file and there are plenty registers (Looking at the algorithm, pretty sure it only needs at most 12 XMM registers except for loading the data arrays), the compiler seems to want to follow normal calling conventions for fooHelper.
So after putting data into XMM registers in foo it puts them back on the stack, and passes pointers, then after the loops and tidy up stuff, it loads that stack back into XMM so i can unload it again...
I guess I could force it to inline fooHelper, but that is a very large number of duplicated instructions because it wouldn't use 4 XMM registers to do the job. I could also not use SSE in foo itself, which would remove the load/store issue, but fooHelper is still doing completely unrequired loads and stores on those 4 state variables...
Ideally since this is a private function a way to ignore the normal calling conventions would be nice, and I am sure this will come up in lots of other larger pieces of SSE where I don't really want everything fully inlined.
Problem 2:
The implementation is basically working on 4 state vectors organised as AAAA, BBBB, CCCC, DDDD, such that the code can simply be written as if it is working with A, B, C and D as separate variables, while processing all 4 data streams at once.
However the output itself is in the form ABCD, ABCD, ABCD, ABCD and the input is also 4 separate buffers requiring the _m_set_epi32 to load it.
Is there a better way to deal with these inputs and outputs (the format of which can not practically be changed)?
namespace
{
void fooHelperA(__m128i &a, __m128i b, __m128i x, int s)
{
...small function (<5 sse operations)...
}
...bunch of other small functions...
//
void fooHelper(
const int *data1, const int *data2, const int *data3, const int *data4,
__m128i &a, __m128i &b, __m128i &c, __m128i &d)
{
//Get the current piece of data
__m128 c = _mm_set_epi32(data1[0], data2[0], data3[0], data4[0]);
...do stuff with data...
fooHelperA(a, b, c, 5);
...
c = _mm_set_epi32(data1[1], data2[1], data3[1], data4[1]);
...
fooHelperA(b, a, c, 7);
... lots more code ...
c = _mm_set_epi32(data1[3], data2[3], data3[3], data4[3]);
...
}
}
void foo(
const char*data1, const char *data2, const float *data3, const char *data4,
int*out1, int*out2, int*out3, int*out4,
size_t len)
{
__m128i a = _mm_setzero_si128();
__m128i b = _mm_setzero_si128();
__m128i c = _mm_setzero_si128();
__m128i d = _mm_setzero_si128();
while (len >= 16) //expected to loop <25 times for datasets in question
{
fooHelper((const int*)data1, (const int*)data2, (const int*)data3, (const int*)data4, a,b,c,d);
data1 += 16;
data2 += 16;
data3 += 16;
data4 += 16;
len -= 16;
}
if (len)
{
int[4][4] buffer;
...padd data into buffer...
fooHelper(buffer[0], buffer[1], buffer[2], buffer[3], a,b,c,d);
}
ALIGNED(16, int[4][4]) tmp;
_mm_store_si128((__m128i*)tmp[0], a);
_mm_store_si128((__m128i*)tmp[1], b);
_mm_store_si128((__m128i*)tmp[2], c);
_mm_store_si128((__m128i*)tmp[3], d);
out1[0] = tmp[0][0];
out2[0] = tmp[0][1];
out3[0] = tmp[0][2];
out4[0] = tmp[0][3];
out1[1] = tmp[0][0];
out2[1] = tmp[0][1];
out3[1] = tmp[0][2];
out4[1] = tmp[0][3];
out1[2] = tmp[0][0];
out2[2] = tmp[0][1];
out3[2] = tmp[0][2];
out4[2] = tmp[0][3];
out1[3] = tmp[0][0];
out2[3] = tmp[0][1];
out3[3] = tmp[0][2];
out4[3] = tmp[0][3];
}

Some advice,
1) Looking at your code and data description, it seem you can have huge gain by moving your data organization from SOA (Struct of array ) your AAAA vector to a AOS array of struct where your input data will already be organized as ABCD , you will have 1 big input vector (4x bigger)!
2) take care to your data alignment. for now you don't care has you should have pinalllity due to the set_epi32 function but if you switch to AOS you should be able to use a fast load ( memory to XMS ).
3) the end of the function is a bit strange, (I cannot simulate for now) I really don't understand why you need a tmp 2d array.
4) interleaving (& the inverse operation) can be done using some example of SOA/ AOS conversion ... Intel wrote a lot of paper on this topic when promoting SIMD Instruction Set.
good luck,
alex

Related

What is the "correct" way to go from avx/sse masks to avx512 masks?

I have some existing avx/sse masks that I got the old way:
auto mask_sse = _mm_cmplt_ps(a, b);
auto mask_avx = _mm_cmp_ps(a, b, 17);
In some circumstances when mixing old avx code with new avx512 code, I want to convert these old style masks into the new avx512 __mmask4 or __mmask8 types.
I tried this:
auto mask_avx512 = _mm_cmp_ps_mask(sse_mask, _mm_setzero_ps(), 25/*nge unordered quiet*/);
and it seems to work for plain old outputs of comparisons, but I don't think it would capture positive NANs correctly that could have been used with an sse4.1 _mm_blendv_ps.
There also is good old _mm_movemask_ps but that looks like it puts the mask all the way out in a general purpose register, and I would need to chain it with a _cvtu32_mask8 to pull it back into one of the dedicated mask registers.
Is there a cleaner way to just directly pull the sign bit out of an old style mask into one of the k registers?
Example Code:
Here's an example program doing the sort of mask conversion the first way I mentioned above
#include "x86intrin.h"
#include <cassert>
#include <cstdio>
int main()
{
auto a = _mm_set_ps(-1, 0, 1, 2);
auto c = _mm_set_ps(3, 4, 5, 6);
auto sse_mask = _mm_cmplt_ps(a, _mm_setzero_ps());
auto avx512_mask = _mm_cmp_ps_mask(sse_mask, _mm_setzero_ps(), 25);
alignas(16) float v1[4];
alignas(16) float v2[4];
_mm_store_ps(v1, _mm_blendv_ps(a, c, sse_mask));
_mm_store_ps(v2, _mm_mask_blend_ps(avx512_mask, a, c));
assert(v1[0] == v2[0]);
assert(v1[1] == v2[1]);
assert(v1[2] == v2[2]);
assert(v1[3] == v2[3]);
return 0;
}
Use an AVX-512 compare intrinsic to get an AVX-512 mask in the first place (like _mm_cmp_ps_mask); that's going to be significantly more efficient than comparing into a vector and then converting it, unless the compiler optimizes away this inefficiency for you. (Consider using a wrapper library like Agner Fog's VCL to try to abstract away the difference. The VCL licence changed recently from GPL to Apache.)
But if you really need this (e.g. as a stop-gap before you finish optimizing), you don't need an FP compare. _mm_cmp_ps in C produces a __m128 result, but it's not really a vector of floats1. It's all-one-bits / all-zero-bits. You just want the bits, so you're looking for the AVX-512 equivalent of vmovmskps, but into a k register instead of GP integer. i.e. VPMOVD2M k, x/y/zmm for 32-bit source elements.
__m128 cmpvec = _mm_cmplt_ps(v, _mm_setzero_ps() );
__mmask8 cmpmask = _mm_movepi32_mask( _mm_castps_si128(cmpvec) ); // <----
// equivalent to comparing into a mask in the first place:
__mmask8 cmpmask = _mm_cmplt_ps_mask(v, _mm_setzero_ps(), _CMP_LT_OQ);
// equivalent to (if I got this right)
__mmask8 cmpmask = _mm_fpclass_ps_mask(v, 0x40 | 0x10); // negative | negative_inf
https://uops.info/ is down right now, otherwise I'd check latency and execution ports of VPMOVD2M vs. VCMPPS into mask (for an UNORD predicate) vs. VFPCLASSPS.
Footnote 1: You could use AVX-512 vfpclassps into a mask, or even compare against itself with a vcmpps predicate like UNORD to detect NAN or not. But those are I think slower.
I would need to chain it with a _cvtu32_mask8 to pull it back into one of the dedicated mask registers.
The way compilers currently do things, __mmask8 is just a typedef for unsigned char, and __mmask16 is unsigned short. They're freely convertible without intrinsics, for good or ill. But in asm, it takes a kmovb k1, eax instruction to get the data from a GP reg to a k mask reg, and that instruction can only run on port 5 in current CPUs.

Unaligned load versus unaligned store

The short question is that if I have a function that takes two vectors. One is input and the other is output (no alias). I can only align one of them, which one should I choose?
The longer version is that, consider a function,
void func(size_t n, void *in, void *out)
{
__m256i *in256 = reinterpret_cast<__m256i *>(in);
__m256i *out256 = reinterpret_cast<__m256i *>(out);
while (n >= 32) {
__m256i data = _mm256_loadu_si256(in256++);
// process data
_mm256_storeu_si256(out256++, data);
n -= 32;
}
// process the remaining n % 32 bytes;
}
If in and out are both 32-bytes aligned, then there's no penalty of using vmovdqu instead of vmovdqa. The worst case scenario is that both are unaligned, and one in four load/store will cross the cache-line boundary.
In this case, I can align one of them to the cache line boundary by processing a few elements first before entering the loop. However, the question is which should I choose? Between unaligned load and store, which one is worse?
Risking to state the obvious here: There is no "right answer" except "you need to benchmark both with actual code and actual data". Whichever variant is faster strongly depends on the CPU you are using, the amount of calculations you are doing on each package and many other things.
As noted in the comments, you should also try non-temporal stores. What also sometimes can help is to load the input of the following data packet inside the current loop, i.e.:
__m256i next = _mm256_loadu_si256(in256++);
for(...){
__m256i data = next; // usually 0 cost
next = _mm256_loadu_si256(in256++);
// do computations and store data
}
If the calculations you are doing have unavoidable data latencies, you should also consider calculating two packages interleaved (this uses twice as many registers though).

Swap two variables with XOR

With the following method we can swap two variable A and B
A = A XOR B
B = A XOR B
A = A XOR B
I want to implement such a method in C++ that operate with all types (int, float, char, ...) as well as structures. As we know all types of data including structures take a specific space of memory, for example 4 bytes, 8 bytes
In my opinion this method for swapping must work with all types excluding pointer based types, It should swap memory contents, that is bits, of two variables
My Question
I have no idea how can I implement such a method in C++ that works with structures (those does not contain any pointers). Can any one please help me?
Your problem is easily reduced to xor-swap buffers of raw memory. Something like that.
void xorswap(void *a, void *b, size_t size);
That can be implemented in terms of xorswaps of primitive types. For example:
void xorswap(void *a, void *b, size_t size)
{
if (a == b)
return; //nothing to do
size_t qwords = size / 8;
size_t rest = size % 8;
uint64_t *a64 = (uint64_t *)a;
uint64_t *b64 = (uint64_t *)b;
for (size_t i = 0; i < qwords; ++i)
xorswap64(a64++, b64++);
uint8_t *a8 = (uint8_t*)a64;
uint8_t *b8 = (uint8_t*)b64;
for (size_t i = 0; i < rest; ++i)
xorswap8(a8++, b8++);
}
I leave the implementation of xorswap64() and xorswap8() as an exercise to the reader.
Also note that to be efficient, the original buffers should be 8-byte aligned. If that's not the case, depending on the architecture, the code may work suboptimally or not work at all (again, an exercise to the reader ;-).
Other optimizations are possible. You can even use Duff's device to unroll the last loop, but I don't know if it is worth it. You'll have to profile it to know for sure.
You could use Bitwise XOR "^" in C to XOR two bits. See here and here. Now to XOR 'a' and 'b' start XORing from the east significant bit to the most significant bit.

Memory Access Violations When Using SSE Operations

I've been trying to re-implement some existing vector and matrix classes to use SSE3 commands, and I seem to be running into these "memory access violation" errors whenever I perform a series of operations on an array of vectors. I'm relatively new to SSE, so I've been starting off simple. Here's the entirety of my vector class:
class SSEVector3D
{
public:
SSEVector3D();
SSEVector3D(float x, float y, float z);
SSEVector3D& operator+=(const SSEVector3D& rhs); //< Elementwise Addition
float x() const;
float y() const;
float z() const;
private:
float m_coords[3] __attribute__ ((aligned (16))); //< The x, y and z coordinates
};
So, not a whole lot going on yet, just some constructors, accessors, and one operation. Using my (admittedly limited) knowledge of SSE, I implemented the addition operation as follows:
SSEVector3D& SSEVector3D::operator+=(const SSEVector3D& rhs)
{
__m128 * pLhs = (__m128 *) m_coords;
__m128 * pRhs = (__m128 *) rhs.m_coords;
*pLhs = _mm_add_ps(*pLhs, *pRhs);
return (*this);
}
To speed-test my new vector class against the old one (to see if it's worth re-implementing the whole thing), I created a simple program that generates a random array of SSEVector3D objects and adds them together. Nothing too complicated:
SSEVector3D sseSum(0, 0, 0);
for(i=0; i<sseVectors.size(); i++)
{
sseSum += sseVectors[i];
}
printf("Total: %f %f %f\n", sseSum.x(), sseSum.y(), sseSum.z());
The sseVectors variable is an std::vector containing elements of type SSEVector3D, whose components are all initialized to random numbers between -1 and 1.
Here's the issue I'm having. If the size of sseVectors is 8,191 or less (a number I arrived at through a lot of trial and error), this runs fine. If the size is 8,192 or more, I get this error when I try to run it:
signal: SIGSEGV, si_code: 0 (memory access violation at address: 0x00000080)
However, if I comment out that print statement at the end, I get no error even if sseVectors has a size of 8,192 or more.
Is there something wrong with the way I've written this vector class? I'm running Ubuntu 12.04.1 with GCC version 4.6
First, and foremost, don't do this
__m128 * pLhs = (__m128 *) m_coords;
__m128 * pRhs = (__m128 *) rhs.m_coords;
*pLhs = _mm_add_ps(*pLhs, *pRhs);
With SSE, always do your loads and stores explicitly via the appropriate intrinsics, never by just dereferencing. Instead of storing an array of 3 floats in your class, store a value of type _m128. That should make the compiler align instances of your class correctly, without any need for align attributes.
Note, however, that this won't work very well with MSVC. MSVC seems to generally be unable to cope with alignment requirements stronger than 8-byte aligned for by-value arguments :-(. The last time I needed to port SSE code to windows, my solution was to use Intel's C++ compiler for the SSE parts instead of MSVC...
The trick is to notice that __m128 is 16 byte aligned. Use _malloc_aligned() to assure that your float array is correctly aligned, then you can go ahead and cast your float to an array of __m128. Make sure also that the number of floats you allocate is divisible by four.

Boolean bit fields vs logical bit masking or bit shifting - C++

I have a series of classes that are going to require many boolean fields, somewhere between 4-10. I'd like to not have to use a byte for each boolean. I've been looking into bit field structs, something like:
struct BooleanBitFields
{
bool b1:1;
bool b2:1;
bool b3:1;
bool b4:1;
bool b5:1;
bool b6:1;
};
But after doing some research I see a lot of people saying that this can cause inefficient memory access and not be worth the memory savings. I'm wondering what the best method for this situation is. Should I use bit fields, or use a char with bit masking (and's and or
s) to store 8bits? If the second solution is it better to bit shift or use logic?
If anyone could comment as to what method they would use and why it would really help me decide which route I should go down.
Thanks in advance!
With the large address spaces on desktop boxes, an array of 32/64-bit booleans may seem wasteful, and indeed it is, but most developers don't care, (me included). On RAM-restricted embedded controllers, or when accessing hardware in drivers, then sure, use bitfields, otherwise..
One other issue, apart from R/W ease/speed, is that a 32- or 64-bit boolean is thread-safer than one bit in the middle that has to be manipulated by multiple logical operations.
Bit fields are only a recommendation for the compiler. The compiler is free to implement them as it likes. On embedded systems there are compilers that guarantee 1 bit-to-bit mapping. Other compilers don't.
I would go with a regular struct, like yours but no bit fields. Make them unsigned chars - the shortest data type. The struct will make it easier to access them while editing, if your IDE supports auto completion.
Use an int bit array (leaves you lots of space to expand, and there is no advantage to a single char) and test with mask constants:
#define BOOL_A 1
#define BOOL_B 1 << 1
#define BOOL_C 1 << 2
#define BOOL_D 1 << 3
/* Alternately: use const ints for encapsulation */
// declare and set
int bitray = 0 | BOOL_B | BOOL_D;
// test
if (bitray & BOOL_B) cout << "Set!\n";
I want to write an answer to make sure once again and formalize the thought: "What does the transition from working with bytes to working with bits entail?" And also because the answer "I don't care" seems to me to be unreasonable.
Exploring char vs bitfield
Agree, It's very tempting. Especially when it's supposed to be used like this:
#define FLAG_1 1
#define FLAG_2 (1 << 1)
#define FLAG_3 (1 << 2)
#define FLAG_4 (1 << 3)
struct S1 {
char flag_1: 1;
char flag_2: 1;
char flag_3: 1;
char flag_4: 1;
}; //sizeof == 1
void MyFunction(struct S1 *obj, char flags) {
obj->flag_1 = flags & FLAG_1;
obj->flag_2 = flags & FLAG_2;
obj->flag_3 = flags & FLAG_3;
obj->flag_4 = flags & FLAG_4;
// we desire it to be as *obj = flags;
}
int main(int argc, char **argv)
{
struct S1 obj;
MyFunction(&obj, FLAG_1 | FLAG_2 | FLAG_3 | FLAG_4);
return 0;
}
But let's cover all aspects of such optimization. Let's decompose the operation into simpler C-commands, roughly corresponding to the assembler commands:
Initialization of all flags.
char flags = FLAG_1 | FLAG_3;
//obj->flag_1 = flags & FLAG_1;
//obj->flag_2 = flags & FLAG_2;
//obj->flag_3 = flags & FLAG_3;
//obj->flag_4 = flags & FLAG_4;
*obj = flags;
Writing one flag as a constant
//obj.flag_3 = 1;
char a = *obj;
a &= ~FLAG_3;
a |= FLAG_3;
*obj = a;
Write a single flag using a variable
char b = 3;
//obj.flag_3 = b;
char a = *obj;
a &= ~FLAG_3;
char c = b;
c <<= 3;
c &= ~FLAG_3; //Fixing b > 1
a |= c;
*obj = a;
Reading one flag into variable
//char f = obj.flag_3;
char f = *obj;
f >>= 3;
f &= 0x01;
Write one flag to another
//obj.flag_2 = obj.flag_4;
char a = *obj;
char b = a;
a &= FLAG_4;
a <<= 2; //Shift to FLAG_2 position
b |= a;
*obj = b;
Resume
Command
Cost, bitfield
Cost, variable
1. Init
1
4 or less
2. obj.flag_3 = 1;
3
1
3. obj.flag_3 = b;
7
1 or 3 *
4. char f = obj.flag_3;
2
1
5. obj.flag_2 = obj.flag_4;
6
1
*- if we guarantee flag be no more than 1
All operations except initialization take many lines of code. It looks like it would be better for us to leave bit fields alone after initialization)))). However, this is usually what happens to flags all the time. They change their state without warning and randomly.
We are essentially trying to make the rare value initialization operation cheaper by sacrificing frequent value change operations.
There are systems in which bitwise comparison operations, bit set and reset, bit copying and even bit swapping, bit branching, take one cycle. There are even systems in which mutex locking operations are implemented by a single assembler instruction (in such systems, bit fields may not be located on the entire memory area, for example, PIC microcontrollers). in any way it's not a common memory area.
Perhaps in such systems, the bool type could point to a component of the bitfield.
If your desire to save on insignificant bits of a byte has not yet disappeared, try to think about implementing addressability, atomicity of operations, arithmetic with bytes, and the resulting overhead for calls, data memory, code memory, stack if algorithms are placed in functions.
Reflections on the choice of bool or char
If your target platform decodes the bool type as 2 bytes or 4 or more. That most likely operations with bits on it will not be optimized. Rather, it is a platform for high-volume computing. This means that bit operations are not so in demand on it, in addition, operations with bytes and words are not so in demand on it.
In the same way that operations on bits hurt performance, operations on a single byte can also greatly increase the number of cycles to access a variable.
No system can be equally optimal for everything at once. Instead of obsessing over memory savings in systems that are clearly built with a lot of memory surplus, pay attention to the strengths of those systems.
Conclusion
Use char or bool if:
You need to store the mutable state or behavior of the algorithm (and change and return flags individually).
Your flag does not accurately describe the system and could evolve into a number.
You need to be able to access the flag by address.
If your code claims to be platform independent and there is no guarantee that bit operations will be optimized on the target platform.
Use bitfields if:
You need to store a huge number of flags without having to constantly read and rewrite them.
You have unusually tight memory requirements, or memory is low.
In other deeply justified cases, with calculations and confirming experiments.
Perhaps a short rule might be:
Independent flags are stored in a bool.
P.S.: If you've read this far and still want to save 7 bits out of 8, then consider why there is no desire to use 7 bit bit fields for variables that take a value up to 100 maximum.
References
Raymond Chen: The cost-benefit analysis of bitfields for a collection of booleans