How to effectively apply bitwise operation to (large) packed bit vectors? - c++

I want to implement
void bitwise_and(
char* __restrict__ result,
const char* __restrict__ lhs,
const char* __restrict__ rhs,
size_t length);
or maybe a bitwise_or(), bitwise_xor() or any other bitwise operation. Obviously it's not about the algorithm, just the implementation details - alignment, loading the largest possible element from memory, cache-awareness, using SIMD instructions etc.
I'm sure this has (more than one) fast existing implementations, but I would guess most library implementations would require some fancy container, e.g. std::bitset or boost::dynamic_bit_set - but I don't want to spend the time constructing one of those.
So do I... Copy-paste from an existing library? Find a library which can 'wrap' a raw packed bits array in memory with a nice object? Roll my own implementation anyway?
Notes:
I'm mostly interested in C++ code, but I certainly don't mind a plain C approach.
Obviously, making copies of the input arrays is out of the question - that would probably nearly-double the execution time.
I intentionally did not template the bitwise operator, in case there's some specific optimization for OR, or for AND etc.
Bonus points for discussing operations on multiple vectors at once, e.g. V_out = V_1 bitwise-and V_2 bitwise-and V_3 etc.
I noted this article comparing library implementations, but it's from 5 years ago. I can't ask which library to use since that would violate SO policy I guess...
If it helps you any, assume its uint64_ts rather than chars (that doesn't really matter - if the char array is unaligned we can just treated the heading and trailing chars separately).

This answer is going to assume you want the fastest possible way and are happy to use platform specific things. You optimising compiler may be able to produce similar code to the below from normal C but in my experiance across a few compilers something as specific as this is still best hand-written.
Obviously like all optimisation tasks, never assume anything is better/worse and measure, measure, measure.
If you could lock down you architecture to x86 with at least SSE3 you would do:
void bitwise_and(
char* result,
const char* lhs,
const char* rhs,
size_t length)
{
while(length >= 16)
{
// Load in 16byte registers
auto lhsReg = _mm_loadu_si128((__m128i*)lhs);
auto rhsReg = _mm_loadu_si128((__m128i*)rhs);
// do the op
auto res = _mm_and_si128(lhsReg, rhsReg);
// save off again
_mm_storeu_si128((__m128i*)result, res);
// book keeping
length -= 16;
result += 16;
lhs += 16;
rhs += 16;
}
// do the tail end. Assuming that the array is large the
// most that the following code can be run is 15 times so I'm not
// bothering to optimise. You could do it in 64 bit then 32 bit
// then 16 bit then char chunks if you wanted...
while (length)
{
*result = *lhs & *rhs;
length -= 1;
result += 1;
lhs += 1;
rhs += 1;
}
}
This compiles to ~10asm instructions per 16 bytes (+ change for the leftover and a little overhead).
The great thing about doing intrinsics like this (over hand rolled asm) is that the compiler is still free to do additional optimisations (such as loop unrolling) ontop of what you write. It also handles register allocation.
If you could guarantee aligned data you could save an asm instruction (use _mm_load_si128 instead and the compiler will be clever enough to avoid a second load and use it as an direct mem operand to the 'pand'.
If you could guarantee AVX2+ then you could use the 256 bit version and handle 10asm instructions per 32 bytes.
On arm theres similar NEON instructions.
If you wanted to do multiple ops just add the relevant intrinsic in the middle and it'll add 1 asm instruction per 16 bytes.
I'm pretty sure with a decent processor you dont need any additional cache control.

Don't do it this way. The individual operations will look great, sleek asm, nice performance .. but a composition of them will be terrible. You cannot make this abstraction, nice as it looks. The arithmetic intensity of those kernels is almost the worst possible (the only worse one is doing no arithmetic, such as a straight up copy), and composing them at a high level will retain that awful property. In a sequence of operations each using the result of the previous one, the results are written and read again a lot later (in the next kernel), even though the high level flow could be transposed so that the result the "next operation" needs is right there in a register. Also, if the same argument appears twice in an expression tree (and not both as operands to one operation), they will be streamed in twice, instead of reusing the data for two operations.
It doesn't have that nice warm fuzzy feeling of "look at all this lovely abstraction" about it, but what you should do is find out at a high level how you're combining your vectors, and then try to chop that in pieces that make sense from a performance perspective. In some cases that may mean making big ugly messy loops that will make people get an extra coffee before diving in, that's just too bad then. If you want performance, you often have to sacrifice something else. Usually it's not so bad, it probably just means you have a loop that has an expression consisting of intrinsics in it, instead of an expression of vector-operations that each individually have a loop.

Related

Simulating AVX-512 mask instructions

According to the documentation, from gcc 4.9 on the AVX-512 instruction set is supported, but I have gcc 4.8. I currently have code like this for summing up a block of memory (it's guaranteed to be less than 256 bytes, so no overflow worries):
__mm128i sum = _mm_add_epi16(sum, _mm_cvtepu8_epi16(*(__m128i *) &mem));
Now, looking through the documentation, if we have, say, four bytes left over, I could use:
__mm128i sum = _mm_add_epi16(sum,
_mm_mask_cvtepu8_epi16(_mm_set1_epi16(0),
(__mmask8)_mm_set_epi16(0,0,0,0,1,1,1,1),
*(__m128i *) &mem));
(Note, the type of __mmask8 doesn't seem to be documented anywhere I can find, so I am guessing...)
However, _mm_mask_cvtepu8_epi16 is an AVX-512 instruction, so is there a way to duplicate this? I tried:
mm_mullo_epi16(_mm_set_epi16(0,0,0,0,1,1,1,1),
_mm_cvtepu8_epi16(*(__m128i *) &mem));
However, there was a cache stall so just a direct for (int i = 0; i < remaining_bytes; i++) sum += mem[i]; gave better performance.
As I happened to stumble across this question, and it still hasn't gotten an answer, if this is still a problem...
For your example problem, you're on the right track.
Multiply is a relatively slow operation, so you should avoid the use of _mm_mullo_epi16. Use _mm_and_si128 instead as bitwise AND is a much faster operation, e.g. _mm_and_si128(_mm_cvtepu8_epi16(*(__m128i *) &mem), _mm_set_epi32(0, 0, -1, -1))
I'm not sure what you mean by a cache stall, but if memory access is a bottleneck, and the compiler won't put the constant for the above into a register, you could use something like _mm_srli_si128(vector, 8) which doesn't need any additional registers/memory loads. A shift may be slower than an AND.
If it's always 8 bytes, you can use _mm_move_epi64
None of this solves the case if the remaining number isn't a fixed number of elements (e.g. you have n%16 bytes for some arbitrary n). Note that AVX-512 doesn't really solve it either. If you need to deal with this case, you could have a table of masks and AND depending on what's remaining, e.g. _mm_and_si128(vector, masks[n & 0xf])
(_mm_mask_cvtepu8_epi16 only cares about the low half of the vector, so your example is somewhat confusing - that is, you don't need to mask anything because the later elements are completely ignored anway)
On a more generic level, mask operations are really just an embedded _mm_blend_epi16 (or equivalent). For zeroing idioms, they can easily be emulated with _mm_and_si128 / _mm_andnot_si128, as shown above.

fastest way to write a bitstream on modern x86 hardware

What is the fastest way to write a bitstream on x86/x86-64? (codeword <= 32bit)
by writing a bitstream I refer to the process of concatenating variable bit-length symbols into a contiguous memory buffer.
currently I've got a standard container with a 32bit intermediate buffer to write to
void write_bits(SomeContainer<unsigned int>& dst,unsigned int& buffer, unsigned int& bits_left_in_buffer,int codeword, short bits_to_write){
if(bits_to_write < bits_left_in_buffer){
buffer|= codeword << (32-bits_left_in_buffer);
bits_left_in_buffer -= bits_to_write;
}else{
unsigned int full_bits = bits_to_write - bits_left_in_buffer;
unsigned int towrite = buffer|(codeword<<(32-bits_left_in_buffer));
buffer= full_bits ? (codeword >> bits_left_in_buffer) : 0;
dst.push_back(towrite);
bits_left_in_buffer = 32-full_bits;
}
}
Does anyone know of any nice optimizations, fast instructions or other info that may be of use?
Cheers,
I wrote once a quite fast implementation, but it has several limitations: It works on 32 bit x86 when you write and read the bitstream. I don't check for buffer limits here, I was allocating larger buffer and checked it from time to time from the calling code.
unsigned char* membuff;
unsigned bit_pos; // current BIT position in the buffer, so it's max size is 512Mb
// input bit buffer: we'll decode the byte address so that it's even, and the DWORD from that address will surely have at least 17 free bits
inline unsigned int get_bits(unsigned int bit_cnt){ // bit_cnt MUST be in range 0..17
unsigned int byte_offset = bit_pos >> 3;
byte_offset &= ~1; // rounding down by 2.
unsigned int bits = *(unsigned int*)(membuff + byte_offset);
bits >>= bit_pos & 0xF;
bit_pos += bit_cnt;
return bits & BIT_MASKS[bit_cnt];
};
// output buffer, the whole destination should be memset'ed to 0
inline unsigned int put_bits(unsigned int val, unsigned int bit_cnt){
unsigned int byte_offset = bit_pos >> 3;
byte_offset &= ~1;
*(unsigned int*)(membuff + byte_offset) |= val << (bit_pos & 0xf);
bit_pos += bit_cnt;
};
It's hard to answer in general because it depends on many factors such as the distribution of bit-sizes you are reading, the call pattern in the client code and the hardware and compiler. In general, the two possible approaches for reading (writing) from a bitstream are:
Using a 32-bit or 64-bit buffer and conditionally reading (writing) from the underlying array it when you need more bits. That's the approach your write_bits method takes.
Unconditionally reading (writing) from the underlying array on every bitstream read (write) and then shifting and masking the resultant values.
The primary advantages of (1) include:
Only reads from the underlying buffer the minimally required number of times in an aligned fashion.
The fast path (no array read) is somewhat faster since it doesn't have to do the read and associated addressing math.
The method is likely to inline better since it doesn't have reads - if you have several consecutive read_bits calls, for example, the compiler can potentially combine a lot of the logic and produce some really fast code.
The primary advantage of (2) is that it is completely predictable - it contains no unpredictable branches.
Just because there is only one advantage for (2) doesn't mean it's worse: that advantage can easily overwhelm everything else.
In particular, you can analyze the likely branching behavior of your algorithm based on two factors:
How often will the bitsteam need to read from the underlying buffer?
How predictable is the number of calls before a read is needed?
For example if you are reading 1 bit 50% of the time and 2 bits 50% of time, you will do 64 / 1.5 = ~42 reads (if you can use a 64-bit buffer) before requiring an underlying read. This favors method (1) since reads of the underlying are infrequent, even if mis-predicted. On the other hand, if you are usually reading 20+ bits, you will read from the underlying every few calls. This is likely to favor approach (2), unless the pattern of underlying reads is very predictable. For example, if you always read between 22 and 30 bits, you'll perhaps always take exactly three calls to exhaust the buffer and read the underlying1 array. So the branch will be well-predicated and (1) will stay fast.
Similarly, it depends on how you call these methods, and how the compiler can inline and simplify the code. Especially if you ever call the methods repeatedly with a compile-time constant size, a lot of simplification is possible. Little to no simplification is available when the codeword is known at compile-time.
Finally, you may be able to get increased performance by offering a more complex API. This mostly applies to implementation option (1). For example, you can offer an ensure_available(unsigned size) call which ensures that at least size bits (usually limited the buffer size) are available to read. Then you can read up to that number of bits using unchecked calls that don't check the buffer size. This can help you reduce mis-predictions by forcing the buffer fills to a predictable schedule and lets you write simpler unchecked methods.
1 This depends on exactly how your "read from underlying" routine is written, as there are a few options here: Some always fill to 64-bits, some fill to between 57 and 64-bits (i.e., read an integral number of bytes), and some may fill between 32 or 33 and 64-bits (like your example which reads 32-bit chunks).
You'll probably have to wait until 2013 to get hold of real HW, but the "Haswell" new instructions will bring proper vectorised shifts (ie the ability to shift each vector element by different amounts specified in another vector) to x86/AVX. Not sure of details (plenty of time to figure them out), but that will surely enable a massive performance improvement in bitstream construction code.
I don't have the time to write it for you (not too sure your sample is actually complete enough to do so) but if you must, I can think of
using translation tables for the various input/output bit shift offsets; This optimization would make sense for fixed units of n bits (with n sufficiently large (8 bits?) to expect performance gains)
In essence, you'd be able to do
destloc &= (lookuptable[bits_left_in_buffer][input_offset][codeword]);
disclaimer: this is very sloppy pseudo code, I just hope it conveys my idea of a lookup table o prevent bitshift arithmetics
writing it in assembly (I know i386 has XLAT, but then again, a good compiler might already use something like that)
; Also, XLAT seems limited to 8 bits and the AL register, so it's not really versatile
Update
Warning: be sure to use a profiler and test your optimization for correctness and speed. Using a lookup table can result in poorer performance in the light of locality of reference. So, you might need to change the bit-streaming thread on a single core (set thread affinity) to get the benefits, and you might have to adapt the lookup table size to the processor's L2 cache.
Als, have a look at SIMD, SSE4 or GPU (CUDA) instruction sets if you know you'll have certain features at your disposal.

Implementing memcmp

The following is the Microsoft CRT implementation of memcmp:
int memcmp(const void* buf1,
const void* buf2,
size_t count)
{
if(!count)
return(0);
while(--count && *(char*)buf1 == *(char*)buf2 ) {
buf1 = (char*)buf1 + 1;
buf2 = (char*)buf2 + 1;
}
return(*((unsigned char*)buf1) - *((unsigned char*)buf2));
}
It basically performs a byte by byte comparision.
My question is in two parts:
Is there any reason to not alter this to an int by int comparison until count < sizeof(int), then do a byte by byte comparision for what remains?
If I were to do 1, are there any potential/obvious problems?
Notes: I'm not using the CRT at all, so I have to implement this function anyway. I'm just looking for advice on how to implement it correctly.
You could do it as an int-by-int comparison or an even wider data type if you wish.
The two things you have to watch out for (at a minimum) are an overhang at the start as well as the end, and whether the alignments are different between the two areas.
Some processors run slower if you access values without following their alignment rules (some even crash if you try it).
So your code could probably do char comparisons up to an int alignment area, then int comparisons, then char comparisons again but, again, the alignments of both areas will probably matter.
Whether that extra code complexity is worth whatever savings you will get depends on many factors outside your control. A possible method would be to detect the ideal case where both areas are aligned identically and do it a fast way, otherwise just do it character by character.
The optimization you propose is very common. The biggest concern would be if you try to run it on a processor that doesn't allow unaligned accesses for anything other than a single byte, or is slower in that mode; the x86 family doesn't have that problem.
It's also more complicated, and thus more likely to contain a bug.
Don't forget that when you find a mismatch within a larger chunk, you must then identify the first differing char within that chunk so that you can calculate the correct return value (memcmp() returns the difference of the first differing bytes, treated as unsigned char values).
If you compare as int, you will need to check alignment and check if count is divisible by sizeof(int) (to compare the last bytes as char).
Is that really their implementation? I have other issues besides not doing it int-wise:
castng away constness.
does that return statement work? unsigned char - unsigned char = signed int?
int at a time only works if the pointers are aligned, or if you can read a few bytes from the front of each and they are both still aligned, so if both are 1 before the alignment boundary you can read one char of each then go int-at-a-time, but if they are aligned differently eg one is aligned and one is not, there is no way to do this.
memcmp is at its most inefficient (i.e. it takes the longest) when they do actually compare (it has to go to the end) and the data is long.
I would not write my own but if you are going to be comparing large portions of data you could do things like ensure alignment and even pad the ends, then do word-at-a-time, if you want.
Another idea is to optimize for the processor cache and fetching. Processors like to fetch in large chunks rather than individual bytes at random times. Although the internal workings may already account for this, it would be a good exercise anyway. Always profile to determine the most efficient solution.
Psuedo code:
while bytes remaining > (cache size) / 2 do // Half the cache for source, other for dest.
fetch source bytes
fetch destination bytes
perform comparison using fetched bytes
end-while
perform byte by byte comparison for remainder.
For more information, search the web for "Data Driven Design" and "data oriented programming".
Some processors, such as the ARM family, allow for conditional execution of instructions (in 32-bit, non-thumb) mode. The processor fetches the instructions but will only execute them if the conditions are satisfied. In this case, try rephrasing the comparison in terms of boolean assignments. This may also reduce the number of branches taken, which improves performance.
See also loop unrolling.
See also assembly language.
You can gain a lot of performance by tailoring the algorithm to a specific processor, but loose in the portability area.
The code you found is just a debug implementation of memcmp, it's optimized for simplicity and readability, not for performance.
The intrinsic compiler implementation is platform specific and smart enough to generate processor instructions that compare dwords or qwords (depending on the target architecture) at once whenever possible.
Also, an intrinsic implementation may return immediately if both buffers have the same address (buf1 == buf2). This check is also missing in the debug implementation.
Finally, even when you know exactly on which platform you'll be running, the perfect implementation is still the less generic one as it depends on a bunch of different factors that are specific to the rest of your program:
What is the minumum guaranteed buffer alignment?
Can you read any padding bytes past the end of a buffer without triggering an access violation?
May the buffer parameters be identical?
May the buffer size be 0?
Do you only need to compare buffer contents for equality? Or do you also need to know which one is larger (return value < 0 or > 0)?
...
If performace is a concern, I suggest writing the comparison routine in assembly. Most compilers give you an option to see the assembly lising that they generate for a source. You could take that code and adapt it to your needs.
Many processors implement this as a single instruction. If you can guarantee the processor you're running on it can be implemented with a single line of inline assembler.

C++, using one byte to store two variables

I am working on representation of the chess board, and I am planning to store it in 32 bytes array, where each byte will be used to store two pieces. (That way only 4 bits are needed per piece)
Doing it in that way, results in a overhead for accessing particular index of the board.
Do you think that, this code can be optimised or completely different method of accessing indexes can be used?
c++
char getPosition(unsigned char* c, int index){
//moving pointer
c+=(index>>1);
//odd number
if (index & 1){
//taking right part
return *c & 0xF;
}else
{
//taking left part
return *c>>4;
}
}
void setValue(unsigned char* board, char value, int index){
//moving pointer
board+=(index>>1);
//odd number
if (index & 1){
//replace right part
//save left value only 4 bits
*board = (*board & 0xF0) + value;
}else
{
//replacing left part
*board = (*board & 0xF) + (value<<4);
}
}
int main() {
char* c = (char*)malloc(32);
for (int i = 0; i < 64 ; i++){
setValue((unsigned char*)c, i % 8,i);
}
for (int i = 0; i < 64 ; i++){
cout<<(int)getPosition((unsigned char*)c, i)<<" ";
if (((i+1) % 8 == 0) && (i > 0)){
cout<<endl;
}
}
return 0;
}
I am equally interested in your opinions regarding chess representations, and optimisation of the method above, as a stand alone problem.
Thanks a lot
EDIT
Thanks for your replies. A while ago I created checkers game, where I was using 64 bytes board representation. This time I am trying some different methods, just to see what I like. Memory is not such a big problem. Bit-boards is definitely on my list to try. Thanks
That's the problem with premature optimization. Where your chess board would have taken 64 bytes to store, now it takes 32. What has this really boughten you? Did you actually analyze the situation to see if you needed to save that memory?
Assuming that you used one of the least optimal search method, straight AB search to depth D with no heuristics, and you generate all possible moves in a position before searching, then absolute maximum memory required for your board is going to be sizeof(board) * W * D. If we assume a rather large W = 100 and large D = 30 then you're going to have 3000 boards in memory at depth D. 64k vs 32k...is it really worth it?
On the other hand, you've increased the amount of operations necessary to access board[location] and this will be called many millions of times per search.
When building chess AI's the main thing you'll end up looking for is cpu cycles, not memory. This may vary a little bit if you're targeting a cell phone or something, but even at that you're going to worry more about speed before you'll ever reach enough depth to cause any memory issues.
As to which representation I prefer...I like bitboards. Haven't done a lot of serious measurements but I did compare two engines I made, one bitboard and one array, and the bitboard one was faster and could reach much greater depths than the other.
Let me be the first to point out a potential bug (depending on compilers and compiler settings). And bugs being why premature optimization is evil:
//taking left part
return *c>>4;
if *c is negative, then >> may repeat the negative high bit. ie in binary:
0b10100000 >> 4 == 0b11111010
for some compilers (ie the C++ standard leaves it to the compiler to decide - both whether to carry the high bit, and whether a char is signed or unsigned).
If you do want to go forward with your packed bits (and let me say that you probably shouldn't bother, but it is up to you), I would suggest wrapping the packed bits into a class, and overriding [] such that
board[x][y]
gives you the unpacked bits. Then you can turn the packing on and off easily, and having the same syntax in either case. If you inline the operator overloads, it should be as efficient as the code you have now.
Well, 64 bytes is a very small amount of RAM. You're better off just using a char[8][8]. That is, unless you plan on storing a ton of chess boards. Doing char[8][8] makes it easier (and faster) to access the board and do more complex operations on it.
If you're still interested in storing the board in packed representation (either for practice or to store a lot of boards), I say you're "doing it right" regarding the bit operations. You may want to consider inlining your accessors if you're going for speed using the inline keyword.
Is space enough of a consideration where you can't just use a full byte to represent a square? That would make accesses easier to follow on the program and additionally most likely faster as the bit manipulations are not required.
Otherwise to make sure everything goes smoothly I would make sure all your types are unsigned: getPosition return unsigned char, and qualify all your numeric literals with "U" (0xF0U for example) to make sure they're always interpreted as unsigned. Most likely you won't have any problems with signed-ness but why take chances on some architecture that behaves unexpectedly?
Nice code, but if you are really that deep into performance optimization, you should probably learn more about your particular CPU architecture.
AFAIK, you may found that storing a chess piece in as much 8 bytes will be more efficient. Even if you recurse 15 moves deep, L2 cache size would hardly be a constraint, but RAM misalignment may be. I would guess that proper handling of a chess board would include Expand() and Reduce() functions to translate between board representations during different parts of the algorithm: some may be faster on compact representation, and some vice versa. For example, caching, and algorithms involving hashing by composition of two adjacent cells might be good for the compact structure, all else no.
I would also consider developing some helper hardware, like some FPGA board, or some GPU code, if performance is so important..
As a chess player, I can tell you: There's more to a position than the mere placement of each piece. You have to take in to consideration some other things:
Which side has to move next?
Can white and/or black castle king and/or queenside?
Can a pawn be taken en passant?
How many moves have passed since the last pawn move and/or capturing move?
If the data structure you use to represent a position doesn't reflect this information, then you're in big trouble.

Using SSE instructions

I have a loop written in C++ which is executed for each element of a big integer array. Inside the loop, I mask some bits of the integer and then find the min and max values. I heard that if I use SSE instructions for these operations it will run much faster compared to a normal loop written using bitwise AND , and if-else conditions. My question is should I go for these SSE instructions? Also, what happens if my code runs on a different processor? Will it still work or these instructions are processor specific?
SSE instructions are processor specific. You can look up which processor supports which SSE version on wikipedia.
If SSE code will be faster or not depends on many factors: The first is of course whether the problem is memory-bound or CPU-bound. If the memory bus is the bottleneck SSE will not help much. Try simplifying your integer calculations, if that makes the code faster, it's probably CPU-bound, and you have a good chance of speeding it up.
Be aware that writing SIMD-code is a lot harder than writing C++-code, and that the resulting code is much harder to change. Always keep the C++ code up to date, you'll want it as a comment and to check the correctness of your assembler code.
Think about using a library like the IPP, that implements common low-level SIMD operations optimized for various processors.
SIMD, of which SSE is an example, allows you to do the same operation on multiple chunks of data. So, you won't get any advantage to using SSE as a straight replacement for the integer operations, you will only get advantages if you can do the operations on multiple data items at once. This involves loading some data values that are contiguous in memory, doing the required processing and then stepping to the next set of values in the array.
Problems:
1 If the code path is dependant on the data being processed, SIMD becomes much harder to implement. For example:
a = array [index];
a &= mask;
a >>= shift;
if (a < somevalue)
{
a += 2;
array [index] = a;
}
++index;
is not easy to do as SIMD:
a1 = array [index] a2 = array [index+1] a3 = array [index+2] a4 = array [index+3]
a1 &= mask a2 &= mask a3 &= mask a4 &= mask
a1 >>= shift a2 >>= shift a3 >>= shift a4 >>= shift
if (a1<somevalue) if (a2<somevalue) if (a3<somevalue) if (a4<somevalue)
// help! can't conditionally perform this on each column, all columns must do the same thing
index += 4
2 If the data is not contigous then loading the data into the SIMD instructions is cumbersome
3 The code is processor specific. SSE is only on IA32 (Intel/AMD) and not all IA32 cpus support SSE.
You need to analyse the algorithm and the data to see if it can be SSE'd and that requires knowing how SSE works. There's plenty of documentation on Intel's website.
This kind of problem is a perfect example of where a good low level profiler is essential. (Something like VTune) It can give you a much more informed idea of where your hotspots lie.
My guess, from what you describe is that your hotspot will probably be branch prediction failures resulting from min/max calculations using if/else. Therefore, using SIMD intrinsics should allow you to use the min/max instructions, however, it might be worth just trying to use a branchless min/max caluculation instead. This might achieve most of the gains with less pain.
Something like this:
inline int
minimum(int a, int b)
{
int mask = (a - b) >> 31;
return ((a & mask) | (b & ~mask));
}
If you use SSE instructions, you're obviously limited to processors that support these.
That means x86, dating back to the Pentium 2 or so (can't remember exactly when they were introduced, but it's a long time ago)
SSE2, which, as far as I can recall, is the one that offers integer operations, is somewhat more recent (Pentium 3? Although the first AMD Athlon processors didn't support them)
In any case, you have two options for using these instructions. Either write the entire block of code in assembly (probably a bad idea. That makes it virtually impossible for the compiler to optimize your code, and it's very hard for a human to write efficient assembler).
Alternatively, use the intrinsics available with your compiler (if memory serves, they're usually defined in xmmintrin.h)
But again, the performance may not improve. SSE code poses additional requirements of the data it processes. Mainly, the one to keep in mind is that data must be aligned on 128-bit boundaries. There should also be few or no dependencies between the values loaded into the same register (a 128 bit SSE register can hold 4 ints. Adding the first and the second one together is not optimal. But adding all four ints to the corresponding 4 ints in another register will be fast)
It may be tempting to use a library that wraps all the low-level SSE fiddling, but that might also ruin any potential performance benefit.
I don't know how good SSE's integer operation support is, so that may also be a factor that can limit performance. SSE is mainly targeted at speeding up floating point operations.
If you intend to use Microsoft Visual C++, you should read this:
http://www.codeproject.com/KB/recipes/sseintro.aspx
We have implemented some image processing code, similar to what you describe but on a byte array, In SSE. The speedup compared to C code is considerable, depending on the exact algorithm more than a factor of 4, even in respect to the Intel compiler. However, as you already mentioned you have the following drawbacks:
Portability. The code will run on every Intel-like CPU, so also AMD, but not on other CPUs. That is not a problem for us because we control the target hardware. Switching compilers and even to a 64 bit OS can also be a problem.
You have a steep learning curve, but I found that after you grasp the principles writing new algorithms is not that hard.
Maintainability. Most C or C++ programmers have no knowledge of assembly/SSE.
My advice to you will be to go for it only if you really need the performance improvement, and you can't find a function for your problem in a library like the intel IPP, and if you can live with the portability issues.
I can tell from my experince that SSE brings a huge (4x and up) speedup over a plain c version of the code (no inline asm, no intrinsics used) but hand-optimized assembler can beat Compiler-generated assembly if the compiler can't figure out what the programmer intended (belive me, compilers don't cover all possible code combinations and they never will).
Oh and, the compiler can't everytime layout the data that it runs at the fastest-possible speed.
But you need much experince for a speedup over an Intel-compiler (if possible).
SSE instructions were originally just on Intel chips, but recently (since Athlon?) AMD supports them as well, so if you do code against the SSE instruction set, you should be portable to most x86 procs.
That being said, it may not be worth your time to learn SSE coding unless you're already familiar with assembler on x86's - an easier option might be to check your compiler docs and see if there are options to allow the compiler to autogenerate SSE code for you. Some compilers do very well vectorizing loops in this way. (You're probably not surprised to hear that the Intel compilers do a good job of this :)
Write code that helps the compiler understand what you are doing. GCC will understand and optimize SSE code such as this:
typedef union Vector4f
{
// Easy constructor, defaulted to black/0 vector
Vector4f(float a = 0, float b = 0, float c = 0, float d = 1.0f):
X(a), Y(b), Z(c), W(d) { }
// Cast operator, for []
inline operator float* ()
{
return (float*)this;
}
// Const ast operator, for const []
inline operator const float* () const
{
return (const float*)this;
}
// ---------------------------------------- //
inline Vector4f operator += (const Vector4f &v)
{
for(int i=0; i<4; ++i)
(*this)[i] += v[i];
return *this;
}
inline Vector4f operator += (float t)
{
for(int i=0; i<4; ++i)
(*this)[i] += t;
return *this;
}
// Vertex / Vector
// Lower case xyzw components
struct {
float x, y, z;
float w;
};
// Upper case XYZW components
struct {
float X, Y, Z;
float W;
};
};
Just don't forget to have -msse -msse2 on your build parameters!
Although it is true that SSE is specific to some processors (SSE may be relatively safe, SSE2 much less in my experience), you can detect the CPU at runtime, and load the code dynamically depending on the target CPU.
SIMD intrinsics (such as SSE2) can speed this sort of thing up but take expertise to use correctly. They are very sensitive to alignment and pipeline latency; careless use can make performance even worse than it would have been without them. You'll get a much easier and more immediate speedup from simply using cache prefetching to make sure all your ints are in L1 in time for you to operate on them.
Unless your function needs a throughput of better than 100,000,000 integers per second, SIMD probably isn't worth the trouble for you.
Just to add briefly to what has been said before about different SSE versions being available on different CPUs: This can be checked by looking at the respective feature flags returned by the CPUID instruction (see e.g. Intel's documentation for details).
Have a look at inline assembler for C/C++, here is a DDJ article. Unless you are 100% certain your program will run on a compatible platform you should follow the recommendations many have given here.
I agree with the previous posters. Benefits can be quite large but to get it may require a lot of work. Intel documentation on these instructions is over 4K pages. You may want to check out EasySSE (c++ wrappers library over intrinsics + examples) free from Ocali Inc.
I assume my affiliation with this EasySSE is clear.
I don't recommend doing this yourself unless you're fairly proficient with assembly. Using SSE will, more than likely, require careful reorganization of your data, as Skizz points out, and the benefit is often questionable at best.
It would probably be much better for you to write very small loops and keep your data very tightly organized and just rely on the compiler doing this for you. Both the Intel C Compiler and GCC (since 4.1) can auto-vectorize your code, and will probably do a better job than you. (Just add -ftree-vectorize to your CXXFLAGS.)
Edit: Another thing I should mention is that several compilers support assembly intrinsics, which would probably, IMO, be easier to use than the asm() or __asm{} syntax.