fastest way to write a bitstream on modern x86 hardware

fastest way to write a bitstream on modern x86 hardware - c++

What is the fastest way to write a bitstream on x86/x86-64? (codeword <= 32bit)
by writing a bitstream I refer to the process of concatenating variable bit-length symbols into a contiguous memory buffer.
currently I've got a standard container with a 32bit intermediate buffer to write to
void write_bits(SomeContainer<unsigned int>& dst,unsigned int& buffer, unsigned int& bits_left_in_buffer,int codeword, short bits_to_write){
if(bits_to_write < bits_left_in_buffer){
buffer|= codeword << (32-bits_left_in_buffer);
bits_left_in_buffer -= bits_to_write;
}else{
unsigned int full_bits = bits_to_write - bits_left_in_buffer;
unsigned int towrite = buffer|(codeword<<(32-bits_left_in_buffer));
buffer= full_bits ? (codeword >> bits_left_in_buffer) : 0;
dst.push_back(towrite);
bits_left_in_buffer = 32-full_bits;
}
}
Does anyone know of any nice optimizations, fast instructions or other info that may be of use?
Cheers,

I wrote once a quite fast implementation, but it has several limitations: It works on 32 bit x86 when you write and read the bitstream. I don't check for buffer limits here, I was allocating larger buffer and checked it from time to time from the calling code.
unsigned char* membuff;
unsigned bit_pos; // current BIT position in the buffer, so it's max size is 512Mb
// input bit buffer: we'll decode the byte address so that it's even, and the DWORD from that address will surely have at least 17 free bits
inline unsigned int get_bits(unsigned int bit_cnt){ // bit_cnt MUST be in range 0..17
unsigned int byte_offset = bit_pos >> 3;
byte_offset &= ~1; // rounding down by 2.
unsigned int bits = *(unsigned int*)(membuff + byte_offset);
bits >>= bit_pos & 0xF;
bit_pos += bit_cnt;
return bits & BIT_MASKS[bit_cnt];
};
// output buffer, the whole destination should be memset'ed to 0
inline unsigned int put_bits(unsigned int val, unsigned int bit_cnt){
unsigned int byte_offset = bit_pos >> 3;
byte_offset &= ~1;
*(unsigned int*)(membuff + byte_offset) |= val << (bit_pos & 0xf);
bit_pos += bit_cnt;
};

It's hard to answer in general because it depends on many factors such as the distribution of bit-sizes you are reading, the call pattern in the client code and the hardware and compiler. In general, the two possible approaches for reading (writing) from a bitstream are:
Using a 32-bit or 64-bit buffer and conditionally reading (writing) from the underlying array it when you need more bits. That's the approach your write_bits method takes.
Unconditionally reading (writing) from the underlying array on every bitstream read (write) and then shifting and masking the resultant values.
The primary advantages of (1) include:
Only reads from the underlying buffer the minimally required number of times in an aligned fashion.
The fast path (no array read) is somewhat faster since it doesn't have to do the read and associated addressing math.
The method is likely to inline better since it doesn't have reads - if you have several consecutive read_bits calls, for example, the compiler can potentially combine a lot of the logic and produce some really fast code.
The primary advantage of (2) is that it is completely predictable - it contains no unpredictable branches.
Just because there is only one advantage for (2) doesn't mean it's worse: that advantage can easily overwhelm everything else.
In particular, you can analyze the likely branching behavior of your algorithm based on two factors:
How often will the bitsteam need to read from the underlying buffer?
How predictable is the number of calls before a read is needed?
For example if you are reading 1 bit 50% of the time and 2 bits 50% of time, you will do 64 / 1.5 = ~42 reads (if you can use a 64-bit buffer) before requiring an underlying read. This favors method (1) since reads of the underlying are infrequent, even if mis-predicted. On the other hand, if you are usually reading 20+ bits, you will read from the underlying every few calls. This is likely to favor approach (2), unless the pattern of underlying reads is very predictable. For example, if you always read between 22 and 30 bits, you'll perhaps always take exactly three calls to exhaust the buffer and read the underlying1 array. So the branch will be well-predicated and (1) will stay fast.
Similarly, it depends on how you call these methods, and how the compiler can inline and simplify the code. Especially if you ever call the methods repeatedly with a compile-time constant size, a lot of simplification is possible. Little to no simplification is available when the codeword is known at compile-time.
Finally, you may be able to get increased performance by offering a more complex API. This mostly applies to implementation option (1). For example, you can offer an ensure_available(unsigned size) call which ensures that at least size bits (usually limited the buffer size) are available to read. Then you can read up to that number of bits using unchecked calls that don't check the buffer size. This can help you reduce mis-predictions by forcing the buffer fills to a predictable schedule and lets you write simpler unchecked methods.
1 This depends on exactly how your "read from underlying" routine is written, as there are a few options here: Some always fill to 64-bits, some fill to between 57 and 64-bits (i.e., read an integral number of bytes), and some may fill between 32 or 33 and 64-bits (like your example which reads 32-bit chunks).

You'll probably have to wait until 2013 to get hold of real HW, but the "Haswell" new instructions will bring proper vectorised shifts (ie the ability to shift each vector element by different amounts specified in another vector) to x86/AVX. Not sure of details (plenty of time to figure them out), but that will surely enable a massive performance improvement in bitstream construction code.

I don't have the time to write it for you (not too sure your sample is actually complete enough to do so) but if you must, I can think of
using translation tables for the various input/output bit shift offsets; This optimization would make sense for fixed units of n bits (with n sufficiently large (8 bits?) to expect performance gains)
In essence, you'd be able to do
destloc &= (lookuptable[bits_left_in_buffer][input_offset][codeword]);
disclaimer: this is very sloppy pseudo code, I just hope it conveys my idea of a lookup table o prevent bitshift arithmetics
writing it in assembly (I know i386 has XLAT, but then again, a good compiler might already use something like that)
; Also, XLAT seems limited to 8 bits and the AL register, so it's not really versatile
Update
Warning: be sure to use a profiler and test your optimization for correctness and speed. Using a lookup table can result in poorer performance in the light of locality of reference. So, you might need to change the bit-streaming thread on a single core (set thread affinity) to get the benefits, and you might have to adapt the lookup table size to the processor's L2 cache.
Als, have a look at SIMD, SSE4 or GPU (CUDA) instruction sets if you know you'll have certain features at your disposal.

Related

How the size of a struct containing bitset fields is calculated [duplicate]

It seems for std::bitset<1 to 32>, the size is set to 4 bytes. For sizes 33 to 64, it jumps straight up to 8 bytes. There can't be any overhead because std::bitset<32> is an even 4 bytes.
I can see aligning to byte length when dealing with bits, but why would a bitset need to align to word length, especially for a container most likely to be used in situations with a tight memory budget?
This is under VS2010.

The most likely explanation is that bitset is using a whole number of machine words to store the array.
This is probably done for memory bandwidth reasons: it is typically relatively cheap to read/write a word that's aligned at a word boundary. On the other hand, reading (and especially writing!) an arbitrarily-aligned byte can be expensive on some architectures.
Since we're talking about a fixed-sized penalty of a few bytes per bitset, this sounds like a reasonable tradeoff for a general-purpose library.

I assume that indexing into the bitset is done by grabbing a 32-bit value and then isolating the relevant bit because this is fastest in terms of processor instructions (working with smaller-sized values is slower on x86). The two indexes needed for this can also be calculated very quickly:
int wordIndex = (index & 0xfffffff8) >> 3;
int bitIndex = index & 0x7;
And then you can do this, which is also very fast:
int word = m_pStorage[wordIndex];
bool bit = ((word & (1 << bitIndex)) >> bitIndex) == 1;
Also, a maximum waste of 3 bytes per bitset is not exactly a memory concern IMHO. Consider that a bitset is already the most efficient data structure to store this type of information, so you would have to evaluate the waste as a percentage of the total structure size.
For 1025 bits this approach uses up 132 bytes instead of 129, for 2.3% overhead (and this goes down as the bitset site goes up). Sounds reasonable considering the likely performance benefits.

The memory system on modern machines cannot fetch anything else but words from memory, apart from some legacy functions that extract the desired bits. Hence, having the bitsets aligned to words makes them a lot faster to handle, because you do not need to mask out the bits you don't need when accessing it. If you do not mask, doing something like
bitset<4> foo = 0;
if (foo) {
// ...
}
will most likely fail. Apart from that, I remember reading some time ago that there was a way to cramp several bitsets together, but I don't remember exactly. I think it was when you have several bitsets together in a structure that they can take up "shared" memory, which is not applicable to most use cases of bitfields.

I had the same feature in Aix and Linux implementations. In Aix, internal bitset storage is char based:
typedef unsigned char _Ty;
....
_Ty _A[_Nw + 1];
In Linux, internal storage is long based:
typedef unsigned long _WordT;
....
_WordT _M_w[_Nw];
For compatibility reasons, we modified Linux version with char based storage
Check which implementation are you using inside bitset.h

Because a 32 bit Intel-compatible processor cannot access bytes individually (or better, it can by applying implicitly some bit mask and shifts) but only 32bit words at time.
if you declare
bitset<4> a,b,c;
even if the library implements it as char, a,b and c will be 32 bit aligned, so the same wasted space exist. But the processor will be forced to premask the bytes before letting bitset code to do its own mask.
For this reason MS used a int[1+(N-1)/32] as a container for the bits.

Maybe because it's using int by default, and switches to long long if it overflows? (Just a guess...)

If your std::bitset< 8 > was a member of a structure, you might have this:
struct A
{
std::bitset< 8 > mask;
void * pointerToSomething;
}
If bitset<8> was stored in one byte (and the structure packed on 1-byte boundaries) then the pointer following it in the structure would be unaligned, which would be A Bad Thing. The only time when it would be safe and useful to have a bitset<8> stored in one byte would be if it was in a packed structure and followed by some other one-byte fields with which it could be packed together. I guess this is too narrow a use case for it to be worthwhile providing a library implementation.
Basically, in your octree, a single byte bitset would only be useful if it was followed in a packed structure by another one to three single-byte members. Otherwise, it would have to be padded to four bytes anyway (on a 32-bit machine) to ensure that the following variable was word-aligned.

How to effectively apply bitwise operation to (large) packed bit vectors?

I want to implement
void bitwise_and(
char* __restrict__ result,
const char* __restrict__ lhs,
const char* __restrict__ rhs,
size_t length);
or maybe a bitwise_or(), bitwise_xor() or any other bitwise operation. Obviously it's not about the algorithm, just the implementation details - alignment, loading the largest possible element from memory, cache-awareness, using SIMD instructions etc.
I'm sure this has (more than one) fast existing implementations, but I would guess most library implementations would require some fancy container, e.g. std::bitset or boost::dynamic_bit_set - but I don't want to spend the time constructing one of those.
So do I... Copy-paste from an existing library? Find a library which can 'wrap' a raw packed bits array in memory with a nice object? Roll my own implementation anyway?
Notes:
I'm mostly interested in C++ code, but I certainly don't mind a plain C approach.
Obviously, making copies of the input arrays is out of the question - that would probably nearly-double the execution time.
I intentionally did not template the bitwise operator, in case there's some specific optimization for OR, or for AND etc.
Bonus points for discussing operations on multiple vectors at once, e.g. V_out = V_1 bitwise-and V_2 bitwise-and V_3 etc.
I noted this article comparing library implementations, but it's from 5 years ago. I can't ask which library to use since that would violate SO policy I guess...
If it helps you any, assume its uint64_ts rather than chars (that doesn't really matter - if the char array is unaligned we can just treated the heading and trailing chars separately).

This answer is going to assume you want the fastest possible way and are happy to use platform specific things. You optimising compiler may be able to produce similar code to the below from normal C but in my experiance across a few compilers something as specific as this is still best hand-written.
Obviously like all optimisation tasks, never assume anything is better/worse and measure, measure, measure.
If you could lock down you architecture to x86 with at least SSE3 you would do:
void bitwise_and(
char* result,
const char* lhs,
const char* rhs,
size_t length)
{
while(length >= 16)
{
// Load in 16byte registers
auto lhsReg = _mm_loadu_si128((__m128i*)lhs);
auto rhsReg = _mm_loadu_si128((__m128i*)rhs);
// do the op
auto res = _mm_and_si128(lhsReg, rhsReg);
// save off again
_mm_storeu_si128((__m128i*)result, res);
// book keeping
length -= 16;
result += 16;
lhs += 16;
rhs += 16;
}
// do the tail end. Assuming that the array is large the
// most that the following code can be run is 15 times so I'm not
// bothering to optimise. You could do it in 64 bit then 32 bit
// then 16 bit then char chunks if you wanted...
while (length)
{
*result = *lhs & *rhs;
length -= 1;
result += 1;
lhs += 1;
rhs += 1;
}
}
This compiles to ~10asm instructions per 16 bytes (+ change for the leftover and a little overhead).
The great thing about doing intrinsics like this (over hand rolled asm) is that the compiler is still free to do additional optimisations (such as loop unrolling) ontop of what you write. It also handles register allocation.
If you could guarantee aligned data you could save an asm instruction (use _mm_load_si128 instead and the compiler will be clever enough to avoid a second load and use it as an direct mem operand to the 'pand'.
If you could guarantee AVX2+ then you could use the 256 bit version and handle 10asm instructions per 32 bytes.
On arm theres similar NEON instructions.
If you wanted to do multiple ops just add the relevant intrinsic in the middle and it'll add 1 asm instruction per 16 bytes.
I'm pretty sure with a decent processor you dont need any additional cache control.

Don't do it this way. The individual operations will look great, sleek asm, nice performance .. but a composition of them will be terrible. You cannot make this abstraction, nice as it looks. The arithmetic intensity of those kernels is almost the worst possible (the only worse one is doing no arithmetic, such as a straight up copy), and composing them at a high level will retain that awful property. In a sequence of operations each using the result of the previous one, the results are written and read again a lot later (in the next kernel), even though the high level flow could be transposed so that the result the "next operation" needs is right there in a register. Also, if the same argument appears twice in an expression tree (and not both as operands to one operation), they will be streamed in twice, instead of reusing the data for two operations.
It doesn't have that nice warm fuzzy feeling of "look at all this lovely abstraction" about it, but what you should do is find out at a high level how you're combining your vectors, and then try to chop that in pieces that make sense from a performance perspective. In some cases that may mean making big ugly messy loops that will make people get an extra coffee before diving in, that's just too bad then. If you want performance, you often have to sacrifice something else. Usually it's not so bad, it probably just means you have a loop that has an expression consisting of intrinsics in it, instead of an expression of vector-operations that each individually have a loop.

Byte swap during copy

I need to efficiently swap the byte order of an array during copying into another array.
The source array is of a certain type; char, short or int so the byte swapping required is unambiguous and will be according to that type.
My plan is to do this very simply with a multi-pass byte-wise copy (2 for short, 4 for int, ...). However are there any pre-existing "memcpy_swap_16/32/64" functions or libraries? Perhaps in image processing for BGR/RGB image processing.
EDIT
I know how to swap the bytes of individual values, that is not the problem. I want to do this process during a copy that I am going to perform anyway.
For example, if I have an array or little endian 4-byte integers I can do they swap by performing 4 bytewise copies with initial offsets of 0, 1, 2 and 3 with a stride of 4. But there may be a better way, perhaps even reading each 4-byte integer individually and using the byte-swap intrinsics _byteswap_ushort, _byteswap_ulong and _byteswap_uint64 would be faster. But I suspect there must be existing functions that do this type of processing.
EDIT 2
Just found this, which may be a useful basis for SSE, though its true that memory bandwidth probably makes it a waste of time.
Fast vectorized conversion from RGB to BGRA

Unix systems have a swab function that does what you want for 16-bit arrays. It's probably optimized, but I'm not sure. Note that modern gcc will generate extremely efficient code if you just write the naive byte swap code:
uint32_t x, y;
y = (x<<24) | (x<<8 & 0xff0000) | (x>>8 & 0xff00) | (x>>24);
i.e. it will use the bswap instruction on i486+. Presumably putting this in a loop will give an efficient loop too...
Edit: For your copying task, I would do the following in your loop:
Read a 32-bit value from const uint32_t *src.
Use the above code to swap it.
Write a 32-bit value to uint32_t *dest.
Strictly speaking this may not be portable (aliasing violations) but as long as the copy function is in its own translation unit and not getting inlined, there's very little to worry about. Forget what I wrote about aliasing; if you're swapping the data as 32-bit values, it almost surely was actually 32-bit values to begin with, not some other type of pointer that was cast, so there's no issue.

In linux, you should check the header bits/byteswap.h. there's a family of macros of the form bswap_##, and some of them use assembly instructions where appropriate.

Yes there are existing functions like the one linked in the question but its not worth the effort because the size of the data (in this case) means the set up overhead is too high. So instead, it's better to just read out 2, 4, and 8 bytes at a time and do the swap using intrinsics and write back.

Implementing memcmp

The following is the Microsoft CRT implementation of memcmp:
int memcmp(const void* buf1,
const void* buf2,
size_t count)
{
if(!count)
return(0);
while(--count && *(char*)buf1 == *(char*)buf2 ) {
buf1 = (char*)buf1 + 1;
buf2 = (char*)buf2 + 1;
}
return(*((unsigned char*)buf1) - *((unsigned char*)buf2));
}
It basically performs a byte by byte comparision.
My question is in two parts:
Is there any reason to not alter this to an int by int comparison until count < sizeof(int), then do a byte by byte comparision for what remains?
If I were to do 1, are there any potential/obvious problems?
Notes: I'm not using the CRT at all, so I have to implement this function anyway. I'm just looking for advice on how to implement it correctly.

You could do it as an int-by-int comparison or an even wider data type if you wish.
The two things you have to watch out for (at a minimum) are an overhang at the start as well as the end, and whether the alignments are different between the two areas.
Some processors run slower if you access values without following their alignment rules (some even crash if you try it).
So your code could probably do char comparisons up to an int alignment area, then int comparisons, then char comparisons again but, again, the alignments of both areas will probably matter.
Whether that extra code complexity is worth whatever savings you will get depends on many factors outside your control. A possible method would be to detect the ideal case where both areas are aligned identically and do it a fast way, otherwise just do it character by character.

The optimization you propose is very common. The biggest concern would be if you try to run it on a processor that doesn't allow unaligned accesses for anything other than a single byte, or is slower in that mode; the x86 family doesn't have that problem.
It's also more complicated, and thus more likely to contain a bug.

Don't forget that when you find a mismatch within a larger chunk, you must then identify the first differing char within that chunk so that you can calculate the correct return value (memcmp() returns the difference of the first differing bytes, treated as unsigned char values).

If you compare as int, you will need to check alignment and check if count is divisible by sizeof(int) (to compare the last bytes as char).

Is that really their implementation? I have other issues besides not doing it int-wise:
castng away constness.
does that return statement work? unsigned char - unsigned char = signed int?
int at a time only works if the pointers are aligned, or if you can read a few bytes from the front of each and they are both still aligned, so if both are 1 before the alignment boundary you can read one char of each then go int-at-a-time, but if they are aligned differently eg one is aligned and one is not, there is no way to do this.
memcmp is at its most inefficient (i.e. it takes the longest) when they do actually compare (it has to go to the end) and the data is long.
I would not write my own but if you are going to be comparing large portions of data you could do things like ensure alignment and even pad the ends, then do word-at-a-time, if you want.

Another idea is to optimize for the processor cache and fetching. Processors like to fetch in large chunks rather than individual bytes at random times. Although the internal workings may already account for this, it would be a good exercise anyway. Always profile to determine the most efficient solution.
Psuedo code:
while bytes remaining > (cache size) / 2 do // Half the cache for source, other for dest.
fetch source bytes
fetch destination bytes
perform comparison using fetched bytes
end-while
perform byte by byte comparison for remainder.
For more information, search the web for "Data Driven Design" and "data oriented programming".
Some processors, such as the ARM family, allow for conditional execution of instructions (in 32-bit, non-thumb) mode. The processor fetches the instructions but will only execute them if the conditions are satisfied. In this case, try rephrasing the comparison in terms of boolean assignments. This may also reduce the number of branches taken, which improves performance.
See also loop unrolling.
See also assembly language.
You can gain a lot of performance by tailoring the algorithm to a specific processor, but loose in the portability area.

The code you found is just a debug implementation of memcmp, it's optimized for simplicity and readability, not for performance.
The intrinsic compiler implementation is platform specific and smart enough to generate processor instructions that compare dwords or qwords (depending on the target architecture) at once whenever possible.
Also, an intrinsic implementation may return immediately if both buffers have the same address (buf1 == buf2). This check is also missing in the debug implementation.
Finally, even when you know exactly on which platform you'll be running, the perfect implementation is still the less generic one as it depends on a bunch of different factors that are specific to the rest of your program:
What is the minumum guaranteed buffer alignment?
Can you read any padding bytes past the end of a buffer without triggering an access violation?
May the buffer parameters be identical?
May the buffer size be 0?
Do you only need to compare buffer contents for equality? Or do you also need to know which one is larger (return value < 0 or > 0)?
...
If performace is a concern, I suggest writing the comparison routine in assembly. Most compilers give you an option to see the assembly lising that they generate for a source. You could take that code and adapt it to your needs.

Many processors implement this as a single instruction. If you can guarantee the processor you're running on it can be implemented with a single line of inline assembler.

which of these two methods of converting this array to integer you would suggest?

consider the following array of bytes that is intended to be converted into a single unsigned integer:
unsigned char arr[3] = {0x23, 0x45, 0x67};
each byte represents the equivalent byte in integer, now which one of the following methods would you suggest specially performance-wise:
unsigned int val1 = arr[2] << 16 | arr[1] << 8 | arr[0];
//or
unsigned int val2=arr[0];
*((char *)&val2+1)=arr[1];
*((char *)&val2+2)=arr[2];

I prefer the first method because it is portable. The second isn't due to endianness issues.

This depends on your specific processor, a lot.
For example, on the PowerPC, the second form -- writing through the character pointers -- runs into a tricky implementation detail called a load-hit-store. This is a CPU stall that occurs when you store to a location in memory, then read it back again before the store has completed. The load op cannot complete until the store has finished (most PPCs do not have memory store-forwarding), and the store may take many cycles to make it from the CPU out to the memory cache.
Because of the way the store and arithmetic units are arranged in the pipeline, the CPU will have to flush the pipeline completely until the store completes: this can be a stall of twenty cycles or more during which the CPU has stopped dead. In general, writing to memory and then reading it back immediately is very bad on this platform. So on this case, the sequential bitshifts will be much faster, as they all occur on registers, and will not incur a pipeline stall.
On the Pentium series, the situation may be entirely reversed, because that chipset does have store forwarding and a fast stack architecture, and relatively few architectural registers. On the Core Duos and i7s, it may reverse yet again, because their pipelines are very deep.
Remember: it is not the case that every opcode takes one cycle. CPUs are not simple, and things like superscalar pipes and data hazards may cause instructions to take many cycles, or even many instructions to occur per cycle, depending on just how you arrange your code.
All of this just to underscore the point: this sort of optimization is extremely specific to a particular compiler and chipset. So you must compile, test and measure.

the first is faster, translated in x86 asm. It depends on your architecture anyway. Usually the compilers are able to optimize the first expression very well, and it's more portable too

The performance depends on the compiler and the machine. For example, in my experiment with gcc 4.4.5 on x64 the second was marginally faster, while others report the first as being faster. Therefore I recommend to stick with the first one because it is cleaner (no casts) and safer (no endianness issues).

I believe bitshift will the fastest solution. In my mind the CPU can just slide in the values, but by going directly to the address, like your second example, it will have to use many temp storages.

I would suggest a solution with union :
union color {
// first representation (member of union)
struct s_color {
unsigned char a, b, g, r;
} uc_color;
// second representation (member of union)
unsigned int int_color;
};
int main()
{
color a;
a.int_color = 0x23567899;
a.uc_color.a;
a.uc_color.b;
a.uc_color.g;
a.uc_color.r;
}
Take care that it platform dependent (which endianess)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js