Byte swap during copy - c++

I need to efficiently swap the byte order of an array during copying into another array.
The source array is of a certain type; char, short or int so the byte swapping required is unambiguous and will be according to that type.
My plan is to do this very simply with a multi-pass byte-wise copy (2 for short, 4 for int, ...). However are there any pre-existing "memcpy_swap_16/32/64" functions or libraries? Perhaps in image processing for BGR/RGB image processing.
EDIT
I know how to swap the bytes of individual values, that is not the problem. I want to do this process during a copy that I am going to perform anyway.
For example, if I have an array or little endian 4-byte integers I can do they swap by performing 4 bytewise copies with initial offsets of 0, 1, 2 and 3 with a stride of 4. But there may be a better way, perhaps even reading each 4-byte integer individually and using the byte-swap intrinsics _byteswap_ushort, _byteswap_ulong and _byteswap_uint64 would be faster. But I suspect there must be existing functions that do this type of processing.
EDIT 2
Just found this, which may be a useful basis for SSE, though its true that memory bandwidth probably makes it a waste of time.
Fast vectorized conversion from RGB to BGRA

Unix systems have a swab function that does what you want for 16-bit arrays. It's probably optimized, but I'm not sure. Note that modern gcc will generate extremely efficient code if you just write the naive byte swap code:
uint32_t x, y;
y = (x<<24) | (x<<8 & 0xff0000) | (x>>8 & 0xff00) | (x>>24);
i.e. it will use the bswap instruction on i486+. Presumably putting this in a loop will give an efficient loop too...
Edit: For your copying task, I would do the following in your loop:
Read a 32-bit value from const uint32_t *src.
Use the above code to swap it.
Write a 32-bit value to uint32_t *dest.
Strictly speaking this may not be portable (aliasing violations) but as long as the copy function is in its own translation unit and not getting inlined, there's very little to worry about. Forget what I wrote about aliasing; if you're swapping the data as 32-bit values, it almost surely was actually 32-bit values to begin with, not some other type of pointer that was cast, so there's no issue.

In linux, you should check the header bits/byteswap.h. there's a family of macros of the form bswap_##, and some of them use assembly instructions where appropriate.

Yes there are existing functions like the one linked in the question but its not worth the effort because the size of the data (in this case) means the set up overhead is too high. So instead, it's better to just read out 2, 4, and 8 bytes at a time and do the swap using intrinsics and write back.

Related

Can plain array be put in cpu register?

Can a plain array (arr) be put into cpu register in the following situation:
int arr[6] {/*some values*/};
const auto lambda_f = [arr](int input) {/*some manipulations on array elements and input*/};
// further usage of lambda_f
If it can't, how should I rewrite the code to make it possible?
EDIT: Of course, I've meant the copy of the array in the lambda.
In general, yes. But it depends on the particular platform. If you have a very recent Intel x86-64 chip it will have "AVX2" instructions which operate on 256-bit (32 byte) integers.
Your int arr[6] is 4*6=24 bytes, so it will fit in a single AVX register, and then you can use AVX2 integer instructions on it. You'll just need to ignore the extra 8 bytes of the register, but that's usually not a problem.
If you need to support older processors, you can use SSE2 instead, which will work on most any modern-ish x86 system. You can only operate on 128 bits (4 ints) at a time though.
The first step is loading your data into a wide register. For examples, see here: What's the most efficient way to load and extract 32 bit integer values from a 128 bit SSE vector?
On most platforms, int[6] is too much information to put in a single register, though of course you could have pointers into the array in registers suitable for holding pointers. Or if you mean to ask whether the contents of the array could be stored in some set of registers, it could be possible.
But it all depends on how the data is used, and whether or not the compiler thinks the registers are more valuable for other data. If I can assume your posted code is actually inside a function block, any optimizing compiler would ignore your posted code entirely and get rid of both arr and lambda_f, since you never use them.
The only way to determine the answer is to try it out on specific actual code that does use lambda_f, and examine the resulting assembly.

Difference between uint8_t and unspecified int for large matrices

I have a matrix that is over 17,000 x 14,000 that I'm storing in memory in C++. The values will never get over 255 so I'm thinking I should store this matrix as a uint8_t type instead of a regular int type. Will the regular int type will assume the native word size (64 bit so 8 bytes per cell) even with an optimizing compiler? I'm assuming I'll use 8x less memory if I store the array as uint8_t?
If you doubt this, you could have just tried it.
Of course it will be smaller.
However, it wholly depends on your usage patterns which will be faster. Profile! Profile! Profile!
Reasons for unexpected performance considerations:
alignment issues
elements sharing cache lines (could be positive on sequential access; negative in multicore scenarios)
increased need for locking on atomic reads/writes (in case of threading)
reduced applicability of certain optimized MIPS instructions (? - I'm not up-to-date with details here; also a very good optimizing compiler might simply register-allocate temporaries of the right size)
other, unrelated border conditions, originating from the surrounding code
The standard doesn't specify the exact size of int other than it's at least the size of short. On some 64-bit architectures (for example many Linux and Solaris x86 systems I work with) int is 32 bits and long is 64 bits. The exact size of each type will of course vary by compiler/hardware.
The best way to find out is to use sizeof(int) on your system and see how big it is. If you have enough RAM using the native type may in fact be significantly faster than the uint8_t.
Even the best optimizing compiler is not going to do an analysis of the values of the data that you put into your matrix and assume (anthropomorphizing here) "Hmmm. He said int but everything is between 0 and 255. I'm going to make that an array of uint8_t."
The compiler can interpret some keywords such as register and inline as suggestions rather than mandates. Types on the other hand are mandates. You told the compiler to use int so the compiler must use int. So switching to a uint8_t matrix will save you a considerable amount of memory here.

fastest way to write a bitstream on modern x86 hardware

What is the fastest way to write a bitstream on x86/x86-64? (codeword <= 32bit)
by writing a bitstream I refer to the process of concatenating variable bit-length symbols into a contiguous memory buffer.
currently I've got a standard container with a 32bit intermediate buffer to write to
void write_bits(SomeContainer<unsigned int>& dst,unsigned int& buffer, unsigned int& bits_left_in_buffer,int codeword, short bits_to_write){
if(bits_to_write < bits_left_in_buffer){
buffer|= codeword << (32-bits_left_in_buffer);
bits_left_in_buffer -= bits_to_write;
}else{
unsigned int full_bits = bits_to_write - bits_left_in_buffer;
unsigned int towrite = buffer|(codeword<<(32-bits_left_in_buffer));
buffer= full_bits ? (codeword >> bits_left_in_buffer) : 0;
dst.push_back(towrite);
bits_left_in_buffer = 32-full_bits;
}
}
Does anyone know of any nice optimizations, fast instructions or other info that may be of use?
Cheers,
I wrote once a quite fast implementation, but it has several limitations: It works on 32 bit x86 when you write and read the bitstream. I don't check for buffer limits here, I was allocating larger buffer and checked it from time to time from the calling code.
unsigned char* membuff;
unsigned bit_pos; // current BIT position in the buffer, so it's max size is 512Mb
// input bit buffer: we'll decode the byte address so that it's even, and the DWORD from that address will surely have at least 17 free bits
inline unsigned int get_bits(unsigned int bit_cnt){ // bit_cnt MUST be in range 0..17
unsigned int byte_offset = bit_pos >> 3;
byte_offset &= ~1; // rounding down by 2.
unsigned int bits = *(unsigned int*)(membuff + byte_offset);
bits >>= bit_pos & 0xF;
bit_pos += bit_cnt;
return bits & BIT_MASKS[bit_cnt];
};
// output buffer, the whole destination should be memset'ed to 0
inline unsigned int put_bits(unsigned int val, unsigned int bit_cnt){
unsigned int byte_offset = bit_pos >> 3;
byte_offset &= ~1;
*(unsigned int*)(membuff + byte_offset) |= val << (bit_pos & 0xf);
bit_pos += bit_cnt;
};
It's hard to answer in general because it depends on many factors such as the distribution of bit-sizes you are reading, the call pattern in the client code and the hardware and compiler. In general, the two possible approaches for reading (writing) from a bitstream are:
Using a 32-bit or 64-bit buffer and conditionally reading (writing) from the underlying array it when you need more bits. That's the approach your write_bits method takes.
Unconditionally reading (writing) from the underlying array on every bitstream read (write) and then shifting and masking the resultant values.
The primary advantages of (1) include:
Only reads from the underlying buffer the minimally required number of times in an aligned fashion.
The fast path (no array read) is somewhat faster since it doesn't have to do the read and associated addressing math.
The method is likely to inline better since it doesn't have reads - if you have several consecutive read_bits calls, for example, the compiler can potentially combine a lot of the logic and produce some really fast code.
The primary advantage of (2) is that it is completely predictable - it contains no unpredictable branches.
Just because there is only one advantage for (2) doesn't mean it's worse: that advantage can easily overwhelm everything else.
In particular, you can analyze the likely branching behavior of your algorithm based on two factors:
How often will the bitsteam need to read from the underlying buffer?
How predictable is the number of calls before a read is needed?
For example if you are reading 1 bit 50% of the time and 2 bits 50% of time, you will do 64 / 1.5 = ~42 reads (if you can use a 64-bit buffer) before requiring an underlying read. This favors method (1) since reads of the underlying are infrequent, even if mis-predicted. On the other hand, if you are usually reading 20+ bits, you will read from the underlying every few calls. This is likely to favor approach (2), unless the pattern of underlying reads is very predictable. For example, if you always read between 22 and 30 bits, you'll perhaps always take exactly three calls to exhaust the buffer and read the underlying1 array. So the branch will be well-predicated and (1) will stay fast.
Similarly, it depends on how you call these methods, and how the compiler can inline and simplify the code. Especially if you ever call the methods repeatedly with a compile-time constant size, a lot of simplification is possible. Little to no simplification is available when the codeword is known at compile-time.
Finally, you may be able to get increased performance by offering a more complex API. This mostly applies to implementation option (1). For example, you can offer an ensure_available(unsigned size) call which ensures that at least size bits (usually limited the buffer size) are available to read. Then you can read up to that number of bits using unchecked calls that don't check the buffer size. This can help you reduce mis-predictions by forcing the buffer fills to a predictable schedule and lets you write simpler unchecked methods.
1 This depends on exactly how your "read from underlying" routine is written, as there are a few options here: Some always fill to 64-bits, some fill to between 57 and 64-bits (i.e., read an integral number of bytes), and some may fill between 32 or 33 and 64-bits (like your example which reads 32-bit chunks).
You'll probably have to wait until 2013 to get hold of real HW, but the "Haswell" new instructions will bring proper vectorised shifts (ie the ability to shift each vector element by different amounts specified in another vector) to x86/AVX. Not sure of details (plenty of time to figure them out), but that will surely enable a massive performance improvement in bitstream construction code.
I don't have the time to write it for you (not too sure your sample is actually complete enough to do so) but if you must, I can think of
using translation tables for the various input/output bit shift offsets; This optimization would make sense for fixed units of n bits (with n sufficiently large (8 bits?) to expect performance gains)
In essence, you'd be able to do
destloc &= (lookuptable[bits_left_in_buffer][input_offset][codeword]);
disclaimer: this is very sloppy pseudo code, I just hope it conveys my idea of a lookup table o prevent bitshift arithmetics
writing it in assembly (I know i386 has XLAT, but then again, a good compiler might already use something like that)
; Also, XLAT seems limited to 8 bits and the AL register, so it's not really versatile
Update
Warning: be sure to use a profiler and test your optimization for correctness and speed. Using a lookup table can result in poorer performance in the light of locality of reference. So, you might need to change the bit-streaming thread on a single core (set thread affinity) to get the benefits, and you might have to adapt the lookup table size to the processor's L2 cache.
Als, have a look at SIMD, SSE4 or GPU (CUDA) instruction sets if you know you'll have certain features at your disposal.

Implementing memcmp

The following is the Microsoft CRT implementation of memcmp:
int memcmp(const void* buf1,
const void* buf2,
size_t count)
{
if(!count)
return(0);
while(--count && *(char*)buf1 == *(char*)buf2 ) {
buf1 = (char*)buf1 + 1;
buf2 = (char*)buf2 + 1;
}
return(*((unsigned char*)buf1) - *((unsigned char*)buf2));
}
It basically performs a byte by byte comparision.
My question is in two parts:
Is there any reason to not alter this to an int by int comparison until count < sizeof(int), then do a byte by byte comparision for what remains?
If I were to do 1, are there any potential/obvious problems?
Notes: I'm not using the CRT at all, so I have to implement this function anyway. I'm just looking for advice on how to implement it correctly.
You could do it as an int-by-int comparison or an even wider data type if you wish.
The two things you have to watch out for (at a minimum) are an overhang at the start as well as the end, and whether the alignments are different between the two areas.
Some processors run slower if you access values without following their alignment rules (some even crash if you try it).
So your code could probably do char comparisons up to an int alignment area, then int comparisons, then char comparisons again but, again, the alignments of both areas will probably matter.
Whether that extra code complexity is worth whatever savings you will get depends on many factors outside your control. A possible method would be to detect the ideal case where both areas are aligned identically and do it a fast way, otherwise just do it character by character.
The optimization you propose is very common. The biggest concern would be if you try to run it on a processor that doesn't allow unaligned accesses for anything other than a single byte, or is slower in that mode; the x86 family doesn't have that problem.
It's also more complicated, and thus more likely to contain a bug.
Don't forget that when you find a mismatch within a larger chunk, you must then identify the first differing char within that chunk so that you can calculate the correct return value (memcmp() returns the difference of the first differing bytes, treated as unsigned char values).
If you compare as int, you will need to check alignment and check if count is divisible by sizeof(int) (to compare the last bytes as char).
Is that really their implementation? I have other issues besides not doing it int-wise:
castng away constness.
does that return statement work? unsigned char - unsigned char = signed int?
int at a time only works if the pointers are aligned, or if you can read a few bytes from the front of each and they are both still aligned, so if both are 1 before the alignment boundary you can read one char of each then go int-at-a-time, but if they are aligned differently eg one is aligned and one is not, there is no way to do this.
memcmp is at its most inefficient (i.e. it takes the longest) when they do actually compare (it has to go to the end) and the data is long.
I would not write my own but if you are going to be comparing large portions of data you could do things like ensure alignment and even pad the ends, then do word-at-a-time, if you want.
Another idea is to optimize for the processor cache and fetching. Processors like to fetch in large chunks rather than individual bytes at random times. Although the internal workings may already account for this, it would be a good exercise anyway. Always profile to determine the most efficient solution.
Psuedo code:
while bytes remaining > (cache size) / 2 do // Half the cache for source, other for dest.
fetch source bytes
fetch destination bytes
perform comparison using fetched bytes
end-while
perform byte by byte comparison for remainder.
For more information, search the web for "Data Driven Design" and "data oriented programming".
Some processors, such as the ARM family, allow for conditional execution of instructions (in 32-bit, non-thumb) mode. The processor fetches the instructions but will only execute them if the conditions are satisfied. In this case, try rephrasing the comparison in terms of boolean assignments. This may also reduce the number of branches taken, which improves performance.
See also loop unrolling.
See also assembly language.
You can gain a lot of performance by tailoring the algorithm to a specific processor, but loose in the portability area.
The code you found is just a debug implementation of memcmp, it's optimized for simplicity and readability, not for performance.
The intrinsic compiler implementation is platform specific and smart enough to generate processor instructions that compare dwords or qwords (depending on the target architecture) at once whenever possible.
Also, an intrinsic implementation may return immediately if both buffers have the same address (buf1 == buf2). This check is also missing in the debug implementation.
Finally, even when you know exactly on which platform you'll be running, the perfect implementation is still the less generic one as it depends on a bunch of different factors that are specific to the rest of your program:
What is the minumum guaranteed buffer alignment?
Can you read any padding bytes past the end of a buffer without triggering an access violation?
May the buffer parameters be identical?
May the buffer size be 0?
Do you only need to compare buffer contents for equality? Or do you also need to know which one is larger (return value < 0 or > 0)?
...
If performace is a concern, I suggest writing the comparison routine in assembly. Most compilers give you an option to see the assembly lising that they generate for a source. You could take that code and adapt it to your needs.
Many processors implement this as a single instruction. If you can guarantee the processor you're running on it can be implemented with a single line of inline assembler.

Do bit operations cause programs to run slower?

I'm dealing with a problem which needs to work with a lot of data. Currently its values are represented as an unsigned int. I know that real values do not exceed a limit of 1000.
Questions
I can use unsigned short to store it. An upside to this is that it'll use less storage space to store the value. Will performance suffer?
If I decided to store data as short but all the calling functions use int, it's recognized that I need to convert between these datatypes when storing or extracting values. Will performance suffer? Will the loss in performance be dramatic?
If I decided to not use short but just 10 bits packed into an array of unsigned int. What will happen in this case comparing with previous ones?
This all depends on architecture. Bit-fields are generally slower, but if you are able to to significantly cut down memory usage with them, you can even gain in performance due to better CPU caching and similar things. Likewise with short (though it is not dramatic in any case).
The best way is to make your source code able to switch representation easily (at compilation time, of course). Then you will be able to test and profile different implementations in your specific circumstances just by, say, changing one #define.
Also, don't forget about premature optimization rule. Make it work first. If it turns out to be slow/not fast enough, only then try to speed up.
I can use unsigned short to store it.
Yes you can use unsigned short (assuming (sizeof(unsigned short) * CHAR_BITS) >= 10)
An upside to this is that it'll use less storage space to store the value.
Less than what? Less than int? Depends what is the sizeof(int) on your system?
Will performance suffer?
Depends. The type int is supposed to be the most efficient integer type for your system so potentially using short may affect your performance. Whether it does will depend on the system. Time it and find out.
If I decided to store data as short but all the calling functions use int, it's recognized that I need to convert between these datatypes when storing or extracting values.
Yes. But the compiler will do the conversion automatically. One thing you need to watch though is conversion between signed and unsigned types. If the value does not fit the exact result may be implementation defined.
Will performance suffer?
Maybe. if sizeof(unsigned int) == sizeof(unsigned short) then probably not. Time it and see.
Will the loss in performance be dramatic?
Time it and see.
If I decided to not use short but just 10 bits packed into an array of unsigned int. What will happen in this case comparing with previous ones?
Time it and see.
A good compromise for you is probably packing three values into a 32 bit int (with two bits unused). Untangling 10 bits from a bit array is a lot more expensive, and doesn't save much space. You can either use bit fields, or do it by hand yourself:
(i&0x3FF) // Get i[0]
(i>>10)&0x3FF // Get i[1]
(i>>20)&0x3FF // Get i[2]
i = (i&0x3FFFFC00) | (j&0x3FF) // Set i[0] to j
i = (i&0x3FF003FF) | ((j&0x3FF)<<10) // Set i[1] to j
i = (i&0xFFFFF) | ((j&0x3FF)<<20) // Set i[2] to j
You can see here how much extra expense it is: a bit operation and 2/3 of a shift (on average) for get, and three bit operations and 2/3 of a shift (on average) to set. Probably not too bad, especially if you're mostly getting the values not setting them.