Convert each bit in byte to first bit of each nibble in 32 bit int

Convert each bit in byte to first bit of each nibble in 32 bit int - bit-manipulation

I have a byte b. I am looking for the most efficient bit manipulation to
convert each bit in b to the first bit of each nibble in a 32 bit int x.
For example, if b = 01010111, then x = 0x10101111
I know I can do a brute force approach:
x = (b&1) | (((b>>1)&1)<<4) | ......
Edit: this for an OpenCL kernel for GPU

PDEP
As user harold mentioned in the comments, PDEP is the instruction that just does exactly what you want - but it's only available on x86 (as far as I know), and it has terrible1 performance on the newest AMD chips.
LUT
Barring that, a lookup table of 256 x 4-byte entries seems reasonable - at the cost of 1K of pressure on your cache subsystem. You'll find a lot of smart people advocate against LUTs due to the hidden cost of cache misses - but if this particular operation is in fact "hot" then it may turn out to be the fastest even when factoring in any additional misses.
As with any LUT solution, you should be especially careful to benchmark it not only with micro-benchmarks, but in the full application to evaluate the effect of memory pressure.
You could also consider a compromise split-LUT solution that uses one or two 16-entry LUTs for each nibble of the byte, where the result is calculated something like:
int32 x = high_lut[(b & 0xF0) >> 4] | low_lut[b & 0xF]
This cuts the size of the LUTs down by a factor of between ~11 to 322, since we have much fewer entries and some entries can be 2 bytes rather than 4 bytes.
Bit Manipulation
If you really want a bit manipulation solution, to impress your inlaws or something, you can try something like the following:
Split the byte into nibbles and use multiplication by 0x00001111 (low nibble) and 0x01111000 (high nibble) to splat the low (resp. high) nibble into the low (resp high) half of the 4-byte word, and combine the results with or or add. So if your byte had bits abcd efgh you'll have a word like abcd abcd abcd abcd efgh efgh efgh efgh.
and this result with a mask that picks out the bit that belongs in each nibble (although it usually won't be in the right place). The mask is something like 0x84218421 and the result (in binary) will be something like a000 0b00 00c0 000d e000 0f00 00g0 000h.
Now move the 6 out of 8 bits that aren't in the high bit to the right position using the carry behavior of subtraction, something like: ((x | 0x08880888) - 0x01110111) ^ 0x08880888.
The basic idea in the last step is that you set the high bit of each nibble, and subtract 1 from the nibble. So for example, you have the 0b00 nibble, which becomes 1b00 - 1 - the subtraction carries though all the zeros, and stops at the first one, which is either the high bit (b is zero) or b if it is one. So you effectively set the high bit based on the value of the selected bit. Note that you don't need to do this for a or e since they are already in the right place.
The final xor is needed because the above actually sets the high bit to the opposite value as the selected bit, so we need to flip it.
I didn't try it out, so there are no doubt bugs, but the basic idea should be sound. There is probably various ways to optimize it further, but it's not too bad as is: a couple of multiplications and perhaps a half-dozen bit-operations. On platforms with slow multiplications you can probably find another approach for the first step that uses only 1 multiplication combined with a few more primitive operations, or zero at the cost of several more operations.
1 Fully 18x worse throughput than Intel - evidently AMD opted not to implement the circuit to do PDEP in hardware and instead implement it via a series of more elementary operations.
2 The largest reduction is if you share a single 16-entry LUT for both the high and low nibble, although this requires an additional shift for the result of the high nibble lookup. The smaller reduction, shown in the example, uses two 16-entry LUTs: one 4-byte one for the high nibble, and a 2-byte one for the low nibble, and avoids the shift.

Related

How do I efficiently reorder bytes of a __m256i vector (convert int32_t to uint8_t)?

I need to optimize the following compression operation (on a server with AVX2 instructions available):
take the exponents of an array of floats, shift and store to a uint8_t array
I have little experience and was suggested to start with https://github.com/feltor-dev/vcl library
now that I have
uint8_t* uin8_t_ptr = ...;
float* float_ptr = ...;
float* final_ptr = float_ptr + offset;
for (; float_ptr < final_ptr; float_ptr+=8) {
Vec8f vec_f = Vec8f().load(float_ptr);
Vec8i vec_i = fraction(vec_f) + 128; // range: 0~255
...
}
My question is how to efficiently store the vec_i results to the uint8_t array?
I couldn't find relevant functions in the vcl library and was trying to explore the intrinsic instructions since I could access the __m256i data.
My current understanding is to use something like _mm256_shuffle_epi8, but don't know the best way to do it efficiently.
I wonder if trying to fully utilize the bits and store 32 elements every time (using a loop with float_ptr+=32) would be the way to go.
Any suggestions are welcome. Thanks.

Probably your best bet for vectorization of this might be with vpackssdw / vpackuswb, and vpermd as a lane-crossing fixup after in-lane pack.
_mm256_srli_epi32 to shift the exponent (and sign bit) to the bottom in each 32-bit element. A logical shift leaves a non-negative result regardless of the sign bit.
Then pack pairs of vectors down to 16-bit with _mm256_packs_epi32 (signed input, signed saturation of output).
Then mask off the sign bit, leaving an 8-bit exponent. We wait until now so we can do 16x uint16_t elements per instruction instead of 8x uint32_t. Now you have 16-bit elements holding values that fit in uint8_t without overflowing.
Then pack pairs of vectors down to 8-bit with _mm256_packus_epi16 (signed input, unsigned saturation of output). This actually matters, packs would clip some valid values because your data uses the full range of uint8_t.
VPERMD to shuffle the eight 32-bit chunks of that vector that came from each lane of 4x 256-bit input vectors. Exactly the same __m256i lanefix = _mm256_permutevar8x32_epi32(abcd, _mm256_setr_epi32(0,4, 1,5, 2,6, 3,7)); shuffle as in How to convert 32-bit float to 8-bit signed char?, which does the same pack after using FP->int conversion instead of right-shift to grab the exponent field.
Per result vector, you have 4x load+shift (vpsrld ymm,[mem] hopefully), 2x vpackssdw shuffles, 2x vpand mask, 1x vpackuswb, and 1x vpermd. That's 4 shuffles, so the best we can hope for on Intel HSW/SKL is 1 result vector per 4 clocks. (Ryzen has better shuffle throughput, except for vpermd which is expensive.)
But that should be achievable, so 32 bytes of input / 8 bytes of output per clock on average.
The 10 total vector ALU uops (including the micro-fused load+ALU), and the 1 store should be able to execute in that time. We have room for 16 total uops including loop overhead before the front-end becomes a worse bottleneck than shuffles.
update: oops, I forgot to count unbiasing the exponent; that will take an extra add. But you can do that after packing down to 8-bit. (And optimize it to an XOR). I don't think we can optimize it away or into something else, like into masking away the sign bit.
With AVX512BW, you could do a byte-granularity vpaddb to unbias, with zero-masking to zero the high byte of each pair. That would fold the unbiasing into the 16-bit masking.
AVX512F also has vpmovdb 32->8 bit truncation (without saturation), but only for single inputs. So you'd get one 64-bit or 128-bit result from one input 256 or 512-bit vector, with 1 shuffle + 1 add per input instead of 2+1 shuffles + 2 zero-masked vpaddb per input vector. (Both need the right shift per input vector to align the 8-bit exponent field with a byte boundary at the bottom of a dword)
With AVX512VBMI, vpermt2b would let us grab bytes from 2 input vectors. But it costs 2 uops on CannonLake, so only useful on hypothetical future CPUs if it gets cheaper. They can be the top byte of a dword, so we could start with vpaddd a vector to itself to left-shift by 1. But we're probably best with a left-shift because the EVEX encoding of vpslld or vpsrld can take the data from memory with an immediate shift count, unlike the VEX encoding. So hopefully we get a single micro-fused load+shift uop to save front-end bandwidth.
The other option is to shift + blend, resulting in byte-interleaved results that are more expensive to fix up, unless you don't mind that order.
And byte-granularity blending (without AVX512BW) requires vpblendvb which is 2 uops. (And on Haswell only runs on port 5, so potentially a huge bottleneck. On SKL it's 2 uops for any vector ALU port.)

Is it faster to multiply low numbers in C/C++ (as opposed to high numbers)?

Example of question:
Is calculating 123 * 456 faster than calculating 123456 * 7890? Or is it the same speed?
I'm wondering about 32 bit unsigned integers, but I won't ignore answers about other types (64 bit, signed, float, etc.). If it is different, what is the difference due to? Whether or not the bits are 0/1?
Edit: If it makes a difference, I should clarify that I'm referring to any number (two random numbers lower than 100 vs two random numbers higher than 1000)

For builtin types up to at least the architecture's word size (e.g. 64 bit on a modern PC, 32 or 16 bit on most low-cost general purpose CPUs from the last couple decades), for every compiler/implementation/version and CPU I've ever heard of, the CPU opcode for multiplication of a particular integral size takes a certain number of clock cycles irrespective of the quantities involved. Multiplications of data with different sizes, performs differently on some CPUs (e.g. AMD K7 has 3 cycles latency for 16 bit IMUL, vs 4 for 32 bit).
It is possible that on some architecture and compiler/flags combination, a type like long long int has more bits than the CPU opcodes can operate on in one instruction, so the compiler may emit code to do the multiplication in stages and that will be slower than multiplication of CPU-supported types. But again, a small value stored at run-time in a wider type is unlikely to be treated - or perform - any differently than a larger value.
All that said, if one or both values are compile-time constants, the compiler is able to avoid the CPU multiplication operator and optimise to addition or bit shifting operators for certain values (e.g. 1 is obviously a no-op, either side 0 ==> 0 result, * 4 can sometimes be implemented as << 2). There's nothing in particular stopping techniques like bit shifting being used for larger numbers, but a smaller percentage of such numbers can be optimised to the same degree (e.g. there're more powers of two - for which multiplication can be performed using bit shifting left - between 0 and 1000 than between 1000 and 2000).

This is highly dependendent on the processor architecture and model.
In the old days (ca 1980-1990), the number of ones in the two numbers would be a factor - the more ones, the longer it took to multiply [after sign adjustment, so multiplying by -1 wasn't slower than multiplying by 1, but multiplying by 32767 (15 ones) was notably slower than multiplying by 17 (2 ones)]. That's because a multiply is essentially:
unsigned int multiply(unsigned int a, unsigned int b)
{
res = 0;
for(number of bits)
{
if (b & 1)
{
res += a;
}
a <<= 1;
b >>= 1;
}
}
In modern processors, multiply is quite fast either way, but 64-bit multiply can be a clock cycle or two slower than a 32-bit value. Simply because modern processors can "afford" to put down the whole logic for doing this in a single cycle - both when it comes to speed of transistors themselves, and the area that those transistors take up.
Further, in the old days, there was often instructions to do 16 x 16 -> 32 bit results, but if you wanted 32 x 32 -> 32 (or 64), the compiler would have to call a library function [or inline such a function]. Today, I'm not aware of any modern high end processor [x86, ARM, PowerPC] that can't do at least 64 x 64 -> 64, some do 64 x 64 -> 128, all in a single instruction (not always a single cycle tho').
Note that I'm completely ignoring the fact that "if the data is in cache is an important factor". Yes, that is a factor - and it's a bit like ignoring wind resistance when traveling at 200 km/h - it's not at all something you ignore in the real world. However, it is quite unimportant for THIS discussion. Just like people making sports cars care about aerodynamics, to get complex [or simple] software to run fast involves a certain amount of caring about the cache-content.

For all intents and purposes, the same speed (even if there were differences in computation speed, they would be immeasurable). Here is a reference benchmarking different CPU operations if you're curious: http://www.agner.org/optimize/instruction_tables.pdf.

Why are CRC Polynomials given as Normal, Reversed, etc.?

I'm learning about CRCs, and search engines and SO turn up nothing on this....
Why do we have "Normal" and "Reversed" and "Reciprocal" Polynomials? Does one favor Big Endian, Little Endian, or something else?

The classic definition of a CRC would use a non-reflected polynomial, which shifts the CRC left. If the word size being used for the calculation is larger than the CRC, then you would need an operation at the end to clear the high bits that were shifted into (e.g. & 0xffff for a 16-bit CRC).
You can flip the whole thing, use a reflected polynomial, and shift right instead of left. That gives the same CRC properties, but the bits from the message are effectively operated on from least to most significant bit, instead of most to least significant bit. Since you are shifting right, the extraneous bits get dropped off the bottom into oblivion, and there is no need for the additional operation. This may have been one of the early motivations to use a very slightly faster and more compact implementation.
Sometimes the specification from the original hardware is that the bits are processed from least to most significant, so then you have to use the reflected version.
No, none of this favors little or big endian. Either kind of CRC can be computed just as easily in little-endian or big-endian architectures.

Are there any good reasons to use bit shifting except for quick math?

I understand bitwise operations and how they might be useful for different purposes, e.g. permissions. However, I don't seem to understand what use the bit shift operators are. I understand how they work, but I can't think of any scenarios where I might want to use them unless I want to do some really quick multiplication or division. Are there any other reasons to use bit-shifting?

There are many reasons, here are some:
Let's say you represent a black and white image as a sequence of bits and you want to set a single pixel in this image generically. For example your byte offset may be x>>3 and your bit offset may be x & 0x7 and you can set that bit by: byte = byte | (1 << (x & 0x7));
Implementing data compression algorithms where you deal with variable length bit sequences, e.g. huffman coding.
You're are interacting with some hardware, e.g. a serial communication device, and you need to read or set some control bits.
For those and other reasons most processors have bit shift and/or rotation instructions as well as other logic instructions (and/or/xor/not).
Historically multiplication and division were significantly slower as they are more complex operations and some CPUs didn't have those at all.
Also see here:
Have you ever had to use bit shifting in real projects?

As you indicate, a left shift is the same thing as a multiplication by two. At least it is when we're talking about unsigned quantities. The meaning of a "left shift" of a signed quantity is ... language dependent.
With modern compilers, there's really no difference between writing "i = x*2;" and "i = x << 1;" The compiler will generate the most efficient code. So in that sense there's no reason to prefer shift over multiply.
Some algorithms work by shifting a quantity left by one bit and then setting the low bit to either 0 or 1. Some simple compression algorithms work this way. For example, if your accumulated value is in the variable x, and the current value (0 or 1) is in y, then it makes more sense to write "x = (x << 1) | y", rather than "x = (x * 2) + y". Both do the same thing, but the first is more notationally correct. You don't have to think, "oh, right, multiply by two is the same as a left shift."
Also, when you're talking about algorithms that shift bits, it's more convenient to shift left or right by a particular number of bits than to figure out what multiple of 2 you want to multiply or divide by.
So, whereas there's typically no performance benefit to shifting rather than multiplying--at least not when working with high level languages--there are times when having the ability to shift makes what you're doing more easily understood.

There are lot of places where bit shift operations are regularly used outside of their usage in numerical computations. For example, Bitboard is a data structure that is commonly used in board games for board representation. Some of the strongest chess engines use this data structure mainly for speed and ease of move generation and evaluation. These programs use bit operations heavily and bit-shift operations specifically are used in a lot of contexts - such as finding bit masks, generating new moves on the board, computing logarithm very quickly, etc. There are even very advanced numerical computations that can be done elegantly by clever use of bit operations. Check out this site for bit twiddling hacks - a lot of those algorithms use shift operators. Bit shift operations are regularly used in device driver programming, codec development, embedded systems programming and so on.

Shifting allows accessing specific bits within a variable. The expression (n >> p) & ((1 << m) - 1) retrieves an m-bit portion of the variable n with an offset of p bits from the right.
This allows your program to use integers that aren't multiples of 8 bits, which is useful for data compression.
For example, I used it in my Netflix Prize programs to pack records (22-bit user ID + 15-bit movie ID + 12-bit date + 3-bit rating) into a uint64_t (with 12 bits to spare).
A very common special case is to pack 8 bool variables into each byte. (Unix file permissions, black-and-white bitmaps, CPU flags registers, etc.)
Also, bit manipulation is used in UTF-8, which is a very popular character encoding. Unicode characters are represented by distributing their bits across 1, 2, 3, or 4 bytes.

Is a logical right shift by a power of 2 faster in AVR?

I would like to know if performing a logical right shift is faster when shifting by a power of 2
For example, is
myUnsigned >> 4
any faster than
myUnsigned >> 3
I appreciate that everyone's first response will be to tell me that one shouldn't worry about tiny little things like this, it's using correct algorithms and collections to cut orders of magnitude that matters. I fully agree with you, but I am really trying to squeeze all I can out of an embedded chip (an ATMega328) - I just got a performance shift worthy of a 'woohoo!' by replacing a divide with a bit-shift, so I promise you that this does matter.

Let's look at the datasheet:
http://atmel.com/dyn/resources/prod_documents/8271S.pdf
As far as I can see, the ASR (arithmetic shift right) always shifts by one bit and cannot take the number of bits to shift; it takes one cycle to execute. Therefore, shifting right by n bits will take n cycles. Powers of two behave just the same as any other number.

In the AVR instruction set, arithmetic shift right and left happen one bit at a time. So, for this particular microcontroller, shifting >> n means the compiler actually makes n many individual asr ops, and I guess >>3 is one faster than >>4.
This makes the AVR fairly unsual, by the way.

You have to consult the documentation of your processor for this information. Even for a given instruction set, there may be different costs depending on the model. On a really small processor, shifting by one could conceivably be faster than by other values, for instance (it is the case for rotation instructions on some IA32 processors, but that's only because this instruction is so rarely produced by compilers).
According to http://atmel.com/dyn/resources/prod_documents/8271S.pdf all logical shifts are done in one cycle for the ATMega328. But of course, as pointed out in the comments, all logical shifts are by one bit. So the cost of a shift by n is n cycles in n instructions.

Indeed ATMega doesn't have a barrel shifter just like most (if not all) other 8-bit MCUs. Therefore it can only shift by 1 each time instead of any arbitrary values like more powerful CPUs. As a result shifting by 4 is theoretically slower than shifting by 3
However ATMega does have a swap nibble instruction so in fact x >> 4 is faster than x >> 3
Assuming x is an uint8_t then x >>= 3 is implemented by 3 right shifts
x >>= 1;
x >>= 1;
x >>= 1;
whereas x >>= 4 only need a swap and a bit clear
swap(x); // swap the top and bottom nibbles AB <-> BA
x &= 0x0f;
or
x &= 0xf0;
swap(x);
For bigger cross-register shifts there are also various ways to optimize it
With a uint16_t variable y consisting of the low part y0 and high part y1 then y >> 8 is simply
y0 = y1;
y1 = 0;
Similarly y >> 9 can be optimized to
y0 = y1 >> 1;
y1 = 0;
and hence is even faster than a shift by 3 on a char
In conclusion, the shift time varies depending on the shift distance, but it's not necessarily slower for longer or non-power-of-2 values. Generally it'll take at most 3 instructions to shift within an 8-bit char
Here are some demos from compiler explorer
A right shift by 4 is achieved by a swap and an and like above
swap r24
andi r24,lo8(15)
A right shift by 3 has to be done with 3 instructions
lsr r24
lsr r24
lsr r24
Left shifts are also optimized in the same manner
See also Which is faster: x<<1 or x<<10?

It depends on how the processor is built. If the processor has a barrel-rotate it can shift any number of bits in one operation, but that takes chip space and power budget. The most economical hardware would just be able to rotate right by one, with options regarding the wrap-around bit. Next would be one that could rotate by one either left or right. I can imagine a structure that would have a 1-shifter, 2-shifter, 4-shifter, etc. in which case 4 might be faster than 3.

Disassemble first then time the code. Dont be discouraged by people telling you, you are wasting your time. The knowledge you gain will put you in a position to be the goto person for putting out the big company fires. The number of people with real behind the curtain knowledge is dropping at an alarming rate in this industry.
Sounds like others explained the real answer here, which disassembly would have shown, single bit shift instruction. So 4 shifts will take 133% of the time that 3 shifts took, or 3 shifts is 75% of the time of 4 shifts depending on how you compared the numbers. And your measurements should reflect that difference, if they dont I would continue with this experiment until you completely understand the execution times.

If your targer processor has a bit-shift instruction (which is very likely), then it depends on the hardware-implementation of that instruction if there will be any difference between shifting a power-of-2 bits, or shifting some other number. However, it is unlikely to make a difference.

With all respect, you should not even start talking about performace until you start measuring. Compile you program with division. Run. Measure time. Repeat with shift.

replacing a divide with a bit-shift
This is not the same for negative numbers:
char div2 (void)
{
return (-1) / 2;
// ldi r24,0
}
char asr1 (void)
{
return (-1) >> 1;
// ldi r24,-1
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js