How do you think masking bit problems? - bit-manipulation

For example in this answer to a reversing bits function problem made 4 years ago:
[reverse_Bits function]
https://stackoverflow.com/a/50596723/19574301
Code:
def reverse_Bits(n, no_of_bits):
result = 0
for i in range(no_of_bits):
result <<= 1
result |= n & 1
n >>= 1
return result
I don't understand how to think the problem at all.
You multiply actual number (n) by one in order to check the first right side bit. Then you right shift this number by one so you are checking if the second bit is 0 when you and it again, and this for all bits. So basically you're adding 1 to the result if there is a 1 in the actual (?) bit. Aside you left shift the result so I understand you're trying to put the bit in its correct index and if there is a one you add it... I get lost here.
I mean, I know the code works and I know how but I couldn't do it from zero without having this reference because I don't know how you achieve thinking every step of the algorithm.
I don't know if I explain my problem or if it's just a mess but hoping somebody can help me!!

If your question is, "how would I write this from scratch, without any help?" then I find personally that it comes about from a combination of sketching out simple cases, working through them manually, and progressive implementation.
For example, you may have started with example: You have the number 3 (because it is easy) and you want to reverse bits:
3 = 0000 0011 b
need to &1 and if it is non-zero, write 1000 0000 b
need to &2 and if it is non-zero, write 0100 0000 b
need to &4 and as it is zero, write nothing...
...
Okay, how can I automate 1,2,4,8,16,32 .. ? Can have a variable which will double, or I can left-shift a number by 1. Take your pick, does not matter.
For writing the values, same thing, how can I write 1000 0000 b and then 0100 0000 b, etc? Well start off as 1000 0000 b and divide by 2 or right-shift by 1.
With these two simple things, you will end up with something like this for one bit:
result = 0
src_mask = 0x01
dst_mask = 0x80
if number & src_mask != 0:
result |= dst_mask
One bit working. Then you add a loop so that you can do all bits and add a *2 for the src_mask and a /2 for the dst_mask as you do it to address each bit. Again this is all figured out from the scribbles on paper listing what I want to happen for each bit.
Then comes optimization, I don't like the 'if' so can I figure out a way of directly adding the bit without testing? if it was 0 it will add 0 and if the bit is set, then I add the bit?
This is generally the progression. Manual scribbles, first design and then step-by-step enhancements.

Related

Is there anything wrong if the out come is zero

i'm doing an exercise on two complement, the question sound like this:
Solving 11base10 – 11base10 using 2’s complement will lead to a problem; by using 7-bit data representation. Explain what the problem is and suggest steps to overcome the problem.
i got 0 for the answer because 11-11=0, what problem if the answer is 0?
and is there a way to overcome it?
So 11 in base 10 is the following in 7-bit base 2:
000 1011
To subtract 11, you need to find -11 first. One of the many ways is to invert all the bits and add 1, leaving you with:
111 0101
Add the two numbers together:
1000 0000
Well, that's interesting. The 8th bit is a 1.
You didn't end up with zero. Or did you?
That's the question that your homework is attempting to get you to answer.

Convert each bit in byte to first bit of each nibble in 32 bit int

I have a byte b. I am looking for the most efficient bit manipulation to
convert each bit in b to the first bit of each nibble in a 32 bit int x.
For example, if b = 01010111, then x = 0x10101111
I know I can do a brute force approach:
x = (b&1) | (((b>>1)&1)<<4) | ......
Edit: this for an OpenCL kernel for GPU
PDEP
As user harold mentioned in the comments, PDEP is the instruction that just does exactly what you want - but it's only available on x86 (as far as I know), and it has terrible1 performance on the newest AMD chips.
LUT
Barring that, a lookup table of 256 x 4-byte entries seems reasonable - at the cost of 1K of pressure on your cache subsystem. You'll find a lot of smart people advocate against LUTs due to the hidden cost of cache misses - but if this particular operation is in fact "hot" then it may turn out to be the fastest even when factoring in any additional misses.
As with any LUT solution, you should be especially careful to benchmark it not only with micro-benchmarks, but in the full application to evaluate the effect of memory pressure.
You could also consider a compromise split-LUT solution that uses one or two 16-entry LUTs for each nibble of the byte, where the result is calculated something like:
int32 x = high_lut[(b & 0xF0) >> 4] | low_lut[b & 0xF]
This cuts the size of the LUTs down by a factor of between ~11 to 322, since we have much fewer entries and some entries can be 2 bytes rather than 4 bytes.
Bit Manipulation
If you really want a bit manipulation solution, to impress your inlaws or something, you can try something like the following:
Split the byte into nibbles and use multiplication by 0x00001111 (low nibble) and 0x01111000 (high nibble) to splat the low (resp. high) nibble into the low (resp high) half of the 4-byte word, and combine the results with or or add. So if your byte had bits abcd efgh you'll have a word like abcd abcd abcd abcd efgh efgh efgh efgh.
and this result with a mask that picks out the bit that belongs in each nibble (although it usually won't be in the right place). The mask is something like 0x84218421 and the result (in binary) will be something like a000 0b00 00c0 000d e000 0f00 00g0 000h.
Now move the 6 out of 8 bits that aren't in the high bit to the right position using the carry behavior of subtraction, something like: ((x | 0x08880888) - 0x01110111) ^ 0x08880888.
The basic idea in the last step is that you set the high bit of each nibble, and subtract 1 from the nibble. So for example, you have the 0b00 nibble, which becomes 1b00 - 1 - the subtraction carries though all the zeros, and stops at the first one, which is either the high bit (b is zero) or b if it is one. So you effectively set the high bit based on the value of the selected bit. Note that you don't need to do this for a or e since they are already in the right place.
The final xor is needed because the above actually sets the high bit to the opposite value as the selected bit, so we need to flip it.
I didn't try it out, so there are no doubt bugs, but the basic idea should be sound. There is probably various ways to optimize it further, but it's not too bad as is: a couple of multiplications and perhaps a half-dozen bit-operations. On platforms with slow multiplications you can probably find another approach for the first step that uses only 1 multiplication combined with a few more primitive operations, or zero at the cost of several more operations.
1 Fully 18x worse throughput than Intel - evidently AMD opted not to implement the circuit to do PDEP in hardware and instead implement it via a series of more elementary operations.
2 The largest reduction is if you share a single 16-entry LUT for both the high and low nibble, although this requires an additional shift for the result of the high nibble lookup. The smaller reduction, shown in the example, uses two 16-entry LUTs: one 4-byte one for the high nibble, and a 2-byte one for the low nibble, and avoids the shift.

Count the bits set in 1 for binary number in C++

How many bits are set in the number 1 in one binary number of 15 digits.
I have no idea how to start this one. Any help/hints?
Smells like homework, so I'll be all vague and cryptic. But helpful, since that's what we do here at SO.
First, let's figure out how to check the first bit. Hint: you want to set all other bits of the variable to zero, and check the value of the result. Since all other bits are zero, the value of the variable will be the value of the first bit (zero or one). More hint: to set bits to zero, use the AND operation.
Second, let's move the second bit to the first position. There's an operation in C++ just for that.
Third, rinse and repeat until done. Count them ones as you do so.
EDIT: so in pseudocode, assuming x is the source variable
CountOfOnes=0
while X != 0
Y = the first bit of X (Y becomes either 0 or 1)
CountOfOnes = CountOfOnes + Y
X = X right shift 1
Specifically for C++ implementation, you need to make X an unsigned variable; otherwise, the shift right operation will act up on you.
Oh, and << and >> operators are exactly bitwise shift. In C++, they're sometimes overridden in classes to mean something else (like I/O), but when acting on integers, they perform bit shifting.

Can anyone explain how CRC works in this specific case?

I am taught that given:
message M = 101001
polynomial C = x^3 + x^2 + 1 = 1101
I should add k bits to the end of the message such that the result P is divisible by C (where k is the degree of the polynomial, 3 in this case).
I can find no 3 bit combination (XYZ) that when appended to M satisfies this criteria.
Does anyone know what is wrong with my understanding?
I'm 5 months late to this, but here goes :
Perhaps, thinking about this by integer (or binary) division is counterproductive. Better to work it out by the continuous XOR method - which gives a checksum of 001, rather than the expected 100. This, when appended to the source generates the check value 101001001.
Try this C code to see a somewhat descriptive view.
I'm no expert, but I got most of my CRC fundamentals from here. Hope that helps.

Is a logical right shift by a power of 2 faster in AVR?

I would like to know if performing a logical right shift is faster when shifting by a power of 2
For example, is
myUnsigned >> 4
any faster than
myUnsigned >> 3
I appreciate that everyone's first response will be to tell me that one shouldn't worry about tiny little things like this, it's using correct algorithms and collections to cut orders of magnitude that matters. I fully agree with you, but I am really trying to squeeze all I can out of an embedded chip (an ATMega328) - I just got a performance shift worthy of a 'woohoo!' by replacing a divide with a bit-shift, so I promise you that this does matter.
Let's look at the datasheet:
http://atmel.com/dyn/resources/prod_documents/8271S.pdf
As far as I can see, the ASR (arithmetic shift right) always shifts by one bit and cannot take the number of bits to shift; it takes one cycle to execute. Therefore, shifting right by n bits will take n cycles. Powers of two behave just the same as any other number.
In the AVR instruction set, arithmetic shift right and left happen one bit at a time. So, for this particular microcontroller, shifting >> n means the compiler actually makes n many individual asr ops, and I guess >>3 is one faster than >>4.
This makes the AVR fairly unsual, by the way.
You have to consult the documentation of your processor for this information. Even for a given instruction set, there may be different costs depending on the model. On a really small processor, shifting by one could conceivably be faster than by other values, for instance (it is the case for rotation instructions on some IA32 processors, but that's only because this instruction is so rarely produced by compilers).
According to http://atmel.com/dyn/resources/prod_documents/8271S.pdf all logical shifts are done in one cycle for the ATMega328. But of course, as pointed out in the comments, all logical shifts are by one bit. So the cost of a shift by n is n cycles in n instructions.
Indeed ATMega doesn't have a barrel shifter just like most (if not all) other 8-bit MCUs. Therefore it can only shift by 1 each time instead of any arbitrary values like more powerful CPUs. As a result shifting by 4 is theoretically slower than shifting by 3
However ATMega does have a swap nibble instruction so in fact x >> 4 is faster than x >> 3
Assuming x is an uint8_t then x >>= 3 is implemented by 3 right shifts
x >>= 1;
x >>= 1;
x >>= 1;
whereas x >>= 4 only need a swap and a bit clear
swap(x); // swap the top and bottom nibbles AB <-> BA
x &= 0x0f;
or
x &= 0xf0;
swap(x);
For bigger cross-register shifts there are also various ways to optimize it
With a uint16_t variable y consisting of the low part y0 and high part y1 then y >> 8 is simply
y0 = y1;
y1 = 0;
Similarly y >> 9 can be optimized to
y0 = y1 >> 1;
y1 = 0;
and hence is even faster than a shift by 3 on a char
In conclusion, the shift time varies depending on the shift distance, but it's not necessarily slower for longer or non-power-of-2 values. Generally it'll take at most 3 instructions to shift within an 8-bit char
Here are some demos from compiler explorer
A right shift by 4 is achieved by a swap and an and like above
swap r24
andi r24,lo8(15)
A right shift by 3 has to be done with 3 instructions
lsr r24
lsr r24
lsr r24
Left shifts are also optimized in the same manner
See also Which is faster: x<<1 or x<<10?
It depends on how the processor is built. If the processor has a barrel-rotate it can shift any number of bits in one operation, but that takes chip space and power budget. The most economical hardware would just be able to rotate right by one, with options regarding the wrap-around bit. Next would be one that could rotate by one either left or right. I can imagine a structure that would have a 1-shifter, 2-shifter, 4-shifter, etc. in which case 4 might be faster than 3.
Disassemble first then time the code. Dont be discouraged by people telling you, you are wasting your time. The knowledge you gain will put you in a position to be the goto person for putting out the big company fires. The number of people with real behind the curtain knowledge is dropping at an alarming rate in this industry.
Sounds like others explained the real answer here, which disassembly would have shown, single bit shift instruction. So 4 shifts will take 133% of the time that 3 shifts took, or 3 shifts is 75% of the time of 4 shifts depending on how you compared the numbers. And your measurements should reflect that difference, if they dont I would continue with this experiment until you completely understand the execution times.
If your targer processor has a bit-shift instruction (which is very likely), then it depends on the hardware-implementation of that instruction if there will be any difference between shifting a power-of-2 bits, or shifting some other number. However, it is unlikely to make a difference.
With all respect, you should not even start talking about performace until you start measuring. Compile you program with division. Run. Measure time. Repeat with shift.
replacing a divide with a bit-shift
This is not the same for negative numbers:
char div2 (void)
{
return (-1) / 2;
// ldi r24,0
}
char asr1 (void)
{
return (-1) >> 1;
// ldi r24,-1
}