how are bitshift, bitrotate implemented in circuit? - bit-manipulation

Can you implement bit shift using only logic operations: and, or, not, xor ? Can you use bitshift in a bitblt?

To implement bitshifts/rotates in circuits: you can build registers from an array of Flip Flops which in turn you can build e.g. from NAND gates.
In order to implement bit-shifts/rotates you would wire two such registers (or feed back to the same register) by wiring the output of bit 0 to the input of bit 1 etc.
The contents are then transferred on e.g. the next clock rising edge from one array of flip-flops to the other.

You can emulate a left shift with addition a + a. The result of an and/or/not/xor do not depend upon adjacent bits, so you can't use them for bitshifts. In circuit, i'd expect they are hard-coded... You can use bit-shifting for fast hardware multiplication anyway.

Related

What is the Lower and the higher part of multiplication in assembly instructions

I was reading this link, in short can someone explain the problem with current C++ compiler to someone who started learning about assembly x86 and 64bit a week ago.
Unfortunately current compilers don't optimize #craigster0's nice
portable version, so if you want to take advantage of 64-bit CPUs, you
can't use it except as a fallback for targets you don't have an #ifdef
for. (I don't see a generic way to optimize it; you need a 128-bit
type or an intrinsic.)
for clarification I was researching for the benefits of assembly when I came across people saying in multiple posts that the current compilers are not optimised when it comes to multiplication for the 64 bit because they use the lowest part so they do not perform full 64bit multiplication what does this means. so what is the meaning of getting the higher part also I read in a book I have that in the 64 bits architecture only the lowest 32 bits are used for the RFlags, Are these related I am confused?
Most CPUs will allow you to start with two operands, each the size of a register, and multiply them together to get a result that fills two registers.
For example, on x86 if you multiply two 32-bit numbers, you'll get the upper 32 bits of the result in EDX and the lower 32 bits of the result in EAX. If you multiply two 64-bit numbers, you get the results in RDX and RAX instead.
On other processors, other registers are used, but the same basic idea applies: one register times one register gives a result that fills two registers.
C and C++ don't provide an easy way of taking advantage of that capability. When you operate on types smaller than int, the input operands are converted to int, then the ints are multiplied, and the result is an int. If the inputs are larger than int, then they're multiplied as the same type, and the result is the same type. Nothing is done to take into account that the result is twice as big as the input types, and virtually every processor on earth will produce a result twice as big as each input is individually.
There are, of course, ways of dealing with that. The simplest is the basic factoring we learned in grade school: take each number and break it up onto upper and lower halves. We can then multiply those pieces together individually: (a+b) * (c+d) = ac + ad + bc + bd. Since each of those multiplications has only half as many non-zero bits, we can do each piece of arithmetic as a half-size operation producing a full-sized result (plus a single bit carried out from the addition). For example, if we wanted to do 64-bit multiplication on a 64-bit processor to get a 128-bit result, we'd break each 64-bit input up into 32-bit pieces. Then each multiplication would produce a 64-bit result. We'd then add pieces together (with suitable bit-shifts) to get our final 128-bit result.
But, as Peter pointed out, when we do that, compilers are not smart enough to realize what we're trying to accomplish, and turn that sequence of multiplications and additions back into a single multiplication producing a result twice as large as each input. Instead, it translates the the expression fairly directly into a series of multiplications and additions, so it takes somewhere around four times longer than the single multiplication would have.

Why are CRC Polynomials given as Normal, Reversed, etc.?

I'm learning about CRCs, and search engines and SO turn up nothing on this....
Why do we have "Normal" and "Reversed" and "Reciprocal" Polynomials? Does one favor Big Endian, Little Endian, or something else?
The classic definition of a CRC would use a non-reflected polynomial, which shifts the CRC left. If the word size being used for the calculation is larger than the CRC, then you would need an operation at the end to clear the high bits that were shifted into (e.g. & 0xffff for a 16-bit CRC).
You can flip the whole thing, use a reflected polynomial, and shift right instead of left. That gives the same CRC properties, but the bits from the message are effectively operated on from least to most significant bit, instead of most to least significant bit. Since you are shifting right, the extraneous bits get dropped off the bottom into oblivion, and there is no need for the additional operation. This may have been one of the early motivations to use a very slightly faster and more compact implementation.
Sometimes the specification from the original hardware is that the bits are processed from least to most significant, so then you have to use the reflected version.
No, none of this favors little or big endian. Either kind of CRC can be computed just as easily in little-endian or big-endian architectures.

Bitwise Comparison to establish Bit Shift

We are dealing with a SerDes system with a high throughput. To ensure word alignment, I need to estimate the number of bits we need to bitslip.
For example if the input is 14'b 11111110000000 but the decoded output is 14'b 01111111000000 then clearly we need to bitslip by 1 bit.
Due to the timing constraints on this we cannot afford the time to iterativly shift and compare.
Is there any simple (single clock cycle) bitwise comparison can can estimate the number of bits we need to slip, perhaps XOR etc?
Thanks in advance.

Are there any good reasons to use bit shifting except for quick math?

I understand bitwise operations and how they might be useful for different purposes, e.g. permissions. However, I don't seem to understand what use the bit shift operators are. I understand how they work, but I can't think of any scenarios where I might want to use them unless I want to do some really quick multiplication or division. Are there any other reasons to use bit-shifting?
There are many reasons, here are some:
Let's say you represent a black and white image as a sequence of bits and you want to set a single pixel in this image generically. For example your byte offset may be x>>3 and your bit offset may be x & 0x7 and you can set that bit by: byte = byte | (1 << (x & 0x7));
Implementing data compression algorithms where you deal with variable length bit sequences, e.g. huffman coding.
You're are interacting with some hardware, e.g. a serial communication device, and you need to read or set some control bits.
For those and other reasons most processors have bit shift and/or rotation instructions as well as other logic instructions (and/or/xor/not).
Historically multiplication and division were significantly slower as they are more complex operations and some CPUs didn't have those at all.
Also see here:
Have you ever had to use bit shifting in real projects?
As you indicate, a left shift is the same thing as a multiplication by two. At least it is when we're talking about unsigned quantities. The meaning of a "left shift" of a signed quantity is ... language dependent.
With modern compilers, there's really no difference between writing "i = x*2;" and "i = x << 1;" The compiler will generate the most efficient code. So in that sense there's no reason to prefer shift over multiply.
Some algorithms work by shifting a quantity left by one bit and then setting the low bit to either 0 or 1. Some simple compression algorithms work this way. For example, if your accumulated value is in the variable x, and the current value (0 or 1) is in y, then it makes more sense to write "x = (x << 1) | y", rather than "x = (x * 2) + y". Both do the same thing, but the first is more notationally correct. You don't have to think, "oh, right, multiply by two is the same as a left shift."
Also, when you're talking about algorithms that shift bits, it's more convenient to shift left or right by a particular number of bits than to figure out what multiple of 2 you want to multiply or divide by.
So, whereas there's typically no performance benefit to shifting rather than multiplying--at least not when working with high level languages--there are times when having the ability to shift makes what you're doing more easily understood.
There are lot of places where bit shift operations are regularly used outside of their usage in numerical computations. For example, Bitboard is a data structure that is commonly used in board games for board representation. Some of the strongest chess engines use this data structure mainly for speed and ease of move generation and evaluation. These programs use bit operations heavily and bit-shift operations specifically are used in a lot of contexts - such as finding bit masks, generating new moves on the board, computing logarithm very quickly, etc. There are even very advanced numerical computations that can be done elegantly by clever use of bit operations. Check out this site for bit twiddling hacks - a lot of those algorithms use shift operators. Bit shift operations are regularly used in device driver programming, codec development, embedded systems programming and so on.
Shifting allows accessing specific bits within a variable. The expression (n >> p) & ((1 << m) - 1) retrieves an m-bit portion of the variable n with an offset of p bits from the right.
This allows your program to use integers that aren't multiples of 8 bits, which is useful for data compression.
For example, I used it in my Netflix Prize programs to pack records (22-bit user ID + 15-bit movie ID + 12-bit date + 3-bit rating) into a uint64_t (with 12 bits to spare).
A very common special case is to pack 8 bool variables into each byte. (Unix file permissions, black-and-white bitmaps, CPU flags registers, etc.)
Also, bit manipulation is used in UTF-8, which is a very popular character encoding. Unicode characters are represented by distributing their bits across 1, 2, 3, or 4 bytes.

Is a logical right shift by a power of 2 faster in AVR?

I would like to know if performing a logical right shift is faster when shifting by a power of 2
For example, is
myUnsigned >> 4
any faster than
myUnsigned >> 3
I appreciate that everyone's first response will be to tell me that one shouldn't worry about tiny little things like this, it's using correct algorithms and collections to cut orders of magnitude that matters. I fully agree with you, but I am really trying to squeeze all I can out of an embedded chip (an ATMega328) - I just got a performance shift worthy of a 'woohoo!' by replacing a divide with a bit-shift, so I promise you that this does matter.
Let's look at the datasheet:
http://atmel.com/dyn/resources/prod_documents/8271S.pdf
As far as I can see, the ASR (arithmetic shift right) always shifts by one bit and cannot take the number of bits to shift; it takes one cycle to execute. Therefore, shifting right by n bits will take n cycles. Powers of two behave just the same as any other number.
In the AVR instruction set, arithmetic shift right and left happen one bit at a time. So, for this particular microcontroller, shifting >> n means the compiler actually makes n many individual asr ops, and I guess >>3 is one faster than >>4.
This makes the AVR fairly unsual, by the way.
You have to consult the documentation of your processor for this information. Even for a given instruction set, there may be different costs depending on the model. On a really small processor, shifting by one could conceivably be faster than by other values, for instance (it is the case for rotation instructions on some IA32 processors, but that's only because this instruction is so rarely produced by compilers).
According to http://atmel.com/dyn/resources/prod_documents/8271S.pdf all logical shifts are done in one cycle for the ATMega328. But of course, as pointed out in the comments, all logical shifts are by one bit. So the cost of a shift by n is n cycles in n instructions.
Indeed ATMega doesn't have a barrel shifter just like most (if not all) other 8-bit MCUs. Therefore it can only shift by 1 each time instead of any arbitrary values like more powerful CPUs. As a result shifting by 4 is theoretically slower than shifting by 3
However ATMega does have a swap nibble instruction so in fact x >> 4 is faster than x >> 3
Assuming x is an uint8_t then x >>= 3 is implemented by 3 right shifts
x >>= 1;
x >>= 1;
x >>= 1;
whereas x >>= 4 only need a swap and a bit clear
swap(x); // swap the top and bottom nibbles AB <-> BA
x &= 0x0f;
or
x &= 0xf0;
swap(x);
For bigger cross-register shifts there are also various ways to optimize it
With a uint16_t variable y consisting of the low part y0 and high part y1 then y >> 8 is simply
y0 = y1;
y1 = 0;
Similarly y >> 9 can be optimized to
y0 = y1 >> 1;
y1 = 0;
and hence is even faster than a shift by 3 on a char
In conclusion, the shift time varies depending on the shift distance, but it's not necessarily slower for longer or non-power-of-2 values. Generally it'll take at most 3 instructions to shift within an 8-bit char
Here are some demos from compiler explorer
A right shift by 4 is achieved by a swap and an and like above
swap r24
andi r24,lo8(15)
A right shift by 3 has to be done with 3 instructions
lsr r24
lsr r24
lsr r24
Left shifts are also optimized in the same manner
See also Which is faster: x<<1 or x<<10?
It depends on how the processor is built. If the processor has a barrel-rotate it can shift any number of bits in one operation, but that takes chip space and power budget. The most economical hardware would just be able to rotate right by one, with options regarding the wrap-around bit. Next would be one that could rotate by one either left or right. I can imagine a structure that would have a 1-shifter, 2-shifter, 4-shifter, etc. in which case 4 might be faster than 3.
Disassemble first then time the code. Dont be discouraged by people telling you, you are wasting your time. The knowledge you gain will put you in a position to be the goto person for putting out the big company fires. The number of people with real behind the curtain knowledge is dropping at an alarming rate in this industry.
Sounds like others explained the real answer here, which disassembly would have shown, single bit shift instruction. So 4 shifts will take 133% of the time that 3 shifts took, or 3 shifts is 75% of the time of 4 shifts depending on how you compared the numbers. And your measurements should reflect that difference, if they dont I would continue with this experiment until you completely understand the execution times.
If your targer processor has a bit-shift instruction (which is very likely), then it depends on the hardware-implementation of that instruction if there will be any difference between shifting a power-of-2 bits, or shifting some other number. However, it is unlikely to make a difference.
With all respect, you should not even start talking about performace until you start measuring. Compile you program with division. Run. Measure time. Repeat with shift.
replacing a divide with a bit-shift
This is not the same for negative numbers:
char div2 (void)
{
return (-1) / 2;
// ldi r24,0
}
char asr1 (void)
{
return (-1) >> 1;
// ldi r24,-1
}