Is a logical right shift by a power of 2 faster in AVR?

Is a logical right shift by a power of 2 faster in AVR? - c++

I would like to know if performing a logical right shift is faster when shifting by a power of 2
For example, is
myUnsigned >> 4
any faster than
myUnsigned >> 3
I appreciate that everyone's first response will be to tell me that one shouldn't worry about tiny little things like this, it's using correct algorithms and collections to cut orders of magnitude that matters. I fully agree with you, but I am really trying to squeeze all I can out of an embedded chip (an ATMega328) - I just got a performance shift worthy of a 'woohoo!' by replacing a divide with a bit-shift, so I promise you that this does matter.

Let's look at the datasheet:
http://atmel.com/dyn/resources/prod_documents/8271S.pdf
As far as I can see, the ASR (arithmetic shift right) always shifts by one bit and cannot take the number of bits to shift; it takes one cycle to execute. Therefore, shifting right by n bits will take n cycles. Powers of two behave just the same as any other number.

In the AVR instruction set, arithmetic shift right and left happen one bit at a time. So, for this particular microcontroller, shifting >> n means the compiler actually makes n many individual asr ops, and I guess >>3 is one faster than >>4.
This makes the AVR fairly unsual, by the way.

You have to consult the documentation of your processor for this information. Even for a given instruction set, there may be different costs depending on the model. On a really small processor, shifting by one could conceivably be faster than by other values, for instance (it is the case for rotation instructions on some IA32 processors, but that's only because this instruction is so rarely produced by compilers).
According to http://atmel.com/dyn/resources/prod_documents/8271S.pdf all logical shifts are done in one cycle for the ATMega328. But of course, as pointed out in the comments, all logical shifts are by one bit. So the cost of a shift by n is n cycles in n instructions.

Indeed ATMega doesn't have a barrel shifter just like most (if not all) other 8-bit MCUs. Therefore it can only shift by 1 each time instead of any arbitrary values like more powerful CPUs. As a result shifting by 4 is theoretically slower than shifting by 3
However ATMega does have a swap nibble instruction so in fact x >> 4 is faster than x >> 3
Assuming x is an uint8_t then x >>= 3 is implemented by 3 right shifts
x >>= 1;
x >>= 1;
x >>= 1;
whereas x >>= 4 only need a swap and a bit clear
swap(x); // swap the top and bottom nibbles AB <-> BA
x &= 0x0f;
or
x &= 0xf0;
swap(x);
For bigger cross-register shifts there are also various ways to optimize it
With a uint16_t variable y consisting of the low part y0 and high part y1 then y >> 8 is simply
y0 = y1;
y1 = 0;
Similarly y >> 9 can be optimized to
y0 = y1 >> 1;
y1 = 0;
and hence is even faster than a shift by 3 on a char
In conclusion, the shift time varies depending on the shift distance, but it's not necessarily slower for longer or non-power-of-2 values. Generally it'll take at most 3 instructions to shift within an 8-bit char
Here are some demos from compiler explorer
A right shift by 4 is achieved by a swap and an and like above
swap r24
andi r24,lo8(15)
A right shift by 3 has to be done with 3 instructions
lsr r24
lsr r24
lsr r24
Left shifts are also optimized in the same manner
See also Which is faster: x<<1 or x<<10?

It depends on how the processor is built. If the processor has a barrel-rotate it can shift any number of bits in one operation, but that takes chip space and power budget. The most economical hardware would just be able to rotate right by one, with options regarding the wrap-around bit. Next would be one that could rotate by one either left or right. I can imagine a structure that would have a 1-shifter, 2-shifter, 4-shifter, etc. in which case 4 might be faster than 3.

Disassemble first then time the code. Dont be discouraged by people telling you, you are wasting your time. The knowledge you gain will put you in a position to be the goto person for putting out the big company fires. The number of people with real behind the curtain knowledge is dropping at an alarming rate in this industry.
Sounds like others explained the real answer here, which disassembly would have shown, single bit shift instruction. So 4 shifts will take 133% of the time that 3 shifts took, or 3 shifts is 75% of the time of 4 shifts depending on how you compared the numbers. And your measurements should reflect that difference, if they dont I would continue with this experiment until you completely understand the execution times.

If your targer processor has a bit-shift instruction (which is very likely), then it depends on the hardware-implementation of that instruction if there will be any difference between shifting a power-of-2 bits, or shifting some other number. However, it is unlikely to make a difference.

With all respect, you should not even start talking about performace until you start measuring. Compile you program with division. Run. Measure time. Repeat with shift.

replacing a divide with a bit-shift
This is not the same for negative numbers:
char div2 (void)
{
return (-1) / 2;
// ldi r24,0
}
char asr1 (void)
{
return (-1) >> 1;
// ldi r24,-1
}

Related

Replace right shift by multiplication

I know that it is possible to use the left shift to implement multiplication by the power of two (x << 4 = x * 16).
Also, it is trivial to replace the right shift by division by a power of two (x >> 5 = x / 32).
I am wondering is it possible to replace the right shift with multiplication?
It seems to be not possible in the general case, but my question is limited to modulo 2^32 and 2^64 arithmetic (unsigned 32-bit and 64-bit values). Also, maybe it can be done if we can add other cheap instructions like + and - in addition to * to emulate the right bit shift?
I assume exotic architecture where the right shift is more expensive than other arithmetic (similar to division).
uint64_t foo(uint64_t x) {
return x >> 3; // how to avoid using right shift here?
}
There is a similar question How to perform right shifting binary multiplication? that asks how to replace multiplication of two unsigned numbers by right shift. Basically, it uses a loop internally. However, maybe if the second number is a constant, this loop can be avoided (or at least unrolled to a shorter fragment)?

"Multiply-high" aka high-mul, hmul, mulh, etc, can be used to emulate a shift-right with a constant count. Usually that's not a good trade. It's also hardly related to C++.
Normal multiplication (putting floating point stuff aside) cannot be used to implement a shift-right.
my question is limited to modulo 2^32 and 2^64 arithmetic
It doesn't help. You can use that property to "unmultiply" (sort of like divide, except not really) by odd numbers, for example if b = 5 * a then a = b * 0xCCCCCCCD, using the modular multiplicative inverse. The number being inverted must be relatively-prime relative to the modulus. Since the modulus is a power of two, the "divisor" here cannot be a power of two (except 1, but that does nothing), so a shift-right cannot be done this way.
Another way to look at it (probably simpler), is that what a multiplication does is conditionally add together a bunch of left-shifted versions of the multiplicand. Only left-shift versions, not right-shifted versions. Which of those shifted versions are selected by the multiplier doesn't matter, there are no right-shifted versions to select.

Is it faster to multiply low numbers in C/C++ (as opposed to high numbers)?

Example of question:
Is calculating 123 * 456 faster than calculating 123456 * 7890? Or is it the same speed?
I'm wondering about 32 bit unsigned integers, but I won't ignore answers about other types (64 bit, signed, float, etc.). If it is different, what is the difference due to? Whether or not the bits are 0/1?
Edit: If it makes a difference, I should clarify that I'm referring to any number (two random numbers lower than 100 vs two random numbers higher than 1000)

For builtin types up to at least the architecture's word size (e.g. 64 bit on a modern PC, 32 or 16 bit on most low-cost general purpose CPUs from the last couple decades), for every compiler/implementation/version and CPU I've ever heard of, the CPU opcode for multiplication of a particular integral size takes a certain number of clock cycles irrespective of the quantities involved. Multiplications of data with different sizes, performs differently on some CPUs (e.g. AMD K7 has 3 cycles latency for 16 bit IMUL, vs 4 for 32 bit).
It is possible that on some architecture and compiler/flags combination, a type like long long int has more bits than the CPU opcodes can operate on in one instruction, so the compiler may emit code to do the multiplication in stages and that will be slower than multiplication of CPU-supported types. But again, a small value stored at run-time in a wider type is unlikely to be treated - or perform - any differently than a larger value.
All that said, if one or both values are compile-time constants, the compiler is able to avoid the CPU multiplication operator and optimise to addition or bit shifting operators for certain values (e.g. 1 is obviously a no-op, either side 0 ==> 0 result, * 4 can sometimes be implemented as << 2). There's nothing in particular stopping techniques like bit shifting being used for larger numbers, but a smaller percentage of such numbers can be optimised to the same degree (e.g. there're more powers of two - for which multiplication can be performed using bit shifting left - between 0 and 1000 than between 1000 and 2000).

This is highly dependendent on the processor architecture and model.
In the old days (ca 1980-1990), the number of ones in the two numbers would be a factor - the more ones, the longer it took to multiply [after sign adjustment, so multiplying by -1 wasn't slower than multiplying by 1, but multiplying by 32767 (15 ones) was notably slower than multiplying by 17 (2 ones)]. That's because a multiply is essentially:
unsigned int multiply(unsigned int a, unsigned int b)
{
res = 0;
for(number of bits)
{
if (b & 1)
{
res += a;
}
a <<= 1;
b >>= 1;
}
}
In modern processors, multiply is quite fast either way, but 64-bit multiply can be a clock cycle or two slower than a 32-bit value. Simply because modern processors can "afford" to put down the whole logic for doing this in a single cycle - both when it comes to speed of transistors themselves, and the area that those transistors take up.
Further, in the old days, there was often instructions to do 16 x 16 -> 32 bit results, but if you wanted 32 x 32 -> 32 (or 64), the compiler would have to call a library function [or inline such a function]. Today, I'm not aware of any modern high end processor [x86, ARM, PowerPC] that can't do at least 64 x 64 -> 64, some do 64 x 64 -> 128, all in a single instruction (not always a single cycle tho').
Note that I'm completely ignoring the fact that "if the data is in cache is an important factor". Yes, that is a factor - and it's a bit like ignoring wind resistance when traveling at 200 km/h - it's not at all something you ignore in the real world. However, it is quite unimportant for THIS discussion. Just like people making sports cars care about aerodynamics, to get complex [or simple] software to run fast involves a certain amount of caring about the cache-content.

For all intents and purposes, the same speed (even if there were differences in computation speed, they would be immeasurable). Here is a reference benchmarking different CPU operations if you're curious: http://www.agner.org/optimize/instruction_tables.pdf.

Convert each bit in byte to first bit of each nibble in 32 bit int

I have a byte b. I am looking for the most efficient bit manipulation to
convert each bit in b to the first bit of each nibble in a 32 bit int x.
For example, if b = 01010111, then x = 0x10101111
I know I can do a brute force approach:
x = (b&1) | (((b>>1)&1)<<4) | ......
Edit: this for an OpenCL kernel for GPU

PDEP
As user harold mentioned in the comments, PDEP is the instruction that just does exactly what you want - but it's only available on x86 (as far as I know), and it has terrible1 performance on the newest AMD chips.
LUT
Barring that, a lookup table of 256 x 4-byte entries seems reasonable - at the cost of 1K of pressure on your cache subsystem. You'll find a lot of smart people advocate against LUTs due to the hidden cost of cache misses - but if this particular operation is in fact "hot" then it may turn out to be the fastest even when factoring in any additional misses.
As with any LUT solution, you should be especially careful to benchmark it not only with micro-benchmarks, but in the full application to evaluate the effect of memory pressure.
You could also consider a compromise split-LUT solution that uses one or two 16-entry LUTs for each nibble of the byte, where the result is calculated something like:
int32 x = high_lut[(b & 0xF0) >> 4] | low_lut[b & 0xF]
This cuts the size of the LUTs down by a factor of between ~11 to 322, since we have much fewer entries and some entries can be 2 bytes rather than 4 bytes.
Bit Manipulation
If you really want a bit manipulation solution, to impress your inlaws or something, you can try something like the following:
Split the byte into nibbles and use multiplication by 0x00001111 (low nibble) and 0x01111000 (high nibble) to splat the low (resp. high) nibble into the low (resp high) half of the 4-byte word, and combine the results with or or add. So if your byte had bits abcd efgh you'll have a word like abcd abcd abcd abcd efgh efgh efgh efgh.
and this result with a mask that picks out the bit that belongs in each nibble (although it usually won't be in the right place). The mask is something like 0x84218421 and the result (in binary) will be something like a000 0b00 00c0 000d e000 0f00 00g0 000h.
Now move the 6 out of 8 bits that aren't in the high bit to the right position using the carry behavior of subtraction, something like: ((x | 0x08880888) - 0x01110111) ^ 0x08880888.
The basic idea in the last step is that you set the high bit of each nibble, and subtract 1 from the nibble. So for example, you have the 0b00 nibble, which becomes 1b00 - 1 - the subtraction carries though all the zeros, and stops at the first one, which is either the high bit (b is zero) or b if it is one. So you effectively set the high bit based on the value of the selected bit. Note that you don't need to do this for a or e since they are already in the right place.
The final xor is needed because the above actually sets the high bit to the opposite value as the selected bit, so we need to flip it.
I didn't try it out, so there are no doubt bugs, but the basic idea should be sound. There is probably various ways to optimize it further, but it's not too bad as is: a couple of multiplications and perhaps a half-dozen bit-operations. On platforms with slow multiplications you can probably find another approach for the first step that uses only 1 multiplication combined with a few more primitive operations, or zero at the cost of several more operations.
1 Fully 18x worse throughput than Intel - evidently AMD opted not to implement the circuit to do PDEP in hardware and instead implement it via a series of more elementary operations.
2 The largest reduction is if you share a single 16-entry LUT for both the high and low nibble, although this requires an additional shift for the result of the high nibble lookup. The smaller reduction, shown in the example, uses two 16-entry LUTs: one 4-byte one for the high nibble, and a 2-byte one for the low nibble, and avoids the shift.

Which is better option to use for dividing an integer number by 2?

Which of the following techniques is the best option for dividing an integer by 2 and why?
Technique 1:
x = x >> 1;
Technique 2:
x = x / 2;
Here x is an integer.

Use the operation that best describes what you are trying to do.
If you are treating the number as a sequence of bits, use bitshift.
If you are treating it as a numerical value, use division.
Note that they are not exactly equivalent. They can give different results for negative integers. For example:
-5 / 2 = -2
-5 >> 1 = -3
(ideone)

Does the first one look like dividing? No. If you want to divide, use x / 2. Compiler can optimise it to use bit-shift if possible (it's called strength reduction), which makes it a useless micro-optimisation if you do it on your own.

To pile on: there are so many reasons to favor using x = x / 2; Here are some:
it expresses your intent more clearly (assuming you're not dealing with bit twiddling register bits or something)
the compiler will reduce this to a shift operation anyway
even if the compiler didn't reduce it and chose a slower operation than the shift, the likelihood that this ends up affecting your program's performance in a measurable way is itself vanishingly small (and if it does affect it measurably, then you have an actual reason to use a shift)
if the division is going to be part of a larger expression, you're more likely to get the precedence right if you use the division operator:
x = x / 2 + 5;
x = x >> 1 + 5; // not the same as above
signed arithmetic might complicate things even more than the precedence problem mentioned above
to reiterate - the compiler will already do this for you anyway. In fact, it'll convert division by a constant to a series of shifts, adds, and multiplies for all sorts of numbers, not just powers of two. See this question for links to even more information about this.
In short, you buy nothing by coding a shift when you really mean to multiply or divide, except maybe an increased possibility of introducing a bug. It's been a lifetime since compilers weren't smart enough to optimize this kind of thing to a shift when appropriate.

Which one is the best option and why for dividing the integer number by 2?
Depends on what you mean by best.
If you want your colleagues to hate you, or to make your code hard to read, I'd definitely go with the first option.
If you want to divide a number by 2, go with the second one.
The two are not equivalent, they don't behave the same if the number is negative or inside larger expressions - bitshift has lower precedence than + or -, division has higher precedence.
You should write your code to express what its intent is. If performance is your concern, don't worry, the optimizer does a good job at these sort of micro-optimizations.

Just use divide (/), presuming it is clearer. The compiler will optimize accordingly.

I agree with other answers that you should favor x / 2 because its intent is clearer, and the compiler should optimize it for you.
However, another reason for preferring x / 2 over x >> 1 is that the behavior of >> is implementation-dependent if x is a signed int and is negative.
From section 6.5.7, bullet 5 of the ISO C99 standard:
The result of E1 >> E2 is E1 right-shifted E2 bit positions. If E1 has
an unsigned type or if E1 has a signed type and a nonnegative value,
the value of the result is the integral part of the quotient of E1 /
2E2. If E1 has a signed type and a negative value, the resulting value
is implementation-defined.

x / 2 is clearer, and x >> 1 is not much faster (according to a micro-benchmark, about 30% faster for a Java JVM). As others have noted, for negative numbers the rounding is slightly different, so you have to consider this when you want to process negative numbers. Some compilers may automatically convert x / 2 to x >> 1 if they know the number can not be negative (even thought I could not verify this).
Even x / 2 may not use the (slow) division CPU instruction, because some shortcuts are possible, but it is still slower than x >> 1.
(This is a C / C++ question, other programming languages have more operators. For Java there is also the unsigned right shift, x >>> 1, which is again different. It allows to correctly calculate the mean (average) value of two values, so that (a + b) >>> 1 will return the mean value even for very large values of a and b. This is required for example for binary search if the array indices can get very large. There was a bug in many versions of binary search, because they used (a + b) / 2 to calculate the average. This doesn't work correctly. The correct solution is to use (a + b) >>> 1 instead.)

Knuth said:
Premature optimization is the root of all evil.
So I suggest to use x /= 2;
This way the code is easy to understand and also I think that the optimization of this operation in that form, don't mean a big difference for the processor.

Take a look at the compiler output to help you decide. I ran this test on x86-64 with
gcc (GCC) 4.2.1 20070719 [FreeBSD]
Also see compiler outputs online at godbolt.
What you see is the compiler does use a sarl (arithmetic right-shift) instruction in both cases, so it does recognize the similarity between the two expressions. If you use the divide, the compiler also needs to adjust for negative numbers. To do that it shifts the sign bit down to the lowest order bit, and adds that to the result. This fixes the off-by-one issue when shifting negative numbers, compared to what a divide would do.
Since the divide case does 2 shifts, while the explicit shift case only does one, we can now explain some of the performance differences measured by other answers here.
C code with assembly output:
For divide, your input would be
int div2signed(int a) {
return a / 2;
}
and this compiles to
movl %edi, %eax
shrl $31, %eax # (unsigned)x >> 31
addl %edi, %eax # tmp = x + (x<0)
sarl %eax # (x + 0 or 1) >> 1 arithmetic right shift
ret
similarly for shift
int shr2signed(int a) {
return a >> 1;
}
with output:
sarl %edi
movl %edi, %eax
ret
Other ISAs can do this about as efficiently, if not moreso. For example GCC for AArch64 uses:
add w0, w0, w0, lsr 31 // x += (unsigned)x>>31
asr w0, w0, 1 // x >>= 1
ret

Just an added note -
x *= 0.5 will often be faster in some VM-based languages -- notably actionscript, as the variable won't have to be checked for divide by 0.

Use x = x / 2; OR x /= 2; Because it is possible that a new programmer works on it in future. So it will be easier for him to find out what is going on in the line of code. Everyone may not be aware of such optimizations.

I am telling for the purpose of programming competitions. Generally they have very large inputs where division by 2 takes place many times and its known that input is positive or negative.
x>>1 will be better than x/2. I checked on ideone.com by running a program where more than 10^10 division by 2 operations took place. x/2 took nearly 5.5s whereas x>>1 took nearly 2.6s for same program.

I would say there are several things to consider.
Bitshift should be faster, as no special computation is really
needed to shift the bits, however as pointed out, there are
potential issues with negative numbers. If you are ensured to have
positive numbers, and are looking for speed then I would recommend
bitshift.
The division operator is very easy for humans to read.
So if you are looking for code readability, you could use this. Note
that the field of compiler optimization has come a long way, so making code easy
to read and understand is good practice.
Depending on the underlying hardware,
operations may have different speeds. Amdal's law is to make the
common case fast. So you may have hardware that can perform
different operations faster than others. For example, multiplying by
0.5 may be faster than dividing by 2. (Granted you may need to take the floor of the multiplication if you wish to enforce integer division).
If you are after pure performance, I would recommend creating some tests that could do the operations millions of times. Sample the execution several times (your sample size) to determine which one is statistically best with your OS/Hardware/Compiler/Code.

As far as the CPU is concerned, bit-shift operations are faster than division operations.
However, the compiler knows this and will optimize appropriately to the extent that it can,
so you can code in the way that makes the most sense and rest easy knowing that your code is
running efficiently. But remember that an unsigned int can (in some cases) be optimized better than an int for reasons previously pointed out.
If you don't need signed arithmatic, then don't include the sign bit.

x = x / 2; is the suitable code to use.. but an operation depend on your own program of how the output you wanted to produce.

Make your intentions clearer...for example, if you want to divide, use x / 2, and let the compiler optimize it to shift operator (or anything else).
Today's processors won't let these optimizations have any impact on the performance of your programs.

The answer to this will depend on the environment you're working under.
If you're working on an 8-bit microcontroller or anything without hardware support for multiplication, bit shifting is expected and commonplace, and while the compiler will almost certainly turn x /= 2 into x >>= 1, the presence of a division symbol will raise more eyebrows in that environment than using a shift to effect a division.
If you're working in a performance-critical environment or section of code, or your code could be compiled with compiler optimization off, x >>= 1 with a comment explaining its reasoning is probably best just for clarity of purpose.
If you're not under one of the above conditions, make your code more readable by simply using x /= 2. Better to save the next programmer who happens to look at your code the 10 second double-take on your shift operation than to needlessly prove you knew the shift was more efficient sans compiler optimization.
All these assume unsigned integers. The simple shift is probably not what you want for signed. Also, DanielH brings up a good point about using x *= 0.5 for certain languages like ActionScript.

mod 2, test for = 1. dunno the syntax in c. but this may be fastest.

generaly the right shift divides :
q = i >> n; is the same as: q = i / 2**n;
this is sometimes used to speed up programs at the cost of clarity. I don't think you should do it . The compiler is smart enough to perform the speedup automatically. This means that putting in a shift gains you nothing at the expense of clarity.
Take a look at this page from Practical C++ Programming.

Obviously, if you are writing your code for the next guy who reads it, go for the clarity of "x/2".
However, if speed is your goal, try it both ways and time the results. A few months ago I worked on a bitmap convolution routine which involved stepping through an array of integers and dividing each element by 2. I did all kinds of things to optimize it including the old trick of substituting "x>>1" for "x/2".
When I actually timed both ways I discovered to my surprise that x/2 was faster than x>>1
This was using Microsoft VS2008 C++ with the default optimizations turned on.

In terms of performance. CPU's shift operations are significantly faster than divide op-codes.
So dividing by two or multiplying by 2 etc all benefit from shift operations.
As to the look and feel. As engineers when did we become so attached to cosmetics that even beautiful ladies don't use! :)

X/Y is a correct one...and " >> " shifting operator..if we want two divide a integer we can use (/) dividend operator. shift operator is used to shift the bits..
x=x/2;
x/=2; we can use like this..

Are there any good reasons to use bit shifting except for quick math?

I understand bitwise operations and how they might be useful for different purposes, e.g. permissions. However, I don't seem to understand what use the bit shift operators are. I understand how they work, but I can't think of any scenarios where I might want to use them unless I want to do some really quick multiplication or division. Are there any other reasons to use bit-shifting?

There are many reasons, here are some:
Let's say you represent a black and white image as a sequence of bits and you want to set a single pixel in this image generically. For example your byte offset may be x>>3 and your bit offset may be x & 0x7 and you can set that bit by: byte = byte | (1 << (x & 0x7));
Implementing data compression algorithms where you deal with variable length bit sequences, e.g. huffman coding.
You're are interacting with some hardware, e.g. a serial communication device, and you need to read or set some control bits.
For those and other reasons most processors have bit shift and/or rotation instructions as well as other logic instructions (and/or/xor/not).
Historically multiplication and division were significantly slower as they are more complex operations and some CPUs didn't have those at all.
Also see here:
Have you ever had to use bit shifting in real projects?

As you indicate, a left shift is the same thing as a multiplication by two. At least it is when we're talking about unsigned quantities. The meaning of a "left shift" of a signed quantity is ... language dependent.
With modern compilers, there's really no difference between writing "i = x*2;" and "i = x << 1;" The compiler will generate the most efficient code. So in that sense there's no reason to prefer shift over multiply.
Some algorithms work by shifting a quantity left by one bit and then setting the low bit to either 0 or 1. Some simple compression algorithms work this way. For example, if your accumulated value is in the variable x, and the current value (0 or 1) is in y, then it makes more sense to write "x = (x << 1) | y", rather than "x = (x * 2) + y". Both do the same thing, but the first is more notationally correct. You don't have to think, "oh, right, multiply by two is the same as a left shift."
Also, when you're talking about algorithms that shift bits, it's more convenient to shift left or right by a particular number of bits than to figure out what multiple of 2 you want to multiply or divide by.
So, whereas there's typically no performance benefit to shifting rather than multiplying--at least not when working with high level languages--there are times when having the ability to shift makes what you're doing more easily understood.

There are lot of places where bit shift operations are regularly used outside of their usage in numerical computations. For example, Bitboard is a data structure that is commonly used in board games for board representation. Some of the strongest chess engines use this data structure mainly for speed and ease of move generation and evaluation. These programs use bit operations heavily and bit-shift operations specifically are used in a lot of contexts - such as finding bit masks, generating new moves on the board, computing logarithm very quickly, etc. There are even very advanced numerical computations that can be done elegantly by clever use of bit operations. Check out this site for bit twiddling hacks - a lot of those algorithms use shift operators. Bit shift operations are regularly used in device driver programming, codec development, embedded systems programming and so on.

Shifting allows accessing specific bits within a variable. The expression (n >> p) & ((1 << m) - 1) retrieves an m-bit portion of the variable n with an offset of p bits from the right.
This allows your program to use integers that aren't multiples of 8 bits, which is useful for data compression.
For example, I used it in my Netflix Prize programs to pack records (22-bit user ID + 15-bit movie ID + 12-bit date + 3-bit rating) into a uint64_t (with 12 bits to spare).
A very common special case is to pack 8 bool variables into each byte. (Unix file permissions, black-and-white bitmaps, CPU flags registers, etc.)
Also, bit manipulation is used in UTF-8, which is a very popular character encoding. Unicode characters are represented by distributing their bits across 1, 2, 3, or 4 bytes.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js