Can ~3 safely be widened automatically? - c++

While answering another question, I ended up trying to justify casting the operand to the ~ operator, but I was unable to come up with a scenario where not casting it would yield wrong results.
I am asking this clarification question in order to be able to clean up that other question, removing the red herrings and keeping only the most relevant information intact.
The problem in question is that we want to clear the two lowermost bits of a variable:
offset = offset & ~3;
This looks dangerous, because ~3 will be an int no matter what offset is, so we might end up masking the bits that do not fit into int's width. For example if int is 32 bits wide and offset is of a 64 bit wide type, one could imagine that this operation would lose the 32 most significant bits of offset.
However, in practice this danger does not seem to manifest itself. Instead, the result of ~3 is sign-extended to fill the width of offset, even when offset is unsigned.
Is this behavior mandated by the standard? I am asking because it seems that this behavior could rely on specific implementation and/or hardware details, but I want to be able to recommend code that is correct according to the language standard.
I can make the operation produce an undesired result if I try to remove the 32. least significant bit. This is because the result of ~(1 << 31) will be positive in a 32 bit signed integer in two's complement representation (and indeed a one's complement representation), so sign-extending the result will make all the higher bits unset.
offset = offset & ~(1 << 31); // BZZT! Fragile!
In this case, if int is 32 bits wide and offset is of a wider type, this operation will clear all the high bits.
However, the proposed solution in the other question does not seem to resolve this problem!
offset = offset & ~static_cast<decltype(offset)>(1 << 31); // BZZT! Fragile!
It seems that 1 << 31 will be sign-extended before the cast, so regardless of whether decltype(offset) is signed or unsigned, the result of this cast will have all the higher bits set, such that the operation again will clear all those bits.
In order to fix this, I need to make the number unsigned before widening, either by making the integer literal unsigned (1u << 31 seems to work) or casting it to unsigned int:
offset = offset &
~static_cast<decltype(offset)>(
static_cast<unsigned int>(
1 << 31
)
);
// Now it finally looks like C++!
This change makes the original danger relevant. When the bitmask is unsigned, the inverted bitmask will be widened by setting all the higher bits to zero, so it is important to have the correct width before inverting.
This leads me to conclude that there are two ways to recommend clearing some bits:
1: offset = offset & ~3;
Advantages: Short, easily readable code.
Disadvantages: None that I know of. But is the behavior guaranteed by the standard?
2: offset = offset & ~static_cast<decltype(offset)>(3u);
Advantages: I understand how all elements of this code works, and I am fairly confident that its behavior is guaranteed by the standard.
Disadvantages: It doesn't exactly roll of the tounge.
Can you guys help me clarify if the behavior of option 1 is guaranteed or if I have to resort to recommending option 2?

It is not valid in sign-magnitude representation. In that representation with 32-bit ints, ~3 is -0x7FFFFFFC. When this is widened to 64-bit (signed) the value is retained, -0x7FFFFFFC. So we would not say that sign-extension happens in that system; and you will incorrectly mask off all the bits 32 and higher.
In two's complement, I think offset &= ~3 always works. ~3 is -4, so whether or not the 64-bit type is signed, you still get a mask with only the bottom 2 bits unset.
However, personally I'd try to avoid writing it, as then when checking over my code for bugs later I'd have to go through all this discussion again! (and what hope does a more casual coder have of understanding the intricacies here). I only do bitwise operations on unsigned types, to avoid all of this.

Related

Bit shifting with leading 1

When I use the >> bitwise operator on 1000 in c++ it gives this result: 1100. I want the result to be 0100. When the 1 is in any other position this is exactly what happens, but with a leading 1 it goes wrong. Why is that and how can it be avoided?
The behavior you describe is coherent with what happens on some platforms when right-shifting a signed integer with the high bit set (so, negative values).
In this case, on many platforms compilers will emit code to perform an arithmetic shift, which propagates the sign bit; this, on platforms with 2's complement representation for negative integers (= virtually every current platform) has the effect of giving the "x >> i = floor(x/2i)" behavior even on negative values. Notice that this is not contractual - as far as the C++ standard is concerned, shifting negative integers in implementation-defined behavior, so any compiler is free to implement different semantics for it1.
To come to your question, to obtain the "regular" shift behavior (generally called "logical shift") you have to make sure to work on unsigned integers. This can be obtained either making sure that the variable you are shifting is of unsigned type (e.g. unsigned int) or, if it's a literal, by putting an U suffix to it (e.g. 1 is an int, 1U is an unsigned int).
If the data you have is of a signed type (e.g. int) you may cast it to the corresponding unsigned type before shifting without risks (conversion from a signed int to an unsigned one is well-defined by the standard, and doesn't change the bit values on 2's complement machines).
Historically, this comes from the fact that C strove to support even machines that didn't have "cheap" arithmetic shift functionality at hardware level and/or didn't use 2's complement representation.
As mentioned by others, when right shifting on a signed int, it is implementation defined whether you will get 1s or 0s. In your case, because the left most bit in 1000 is a 1, the "replacement bits" are also 1. Assuming you must work with signed ints, in order to get rid of it, you can apply a bitmask.

How to avoid integral promotion for bitwise operations

I'm floored that VisualStudio 2015 insists on promoting a WORD (unsigned short) to an unsigned int when only WORD values are involved in only bit manipulations. (i.e. promotes 16 bit to 32 bit when doing 16bit | 16bit).
e.g.
// where WORD is a 'unsigned short'
const WORD kFlag = 1;
WORD old = 2;
auto value = old | kFlag; // why the blazes is value an unsigned int (32 bits)
Moreover, is there a way to get 0x86 intrinsics for WORD|WORD? I surely do not want to pay for (16->32|16->)->16. Nor does this code need to consume more than a couple of 16 bit registers, not a few 32 bit regs.
But the registry use is really just an aside. The optimizer is welcome to do as it pleases, so long as the results are indistinguishable for me. (i.e. it should not change the size in a visible way).
The main problem for me is that using flags|kFlagValue results in a wider entity, and then pumping that into a template gives me a type mismatch error (template is rather much longer than I want to get into here, but the point is it takes two arguments, and they should match in type, or be trivially convertible, but aren't, due to this automatic size-promotion rule).
If I had access to a "conservative bit processing function set" then I could use:
flag non-promoting-bit-operator kFlagValue
To achieve my ends.
I guess I have to go write that, or use casts all over the place, because of this unfortunate rule.
C++ should not promote in this instance. It was a poor language choice.
Why is value promoted to a larger type? Because the language spec says it is (a 16-bit unsigned short will be converted to a 32-bit int). 16-bit ops on x86 actually incur a penalty over the corresponding 32 bit ones (due to a prefix opcode), so the 32 bit version just may run faster.

Perform 64 bit calculations in 64 bit executable

I am using MinGW64 (with the -m64 flag) with Code::Blocks and am looking to know how to perform 64 bit calculations without having to cast a really big number to int64_t before multiplying it. For example, this does not result in overflow:
int64_t test = int64_t(2123123123) * 17; //Returns 36093093091
Without the cast, the calculation overflows like such:
int64_t test = 2123123123 * 17; //Returns 1733354723
A VirusTotal scan confirms that my executable is x64.
Additional Information: OS is Windows 7 x64.
The default int type is still 32 bit even in 64 bit compilations for compatibility resons.
The "shortest" version I guess would be to add the ll suffix to the number
int64_t test = 2123123123ll * 17;
Another way would be to store the numbers in their own variables of type int64_t (or long long) and multiply the varaibles. usually it's rare anyway in a program to have many "magic-numbers" hard-coded into the codebase.
Some background:
Once upon a time, most computers had 8-bit arithmetic logic units and a 16-bit address bus. We called them 8-bit computers.
One of the first things we learned was that no real-world arithmetic problem can be expressed in 8-bits. It's like trying to reason about space flight with the arithmetic abilities of a chimpanzee. So we learned to write multi-word add, multiply, subtract and divide sequences. Because in most real-world problems, the numerical domain of the problem was bigger than 255.
The we briefly had 16-bit computers (where the same problem applied, 65535 is just not enough to model things) and then quite quickly, 32-bit arithmetic logic built in to chips. Gradually, the address bus caught up (20 bits, 24 bits, 32 bits if designers were feeling extravagant).
Then an interesting thing happened. Most of us didn't need to write multi-word arithmetic sequences any more. It turns out that most(tm) real world integer problems could be expressed in 32 bits (up to 4 billion).
Then we started producing more data at a faster rate than ever before, and we perceived the need to address more memory. The 64-bit computer eventually became the norm.
But still, most real-world integer arithmetic problems could be expressed in 32 bits. 4 billion is a big (enough) number for most things.
So, presumably through statistical analysis, your compiler writers decided that on your platform, the most useful size for an int would be 32 bits. Any smaller would be inefficient for 32-bit arithmetic (which we have needed from day 1) and any larger would waste space/registers/memory/cpu cycles.
Expressing an integer literal in c++ (and c) yields an int - the natural arithmetic size for the environment. In the present day, that is almost always a 32-bit value.
The c++ specification says that multiplying two ints yields an int. If it didn't then multiplying two ints would need to yield a long. But then what would multiplying two longs yield? A long long? Ok, that's possible. Now what if we multiply those? A long long long long?
So that's that.
int64_t x = 1 * 2; will do the following:
take the integer (32 bits) of value 1.
take the integer (32 bits) of value 2.
multiply them together, storing the result in an integer. If the arithmetic overflows, so be it. That's your lookout.
cast the resulting integer (whatever that may now be) to int64 (probably on your system a long int.
So in a nutshell, no. There is no shortcut to spelling out the type of at least one of the operands in the code snippet in the question. You can, of course, specify a literal. But there is no guarantee that the a long long (LL literal suffix) on your system is the same as int64_t. If you want an int64_t, and you want the code to be portable, you must spell it out.
For what it's worth:
In a post-c++11 world all the worrying about extra keystrokes and non-DRYness can disappear:
definitely an int64:
auto test = int64_t(2123123123) * 17;
definitely a long long:
auto test = 2'123'123'123LL * 17;
definitely int64, definitely initialised with a (possibly narrowing, but that's ok) long long:
auto test = int64_t(36'093'093'091LL);
Since you're most likely in an LP64 environment, where int is only 32 bits, you have to be careful about literal constants in expressions. The easiest way to do this is to get into the habit of using the proper suffix on literal constants, so you would write the above as:
int64_t test = 2123123123LL * 17LL;
2123123123 is an int (usually 32 bits).
Add an L to make it a long: 2123123123L (usually 32 or 64 bits, even in 64-bit mode).
Add another L to make it a long long: 2123123123LL (64 bits or more starting with C++11).
Note that you only need to add the suffix to constants that exceed the size of an int. Integral conversion will take care of producing the right result*.
(2123123123LL * 17) // 17 is automatically converted to long long, the result is long long
* But beware: even if individual constants in an expression fit into an int, the whole operation can still overflow like in
(1024 * 1024 * 1024 * 10)
In that case you should make sure the arithmetic is performed at sufficient width (taking operator precedence into account):
(1024LL * 1024 * 1024 * 10)
- will perform all 3 operations in 64 bits, with a 64-bit result.
Edit: Literal constants (A.K.A. magic numbers) are frowned upon, so the best way to do it would be to use symbolic constants (const int64_t value = 5). See What is a magic number, and why is it bad? for more info. It's best that you don't read the rest of this answer, unless you really want to use magic numbers for some strange reason.
Also, you can use intptr_t and uintprt_t from #include <cstdint> to let the compiler choose whether to use int or __int64.
For those who stumble upon this question, `LL` at the end of a number can do the trick, but it isn't recommended, as Richard Hodges told me that `long long` may not be always 64 bit, and can increase in size in the future, although it's not likely. See Richard Hodge's answer and the comments on it for more information.
The reliable way would be to put `using QW = int_64t;` at the top and use `QW(5)` instead of `5LL`.
Personally I think there should be an option to define all literals 64 bit without having to add any suffixes or functions to them, and use `int32_t(5)` when necessary, because some programs are unaffected by this change. Example: only use numbers for normal calculations instead of relying on integer overflow to do it's work. The problem is going from 64 bit to 32 bit, rather than going from 32 to 64, as the first 4 bytes are cut off.

What is wrong with this bit-manipulation code from an interview question?

I was having a look over this page: http://www.devbistro.com/tech-interview-questions/Cplusplus.jsp, and didn't understand this question:
What’s potentially wrong with the following code?
long value;
//some stuff
value &= 0xFFFF;
Note: Hint to the candidate about the base platform they’re developing for. If the person still doesn’t find anything wrong with the code, they are not experienced with C++.
Can someone elaborate on it?
Thanks!
Several answers here state that if an int has a width of 16 bits, 0xFFFF is negative. This is not true. 0xFFFF is never negative.
A hexadecimal literal is represented by the first of the following types that is large enough to contain it: int, unsigned int, long, and unsigned long.
If int has a width of 16 bits, then 0xFFFF is larger than the maximum value representable by an int. Thus, 0xFFFF is of type unsigned int, which is guaranteed to be large enough to represent 0xFFFF.
When the usual arithmetic conversions are performed for evaluation of the &, the unsigned int is converted to a long. The conversion of a 16-bit unsigned int to long is well-defined because every value representable by a 16-bit unsigned int is also representable by a 32-bit long.
There's no sign extension needed because the initial type is not signed, and the result of using 0xFFFF is the same as the result of using 0xFFFFL.
Alternatively, if int is wider than 16 bits, then 0xFFFF is of type int. It is a signed, but positive, number. In this case both operands are signed, and long has the greater conversion rank, so the int is again promoted to long by the usual arithmetic conversions.
As others have said, you should avoid performing bitwise operations on signed operands because the numeric result is dependent upon how signedness is represented.
Aside from that, there's nothing particularly wrong with this code. I would argue that it's a style concern that value is not initialized when it is declared, but that's probably a nit-pick level comment and depends upon the contents of the //some stuff section that was omitted.
It's probably also preferable to use a fixed-width integer type (like uint32_t) instead of long for greater portability, but really that too depends on the code you are writing and what your basic assumptions are.
I think depending on the size of a long the 0xffff literal (-1) could be promoted to a larger size and being a signed value it will be sign extended, potentially becoming 0xffffffff (still -1).
I'll assume it's because there's no predefined size for a long, other than it must be at least as big as the preceding size (int). Thus, depending on the size, you might either truncate value to a subset of bits (if long is more than 32 bits) or overflow (if it's less than 32 bits).
Yeah, longs (per the spec, and thanks for the reminder in the comments) must be able to hold at least -2147483647 to 2147483647 (LONG_MIN and LONG_MAX).
For one value isn't initialized before doing the and so I think the behaviour is undefined, value could be anything.
long type size is platform/compiler specific.
What you can here say is:
It is signed.
We can't know the result of value &= 0xFFFF; since it could be for example value &= 0x0000FFFF; and will not do what expected.
While one could argue that since it's not a buffer-overflow or some other error that's likely to be exploitable, it's a style thing and not a bug, I'm 99% confident that the answer that the question-writer is looking for is that value is operated on before it's assigned to. The value is going to be arbitrary garbage, and that's unlikely to be what was meant, so it's "potentially wrong".
Using MSVC I think that the statement would perform what was most likely intended - that is: clear all but the least significant 16 bits of value, but I have encountered other platforms which would interpret the literal 0xffff as equivalent to (short)-1, then sign extend to convert to long, in which case the statement "value &= 0xFFFF" would have no effect.
"value &= 0x0FFFF" is more explicit and robust.

Getting 32 bit words out of 64-bit values in C/C++ and not worrying about endianness

It's my understanding that in C/C++ bitwise operators are supposed to be endian independent and behave the way you expect. I want to make sure that I'm truly getting the most significant and least significant words out of a 64-bit value and not worry about endianness of the machine. Here's an example:
uint64_t temp;
uint32_t msw, lsw;
msw = (temp & 0xFFFFFFFF00000000) >> 32;
lsw = temp & 0x00000000FFFFFFFF;
Will this work?
6.5.7 Bitwise shift operators
4 The result of E1 << E2 is E1
left-shifted E2 bit positions; vacated
bits are filled with zeros. If E1 has
an unsigned type, the value of the
result is E1 × 2E2, reduced modulo one
more than the maximum value
representable in the result type. If
E1 has a signed type and nonnegative
value, and E1 × 2E2 is representable
in the result type, then that is the
resulting value; otherwise, the
behavior is undefined.
So, yes -- guranteed by the standard.
It will work, but the strange propensity of some authors for doing bit-masking before bit-shifting always puzzled me.
In my opinion, a much more elegant approach would be the one that does the shift first
msw = (temp >> 32) & 0xFFFFFFFF;
lsw = temp & 0xFFFFFFFF;
at least because it uses the same "magic" bit-mask constant every time.
Now, if your target type is unsigned and already has the desired bit-width, masking becomes completely unnecesary
msw = temp >> 32;
lsw = temp;
Yes, that should work. When you're retrieving the msw, your mask isn't really accomplishing much though -- the bits you mask to zero will be discarded when you do the shift anyway. Personally, I'd probably use something like this:
uint32_t lsw = -1, msw = -1;
lsw &= temp;
msw &= temp >> 32;
Of course, to produce a meaningful result, temp has to be initialized, which it wasn't in your code.
Yes.
It should work.
Just a thought I would like to share, perhaps you could get around the endianess of a value by using the functions or macros found in <arpa/inet.h>, to convert the Network to Host order and vice versa, it may be said that it is more used in conjunction to sockets, but it could be used for this instance to guarantee that a value such as 0xABCD from another processor is still 0xABCD on the Intel x86, instead of resorting to hand-coded custom functions to deal with the endian architecture....?!
Edit: Here's an article about Endianess on CodeProject and the author developed macros to deal with 64-bit values.
Hope this helps,
Best regards,
Tom.
Endianness is about memory layout. Shifting is about bits (and bit layout). Word significance is about bit layout, not memory layout. So endianness has nothing to do with word significance.
I think what you are saying is quite true, but where does this get you?
If you have some literal values hanging around, then you know which end is which. But if you find yourself with values that have come from outside the program, then you can't be sure, unless they have been encoded in some way.
In addition to the other responses, I shall add that you should not worry about endianness in C. Endianness trouble comes only from looking at some bytes under a different type than what was used to write those bytes in the first place. When you do that, you are very close to have aliasing issues, which means that your code may break when using another compiler or another optimization flag.
As long as you do not try to do such trans-type accesses, your code should be endian-neutral, and run flawlessly on both little-endian and big-endian architectures. Or, in other words, if you have endianness issues, then other kinds of bigger trouble are also lurking nearby.