I am getting a benign warning about possible data loss
warning C4244: 'argument' : conversion from 'const int' to 'float', possible loss of data
Question
I remember as if float has a larger precision than int. So how can data be lost if I convert from a smaller data type (int) to a larger data type (float)?
Because float numbers are not precise. You cannot represent every possible value an int can hold into a float, even though the maximum value of a float is much higher.
For instance, run this simple program:
#include <stdio.h>
int main()
{
for(int i = 0; i < 2147483647; i++)
{
float value = i;
int ivalue = value;
if(i != ivalue)
printf("Integer %d is represented as %d in a float\n", i, ivalue);
}
}
You'll quickly see that there are thousands billions of integers that can't be represented as floats. For instance, all integers between the range 16,777,219 and 16,777,221 are represented as 16,777,220.
EDIT again Running that program above indicates that there are 2,071,986,175 positive integers that cannot be represented precisely as floats. Which leaves you roughly with only 100 millions of positive integer that fit correctly into a float. This means only one integer out of 21 is right when you put it into a float.
I expect the numbers to be the same for the negative integers.
On most architectures int and float are the same size, in that they have the same number of bits. However, in a float those bits are split between exponent and mantissa, meaning that there are actually fewer bits of precision in the float than the int. This is only likely to be a problem for larger integers, though.
On systems where an int is 32 bits, a double is usually 64 bits and so can exactly represent any int.
Both types are composed of 4 bytes (32 bits).
Only one of them allows a fraction (the float).
Take this for a float example;
34.156
(integer).(fraction)
Now use your logic;
If one of them must save fraction information (after all it should represent a number) then it means that it has less bits for the integer part.
Thus, a float can represent a maximal integer number which is smaller than the int's type capability.
To be more specific, an "int" uses 32 bits to represent an integer number (maximal unsigned integer of 4,294,967,296). A "float" uses 23 bits to do so (maximal unsigned integer of 8,388,608).
That's why when you convert from int to float you might lose data.
Example:
int = 1,158,354,125
You cannot store this number in a "float".
More information at:
http://en.wikipedia.org/wiki/Single_precision_floating-point_format
http://en.wikipedia.org/wiki/Integer_%28computer_science%29
Precision does not matter. The precision of int is 1, while the precision of a typical float (IEEE 754 single precision) is approximately 5.96e-8. What matters is the sets of numbers that the two formats can represent. If there are numbers that int can represent exactly that float cannot, then there is a possible loss of data.
Floats and ints are typically both 32 bits these days, but that's not guaranteed. Assuming it is the case on your machine, it follows that there must be int values that float cannot represent exactly, because there are obviously float values that int cannot represent exactly. The range of one format cannot be a proper super-set of the other if both formats use the same number of bits efficiently.
A 32 bit int effectively has 31 bits that code for the absolute value of the number. An IEEE 754 float effectively has only 24 bits that code for the mantissa (one implicit).
The fact is that both a float and an int are represented using 32 bits. The integer value uses all 32 bits so it can accommodate numbers from -231 to 231-1. However, a float uses 1 bit for the sign (including -0.0f) and 8 bits for the exponent. The means 32 - 9 = 23 bits left for the mantissa. However, the float assumes that if the mantissa and exponent are not zero, then the mantissa starts with a 1. So you more or less have 24 bits for your integer, instead of 32. However, because it can be shifted, it accommodates more than 224 integers.
A floating point uses a Sign, an eXponent, and a Mantissa
S X X X X X X X X M M M M M M M M M M M M M M M M M M M M M M M
An integer has a Sign, and a Mantissa
S M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M
So, a 29 bit integer such as:
0 0 0 1 1 1 1 1 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
fits in a float because it can be shifted:
0 0 0 1 1 1 1 1 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
| | |
| +-----------+ +-----------+
| | |
v v v
S X X X X X X X X M M M M M M M M M M M M M M M M M M M M M M M
0 1 0 0 1 1 0 1 1 1 1 1 1 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0
The eXponent represents a biased shift (the shift of the mantissa minus 128, if I'm correct—the shift counts from the decimal point). This clearly shows you that if you have to shift by 5 bits, you're going to lose the 5 lower bits.
So this other integer can be converted to a float with a lose of 2 bits (i.e. when you convert back to an integer, the last two bits (11) are set to zero (00) because they were not saved in the float):
1 1 1 0 0 1 1 1 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1
| ||
| || complement
| vv
| 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1
| | | | | | | |
| +-----------+ +-----------+ +-+-+-+-+--> lost bits
| | |
v v v
S X X X X X X X X M M M M M M M M M M M M M M M M M M M M M M M
1 1 0 0 1 1 0 1 1 1 0 0 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1
Note: For negative numbers, we first generate the complement, which is subtracting 1 then reversing all the bits from 0 to 1. That complement is what gets saved in the mantissa. The sign, however, still gets copied as is.
Pretty simple stuff really.
IMPORTANT NOTE: Yes, the first 1 in the integer is the sign, then the next 1 is not copied in the mantissa, it is assumed to be 1 so it is not required.
A float is usually in the standard IEEE single-precision format. This means there are only 24 bits of precision in a float, while an int is likely to be 32-bit. So, if your int contains a number whose absolute value cannot fit in 24 bits, you are likely to have it rounded to the nearest representable number.
My stock answer to such questions is to read this - What Every Computer Scientist Should Know About Floating-Point Arithmetic.
Related
For example say I have 3 integers 18 9 21
those 3 integers in binary : 10010, 10001, 10101
and say there's a number x I want that number to basically be the most similar bits for example the first digit of each number is 1 so x will start off as "1.....". The second digit of all three numbers is zero, so it will be "10...". The third digit is a mix: We have a 0,0 and a 1. but we have more zeros than 1's so x will be "100.." etc.
Is there any way to do this? I've been looking at bitwise operators and I'm just not sure how to do this? Because bitwise and doesn't really work on three numbers like this because if it sees even just one zero it will just return 0
I would simply add the bits if I were you: imagine the numbers: 17, 9 and 21, and let's write them in binary:
17 : 10001
9 : 01001
21 : 10101
Put this in a "table" and sum your binary digits:
1 0 0 0 1
0 1 0 0 1
1 0 1 0 1
2 1 1 0 3
... and then you say "When I have 0 or 1, I put '0', when 2 or 3, I put '1'", then you get:
1 0 0 0 1
=> your answer becomes "10001" which equals 17.
I am doing a project on digital filters. I needed to know how to add a 4 bit binary number to the most significant 4 bits of an 8 bit number. For example:
0 1 0 0 0 0 0 0 //x
+ 1 0 1 0 //y
= 1 1 1 0 0 0 0 0 //z
Can I add using a code somewhat like this?
z=[7:4]x + y
or should I have to concatenate the 4 bit number with another four zeros and add?
Assuming y is the 4 bit number and x the 8 bit number:
If you do
assign z = x[7:4] + y
Then you are doing a 4-bit addition and the most significant part of z is padded with 0's.
If you do
assign z = y[7:4] + x
You will get an error message from the synthesizer, as subscripts for y are wrong.
So do as this:
assign z = {y,4'b0} + x
Which performs an 8-bit addition with x and the value of y shifted 4 bits to the left, which is want you wanted.
This question already has an answer here:
What does & stands for in C and mmap()
(1 answer)
Closed 8 years ago.
I am trying to understand the condition of an if-else statement in c++, here is the snippet where this statement is in (not it's a shorthand version):
for (int i = 0; i < 8; ++i)
{
Point newCenter = center;
newCenter.x += oneEighth.x * (i&4 ? 0.5f : -0.5f);
}
I do understand that the 0.5f holds if the condition is true and -0.5f otherwise, but what does the i&4 mean?
This here is using two things, firstly it is using the bitwise AND operator &, this takes the binary representation of the two integers (i and 4) and computes the bitwise AND of both of these (i.e. for each position in the resulting binary representation of the number we look at the bits at the corresponding position in the two arguments and set the resultant bit to 1 if and only if both bits in the arguments are 1), secondly, it is using the implicit int to bool conversion which returns true if the integer is not equal to 0.
For example, if we have i=7, then the internal bitwise representation of this in two's complement would be:
/*24 0s*/ 0 0 0 0 0 1 1 1
And the two's complement representation of 4 is /*24 0s*/ 0 0 0 0 0 1 0 0 and so the bitwise AND is /*24 0s*/ 0 0 0 0 0 1 0 0 and as this is not equal to zero it is implictly converted to true and so the condition is met.
Alternatively, if we consider i=2, then we have the internal representation:
/*24 0s*/ 0 0 0 0 0 0 1 0
And thus the bitwise AND gives /*24 0s*/ 0 0 0 0 0 0 0 0 and thus the condition is not met.
The operator is Bitwise AND.
Bitwise binary AND does the logical AND of the bits in each position of a number in its binary form.
So, in your code, i&4 is true when i is 4, 5, 6, 7, because the base-2 representation of 4 is 100. i&4 will be true when the base-2 representation of i has 1 in the 3-rd position(right-left)
What happens if you use a bitwise operator (&, |, etc.) to compare two bitfields of different sizes?
For example, comparing 0 1 1 0 with 0 0 1 0 0 0 0 1:
0 1 1 0 0 0 0 0 The smaller one is extended with zeros and pushed to the
0 0 1 0 0 0 0 1 most-significant side.
Or...
0 0 0 0 0 1 1 0 The smaller one is extended with zeros and pushed to the
0 0 1 0 0 0 0 1 least-significant side.
Or...
0 1 1 0 The longer one is truncated from its least-significant side,
0 0 1 0 keeping its most significant side.
Or...
0 1 1 0 The longer one is truncated from its most-significant side,
0 0 0 1 keeping its least-significant side.
The bitwise operators always work on promoted operands. So exactly what might happen can depend on whether one (or both) bitfields are signed (as that may result in sign extension).
So, for your example values, the bit-field with the binary value 0 1 1 0 will be promoted to the int 6, and the bit-field with the binary value 0 0 1 0 0 0 0 1 will be promoted to the int 33, and those are the operands that will be used with whatever the operation is.
0 0 0 0 0 1 1 0 The smaller one is extended with zeros and pushed to the
0 0 1 0 0 0 0 1 least-significant side.
If you're actually using the values as bitfields, what's the meaning of comparing bitfields of different sizes? Would it generate a meaningful result for you?
That said, both operands will be promoted to a minimum size of int/unsigned with signedness depending on the signedness of the original operands. Then these promoted values will be compared with the bitwise operator.
This behaves as your second example: The smaller one is padded with zeroes on the MSB side (pushed to LSB side if you prefer).
If one operand is signed and negative while the other is unsigned, the negative one will be converted to the congruent unsigned number before the bit operation takes place.
If instead of integral numbers you mean std::bitset, you can't do bitwise operations on bitsets of differing sizes.
I'm not good in English, I can't ask it better, but please below:
if byte in binary is 1 0 0 0 0 0 0 0 then result is 1
if byte in binary is 1 1 0 0 0 0 0 0 then result is 2
if byte in binary is 1 1 1 0 0 0 0 0 then result is 3
if byte in binary is 1 1 1 1 0 0 0 0 then result is 4
if byte in binary is 1 1 1 1 1 0 0 0 then result is 5
if byte in binary is 1 1 1 1 1 1 0 0 then result is 6
if byte in binary is 1 1 1 1 1 1 1 0 then result is 7
if byte in binary is 1 1 1 1 1 1 1 1 then result is 8
But if for example the byte in binary is 1 1 1 0 * * * * then result is 3.
I would determine how many bit is set contiguous from left to right with one operation.
The results are not necessary numbers from 1-8, just something to distinguish.
I think it's possible in one or two operations, but I don't know how.
If you don't know a solution as short as 2 operations, please write that too, and I won't try it anymore.
Easiest non-branching solution I can think of:
y=~x
y|=y>>4
y|=y>>2
y|=y>>1
Invert x, and extend the lefttmost 1-bit (which corresponds to the leftmost 0-bit in the non-inverted value) to the right. Will give distinct values (not 1-8 though, but it's pretty easy to do a mapping).
110* ****
turns into
001* ****
001* **1*
001* 1*1*
0011 1111
EDIT:
As pointed out in a different answer, using a precomputed lookup table is probably the fastets. Given only 8 bits, it's probably even feasible in terms of memory consumption.
EDIT:
Heh, woops, my bad.. You can skip the invert, and do ands instead.
x&=x>>4
x&=x>>2
x&=x>>1
here
110* ****
gives
110* **0*
110* 0*0*
1100 0000
As you can see all values beginning with 110 will result in the same output (1100 0000).
EDIT:
Actually, the 'and' version is based on undefined behavior (shifting negative numbers), and will usually do the right thing if using signed 8-bit (i.e. char, rather than unsigned char in C), but as I said the behavaior is undefined and might not always work.
I'd second a lookup table... otherwise you can also do something like:
unsigned long inverse_bitscan_reverse(unsigned long value)
{
unsigned long bsr = 0;
_BitScanReverse(&bsr, ~value); // x86 bsr instruction
return bsr;
}
EDIT: Not that you have to be careful of the special case where "value" has no zeroed bits. See the documentation for _BitScanReverse.