yes i know, making bitwise ops on double values seems like a bad idea, but i actually need it.
You don't need to read the next paragraph for my question, only for the curious of you guys:
I actually try a special mod to the Mozilla Tamarin (Actionscript Virtual Machine). In it, any object has the first 3 bits reserved for it's type (double is 7 for example). These bits reduce precision for primitive data types (int only 29 bits etc.). For my mod, i need to expand this area by 2 bits. This means, when you for example add 2 doubles, you need to set these last 5 bits to zero, do the math, then reset them on the result. so much for the why ^^
Now back to the code.
Here a minimal example which shows a very similar problem:
double *d = new double;
*d = 15.25;
printf("float: %f\n", *d);
//forced hex output of double
printf("forced bitwise of double: ");
unsigned char * c = (unsigned char *) d;
int i;
for (i = sizeof (double)-1; i >=0 ; i--) {
printf ("%02X ", c[i]);
}
printf ("\n");
//cast to long long-pointer, so that bitops become possible
long long * l = (long long*)d;
//now the bitops:
printf("IntHex: %016X, float: %f\n", *l, *(double*)l); //this output is wrong!
*l = *l | 0x07;
printf("last 3 bits set to 1: %016X, float: %f\n", *l, *d);//this output is wrong!
*l = *l | 0x18;
printf("2 bits more set to 1: %016X, float: %f\n", *l, *d);//this output is wrong!
when running this in VisualStudio2008, the first output is correct. second too. 3rd yields 0 for both hex and float-representation, which is obviously wrong. 4th and 5th also zero for both hex and float, but the modified bits show in the hex-value. So i thought, maybe the typecast messed things up here. so 2 more outputs:
printf("float2: %f\n", *(double*)(long long*)d); //almost right
printf("float3: %f\n", *d); //almost right
well, they show 15.25, but it should be 15.2500000000000550670620214078. so i thought, hey, it's just the precision issue in the output. lets modify a bit further up:
*l = *l |= 0x10000000000;
printf("float4: %f\n", *d);
again, output is 15.25(0000), and not 15.2519531250000550670620214078. Weird enough, another forced hex output (see code above) shows no modification of d at all. so i tinkered a bit, and realized that bit 31 (0x80000000) is the last one i can set by hand. and holy moly, it actually has an effect on the output (15.250004)!
so, though i slightly strayed, still a lot of confusion. is printf broken? am i having a big/little-endian confusion here? am i accidently creating some kind of buffer overrun?
If anybody is interested, in the original problem (the tamarin thing, see above) it's pretty much inverse. there, the last three bits are already set (which represents a double). setting them to zero works fine (which is the original implementation). setting 2 more to zero has the same effect as above (overall value gets floored to zero). which by the way is not output-specific, but also math-ops seem to work with those floored values (mul of 2 values obtained like that results in 0).
Any help would be appreciated.
Greetings.
well, they show 15.25, but it should be 15.2500000000000550670620214078
By default, %f displays 6 digits of precision, so you won't see the difference. You also need to specify that the first argument is long long rather than int, using the ll modifier; otherwise, it might print garbage. If you fix that and use a higher precision, such as %.30f, you should see the expected result:
printf("last 3 bits set to 1: %016llX, float: %.30f\n", *l, *d);
printf("2 bits more set to 1: %016llX, float: %.30f\n", *l, *d);
last 3 bits set to 1: 0000000000000007, float: 15.250000000000012434497875801753
2 bits more set to 1: 000000000000001F, float: 15.250000000000055067062021407764
lets modify a bit further up:
*l = *l |= 0x10000000000;
printf("float4: %f\n", *d);
You have a rogue = giving undefined behaviour, so the value may or may not end up being modified (and the program may or may not crash, phone out for pizza, or destroy the universe). Also, if your compiler isn't C++11 compliant, the type of the integer literal might be no larger than long, which might only be 32 bits; in which case it will (probably) become zero.
Fixing those (and in my case, with your code as it is), I get the expected result:
*l = *l | 0x10000000000LL; // just one assignment, and "LL" to force "long long"
printf("float4: %f\n", *d);
float4: 15.251953
Here is a demonstration.
You have a mistake in the parameter of printf. If you pass a 8byte value you have to use
%llx instead of %x.
use
printf("last 3 bits set to 1: %llX, float: %f\n", *l, *d);
*l = *l | 0x18;
printf("2 bits more set to 1: %llX, float: %f\n", *l, *d);
and your code will work
On 32 bit the constant can not be greater then long (32 bit) so you can not do that:
*l |= 0x10000000000;
You have to create a variable then shift it.
long long ll = 1;
ll <= 32;
*l |= ll;
Related
I am a reading binary file and trying to convert from IBM 4 Byte floating point to double in C++. How exactly would one use the first byte of IBM data to find the ccccccc in the given picture
IBM to value conversion chart
The code below gives an exponent way larger than what the data should have. I am confused with how the line
exponent = ((IBM4ByteValue[0] & 127) - 64);
executes, I do not understand the use of the & operator in this statement. But essentially what the previous author of this code implied is that (IBM4ByteValue[0]) is the ccccccc , so does this mean that the ampersand sets a maximum value that the left side of the operator can equal? Even if this is correct though I'm sure how this line accounts for the fact that there Big Endian bitwise notation in the first byte (I believe it is Big Endian after viewing the picture). Not to mention 1000001 and 0000001 should have the same exponent (-63) however they will not with my current interpretation of the previously mentioned line.
So in short could someone show me how to find the ccccccc (shown in the picture link above) using the first byte --> IBM4ByteValue[0]. Maybe accessing each individual bit? However I do not know the code to do this using my array.
**this code is using the std namespace
**I believe ret should be mantissa * pow(16, 24+exponent) however if I'm wrong about the exponent I'm probable wrong about this (I got the IBM Conversion from a previously asked stackoverflow question) **I would have just commented on the old post, but this question was a bit too large, pun intended, for a comment. It is also different in that I am asking how exactly one accesses the bits in an array storing whole bytes.
Code I put together using an IBM conversion from previous question answer
for (long pos = 0; pos < fileLength; pos += BUF_LEN) {
file.seekg(bytePosition);
file.read((char *)(&IBM4ByteValue[0]), BUF_LEN);
bytePosition += 4;
printf("\n%8ld: ", pos);
//IBM Conversion
double ret = 0;
uint32_t mantissa = 0;
uint16_t exponent = 0;
mantissa = (IBM4ByteValue[3] << 16) | (IBM4ByteValue[2] << 8)|IBM4ByteValue[1];
exponent = ((IBM4ByteValue[0] & 127) - 64);
ret = mantissa * exp2(-24 + 4 * exponent);
if (IBM4ByteValue[0] & 128) ret *= -1.;
printf(":%24f", ret);
printf("\n");
system("PAUSE");
}
The & operator basically takes the bits in that value of the array and masks it with the binary value of 127. If a bit in the value of the array is 1, and the corresponding bit position of 127 is 1, the bit will be a resulting 1. 1 & 0 would be 0, and so would 0 & 0 , and 0 & 1. You would be changing the bits. Then you would take the resulting bit value, converted to decimal now, and subtract 64 from it to equal your exponent.
In floating point we always have a bias (in this case, 64) for the exponent. This means that if your exponent is 5, 69 will be stored. So what this code is trying to do is find the original value of the exponent.
on my Arduino, the following code produces output I don't understand:
void setup(){
Serial.begin(9600);
int a = 250;
Serial.println(a, BIN);
a = a << 8;
Serial.println(a, BIN);
a = a >> 8;
Serial.println(a, BIN);
}
void loop(){}
The output is:
11111010
11111111111111111111101000000000
11111111111111111111111111111010
I do understand the first line: leading zeros are not printed to the serial terminal. However, after shifting the bits the data type of a seems to have changed from int to long (32 bits are printed). The expected behaviour is that bits are shifted to the left, and that bits which are shifted "out" of the 16 bits an int has are simply dropped. Shifting the bits back does not turn the "32bit" variable to "16bit" again.
Shifting by 7 or less positions does not show this effect.
I probably should say that I am not using the Arduino IDE, but the Makefile from https://github.com/sudar/Arduino-Makefile.
What is going on? I almost expect this to be "normal", but I don't get it. Or is it something in the printing routine which simply adds 16 "1"'s to the output?
Enno
In addition to other answers, Integers might be stored in 16 bits or 32 bits depending on what arduino you have.
The function printing numbers in Arduino is defined in /arduino-1.0.5/hardware/arduino/cores/arduino/Print.cpp
size_t Print::printNumber(unsigned long n, uint8_t base) {
char buf[8 * sizeof(long) + 1]; // Assumes 8-bit chars plus zero byte.
char *str = &buf[sizeof(buf) - 1];
*str = '\0';
// prevent crash if called with base == 1
if (base < 2) base = 10;
do {
unsigned long m = n;
n /= base;
char c = m - base * n;
*--str = c < 10 ? c + '0' : c + 'A' - 10;
} while(n);
return write(str);
}
All other functions rely on this one, so yes your int gets promoted to an unsigned long when you print it, not when you shift it.
However, the library is correct. By shifting left 8 positions, the negative bit in the integer number becomes '1', so when the integer value is promoted to unsigned long the runtime correctly pads it with 16 extra '1's instead of '0's.
If you are using such a value not as a number but to contain some flags, use unsigned int instead of int.
ETA: for completeness, I'll add further explanation for the second shifting operation.
Once you touch the 'negative bit' inside the int number, when you shift towards right the runtime pads the number with '1's in order to preserve its negative value. Shifting to the left k positions corresponds to dividing the number by 2^k, and since the number is negative to start with then the result must remain negative.
int value = 0xffffffff;
int len = 32;
int result = value << len; // result will be 0xffffffff
result = value << 32; // result will be 0x0
Why does it makes a difference?
Edit:
Sorry I made a mistake. In the example above, both results are 0xffffffff.
So look at this:
unsigned int value = 0xffffffff;
unsigned int len = 32;
printf("0x%x\n", value << len); //will print 0xffffffff
printf("0x%x\n", 0xffffffff << 32); //will print 0x0
If the size of an int is 32 bits or less, your code contains
undefined behavior. The number of bits
shifted must be greater than or equal 0, and strictly less than
the number of bits in what is being shifted.
What is probably happening in practice is that for the variable,
the compiler is probably just passing it to a machine
instruction which only considers 5 low order bits (which are
0 in the case of 32); when the shift count is a constant, the
compiler evaluates the expression internally, likely in long
long, and then truncates it. But this is just one possible
behavior; anything might happen as far as the language is
concerned.
If len >= sizeof(int) or len < 0, the code contains undefined behaviour.
See this answer for more details.
I'm using a short (and I must use a short for the assignment otherwise I would just use an int) to scan in a value between 0-31 and then using a single integer to store 6 of these scanned values.
This is what I have so far:
int vals = 0;
short ndx, newVal;
/* more printing/scanning and error checking in between */
newVal = newVal << (5*ndx);
vals = vals | newVal;
When I try to place a valid value at spot 4 or 5 it doesn't work and just stays 0... I'm wondering if this is because a short is only 2 bytes long so the bitwise left shift just gets rid of the entire value? And if this is the problem is there some sort of cast I can add to fix it?
It's exactly what you thought. You used a bitwise-shift, and then assigned the result into a short variable (newVal). When you do that, even if the calculation is done in 32-bit, the result still gets truncated, and you only get the least significant 16 bits of 0s.
If you want to refrain from using an int, just drop the newVal variable completely, and calculate vals = vals | ((something) << (some other thing));
#include <stdio.h>
union NumericType
{
float value;
int intvalue;
}Values;
int main()
{
Values.value = 1094795585.00;
printf("%f \n",Values.value);
return 0;
}
This program outputs as :
1094795648.000000
Can anybody explain Why is this happening? Why did the value of the float Values.value increase? Or am I missing something here?
First off, this has nothing whatsoever to do with the use of a union.
Now, suppose you write:
int x = 1.5;
printf("%d\n", x);
what will happen? 1.5 is not an integer value, so it gets converted to an integer (by truncation) and x so actually gets the value 1, which is exactly what is printed.
The exact same thing is happening in your example.
float x = 1094795585.0;
printf("%f\n", x);
1094795585.0 is not representable as a single precision floating-point number, so it gets converted to a representable value. This happens via rounding. The two closest values are:
1094795520 (0x41414100) -- closest `float` smaller than your number
1094795585 (0x41414141) -- your number
1094795648 (0x41414180) -- closest `float` larger than your number
Because your number is slightly closer to the larger value (this is somewhat easier to see if you look at the hexadecimal representation), it rounds to that value, so that is the value stored in x, and that is the value that is printed.
A float isn't as precise as you would like it to be. Its mantissa of an effective 24 bit only provides a precision of 7-8 decimal digits. Your example requires 10 decimal digits precision. A double has an effective 53 bit mantissa which provides 15-16 digits of precision which is enough for your purpose.
It's because your float type doesn't have the precision to display that number. Use a double.
floats only have 7 digits of precision
See this link for more details:
link text
When I do this, I get the same results:
int _tmain(int argc, _TCHAR* argv[])
{
float f = 1094795585.00f;
// 1094795648.000000
printf("%f \n",f);
return 0;
}
I simply don't understand why people use floats - they are often no faster than doubles and may be slower. This code:
#include <stdio.h>
union NumericType
{
double value;
int intvalue;
}Values;
int main()
{
Values.value = 1094795585.00;
printf("%lf \n",Values.value);
return 0;
}
produces:
1094795585.000000
By default a printf of float with %f will give precision 6 after the decimal. If you want a precision of 2 digits after the decimal use %.2f.
Even the below gives same result
#include <stdio.h>
union NumericType
{
float value;
int intvalue;
}Values;
int main()
{
Values.value = 1094795585;
printf("%f \n",Values.value);
return 0;
}
Result
./a.out
1094795648.000000
It only complicates things to speak of decimal digits because this is binary arithmetic. To explain this we can begin by looking at the set of integers in the single precision format where all the integers are representable. Since the single precision format has 23+1=24 bits of precision that means that the range is
0 to 2^24-1
This is not good or detailed enough for explaining so I'll refine it further to
0 to 2^24-2^0 in steps of 2^0
The next higher set is
0 to 2^25-2^1 in steps of 2^1
The next lower set is
0 to 2^23-2^-1 in steps of 2^-1
Your number, 1094795585 (0x41414141 in hex), falls in the range that has a maximum of slightly less than 2^31 =. That range can be expressed in detail as 0 to 2^31-2^7 in steps of 2^7. It's logical because 2^31 is 7 powers of 2 greater than 24. Therefore the increments must also be 7 powers of 2 greater.
Looking at the "next lower" and "next higher" values mentioned in another post we see that the difference between them is 128 i e 2^7.
There's really nothing strange or weird or funny or even magic about this. It's actually absolutely clear and quite simple.