C/C++ - Convert 24-bit signed integer to float - c++

I'm programming in C++. I need to convert a 24-bit signed integer (stored in a 3-byte array) to float (normalizing to [-1.0,1.0]).
The platform is MSVC++ on x86 (which means the input is little-endian).
I tried this:
float convert(const unsigned char* src)
{
int i = src[2];
i = (i << 8) | src[1];
i = (i << 8) | src[0];
const float Q = 2.0 / ((1 << 24) - 1.0);
return (i + 0.5) * Q;
}
I'm not entirely sure, but it seems the results I'm getting from this code are incorrect. So, is my code wrong and if so, why?

You are not sign extending the 24 bits into an integer; the upper bits will always be zero. This code will work no matter what your int size is:
if (i & 0x800000)
i |= ~0xffffff;
Edit: Problem 2 is your scaling constant. In simple terms, you want to multiply by the new maximum and divide by the old maximum, assuming that 0 remains at 0.0 after conversion.
const float Q = 1.0 / 0x7fffff;
Finally, why are you adding 0.5 in the final conversion? I could understand if you were trying to round to an integer value, but you're going the other direction.
Edit 2: The source you point to has a very detailed rationale for your choices. Not the way I would have chosen, but perfectly defensible nonetheless. My advice for the multiplier still holds, but the maximum is different because of the 0.5 added factor:
const float Q = 1.0 / (0x7fffff + 0.5);
Because the positive and negative magnitudes are the same after the addition, this should scale both directions correctly.

Since you are using a char array, it does not necessarily follow that the input is little endian by virtue of being x86; the char array makes the byte order architecture independent.
Your code is somewhat over complicated. A simple solution is to shift the 24 bit data to scale it to a 32bit value (so that the machine's natural signed arithmetic will work), and then use a simple ratio of the result with the maximum possible value (which is INT_MAX less 256 because of the vacant lower 8 bits).
#include <limits.h>
float convert(const unsigned char* src)
{
int i = src[2] << 24 | src[1] << 16 | src[0] << 8 ;
return i / (float)(INT_MAX - 256) ;
}
Test code:
unsigned char* makeS24( unsigned int i, unsigned char* s24 )
{
s24[2] = (unsigned char)(i >> 16) ;
s24[1] = (unsigned char)((i >> 8) & 0xff);
s24[0] = (unsigned char)(i & 0xff);
return s24 ;
}
#include <iostream>
int main()
{
unsigned char s24[3] ;
volatile int x = INT_MIN / 2 ;
std::cout << convert( makeS24( 0x800000, s24 )) << std::endl ; // -1.0
std::cout << convert( makeS24( 0x7fffff, s24 )) << std::endl ; // 1.0
std::cout << convert( makeS24( 0, s24 )) << std::endl ; // 0.0
std::cout << convert( makeS24( 0xc00000, s24 )) << std::endl ; // -0.5
std::cout << convert( makeS24( 0x400000, s24 )) << std::endl ; // 0.5
}

Since it's not symmetrical, this is probably the best compromise.
Maps -((2^23)-1) to -1.0 and ((2^23)-1) to 1.0.
(Note: this is the same conversion style used by 24 bit WAV files)
float convert( const unsigned char* src )
{
  int i = ( ( src[ 2 ] << 24 ) | ( src[ 1 ] << 16 ) | ( src[ 0 ] << 8 ) ) >> 8;
return ( ( float ) i ) / 8388607.0;
}

The solution that works for me:
/**
* Convert 24 byte that are saved into a char* and represent a float
* in little endian format to a C float number.
*/
float convert(const unsigned char* src)
{
float num_float;
// concatenate the chars (short integers) and
// save them to a long int
long int num_integer = (
((src[2] & 0xFF) << 16) |
((src[1] & 0xFF) << 8) |
(src[0] & 0xFF)
) & 0xFFFFFFFF;
// copy the bits from the long int variable
// to the float.
memcpy(&num_float, &num_integer, 4);
return num_float;
}

Works for me:
float convert(const char* stream)
{
int fromStream =
(0x00 << 24) +
(stream[2] << 16) +
(stream[1] << 8) +
stream[0];
return (float)fromStream;
}

Looks like you're treating it as an 24-bit unsigned integer. If the most significant bit is 1, you need to make i negative by setting the remaining 8 bits to 1 as well.

I'm not sure if it's good programming practice, but this seems to work (at least with g++ on 32-bit Linux, haven't tried it on anything else yet) and is certainly more elegant than extracting byte-by-byte from a char array, especially if it's not really a char array but rather a stream (in my case, it's a file stream) that you read from (if it is a char array, you can use memcpy instead of istream::read).
Just load the 24-bit variable into the less significant 3 bytes of a signed 32-bit (signed long). Then shift the long variable one byte to the left, so that the sign bit appears where it's meant to. Finally, just normalize the 32-bit variable, and you're all set.
union _24bit_LE{
char access;
signed long _long;
}_24bit_LE_buf;
float getnormalized24bitsample(){
std::ifstream::read(&_24bit_LE_buf.access+1, 3);
return (_24bit_LE_buf._long<<8) / (0x7fffffff + .5);
}
(Strangely, it doesn't seem to work when you just read into the 3 more significant bytes right away).
EDIT: it turns out this method seems to have some problems I don't fully understand yet. Better don't use it for the time being.

This one, got from here, works for me.
typedef union {
struct {
unsigned short lo;
unsigned short hi;
} u16;
unsigned int u32;
signed int i32;
float f;
}Versatype32;
//Bipolar version (-1.0 to ~1.0)
void fInt24_to_float(float* dest, const char* src, size_t length) {
Versatype32 xTemp;
while (length--) {
xTemp.u32 = *(int*)src;
//Check if Negative by right shifting 8
xTemp.u32 <<= 8; //(If it's a negative, we'll know) (Props to Norman Wong)
//Convert to float
xTemp.f = (float)xTemp.i32;
//Skip divide down if zero
if (xTemp.u32 != 0) {
//Divide by (1<<31 or 2^31)
xTemp.u16.hi -= 0x0F80; //BAM! Bitmagic!
}
*dest = xTemp.f;
//Move to next set
dest++;
src += 3;
} //Are we done yet?
//Yes!
return;
}

Related

C++ Bitshift 4 int_8t into a normal integer (32 bit )

I had already asked a question how to get 4 int8_t into a 32bit int, I was told that I have to cast the int8_t to a uint8_t first to pack it with bitshifting into a 32bit integer.
int8_t offsetX = -10;
int8_t offsetY = 120;
int8_t offsetZ = -60;
using U = std::uint8_t;
int toShader = (U(offsetX) << 24) | (U(offsetY) << 16) | (U(offsetZ) << 8) | (0 << 0);
std::cout << (int)(toShader >> 24) << " "<< (int)(toShader >> 16) << " " << (int)(toShader >> 8) << std::endl;
My Output is
-10 -2440 -624444
It's not what I expected, of course, does anyone have a solution?
In the shader I want to unpack the int16 later and that is only possible with a 32bit integer because glsl does not have any other data types.
int offsetX = data[gl_InstanceID * 3 + 2] >> 24;
int offsetY = data[gl_InstanceID * 3 + 2] >> 16 ;
int offsetZ = data[gl_InstanceID * 3 + 2] >> 8 ;
What is written in the square bracket does not matter it is about the correct shifting of the bits or casting after the bracket.
If any of the offsets is negative, then the shift results in undefined behaviour.
Solution: Convert the offsets to an unsigned type first.
However, this brings another potential problem: If you convert to unsigned, then negative numbers will have very large values with set bits in most significant bytes, and OR operation with those bits will always result in 1 regardless of offsetX and offsetY. A solution is to convert into a small unsigned type (std::uint8_t), and another is to mask the unused bytes. Former is probably simpler:
using U = std::uint8_t;
int third = U(offsetX) << 24u
| U(offsetY) << 16u
| U(offsetZ) << 8u
| 0u << 0u;
I think you're forgetting to mask the bits that you care about before shifting them.
Perhaps this is what you're looking for:
int32 offsetX = (data[gl_InstanceID * 3 + 2] & 0xFF000000) >> 24;
int32 offsetY = (data[gl_InstanceID * 3 + 2] & 0x00FF0000) >> 16 ;
int32 offsetZ = (data[gl_InstanceID * 3 + 2] & 0x0000FF00) >> 8 ;
if (offsetX & 0x80) offsetX |= 0xFFFFFF00;
if (offsetY & 0x80) offsetY |= 0xFFFFFF00;
if (offsetZ & 0x80) offsetZ |= 0xFFFFFF00;
Without the bit mask, the X part will end up in offsetY, and the X and Y part in offsetZ.
on CPU side you can use union to avoid bit shifts and bit masking and branches ...
int8_t x,y,z,w; // your 8bit ints
int32_t i; // your 32bit int
union my_union // just helper union for the casting
{
int8_t i8[4];
int32_t i32;
} a;
// 4x8bit -> 32bit
a.i8[0]=x;
a.i8[1]=y;
a.i8[2]=z;
a.i8[3]=w;
i=a.i32;
// 32bit -> 4x8bit
a.i32=i;
x=a.i8[0];
y=a.i8[1];
z=a.i8[2];
w=a.i8[3];
If you do not like unions the same can be done with pointers...
Beware on GLSL side is this not possible (nor unions nor pointers) and you have to use bitshifts and masks like in the other answer...

How to store bytes of a float value in a string and retrieve the value afterwards?

I'm trying to figure out a way to send a sequence of float values over the network. I've seen various answers for this, and this is my current attempt:
#include <iostream>
#include <cstring>
union floatBytes
{
float value;
char bytes[sizeof (float)];
};
int main()
{
floatBytes data1;
data1.value = 3.1;
std::string string(data1.bytes);
floatBytes data2;
strncpy(data2.bytes, string.c_str(), sizeof (float));
std::cout << data2.value << std::endl; // <-- prints "3.1"
return 0;
}
Which works nicely (though I suspect I might run into problems when sending this string to other systems, please comment).
However, if the float value is a round number (like 3.0 instead of 3.1) then this doesn't work.
data1.value = 3;
std::string string(data1.bytes);
floatBytes data2;
strncpy(data2.bytes, string.c_str(), sizeof (float));
std::cout << data2.value << std::endl; // <-- prints "0"
So what is the preferred way of storing the bytes of a float value, send it, and parse it "back" to a float value?
Never use str* functions this way. These are intended to deal with c-string and the byte representation of a float is certainly not a valid c-string. What you need is to send/receive your data in a common representation. There exist a lot of them, but basically two: a textual representation or a byte coding.
Textual representation) almost consist in converting your float value onto a string using stringstream to convert and then extract the string and send it over the connection.
Byte representation) that is much more problematic because if the two machines are not using the same byte-ordering, float encoding, etc then you can't send the raw byte as-is. But there exists (at least) one standard known as XDR (RFC 4506) that specify a standard to encode bytes of a float/double value natively encoded with IEEE 754.
You can reconstitute a float portably with rather involved code, which I maintain on my IEE754 git hub site. If you break the float into bytes using those functions, and reconstitute using the other function, you will obtain the same value in receiver as you sent, regardless of float encoding, up to the precision of the format.
https://github.com/MalcolmMcLean/ieee754
float freadieee754f(FILE *fp, int bigendian)
{
unsigned long buff = 0;
unsigned long buff2 = 0;
unsigned long mask;
int sign;
int exponent;
int shift;
int i;
int significandbits = 23;
int expbits = 8;
double fnorm = 0.0;
double bitval;
double answer;
for(i=0;i<4;i++)
buff = (buff << 8) | fgetc(fp);
if(!bigendian)
{
for(i=0;i<4;i++)
{
buff2 <<= 8;
buff2 |= (buff & 0xFF);
buff >>= 8;
}
buff = buff2;
}
sign = (buff & 0x80000000) ? -1 : 1;
mask = 0x00400000;
exponent = (buff & 0x7F800000) >> 23;
bitval = 0.5;
for(i=0;i<significandbits;i++)
{
if(buff & mask)
fnorm += bitval;
bitval /= 2;
mask >>= 1;
}
if(exponent == 0 && fnorm == 0.0)
return 0.0f;
shift = exponent - ((1 << (expbits - 1)) - 1); /* exponent = shift + bias */
if(shift == 128 && fnorm != 0.0)
return (float) sqrt(-1.0);
if(shift == 128 && fnorm == 0.0)
{
#ifdef INFINITY
return sign == 1 ? INFINITY : -INFINITY;
#endif
return (sign * 1.0f)/0.0f;
}
if(shift > -127)
{
answer = ldexp(fnorm + 1.0, shift);
return (float) answer * sign;
}
else
{
if(fnorm == 0.0)
{
return 0.0f;
}
shift = -126;
while (fnorm < 1.0)
{
fnorm *= 2;
shift--;
}
answer = ldexp(fnorm, shift);
return (float) answer * sign;
}
}
int fwriteieee754f(float x, FILE *fp, int bigendian)
{
int shift;
unsigned long sign, exp, hibits, buff;
double fnorm, significand;
int expbits = 8;
int significandbits = 23;
/* zero (can't handle signed zero) */
if (x == 0)
{
buff = 0;
goto writedata;
}
/* infinity */
if (x > FLT_MAX)
{
buff = 128 + ((1 << (expbits - 1)) - 1);
buff <<= (31 - expbits);
goto writedata;
}
/* -infinity */
if (x < -FLT_MAX)
{
buff = 128 + ((1 << (expbits - 1)) - 1);
buff <<= (31 - expbits);
buff |= (1 << 31);
goto writedata;
}
/* NaN - dodgy because many compilers optimise out this test, but
*there is no portable isnan() */
if (x != x)
{
buff = 128 + ((1 << (expbits - 1)) - 1);
buff <<= (31 - expbits);
buff |= 1234;
goto writedata;
}
/* get the sign */
if (x < 0) { sign = 1; fnorm = -x; }
else { sign = 0; fnorm = x; }
/* get the normalized form of f and track the exponent */
shift = 0;
while (fnorm >= 2.0) { fnorm /= 2.0; shift++; }
while (fnorm < 1.0) { fnorm *= 2.0; shift--; }
/* check for denormalized numbers */
if (shift < -126)
{
while (shift < -126) { fnorm /= 2.0; shift++; }
shift = -1023;
}
/* out of range. Set to infinity */
else if (shift > 128)
{
buff = 128 + ((1 << (expbits - 1)) - 1);
buff <<= (31 - expbits);
buff |= (sign << 31);
goto writedata;
}
else
fnorm = fnorm - 1.0; /* take the significant bit off mantissa */
/* calculate the integer form of the significand */
/* hold it in a double for now */
significand = fnorm * ((1LL << significandbits) + 0.5f);
/* get the biased exponent */
exp = shift + ((1 << (expbits - 1)) - 1); /* shift + bias */
hibits = (long)(significand);
buff = (sign << 31) | (exp << (31 - expbits)) | hibits;
writedata:
/* write the bytes out to the stream */
if (bigendian)
{
fputc((buff >> 24) & 0xFF, fp);
fputc((buff >> 16) & 0xFF, fp);
fputc((buff >> 8) & 0xFF, fp);
fputc(buff & 0xFF, fp);
}
else
{
fputc(buff & 0xFF, fp);
fputc((buff >> 8) & 0xFF, fp);
fputc((buff >> 16) & 0xFF, fp);
fputc((buff >> 24) & 0xFF, fp);
}
return ferror(fp);
}
Let me first clear the issue with your code.
You are using strncpy which stops the copy the moment it sees '\0'. Which simply means that it is not copying all your data.
And thus the 0 is expected.
Using memcpy instead of strncpy should do the trick.
I just tried this C++ code
int main(){
float f = 3.34;
printf("before = %f\n", f);
char a[10];
memcpy(a, (char*) &f, sizeof(float));
a[sizeof(float)] = '\0'; // For sending over network
float f1 = 1.99;
memcpy((char*) &f1, a, sizeof(float));
printf("after = %f\n", f1);
return 0;
}
I get the correct output as expected.
Now coming to the correctness. I am not sure if this classifies as Undefined Behaviour. It could also be called a case of type punning, in which case it would be implementation defined (and I assume any sane compiler would not muck this).
This is all okay as long as I am doing it for the same program.
Now for your problem of sending it over network. I don't think this would be the correct way of doing it. Like #Jean-Baptiste Yunès mentioned, both the systems could be using different representations for float, or even different ordering for bytes.
In that case you need to use a library to convert it to some standard representation like IEEE 754.
The main problem is that C++ do not enforce IEEE754, so the representation of your float may work between 2 computers and fail with another.
The problem have to be divided into two:
How to encode and decode a float to shared format
How to serialize the value to a char array for transmission.
How to encode/decode a float to a common format
C++ does not impose a specific bit-format, this mean a computer might transfer a float and the value on the other machine would be different.
Example of 1.0f
Machine1: sign + 8bit Exponent + 23bit mantissa:
0-01111111-00000000000000000000000
Machine2: sign + 7bit exponent
+ 24bit mantissa: 0-0111111-000000000000000000000000
Sending from machine 1 to machine 2 without shared format, would result in machine 2 receiving: 0-0111111-100000000000000000000000 = 1.5
This is a complex topic and may be difficult to solve completely cross-platform. C++ includes some convenience properties helping somehow with this:
bool isIeee754 = std::numeric_limits<float>::is_iec559;
The main problem is that the compiler may not know about the exact CPU architecture on which its output will run. So this is half reliable. Fortunately, the bit format is in most of the case correct. Additionally, if the format is not known, it may be very difficult to normalize it.
We might design some code to detect the float format, or we might decide to skip those cases as "unsupported platforms".
In the case of the IEEE754 32bit, we may easily extract Mantissa, Sign and Exponent with bitwise operations:
float input;
uint8_t exponent = (input>>23)&0xFF;
uint32_t mantissa = (input&0x7FFFFF);
bool sign = (input>>31);
A standard format for transmission could well be the 32 bit IEEE754, so it would work in most of the times without even encoding:
bool isStandard32BitIeee754( float f)
{
// TODO: improve for "special" targets.
return std::numeric_limits<decltype(f)>::is_iec559 && sizeof(f)==4;
}
Finally, and especially for those non-standard platforms, it is required to keep special values for NaN and infinite.
Serialization of a float for transmission
The second issue is much simpler, it is just required to transform the standardized binary to a char array, however, not all characters may be acceptable on network, especially if it is used in HTTP protocol or equivalent.
For this example, I will convert the stream to hexadecimal encoding (an alternative could be Base64, etc..).
Note: I know there are some function which may help, I deliberately use simple C++ to show the steps at a level as lower as possible.
void toHex( uint8_t &out1, uint8_t &out2, uint8_t in)
{
out1 = in>>4;
out1 = out1>9? out1-10+'A' : out1+'0';
out2 = in&0xF;
out2 = out2>9? out2-10+'A' : out2+'0';
}
void standardFloatToHex (float in, std::string &out)
{
union Aux
{
uint8_t c[4];
float f;
};
out.resize(8);
Aux converter;
converter.f = in;
for (int i=0; i<4; i++)
{
// Could use std::stringstream as an alternative.
uint8_t c1, c2, c = converter.c[i];
toHex(c1, c2, c);
out[i*2] = c1;
out[i*2+1] = c2;
}
}
Finally, the equivalent decoding is required in the opposite side.
Conclusion
The standardization of the float value into a shared bit format has been explained. Some implementation-dependent conversions may be required.
The serialization for most common network protocols is shown.

How to extract a DWORD color variable into its parameters? (decimal type)

I have the example :
unsigned int dwColor = 0xAABBCCFF; //Light blue color
And its parameters from left to right are : "alpha, red, green, blue"; each parameter requires two hexadecimal values.
The maximum value of each parameter is 255; lowest : 0
And, how to extract then convert all parameters of a DWORD color to decimals?
I like the value range "0.00 -> 1.00".
For example :
float alpha = convert_to_decimal(0xAA); //It gives 0.666f
float red = convert_to_decimal(0xBB); //It gives 0.733f
float green = convert_to_decimal(0xCC); //It gives 0.800f
float blue = convert_to_decimal(0xFF); //It gives 1.000f
EDIT : I've just seen union, but the answerer says it's UB (Undefined Behaviour). Does anyone know the better solution? :)
I usually use an union:
union color
{
unsigned int value;
unsigned char component[4];
};
color c;
c.value = 0xAABBCCFF;
unsigned char r = c.component[0];
unsigned char g = c.component[1];
unsigned char b = c.component[2];
unsigned char a = c.component[3];
If you need to treat it as a float value:
float fr = c.component[0] / 255.0f;
float fg = c.component[1] / 255.0f;
float fb = c.component[2] / 255.0f;
float fa = c.component[3] / 255.0f;
EDIT:
As mentioned in the comments below, this use of union is Undefined Behaviour (UB), see this question from Luchian Grigore.
EDIT 2:
So, another way to break a DWORD into components avoiding the union is using some bitwise magic:
#define GET_COMPONENT(color, index) (((0xFF << (index * 8)) & color) >> (index * 8))
But I do not advise the macro solution, I think is better to use a function:
unsigned int get_component(unsigned int color, unsigned int index)
{
const unsigned int shift = index * 8;
const unsigned int mask = 0xFF << shift;
return (color & mask) >> shift;
}
How it works? Lets supose we call get_component(0xAABBCCFF, 0):
shift = 0 * 8
shift = 0
mask = 0xFF << 0
mask = 0x000000FF
0x000000FF &
0xAABBCCFF
----------
0x000000FF
0x000000FF >> 0 = 0xFF
Lets supose we call get_component(0xAABBCCFF, 2):
shift = 2 * 8
shift = 16
mask = 0xFF << 16
mask = 0x00FF0000
0x00FF0000 &
0xAABBCCFF
----------
0x00BB0000
0x00BB0000 >> 16 = 0xBB
Warning! not all color formats will match that pattern!
But IMHO, the neater solution is to combine the function with an enum, since we're working with a limited pack of values for the index:
enum color_component
{
A,B,G,R
};
unsigned int get_component(unsigned int color, color_component component)
{
switch (component)
{
case R:
case G:
case B:
case A:
{
const unsigned int shift = component * 8;
const unsigned int mask = 0xFF << shift;
return (color & mask) >> shift;
}
default:
throw std::invalid_argument("invalid color component");
}
return 0;
}
The last approach ensures that the bitwise operations will only be performed if the input parameters are valid, this would be an example of usage:
std::cout
<< "R: " << get_component(the_color, R) / 255.0f << '\n'
<< "G: " << get_component(the_color, G) / 255.0f << '\n'
<< "B: " << get_component(the_color, B) / 255.0f << '\n'
<< "A: " << get_component(the_color, A) / 255.0f << '\n';
And here is a live demo.

How to improve fixed point square-root for small values

I am using Anthony Williams' fixed point library described in the Dr Dobb's article "Optimizing Math-Intensive Applications with Fixed-Point Arithmetic" to calculate the distance between two geographical points using the Rhumb Line method.
This works well enough when the distance between the points is significant (greater than a few kilometers), but is very poor at smaller distances. The worst case being when the two points are equal or near equal, the result is a distance of 194 meters, while I need precision of at least 1 metre at distances >= 1 metre.
By comparison with a double precision floating-point implementation, I have located the problem to the fixed::sqrt() function, which performs poorly at small values:
x std::sqrt(x) fixed::sqrt(x) error
----------------------------------------------------
0 0 3.05176e-005 3.05176e-005
1e-005 0.00316228 0.00316334 1.06005e-006
2e-005 0.00447214 0.00447226 1.19752e-007
3e-005 0.00547723 0.0054779 6.72248e-007
4e-005 0.00632456 0.00632477 2.12746e-007
5e-005 0.00707107 0.0070715 4.27244e-007
6e-005 0.00774597 0.0077467 7.2978e-007
7e-005 0.0083666 0.00836658 1.54875e-008
8e-005 0.00894427 0.00894427 1.085e-009
Correcting the result for fixed::sqrt(0) is trivial by treating it as a special case, but that will not solve the problem for small non-zero distances, where the error starts at 194 metres and converges toward zero with increasing distance. I probably need at least an order of maginitude improvement in precision toward zero.
The fixed::sqrt() algorithim is briefly explained on page 4 of the article linked above, but I am struggling to follow it let alone determine whether it is possible to improve it. The code for the function is reproduced below:
fixed fixed::sqrt() const
{
unsigned const max_shift=62;
uint64_t a_squared=1LL<<max_shift;
unsigned b_shift=(max_shift+fixed_resolution_shift)/2;
uint64_t a=1LL<<b_shift;
uint64_t x=m_nVal;
while(b_shift && a_squared>x)
{
a>>=1;
a_squared>>=2;
--b_shift;
}
uint64_t remainder=x-a_squared;
--b_shift;
while(remainder && b_shift)
{
uint64_t b_squared=1LL<<(2*b_shift-fixed_resolution_shift);
int const two_a_b_shift=b_shift+1-fixed_resolution_shift;
uint64_t two_a_b=(two_a_b_shift>0)?(a<<two_a_b_shift):(a>>-two_a_b_shift);
while(b_shift && remainder<(b_squared+two_a_b))
{
b_squared>>=2;
two_a_b>>=1;
--b_shift;
}
uint64_t const delta=b_squared+two_a_b;
if((2*remainder)>delta)
{
a+=(1LL<<b_shift);
remainder-=delta;
if(b_shift)
{
--b_shift;
}
}
}
return fixed(internal(),a);
}
Note that m_nVal is the internal fixed point representation value, it is an int64_t and the representation uses Q36.28 format (fixed_resolution_shift = 28). The representation itself has enough precision for at least 8 decimal places, and as a fraction of equatorial arc is good for distances of around 0.14 metres, so the limitation is not the fixed-point representation.
Use of the rhumb line method is a standards body recommendation for this application so cannot be changed, and in any case a more accurate square-root function is likely to be required elsewhere in the application or in future applications.
Question: Is it possible to improve the accuracy of the fixed::sqrt() algorithm for small non-zero values while still maintaining its bounded and deterministic convergence?
Additional Information
The test code used to generate the table above:
#include <cmath>
#include <iostream>
#include "fixed.hpp"
int main()
{
double error = 1.0 ;
for( double x = 0.0; error > 1e-8; x += 1e-5 )
{
double fixed_root = sqrt(fixed(x)).as_double() ;
double std_root = std::sqrt(x) ;
error = std::fabs(fixed_root - std_root) ;
std::cout << x << '\t' << std_root << '\t' << fixed_root << '\t' << error << std::endl ;
}
}
Conclusion
In the light of Justin Peel's solution and analysis, and comparison with the algorithm in "The Neglected Art of Fixed Point Arithmetic", I have adapted the latter as follows:
fixed fixed::sqrt() const
{
uint64_t a = 0 ; // root accumulator
uint64_t remHi = 0 ; // high part of partial remainder
uint64_t remLo = m_nVal ; // low part of partial remainder
uint64_t testDiv ;
int count = 31 + (fixed_resolution_shift >> 1); // Loop counter
do
{
// get 2 bits of arg
remHi = (remHi << 2) | (remLo >> 62); remLo <<= 2 ;
// Get ready for the next bit in the root
a <<= 1;
// Test radical
testDiv = (a << 1) + 1;
if (remHi >= testDiv)
{
remHi -= testDiv;
a += 1;
}
} while (count-- != 0);
return fixed(internal(),a);
}
While this gives far greater precision, the improvement I needed is not to be achieved. The Q36.28 format alone just about provides the precision I need, but it is not possible to perform a sqrt() without loss of a few bits of precision. However some lateral thinking provides a better solution. My application tests the calculated distance against some distance limit. The rather obvious solution in hindsight is to test the square of the distance against the square of the limit!
Given that sqrt(ab) = sqrt(a)sqrt(b), then can't you just trap the case where your number is small and shift it up by a given number of bits, compute the root and shift that back down by half the number of bits to get the result?
I.e.
sqrt(n) = sqrt(n.2^k)/sqrt(2^k)
= sqrt(n.2^k).2^(-k/2)
E.g. Choose k = 28 for any n less than 2^8.
The original implementation obviously has some problems. I became frustrated with trying to fix them all with the way the code is currently done and ended up going at it with a different approach. I could probably fix the original now, but I like my way better anyway.
I treat the input number as being in Q64 to start which is the same as shifting by 28 and then shifting back by 14 afterwards (the sqrt halves it). However, if you just do that, then the accuracy is limited to 1/2^14 = 6.1035e-5 because the last 14 bits will be 0. To remedy this, I then shift a and remainder correctly and to keep filling in digits I do the loop again. The code can be made more efficient and cleaner, but I'll leave that to someone else. The accuracy shown below is pretty much as good as you can get with Q36.28. If you compare the fixed point sqrt with the floating point sqrt of the input number after it has been truncated by fixed point(convert it to fixed point and back), then the errors are around 2e-9(I didn't do this in the code below, but it requires one line of change). This is right in line with the best accuracy for Q36.28 which is 1/2^28 = 3.7529e-9.
By the way, one big mistake in the original code is that the term where m = 0 is never considered so that bit can never be set. Anyway, here is the code. Enjoy!
#include <iostream>
#include <cmath>
typedef unsigned long uint64_t;
uint64_t sqrt(uint64_t in_val)
{
const uint64_t fixed_resolution_shift = 28;
const unsigned max_shift=62;
uint64_t a_squared=1ULL<<max_shift;
unsigned b_shift=(max_shift>>1) + 1;
uint64_t a=1ULL<<(b_shift - 1);
uint64_t x=in_val;
while(b_shift && a_squared>x)
{
a>>=1;
a_squared>>=2;
--b_shift;
}
uint64_t remainder=x-a_squared;
--b_shift;
while(remainder && b_shift)
{
uint64_t b_squared=1ULL<<(2*(b_shift - 1));
uint64_t two_a_b=(a<<b_shift);
while(b_shift && remainder<(b_squared+two_a_b))
{
b_squared>>=2;
two_a_b>>=1;
--b_shift;
}
uint64_t const delta=b_squared+two_a_b;
if((remainder)>=delta && b_shift)
{
a+=(1ULL<<(b_shift - 1));
remainder-=delta;
--b_shift;
}
}
a <<= (fixed_resolution_shift/2);
b_shift = (fixed_resolution_shift/2) + 1;
remainder <<= (fixed_resolution_shift);
while(remainder && b_shift)
{
uint64_t b_squared=1ULL<<(2*(b_shift - 1));
uint64_t two_a_b=(a<<b_shift);
while(b_shift && remainder<(b_squared+two_a_b))
{
b_squared>>=2;
two_a_b>>=1;
--b_shift;
}
uint64_t const delta=b_squared+two_a_b;
if((remainder)>=delta && b_shift)
{
a+=(1ULL<<(b_shift - 1));
remainder-=delta;
--b_shift;
}
}
return a;
}
double fixed2float(uint64_t x)
{
return static_cast<double>(x) * pow(2.0, -28.0);
}
uint64_t float2fixed(double f)
{
return static_cast<uint64_t>(f * pow(2, 28.0));
}
void finderror(double num)
{
double root1 = fixed2float(sqrt(float2fixed(num)));
double root2 = pow(num, 0.5);
std::cout << "input: " << num << ", fixed sqrt: " << root1 << " " << ", float sqrt: " << root2 << ", finderror: " << root2 - root1 << std::endl;
}
main()
{
finderror(0);
finderror(1e-5);
finderror(2e-5);
finderror(3e-5);
finderror(4e-5);
finderror(5e-5);
finderror(pow(2.0,1));
finderror(1ULL<<35);
}
with the output of the program being
input: 0, fixed sqrt: 0 , float sqrt: 0, finderror: 0
input: 1e-05, fixed sqrt: 0.00316207 , float sqrt: 0.00316228, finderror: 2.10277e-07
input: 2e-05, fixed sqrt: 0.00447184 , float sqrt: 0.00447214, finderror: 2.97481e-07
input: 3e-05, fixed sqrt: 0.0054772 , float sqrt: 0.00547723, finderror: 2.43815e-08
input: 4e-05, fixed sqrt: 0.00632443 , float sqrt: 0.00632456, finderror: 1.26255e-07
input: 5e-05, fixed sqrt: 0.00707086 , float sqrt: 0.00707107, finderror: 2.06055e-07
input: 2, fixed sqrt: 1.41421 , float sqrt: 1.41421, finderror: 1.85149e-09
input: 3.43597e+10, fixed sqrt: 185364 , float sqrt: 185364, finderror: 2.24099e-09
I'm not sure how you're getting the numbers from fixed::sqrt() shown in the table.
Here's what I do:
#include <stdio.h>
#include <math.h>
#define __int64 long long // gcc doesn't know __int64
typedef __int64 fixed;
#define FRACT 28
#define DBL2FIX(x) ((fixed)((double)(x) * (1LL << FRACT)))
#define FIX2DBL(x) ((double)(x) / (1LL << FRACT))
// De-++-ified code from
// http://www.justsoftwaresolutions.co.uk/news/optimizing-applications-with-fixed-point-arithmetic.html
fixed sqrtfix0(fixed num)
{
static unsigned const fixed_resolution_shift=FRACT;
unsigned const max_shift=62;
unsigned __int64 a_squared=1LL<<max_shift;
unsigned b_shift=(max_shift+fixed_resolution_shift)/2;
unsigned __int64 a=1LL<<b_shift;
unsigned __int64 x=num;
unsigned __int64 remainder;
while(b_shift && a_squared>x)
{
a>>=1;
a_squared>>=2;
--b_shift;
}
remainder=x-a_squared;
--b_shift;
while(remainder && b_shift)
{
unsigned __int64 b_squared=1LL<<(2*b_shift-fixed_resolution_shift);
int const two_a_b_shift=b_shift+1-fixed_resolution_shift;
unsigned __int64 two_a_b=(two_a_b_shift>0)?(a<<two_a_b_shift):(a>>-two_a_b_shift);
unsigned __int64 delta;
while(b_shift && remainder<(b_squared+two_a_b))
{
b_squared>>=2;
two_a_b>>=1;
--b_shift;
}
delta=b_squared+two_a_b;
if((2*remainder)>delta)
{
a+=(1LL<<b_shift);
remainder-=delta;
if(b_shift)
{
--b_shift;
}
}
}
return (fixed)a;
}
// Adapted code from
// http://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Digit-by-digit_calculation
fixed sqrtfix1(fixed num)
{
fixed res = 0;
fixed bit = (fixed)1 << 62; // The second-to-top bit is set
int s = 0;
// Scale num up to get more significant digits
while (num && num < bit)
{
num <<= 1;
s++;
}
if (s & 1)
{
num >>= 1;
s--;
}
s = 14 - (s >> 1);
while (bit != 0)
{
if (num >= res + bit)
{
num -= res + bit;
res = (res >> 1) + bit;
}
else
{
res >>= 1;
}
bit >>= 2;
}
if (s >= 0) res <<= s;
else res >>= -s;
return res;
}
int main(void)
{
double testData[] =
{
0,
1e-005,
2e-005,
3e-005,
4e-005,
5e-005,
6e-005,
7e-005,
8e-005,
};
int i;
for (i = 0; i < sizeof(testData) / sizeof(testData[0]); i++)
{
double x = testData[i];
fixed xf = DBL2FIX(x);
fixed sqf0 = sqrtfix0(xf);
fixed sqf1 = sqrtfix1(xf);
double sq0 = FIX2DBL(sqf0);
double sq1 = FIX2DBL(sqf1);
printf("%10.8f: "
"sqrtfix0()=%10.8f / err=%e "
"sqrt()=%10.8f "
"sqrtfix1()=%10.8f / err=%e\n",
x,
sq0, fabs(sq0 - sqrt(x)),
sqrt(x),
sq1, fabs(sq1 - sqrt(x)));
}
printf("sizeof(double)=%d\n", (int)sizeof(double));
return 0;
}
And here's what I get (with gcc and Open Watcom):
0.00000000: sqrtfix0()=0.00003052 / err=3.051758e-05 sqrt()=0.00000000 sqrtfix1()=0.00000000 / err=0.000000e+00
0.00001000: sqrtfix0()=0.00311279 / err=4.948469e-05 sqrt()=0.00316228 sqrtfix1()=0.00316207 / err=2.102766e-07
0.00002000: sqrtfix0()=0.00445557 / err=1.656955e-05 sqrt()=0.00447214 sqrtfix1()=0.00447184 / err=2.974807e-07
0.00003000: sqrtfix0()=0.00543213 / err=4.509667e-05 sqrt()=0.00547723 sqrtfix1()=0.00547720 / err=2.438148e-08
0.00004000: sqrtfix0()=0.00628662 / err=3.793423e-05 sqrt()=0.00632456 sqrtfix1()=0.00632443 / err=1.262553e-07
0.00005000: sqrtfix0()=0.00701904 / err=5.202484e-05 sqrt()=0.00707107 sqrtfix1()=0.00707086 / err=2.060551e-07
0.00006000: sqrtfix0()=0.00772095 / err=2.501943e-05 sqrt()=0.00774597 sqrtfix1()=0.00774593 / err=3.390476e-08
0.00007000: sqrtfix0()=0.00836182 / err=4.783859e-06 sqrt()=0.00836660 sqrtfix1()=0.00836649 / err=1.086198e-07
0.00008000: sqrtfix0()=0.00894165 / err=2.621519e-06 sqrt()=0.00894427 sqrtfix1()=0.00894409 / err=1.777289e-07
sizeof(double)=8
EDIT:
I've missed the fact that the above sqrtfix1() won't work well with large arguments. It can be fixed by appending 28 zeroes to the argument and essentially calculating the exact integer square root of that. This comes at the expense of doing internal calculations in 128-bit arithmetic, but it's pretty straightforward:
fixed sqrtfix2(fixed num)
{
unsigned __int64 numl, numh;
unsigned __int64 resl = 0, resh = 0;
unsigned __int64 bitl = 0, bith = (unsigned __int64)1 << 26;
numl = num << 28;
numh = num >> (64 - 28);
while (bitl | bith)
{
unsigned __int64 tmpl = resl + bitl;
unsigned __int64 tmph = resh + bith + (tmpl < resl);
tmph = numh - tmph - (numl < tmpl);
tmpl = numl - tmpl;
if (tmph & 0x8000000000000000ULL)
{
resl >>= 1;
if (resh & 1) resl |= 0x8000000000000000ULL;
resh >>= 1;
}
else
{
numl = tmpl;
numh = tmph;
resl >>= 1;
if (resh & 1) resl |= 0x8000000000000000ULL;
resh >>= 1;
resh += bith + (resl + bitl < resl);
resl += bitl;
}
bitl >>= 2;
if (bith & 1) bitl |= 0x4000000000000000ULL;
if (bith & 2) bitl |= 0x8000000000000000ULL;
bith >>= 2;
}
return resl;
}
And it gives pretty much the same results (slightly better for 3.43597e+10) than this answer:
0.00000000: sqrtfix0()=0.00003052 / err=3.051758e-05 sqrt()=0.00000000 sqrtfix2()=0.00000000 / err=0.000000e+00
0.00001000: sqrtfix0()=0.00311279 / err=4.948469e-05 sqrt()=0.00316228 sqrtfix2()=0.00316207 / err=2.102766e-07
0.00002000: sqrtfix0()=0.00445557 / err=1.656955e-05 sqrt()=0.00447214 sqrtfix2()=0.00447184 / err=2.974807e-07
0.00003000: sqrtfix0()=0.00543213 / err=4.509667e-05 sqrt()=0.00547723 sqrtfix2()=0.00547720 / err=2.438148e-08
0.00004000: sqrtfix0()=0.00628662 / err=3.793423e-05 sqrt()=0.00632456 sqrtfix2()=0.00632443 / err=1.262553e-07
0.00005000: sqrtfix0()=0.00701904 / err=5.202484e-05 sqrt()=0.00707107 sqrtfix2()=0.00707086 / err=2.060551e-07
0.00006000: sqrtfix0()=0.00772095 / err=2.501943e-05 sqrt()=0.00774597 sqrtfix2()=0.00774593 / err=3.390476e-08
0.00007000: sqrtfix0()=0.00836182 / err=4.783859e-06 sqrt()=0.00836660 sqrtfix2()=0.00836649 / err=1.086198e-07
0.00008000: sqrtfix0()=0.00894165 / err=2.621519e-06 sqrt()=0.00894427 sqrtfix2()=0.00894409 / err=1.777289e-07
2.00000000: sqrtfix0()=1.41419983 / err=1.373327e-05 sqrt()=1.41421356 sqrtfix2()=1.41421356 / err=1.851493e-09
34359700000.00000000: sqrtfix0()=185363.69654846 / err=5.097361e-06 sqrt()=185363.69655356 sqrtfix2()=185363.69655356 / err=1
.164153e-09
Many many years ago I worked on a demo program for a small computer our outfit had built. The computer had a built-in square-root instruction, and we built a simple program to demonstrate the computer doing 16-bit add/subtract/multiply/divide/square-root on a TTY. Alas, it turned out that there was a serious bug in the square root instruction, but we had promised to demo the function. So we created an array of the squares of the values 1-255, then used a simple lookup to match the value typed in to one of the array values. The index was the square root.

How to determine how many bytes an integer needs?

I'm looking for the most efficient way to calculate the minimum number of bytes needed to store an integer without losing precision.
e.g.
int: 10 = 1 byte
int: 257 = 2 bytes;
int: 18446744073709551615 (UINT64_MAX) = 8 bytes;
Thanks
P.S. This is for a hash functions which will be called many millions of times
Also the byte sizes don't have to be a power of two
The fastest solution seems to one based on tronics answer:
int bytes;
if (hash <= UINT32_MAX)
{
if (hash < 16777216U)
{
if (hash <= UINT16_MAX)
{
if (hash <= UINT8_MAX) bytes = 1;
else bytes = 2;
}
else bytes = 3;
}
else bytes = 4;
}
else if (hash <= UINT64_MAX)
{
if (hash < 72057594000000000ULL)
{
if (hash < 281474976710656ULL)
{
if (hash < 1099511627776ULL) bytes = 5;
else bytes = 6;
}
else bytes = 7;
}
else bytes = 8;
}
The speed difference using mostly 56 bit vals was minimal (but measurable) compared to Thomas Pornin answer. Also i didn't test the solution using __builtin_clzl which could be comparable.
Use this:
int n = 0;
while (x != 0) {
x >>= 8;
n ++;
}
This assumes that x contains your (positive) value.
Note that zero will be declared encodable as no byte at all. Also, most variable-size encodings need some length field or terminator to know where encoding stops in a file or stream (usually, when you encode an integer and mind about size, then there is more than one integer in your encoded object).
You need just two simple ifs if you are interested on the common sizes only. Consider this (assuming that you actually have unsigned values):
if (val < 0x10000) {
if (val < 0x100) // 8 bit
else // 16 bit
} else {
if (val < 0x100000000L) // 32 bit
else // 64 bit
}
Should you need to test for other sizes, choosing a middle point and then doing nested tests will keep the number of tests very low in any case. However, in that case making the testing a recursive function might be a better option, to keep the code simple. A decent compiler will optimize away the recursive calls so that the resulting code is still just as fast.
Assuming a byte is 8 bits, to represent an integer x you need [log2(x) / 8] + 1 bytes where [x] = floor(x).
Ok, I see now that the byte sizes aren't necessarily a power of two. Consider the byte sizes b. The formula is still [log2(x) / b] + 1.
Now, to calculate the log, either use lookup tables (best way speed-wise) or use binary search, which is also very fast for integers.
The function to find the position of the first '1' bit from the most significant side (clz or bsr) is usually a simple CPU instruction (no need to mess with log2), so you could divide that by 8 to get the number of bytes needed. In gcc, there's __builtin_clz for this task:
#include <limits.h>
int bytes_needed(unsigned long long x) {
int bits_needed = sizeof(x)*CHAR_BIT - __builtin_clzll(x);
if (bits_needed == 0)
return 1;
else
return (bits_needed + 7) / 8;
}
(On MSVC you would use the _BitScanReverse intrinsic.)
You may first get the highest bit set, which is the same as log2(N), and then get the bytes needed by ceil(log2(N) / 8).
Here are some bit hacks for getting the position of the highest bit set, which are copied from http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious, and you can click the URL for details of how these algorithms work.
Find the integer log base 2 of an integer with an 64-bit IEEE float
int v; // 32-bit integer to find the log base 2 of
int r; // result of log_2(v) goes here
union { unsigned int u[2]; double d; } t; // temp
t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] = 0x43300000;
t.u[__FLOAT_WORD_ORDER!=LITTLE_ENDIAN] = v;
t.d -= 4503599627370496.0;
r = (t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] >> 20) - 0x3FF;
Find the log base 2 of an integer with a lookup table
static const char LogTable256[256] =
{
#define LT(n) n, n, n, n, n, n, n, n, n, n, n, n, n, n, n, n
-1, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
LT(4), LT(5), LT(5), LT(6), LT(6), LT(6), LT(6),
LT(7), LT(7), LT(7), LT(7), LT(7), LT(7), LT(7), LT(7)
};
unsigned int v; // 32-bit word to find the log of
unsigned r; // r will be lg(v)
register unsigned int t, tt; // temporaries
if (tt = v >> 16)
{
r = (t = tt >> 8) ? 24 + LogTable256[t] : 16 + LogTable256[tt];
}
else
{
r = (t = v >> 8) ? 8 + LogTable256[t] : LogTable256[v];
}
Find the log base 2 of an N-bit integer in O(lg(N)) operations
unsigned int v; // 32-bit value to find the log2 of
const unsigned int b[] = {0x2, 0xC, 0xF0, 0xFF00, 0xFFFF0000};
const unsigned int S[] = {1, 2, 4, 8, 16};
int i;
register unsigned int r = 0; // result of log2(v) will go here
for (i = 4; i >= 0; i--) // unroll for speed...
{
if (v & b[i])
{
v >>= S[i];
r |= S[i];
}
}
// OR (IF YOUR CPU BRANCHES SLOWLY):
unsigned int v; // 32-bit value to find the log2 of
register unsigned int r; // result of log2(v) will go here
register unsigned int shift;
r = (v > 0xFFFF) << 4; v >>= r;
shift = (v > 0xFF ) << 3; v >>= shift; r |= shift;
shift = (v > 0xF ) << 2; v >>= shift; r |= shift;
shift = (v > 0x3 ) << 1; v >>= shift; r |= shift;
r |= (v >> 1);
// OR (IF YOU KNOW v IS A POWER OF 2):
unsigned int v; // 32-bit value to find the log2 of
static const unsigned int b[] = {0xAAAAAAAA, 0xCCCCCCCC, 0xF0F0F0F0,
0xFF00FF00, 0xFFFF0000};
register unsigned int r = (v & b[0]) != 0;
for (i = 4; i > 0; i--) // unroll for speed...
{
r |= ((v & b[i]) != 0) << i;
}
Find the number of bits by taking the log2 of the number, then divide that by 8 to get the number of bytes.
You can find logn of x by the formula:
logn(x) = log(x) / log(n)
Update:
Since you need to do this really quickly, Bit Twiddling Hacks has several methods for quickly calculating log2(x). The look-up table approach seems like it would suit your needs.
This will get you the number of bytes. It's not strictly the most efficient, but unless you're programming a nanobot powered by the energy contained in a red blood cell, it won't matter.
int count = 0;
while (numbertotest > 0)
{
numbertotest >>= 8;
count++;
}
You could write a little template meta-programming code to figure it out at compile time if you need it for array sizes:
template<unsigned long long N> struct NBytes
{ static const size_t value = NBytes<N/256>::value+1; };
template<> struct NBytes<0>
{ static const size_t value = 0; };
int main()
{
std::cout << "short = " << NBytes<SHRT_MAX>::value << " bytes\n";
std::cout << "int = " << NBytes<INT_MAX>::value << " bytes\n";
std::cout << "long long = " << NBytes<ULLONG_MAX>::value << " bytes\n";
std::cout << "10 = " << NBytes<10>::value << " bytes\n";
std::cout << "257 = " << NBytes<257>::value << " bytes\n";
return 0;
}
output:
short = 2 bytes
int = 4 bytes
long long = 8 bytes
10 = 1 bytes
257 = 2 bytes
Note: I know this isn't answering the original question, but it answers a related question that people will be searching for when they land on this page.
Floor((log2(N) / 8) + 1) bytes
You need exactly the log function
nb_bytes = floor(log(x)/log(256))+1
if you use log2, log2(256) == 8 so
floor(log2(x)/8)+1
You need to raise 256 to successive powers until the result is larger than your value.
For example: (Tested in C#)
long long limit = 1;
int byteCount;
for (byteCount = 1; byteCount < 8; byteCount++) {
limit *= 256;
if (limit > value)
break;
}
If you only want byte sizes to be powers of two (If you don't want 65,537 to return 3), replace byteCount++ with byteCount *= 2.
I think this is a portable implementation of the straightforward formula:
#include <limits.h>
#include <math.h>
#include <stdio.h>
int main(void) {
int i;
unsigned int values[] = {10, 257, 67898, 140000, INT_MAX, INT_MIN};
for ( i = 0; i < sizeof(values)/sizeof(values[0]); ++i) {
printf("%d needs %.0f bytes\n",
values[i],
1.0 + floor(log(values[i]) / (M_LN2 * CHAR_BIT))
);
}
return 0;
}
Output:
10 needs 1 bytes
257 needs 2 bytes
67898 needs 3 bytes
140000 needs 3 bytes
2147483647 needs 4 bytes
-2147483648 needs 4 bytes
Whether and how much the lack of speed and the need to link floating point libraries depends on your needs.
I know this question didn't ask for this type of answer but for those looking for a solution using the smallest number of characters, this does the assignment to a length variable in 17 characters, or 25 including the declaration of the length variable.
//Assuming v is the value that is being counted...
int l=0;
for(;v>>l*8;l++);
This is based on SoapBox's idea of creating a solution that contains no jumps, branches etc... Unfortunately his solution was not quite correct. I have adopted the spirit and here's a 32bit version, the 64bit checks can be applied easily if desired.
The function returns number of bytes required to store the given integer.
unsigned short getBytesNeeded(unsigned int value)
{
unsigned short c = 0; // 0 => size 1
c |= !!(value & 0xFF00); // 1 => size 2
c |= (!!(value & 0xFF0000)) << 1; // 2 => size 3
c |= (!!(value & 0xFF000000)) << 2; // 4 => size 4
static const int size_table[] = { 1, 2, 3, 3, 4, 4, 4, 4 };
return size_table[c];
}
For each of eight times, shift the int eight bits to the right and see if there are still 1-bits left. The number of times you shift before you stop is the number of bytes you need.
More succinctly, the minimum number of bytes you need is ceil(min_bits/8), where min_bits is the index (i+1) of the highest set bit.
There are a multitude of ways to do this.
Option #1.
int numBytes = 0;
do {
numBytes++;
} while (i >>= 8);
return (numBytes);
In the above example, is the number you are testing, and generally works for any processor, any size of integer.
However, it might not be the fastest. Alternatively, you can try a series of if statements ...
For a 32 bit integers
if ((upper = (value >> 16)) == 0) {
/* Bit in lower 16 bits may be set. */
if ((high = (value >> 8)) == 0) {
return (1);
}
return (2);
}
/* Bit in upper 16 bits is set */
if ((high = (upper >> 8)) == 0) {
return (3);
}
return (4);
For 64 bit integers, Another level of if statements would be required.
If the speed of this routine is as critical as you say, it might be worthwhile to do this in assembler if you want it as a function call. That could allow you to avoid creating and destroying the stack frame, saving a few extra clock cycles if it is that critical.
A bit basic, but since there will be a limited number of outputs, can you not pre-compute the breakpoints and use a case statement? No need for calculations at run-time, only a limited number of comparisons.
Why not just use a 32-bit hash?
That will work at near-top-speed everywhere.
I'm rather confused as to why a large hash would even be wanted. If a 4-byte hash works, why not just use it always? Excepting cryptographic uses, who has hash tables with more then 232 buckets anyway?
there are lots of great recipes for stuff like this over at Sean Anderson's "Bit Twiddling Hacks" page.
This code has 0 branches, which could be faster on some systems. Also on some systems (GPGPU) its important for threads in the same warp to execute the same instructions. This code is always the same number of instructions no matter what the input value.
inline int get_num_bytes(unsigned long long value) // where unsigned long long is the largest integer value on this platform
{
int size = 1; // starts at 1 sot that 0 will return 1 byte
size += !!(value & 0xFF00);
size += !!(value & 0xFFFF0000);
if (sizeof(unsigned long long) > 4) // every sane compiler will optimize this out
{
size += !!(value & 0xFFFFFFFF00000000ull);
if (sizeof(unsigned long long) > 8)
{
size += !!(value & 0xFFFFFFFFFFFFFFFF0000000000000000ull);
}
}
static const int size_table[] = { 1, 2, 4, 8, 16 };
return size_table[size];
}
g++ -O3 produces the following (verifying that the ifs are optimized out):
xor %edx,%edx
test $0xff00,%edi
setne %dl
xor %eax,%eax
test $0xffff0000,%edi
setne %al
lea 0x1(%rdx,%rax,1),%eax
movabs $0xffffffff00000000,%rdx
test %rdx,%rdi
setne %dl
lea (%rdx,%rax,1),%rax
and $0xf,%eax
mov _ZZ13get_num_bytesyE10size_table(,%rax,4),%eax
retq
Why so complicated? Here's what I came up with:
bytesNeeded = (numBits/8)+((numBits%8) != 0);
Basically numBits divided by eight + 1 if there is a remainder.
There are already a lot of answers here, but if you know the number ahead of time, in c++ you can use a template to make use of the preprocessor.
template <unsigned long long N>
struct RequiredBytes {
enum : int { value = 1 + (N > 255 ? RequiredBits<(N >> 8)>::value : 0) };
};
template <>
struct RequiredBytes<0> {
enum : int { value = 1 };
};
const int REQUIRED_BYTES_18446744073709551615 = RequiredBytes<18446744073709551615>::value; // 8
or for a bits version:
template <unsigned long long N>
struct RequiredBits {
enum : int { value = 1 + RequiredBits<(N >> 1)>::value };
};
template <>
struct RequiredBits<1> {
enum : int { value = 1 };
};
template <>
struct RequiredBits<0> {
enum : int { value = 1 };
};
const int REQUIRED_BITS_42 = RequiredBits<42>::value; // 6