Is it possble to combine number of float values into one float value and extract the values when needed?

Is it possble to combine number of float values into one float value and extract the values when needed? - c++

Am working on an algorithm for an iPhone app, where the data i need to keep in memory is exceeding the limit, so is it possible to represent number of float numbers as one float value and retrieve those value when i need.
For instance:
float array[4];
array[0]=0.12324;
array[1]=0.56732;
array[2]=0.86555;
array[3]=0.34545;
float combinedvalue=?

Not in general, no. You can't store 4N bits of information in only N bits.
If there's some patten in your numbers, then you might find a scheme. For example, if all your numbers are of similar value, you could potentially store only the differences between the numbers in lower precision.
However, this kind of thing is difficult, and limited.

If those numbers are exactly 5 digits each, you can treat them as ints by multiplying with 100000. Then you'll need 17 bits for each number, 68 bits in total, which (with some bit-shifting) takes up 9 bytes. Does that help, 9 bytes instead of 16?
Please note that the implementation of your algorithm will also take up memory!

What you are requiring could be accomplished in several different ways.
For instance, in c++ you generally have single precision floats (4 bytes) as the smallest precision available, though I wouldn't be surprised if there are other packages that handle smaller precision floating point values.
Therefore, if you are using double precision floating point values and can get by with less precision then you can switch to a smaller precision.
Now, depending on your range of values you want to store, you might be able to use a fixed-point representation as well, but you will need to be familiar with the nuances of bit shifting and masking, etc. But, another added benefit of this approach is that it could make your program run faster since fixed-point (integer) arithmetic is much faster than floating-point arithmetic.
The choice of options depends on your data you need to store and how comfortable you are with lower level binary arithmetic.

Related

More Precise Floating point Data Types than double?

In my project I have to compute division, multiplication, subtraction, addition on a matrix of double elements.
The problem is that when the size of matrix increases the accuracy of my output is drastically getting affected.
Currently I am using double for each element which I believe uses 8 bytes of memory & has accuracy of 16 digits irrespective of decimal position.
Even for large size of matrix the memory occupied by all the elements is in the range of few kilobytes. So I can afford to use datatypes which require more memory.
So I wanted to know which data type is more precise than double.
I tried searching in some books & I could find long double.
But I dont know what is its precision.
And what if I want more precision than that?

According to Wikipedia, 80-bit "Intel" IEEE 754 extended-precision long double, which is 80 bits padded to 16 bytes in memory, has 64 bits mantissa, with no implicit bit, which gets you 19.26 decimal digits. This has been the almost universal standard for long double for ages, but recently things have started to change.
The newer 128-bit quad-precision format has 112 mantissa bits plus an implicit bit, which gets you 34 decimal digits. GCC implements this as the __float128 type and there is (if memory serves) a compiler option to set long double to it.

You might want to consider the sequence of operations, i.e. do the additions in an ordered sequence starting with the smallest values first. This will increase overall accuracy of the results using the same precision in the mantissa:
1e00 + 1e-16 + ... + 1e-16 (1e16 times) = 1e00
1e-16 + ... + 1e-16 (1e16 times) + 1e00 = 2e00
The point is that adding small numbers to a large number will make them disappear. So the latter approach reduces the numerical error

Floating point data types with greater precision than double are going to depend on your compiler and architecture.
In order to get more than double precision, you may need to rely on some math library that supports arbitrary precision calculations. These probably won't be fast though.

On Intel architectures the precision of long double is 80bits.
What kind of values do you want to represent? Maybe you are better off using fixed precision.

Real numbers - how to determine whether float or double is required?

Given a real value, can we check if a float data type is enough to store the number, or a double is required?
I know precision varies from architecture to architecture. Is there any C/C++ function to determine the right data type?

For background, see What Every Computer Scientist Should Know About Floating-Point Arithmetic
Unfortunately, I don't think there is any way to automate the decision.
Generally, when people represent numbers in floating point, rather than as strings, the intent is to do arithmetic using the numbers. Even if all the inputs fit in a given floating point type with acceptable precision, you still have to consider rounding error and intermediate results.
In practice, most calculations will work with enough precision for usable results, using a 64 bit type. Many calculations will not get usable results using only 32 bits.
In modern processors, buses and arithmetic units are wide enough to give 32 bit and 64 bit floating point similar performance. The main motivation for using 32 bit is to save space when storing a very large array.
That leads to the following strategy:
If arrays are large enough to justify spending significant effort to halve their size, do analysis and experiments to decide whether a 32 bit type gives good enough results, and if so use it. Otherwise, use a 64 bit type.

I think your question presupposes a way to specify any "real number" to C / C++ (or any other program) without precision loss.
Suppose that you get this real number by specifying it in code or through user input; a way to check if a float or a double would be enough to store it without precision loss is to just count the number of significant bits and check that against the data range for float and double.
If the number is given as an expression (i.e. 1/7 or sqrt(2)), you will also want ways of detecting:
If the number is rational, whether it has repeating decimals, or cyclic decimals.
Or, What happens when you have an irrational number?
More over, there are numbers, such as 0.9, that float / double cannot in theory represent "exactly" )at least not in our binary computation paradigm) - see Jon Skeet's excellent answer on this.
Lastly, see additional discussion on float vs. double.

Precision is not very platform-dependent. Although platforms are allowed to be different, float is almost universally IEEE standard single precision and double is double precision.
Single precision assigns 23 bits of "mantissa," or binary digits after the radix point (decimal point). Since the bit before the dot is always one, this equates to a 24-bit fraction. Dividing by log2(10) = 3.3, a float gets you 7.2 decimal digits of precision.
Following the same process for double yields 15.9 digits and long double yields 19.2 (for systems using the Intel 80-bit format).
The bits besides the mantissa are used for exponent. The number of exponent bits determines the range of numbers allowed. Single goes to ~ 10±38, double goes to ~ 10±308.
As for whether you need 7, 16, or 19 digits or if limited-precision representation is appropriate at all, that's really outside the scope of the question. It depends on the algorithm and the application.

A very detailed post that may or may not answer your question.
An entire series in floating point complexities!

Couldn't you simply store it to a float and a double variable and than compare these two? This should implicitely convert the float back to a double - if there is no difference, the float is sufficient?
float f = value;
double d = value;
if ((double)f == d)
{
// float is sufficient
}

You cannot represent real number with float or double variables, but only a subset of rational numbers.
When you do floating point computation, your CPU floating point unit will decide the best approximation for you.
I might be wrong but I thought that float (4 bytes) and double (8 bytes) floating point representation were actually specified independently of comp architectures.

Conversion Big Integer <-> double in C++

I am writing my own long arithmetic library in C++ for fun and it is already pretty finished, I even implemented several Cryptogrphic algorithms with that library, but one important thing is still missing: I want to convert doubles (and floats/long doubles) into my number and vice versa. My numbers are represented as a variable sized array of unsigned long ints plus a sign bit.
I tried to find the answer with google, but the problem is that people rarely ever implement such things themselves, so I only find things about how to use Java BigInteger etc.
Conceptually, it is rather easy: I take the mantissa, shift it by the number of bits dictated by the exponent and set the sign. In the other direction I truncate it so that it fits into the mantissa and set the exponent depending on my log2 function.
But I am having a hard time to figure out the details, I could either play around with some bit patterns and cast it to a double, but I didn't find an elegant way to achieve that or I could "calculate" it by starting with 2, exponentiate, multiply etc, but that doesn't seem very efficient.
I would appreciate a solution that doesn't use any library calls because I am trying to avoid libraries for my project, otherwise I could just have used gmp, furthermore, I often have two solutions on several other occasions, one using inline assembler which is efficient and one that is more platform independent, so either answer is useful for me.
edit: I use uint64_t for my parts, but I would like to be able to change it depending on the machine, but I am willing to do some different implementations with some #ifdefs to achieve that.

I'm going to make non-portable assumption here: namely, that unsigned long long has more accurate digits than double. (This is true on all modern desktop systems that I know of.)
First, convert the most significant integer(s) into an unsigned long long. Then convert that to a double S. Let M be the number of integers less than those used in that first step. multiply S by(1ull << (sizeof(unsigned)*CHAR_BIT*M). (If shifting more than 63 bits, you will have to split those into seperate shifts and do some alrithmetic) Finally, if the original number was negative you multiply this result by -1.
This rounds a lot, but even with this rounding, due to the above assumption, no digits are lost that wouldn't be lost anyway with the conversion to a double. I think this is a similar process to what Mark Ransom said, but I'm not certain.
For converting from a double to a biginteger, first seperate the mantissa into a double M and the exponent into an int E, using frexp. Multiply M by UNSIGNED_MAX, and store that result in an unsigned R. If std::numeric_limits<double>::radix() is 2 (I don't know if it is or not for x86/x64), you can easily shift R left by E-(sizeof(unsigned)*CHAR_BIT) bits and you're done. Otherwise the result will instead beR*(E**(sizeof(unsigned)*CHAR_BIT)) (where ** means to the power of)
If performance is a concern, you can add an overload to your bignum class for multiplying by std::constant_integer<unsigned, 10>, which simply returns (LHS<<4)+(LHS<<2). You can similarly optimize other constants if you wish.

This blog post might help you Clarifying and optimizing Integer>>asFloat
Otherwise, you can yet have an idea of algorithm with this SO question Converting from unsigned long long to float with round to nearest even

You don't say explicitly, but I assume your library is integer only and the unsigned longs are 32 bit and binary (not decimal). The conversion to double is simple, so I'll tackle that first.
Start with a multiplier for the current piece; if the number is positive it will be 1.0, if negative it will be -1.0. For each of the unsigned long ints in your bignum, multiply by the current multiplier and add it to the result, then multiply your multiplier by pow(2.0, 32) (4294967296.0) for 32 bits or pow(2.0, 64) (18446744073709551616.0) for 64 bits.
You can optimize this process by working with only the 2 most significant values. You need to use 2 even if the number of bits in your integer type is larger than the precision of a double, since the number of used bits in the most significant value might only be 1. You can generate the multiplier by taking a power of 2 to the number of skipped bits, e.g. pow(2.0, most_significant_count*sizeof(bit_array[0])*8). You can't use a bit shift as given in another answer because it will overflow after the first value.
To convert from double, you can get the exponent and mantissa separated from each other with the frexp function. The mantissa will come as a floating point value between 0.5 and 1.0 so you'll want to multiply it by pow(2.0, 32) or pow(2.0, 64) to convert it to an integer, then adjust the exponent by -32 or -64 to compensate.

To go from a big integer to a double, just do it the same way you parse numbers. For example, you parse the number "531" as "1 + (3 * 10) + (5 * 100)". Compute each portion using doubles, starting with the least significant portion.
To go from a double to a big integer, do it the same way but in reverse starting with the most significant portion. So, to convert 531, you first see that it's more than 100 but less than 1000. You find the first digit by dividing by 100. Then you subtract to get the remainder of 31. Then find the next digit by dividing by 10. And so on.
Of course, you won't be using tens (unless you store your big integers as digits). Exactly how you break it apart depends on how your big integer class is constructed. For example, if it's uses 64-bit units, then you'll use powers of 2^64 instead of powers of 10.

How do I find the largest integer fully supported by hardware arithmetics?

I am implementing a BigInt class that must support arbitrary-precision operations on integers.
Quote from "The Algorithm Design Manual" by S.Skiena:
What base should I do [editor's note: arbitrary-precision] arithmetic in? - It is perhaps simplest to implement your own high-precision arithmetic package in decimal, and thus represent each integer as a string of base-10 digits. However, it is far more efficient to use a higher base, ideally equal to the square root of the largest integer supported fully by hardware arithmetic.
How do I find the largest integer supported fully by hardware arithmetic? If I understand correctly, being my machine an x64 based PC, the largest integer supported should be 2^64 (http://en.wikipedia.org/wiki/X86-64 - Architectural features: 64-bit integer capability), so I should use base 2^32, but is there a way in c++ to get this size programmatically so I can typedef my base_type to it?

You might be searching for std::uintmax_t and std::intmax_t.

static_cast<unsigned>(-1) is the max int. e.g. all bits set to 1 Is that what you are looking for ?
You can also use std::numeric_limits<unsigned>::max() or UINT_MAX, and all of these will yield the same result. and what these values tell is the maximum capacity of unsigned type. e.g. the maximum value that can be stored into unsigned type.

int (and, by extension, unsigned int) is the "natural" size for the architecture. So a type that has half the bits of an int should work reasonably well. Beyond that, you really need to configure for the particular hardware; the type of the storage unit and the type of the calculation unit should be typedefs in a header and their type selected to match the particular processor. Typically you'd make this selection after running some speed tests.
INT_MAX doesn't help here; it tells you the largest value that can be stored in an int, which may or may not be the largest value that the hardware can support directly. Similarly, INTMAX_MAX is no help, either; it tells you the largest value that can be stored as an integral type, but doesn't tell you whether operations on such a value can be done in hardware or require software emulation.
Back in the olden days, the rule of thumb was that operations on ints were done directly in hardware, and operations on longs were done as multiple integer operations, so operations on longs were much slower than operations on ints. That's no longer a good rule of thumb.

Things are not so black and white. There are MAY issues here, and you may have other things worth considering. I've now written two variable precision tools (in MATLAB, VPI and HPF) and I've chosen different approaches in each. It also matters whether you are writing an integer form or a high precision floating point form.
The difference is, integers can grow without bound in the number of digits. But if you are doing a floating point implementation with a user specified number of digits, you always know the number of digits in the mantissa. This is fixed.
First of all, it is simplest to use a single integer for each decimal digit. This makes many things work nicely, so I/O is easy. It is a bit inefficient in terms of storage though. Adds and subtracts are easy though. And if you use integers for each digit, then multiplies are even easy. In MATLAB for example, conv is pretty fast, though it is still O(n^2). I think gmp uses an fft multiply, so faster yet.
But assuming you use a basic conv multiply, then you need to worry about overflows for numbers with a huge number of digits. For example, suppose I store decimal digits as 8 bit signed integers. Using conv, followed by carries, I can do a multiply. For example, suppose I have the number 9999.
N = repmat(9,1,4)
N =
9 9 9 9
conv(N,N)
ans =
81 162 243 324 243 162 81
Thus even to form the product 9999*9999, I'd need to be careful as the digits will overflow an 8 bit signed integer. If I'm using 16 bit integers to accumulate the convolution products, then a multiply between a pair of 1000 digits integers can cause an overflow.
N = repmat(9,1,1000);
max(conv(N,N))
ans =
81000
So if you are worried about the possibility of millions of digits, you need to watch out.
One alternative is to use what I call migits, essentially working in a higher base than 10. Thus by using base 1000000 and doubles to store the elements, I can store 6 decimal digits per element. A convolution will still cause overflows for larger numbers though.
N = repmat(999999,1,10000);
log2(max(conv(N,N)))
ans =
53.151
Thus a convolution between two sets of base 1000000 migits that are 10000 migits in length (60000 decimal digits) will overflow the point where a double cannot represent an integer exactly.
So again, if you will use numbers with millions of digits, beware. A nice thing about the use of a higher base of migits with a convolution based multiply is since the conv operation is O(n^2), then going from base 10 to base 100 gives you a 4-1 speedup. Going to base 1000 yields a 9-1 speedup in the convolutions.
Finally, the use of a base other than 10 as migits makes it logical to implement guard digits (for floats.) In floating point arithmetic, you should never trust the least significant bits of a computation, so it makes sense to keep a few digits hidden in the shadows. So when I wrote my HPF tool, I gave the user control of how many digits would be carried along. This is not an issue for integers of course.
There are many other issues. I discuss them in the docs carried with those tools.

How do I compress a large number of similar doubles?

I want to store billions (10^9) of double precision floating point numbers in memory and save space. These values are grouped in thousands of ordered sets (they are time series), and within a set, I know that the difference between values is usually not large (compared to their absolute value). Also, the closer to each other, the higher the probability of the difference being relatively small.
A perfect fit would be a delta encoding that stores only the difference of each value to its predecessor. However, I want random access to subsets of the data, so I can't depend on going through a complete set in sequence. I'm therefore using deltas to a set-wide baseline that yields deltas which I expect to be within 10 to 50 percent of the absolute value (most of the time).
I have considered the following approaches:
divide the smaller value by the larger one, yielding a value between 0 and 1 that could be stored as an integer of some fixed precision plus one bit for remembering which number was divided by which. This is fairly straightforward and yields satisfactory compression, but is not a lossless method and thus only a secondary choice.
XOR the IEEE 754 binary64 encoded representations of both values and store the length of the long stretches of zeroes at the beginning of the exponent and mantissa plus the remaining bits which were different. Here I'm quite unsure how to judge the compression, although I think it should be good in most cases.
Are there standard ways to do this? What might be problems about my approaches above? What other solutions have you seen or used yourself?

Rarely are all the bits of a double-precision number meaningful.
If you have billions of values that are the result of some measurement, find the calibration and error of your measurement device. Quantize the values so that you only work with meaningful bits.
Often, you'll find that you only need 16 bits of actual dynamic range. You can probably compress all of this into arrays of "short" that retain all of the original input.
Use a simple "Z-score technique" where every value is really a signed fraction of the standard deviation.
So a sequence of samples with a mean of m and a standard deviation of s gets transformed into a bunch of Z score. Normal Z-score transformations use a double, but you should use a fixed-point version of that double. s/1000 or s/16384 or something that retains only the actual precision of your data, not the noise bits on the end.
for u in samples:
z = int( 16384*(u-m)/s )
for z in scaled_samples:
u = s*(z/16384.0)+m
Your Z-scores retain a pleasant easy-to-work with statistical relationship with the original samples.
Let's say you use a signed 16-bit Z-score. You have +/- 32,768. Scale this by 16,384 and your Z-scores have an effective resolution of 0.000061 decimal.
If you use a signed 24-but Z-score, you have +/- 8 million. Scale this by 4,194,304 and you have a resolution of 0.00000024.
I seriously doubt you have measuring devices this accurate. Further, any arithmetic done as part of filter, calibration or noise reduction may reduce the effective range because of noise bits introduced during the arithmetic. A badly thought-out division operator could make a great many of your decimal places nothing more than noise.

Whatever compression scheme you pick, you can decouple that from the problem of needing to be able to perform arbitrary seeks by compressing into fixed-size blocks and prepending to each block a header containing all the data required to decompress it (e.g. for a delta encoding scheme, the block would contain deltas enconded in some fashion that takes advantage of their small magnitude to make them take less space, e.g. fewer bits for exponent/mantissa, conversion to fixed-point value, Huffman encoding etc; and the header a single uncompressed sample); seeking then becomes a matter of cheaply selecting the appropriate block, then decompressing it.
If the compression ratio is so variable that much space is being wasted padding the compressed data to produce fixed size blocks, a directory of offsets into the compressed data could be built instead and the state required to decompress recorded in that.

If you know a group of doubles has the same exponent, you could store the exponent once, and only store the mantissa for each value.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js