using floats to store large numbers

using floats to store large numbers - c++

I'm using floats to represent a position in my game:
struct Position
{
float x;
float y;
};
I'm wondering if this is the best choice and what the consequences will be as the position values continue to grow larger. I took some time to brush up on how floats are stored and realized that I am a little confused.
(I'm using Microsoft Visual C++ compiler.)
In float.h, FLT_MAX is defined as follows:
#define FLT_MAX 3.402823466e+38F /* max value */
which is 340282346600000000000000000000000000000.
That value is much greater than UINT_MAX which is defined as:
#define UINT_MAX 0xffffffff
and corresponds to the value 4294967295.
Based on this, it seems like a float would be a good choice to store a very large number like a position. Even though FLT_MAX is very large, I'm wondering how the precision issues will come into play.
Based on my understanding, a float uses 1 bit to store the sign, 8 bits to store the exponent, and 23 bits to store the mantissa (a leading 1 is assumed):
S EEEEEEEE MMMMMMMMMMMMMMMMMMMMMMM
That means FLT_MAX might look like:
0 11111111 11111111111111111111111
which would be the equivalent of:
1.11111111111111111111111 x 2^128
or
111111111111111111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Even knowing this, I have trouble visualizing the loss of precision and I'm getting confused thinking about what will happen as the values continue to increase.
Is there any easier way to think about this? Are floats or doubles generally used to store very large numbers over something like an unsigned int?

A way of thinking about the precision of a float, is to consider that they have roughly 5 digits of accuracy. So if your units are meters, and you have something 1km away, thats 1000m - attempting to deal with that object at a resolution of 10cm (0.1m) or less may be problematic.
The usual approach in a game would be to use floats, but to divide the world up such that positions are relative to local co-ordinate systems (for example, divide the world into a grid, and for each grid square have a translation value). Everything will have enough precision until it gets transformed relative to the camera for rendering, at which point the imprecision for far away things is not a problem.
As an example, imagine a game set in the solar system. If the origin of your co-ordinate system is in the heart of the sun, then co-ordinates on the surface of planets will be impossible to represent accurately in a float. However if you instead have a co-ordinate system relative to the planet's surface, which in turn is relative to the center of the planet, and then you know where the planet is relative to the sun, you can operate on things in a local space with accuracy, and then transform into whatever space you want for rendering.

No, they're not.
Let's say your position needs to increase by 10 cm for a certain frame since the game object moved.
Assuming a game world scaled in meters, this is 0.10. But if your float value is large enough it won't be able to represent a difference of 0.10 any more, and your attempt to increase the value will simply fail.

Do you need to store a value greater than 16.7m with a fractional part? Then float will be too small.
This series by Bruce Dawson may help.

If you really need to handle very large numbers, then consider using an arbitrary-precision arithmetic library. You will have to profile your code because these libraries are slower than the arithmetics of built-in types.
It is possible that you do not really need very large coordinate values. For example, you could wrap around the edges of your world, and use modulo arithmetic for handling positions.

Related

Why don't we have smaller float types in OpenGL and DirectX?

What's the problem with having lower than 10 bit floats? Why don't we have 8 bit floats? I can imagine how it would affect the outcome if glfloat is used for colors, but I can't imagine how it affects vertices. Do problems start to occur when we zoom into objects?
Yes, gpu treats so many values as 32 bit values, and even opengl redbook suggests us to use half floats whenever possible but how can I know my lower limit in precision?

Why don't we have 8 bit floats? I can imagine how it would affect the outcome if glfloat is used for colors, but I can't imagine how it affects vertices. Do problems start to occur when we zoom into objects?
Well, yes. Given that even low resolution displays these days have at least 1024 pixels in one direction you need at least 10 bits of significant digits to accurately represent a position on the screen. So assuming that the whole transformation chain is performed without loss of precision (which is not the case obviously) this means, that you need at least 11 bits of significant digits in the original data.
In a floating point value, the mantissa is, what gives the significant digits. In a half precision float (16 bits total) the mantissa is 11 bits long, which is the least amount of precision one needs to represent vertices in screen space without transformation operation roundoff artifacts to become visible on a low resolution screen.
8 bits would be just too little precision to be useful for anything.

How can you transform a set of numbers into mostly whole ones?

Small amount of background: I am working on a converter that bridges between a map maker (Tiled) that outputs in XML, and an engine (Angel2D) that inputs lua tables. Most of this is straight forward
However, Tiled outputs in pixel offsets (integers of absolute values), while Angel2D inputs OpenGL units (floats of relative values); a conversion factor between these two is needed (for example, 32px = 1gu). Since OpenGL units are abstract, and the camera can zoom in or out if the objects are too small or big, the actual conversion factor isn't important; I could use a random number, and the user would merely have to zoom in or out.
But it would be best if the conversion factor was selected such that most numbers outputted were small and whole (or fractions of small whole numbers), because that makes it easier to work with (and the whole point of the OpenGL units is that they are easy to work with).
How would I find such a conversion factor reliably?
My first attempt was to use the smallest number given; this resulted in no fractions below 1, but often lead to lots of decimal places where the factors didn't line up.
Then I tried the mode of the sequence, which lead to the largest number of 1's possible, but often lead to very long floats for background images.
My current approach gets the GCD of the whole sequence, which, when it works, works great, but can easily be thrown off course by a single bad apple.
Note that while I could easily just pass the numbers I am given along, or pick some fixed factor, or use one of the conversions I specified above, I am looking for a method to reliably scale this list of integers to small, whole numbers or simple fractions, because this would most likely be unsurprising to the end user; this is not a one off conversion.
The end users tend to use 1.0 as their "base" for manipulations (because it's simple and obvious), so it would make more sense for the sizes of entities to cluster around this.

How about the 'largest number which is a factor of some % of the values'.
So the GCD is the 'largest number which is a factor of 100%' of the values.
You could pick the largest number which is a factor of, say 60%, of the values. I don't know if it's a technical term but it's sort of a 'rough GCD if not a precise GCD'.
You might have to do trail and error to find it (possibly a binary search). But you could also consider sampling. I.e. if you have a million data points, just pick 100 or 1000 at random to find a number which divides evenly into your goal percentage of the sample set and that might be good enough.
some crummy pseudo C.
/** return percent of values in sampleset for which x is a factor */
double percentIsFactorOf(x, sampleset) {
int factorCount = 0;
for (sample : sampleset)
if (sample%x == 0) factorCount++;
return (double)factorCount/sampleset.size;
}
/** find largest value which is a factor of goalPercentage of sampleset */
double findGoodEnoughCommonFactor(sampleset, goalPercentage) {
// slow n^2 alogrithm here - add binary search, sampling, or something smarter to improve if you like
int start = max(sampleset);
while (percentIsFactorOf(start, sampleset)< goalPercent)
start--;
}

If your input is in N^2 (two dimensional space over the field the natural numbers, i.e. non-negative integers), and you need to output to R^2 (two dimensional space over the field of real numbers, which in this case will be represented/approximated with a float).
Forget about scaling for a minute and let the output be of the same scale as the input. The first step is to realize that you the input coordinate <0, 0> does not represent <0, 0> in the output, it represents <0.5f, 0.5f>, the center of the pixel. Similarly the input <2, 3> becomes <2.5, 3.5>. In general the conversion can be performed like this:
float x_prime = (float)x + 0.5f;
float y_prime = (float)y + 0.5f;
Next, you probably want to pick a scaling factor, as you have mentioned. I've always found it useful to pick some real-world unit, usually meters. This way you can reason about other physical aspects of what you're trying to model, because they have units; i.e. speeds, accelerations, can now be in meters per second, or meters per second squared. How many meters tall or wide is the thing you are making? How many meters is a pixel? Pick something that makes sense, and then your formula becomes this:
float x_prime = ((float)x + 0.5f) * (float)units_per_pixel;
float y_prime = ((float)y + 0.5f) * (float)units_per_pixel;
You may not want all of your output coordinates to be in the positive quadrant; that is you may want the origin to be in the center of the object. If you do, you probably want your starting coordinate system's field to include negative integers, or provide some offset to the true center. Lets say you provide a pixel offset to the true center. Your conversion then becomes this:
float x_prime = ((float)x + 0.5f - (float)x_offset) * (float)units_per_pixel;
float y_prime = ((float)y + 0.5f - (float)y_offset) * (float)units_per_pixel;

Discarding your background information, I understand that the underlying problem you are trying to solve is the following:
Given a finite number of (positive) integers {x_1, ... x_N} find some (rational) number f such that all x_i / f are "nice".
If you insist on "nice" meaning integer and as small as possible, then f = GCD is the (mathematically) exact answer to this question. There just is nothing "better", if the GCD is 1, tough luck.
If "nice" is supposed to mean rational with small numerator and denominator, the question gets more interesting and depending on what "small" means, find your trade off between small absolute value (f = max) and small denominator (f = GCD). Notice, however, that small numerator/denominator does not mean small floating point representation, e.g. 1/3 = 0.333333... in base 10.
If you want short floating points, make sure that f is a power of your base, i.e. 10 or 2, depending on whether the numbers should look short to the user or actually have a reasonable machine representation. This is what is used for scientific representation of floating points, which might be the best answer to the question of how to make decimal numbers look nice in the first place.

I have no idea what you are talking about with "GL units".
At the most abstract level, GL has no unit. Vertex coordinates are in object-space initially, and go through half a dozen user-defined transformations before they eventually produce coordinates (window-space) with familiar units (pixels).
You are absolutely correct that even in window-space, coordinates are still not whole numbers. You would not want this in fact, or triangles would jump all over the place and generally would not resemble triangles if their vertex positions were snapped to integer pixel coordinates.
Instead, GL throws sub-pixel precision into the mix. Coordinates still ultimately wind up quantized to integer values, but each integer may cover 1/256th of a pixel given 8-bit sub-pixel precision. Pixel coverage testing is done at the sub-pixel level as you can see here:
(source: microsoft.com)
GL never attempts to find any conversion factor like you are discussing, it just splits the number space for pixel coordinates up into a fixed division between integral and fractional... fixed-point in other words. You might consider doing the same thing.

You can recycle the code you probably currently use for vector normalisation, normalise the values to fit within a max. value of 1; for example:
the formula for 3d normalisation of a vector works fine here
Get the length first:
|a| = sqrt((ax * ax) + (ay * ay) + (az * az))
Then you will need to divide the values of each component by the length:
x = ax/|a|
y = ay/|a|
z = az/|a|
Now all the x, y, z values will fall into the maxima of -1 to 1, the same as the OpenGL base coordinate system.
I know this does not generate the whole numbers system you would like, however it does give a smaller more unified feel to the range.
Say you want to limit the range to whole numbers only, simply use a function like the following, which will take the normalised value and convert it to an int-only range value:
#include <algorithm> // this allows the use of std::min
int maxVal = 256
unsigned char convertToSpread(float floatValueToConvert){
return (unsigned char) (std::min((maxVal-1), (int) (floatValueToConvert * maxVal)));
}
The above will spread your values between 0 and 255, simply increase the value of maxVal to what you need and change the unsigned char to a datatype which suits your needs.
So if you want 1024 values, simply change maxVal to 1024 and unsigned char tounsigned int`
Hope this helps, however, let me know if you need more information as well, and I can elaborate:)

Large doubles/float/numbers

Say I have a huge floating number, say a trillion decimal places out. Obviously a long double can't hold this. Let's also assume I have a computer with more than enough memory to hold it. How do you do something like this?

You need arbitrary-precision arithmetic.

Arbitrary-precision math.

It's easy to say "arbitrary precision arithmetic" (or something similar), but I think it's worth adding that it's difficult to conceive of ways to put numbers anywhere close to this size to use.
Just for example: the current estimates of the size of the universe are somewhere in the vicinity of 150-200 billion light years. At the opposite end of the spectrum, the diameter of a single electron is estimated at a little less than 1 atometer. 1 light year is roughly 9.46x1015 meters (for simplicity, we'll treat it as 1016 meters).
So, let's take 1 atometer as our unit, and figure out the size of number for the diameter of the universe in that unit. 1018 units/meter * 1016 meters/light year * 1011 light years/universe diameter = about a 45 digit number to express the diameter of the universe in units of roughly the diameter of an electron.
Even if we went the next step, and expressed it in terms of the theorized size of a superstring, and added a few extra digits just in case the current estimates are off by a couple orders of magnitude, we'd still end up with a number around 65 digits or so.
This means, for example, that if we knew the diameter of the universe to the size of a single superstring, and we wanted to compute something like volume of the universe in terms of superstring diameters, our largest intermediate result would be something like 600-700 digits or so.
Consider another salient point: if you were to program a 64-bit computer running at, say, 10 GHz to do nothing but count -- increment a register once per clock cycle -- it would take roughly 1400 years for it to just cycle through the 64-bit numbers so it wrapped around to 0 again.
The bottom line is that it's incredibly difficult to come up with excuses (much less real reasons) to carry out calculations to anywhere close to millions, billions/milliards or trillions/billions of digits. The universe isn't that big, doesn't contain that many atoms, etc.

Sounds like what logarithms were invented for.
Without knowing what you intend to do with the number, it's impossible to accurately say how to represent it.

Drawing real coordinates

I've implemented a plotting class that is currently capable of handling integer values only. I would like to get advice about techniques/mechanisms in order to handle floating numbers. Library used is GDI.
Thanks,
Adi

At some point, they need to be converted to integers to draw actual pixels.
Generally speaking, however, you do not want to just cast each float to int, and draw -- you'll almost certainly get a mess. Instead, you need/want to scale the floats, then round the scaled value to an integer. In most cases, you'll want to make the scaling factor variable so the user can zoom in and out as needed.
Another possibility is to let the hardware handle most of the work -- you could use OpenGL (for one example) to render your points, leaving them as floating point internally, and letting the driver/hardware handle issues like scaling and conversion to integers. This has a rather steep cost up-front (learning enough OpenGL to get it to do anything useful), but can have a fairly substantial payoff as well, such as fast, hardware-based rendering, and making it relatively easy to handle some things like scaling and (if you ever need it) being able to display 3D points as easily as 2D.
Edit:(mostly response to comment): Ultimately it comes down to this: the resolution of a screen is lower than the resolution of a floating point number. For example, a really high resolution screen might display 2048 pixels horizontally -- that's 11 bits of resolution. Even a single precision floating point number has around 24 bits of precision. No matter how you do it, reducing 24-bit resolution to 12-bit resolution is going to lose something -- usually a lot.
That's why you pretty nearly have to make your scaling factor variable -- so the user can choose whether to zoom out and see the whole picture with reduced resolution, or zoom in to see a small part at high resolution.
Since sub-pixel resolution was mentioned: it does help, but only a little. It's not going to resolve a thousand different items that map to a single pixel.

What do these float values represent? I will assume they are some co-ordinates. You will need to know two things:
The source resolution (i.e. the dpi at which these co-ordinates are drawn)
The range that you need to address
After that, this becomes a problem of scaling the points to suitable integer co-ordinates (based on your screen-resolution).
Edit: A simple formula will be:
X(dst) = X(src) * DPI(dst) / DPI(src)

You'll have to convert them to integers and then pass them to functions like MoveTo() and LineTo().

Scale. For example, multiply all the integral values by 10. Multiply the floating point values by 10.0 and then truncate or round (your choice). Now plot as normal.
This will give you extra precision in your graphing. Just remember the scale factor when you look at the picture.
Otherwise convert the floats to int before plotting.

You can try to use GDI+ instead GDI, it has functions that are using float coordinates.

How do I compress a large number of similar doubles?

I want to store billions (10^9) of double precision floating point numbers in memory and save space. These values are grouped in thousands of ordered sets (they are time series), and within a set, I know that the difference between values is usually not large (compared to their absolute value). Also, the closer to each other, the higher the probability of the difference being relatively small.
A perfect fit would be a delta encoding that stores only the difference of each value to its predecessor. However, I want random access to subsets of the data, so I can't depend on going through a complete set in sequence. I'm therefore using deltas to a set-wide baseline that yields deltas which I expect to be within 10 to 50 percent of the absolute value (most of the time).
I have considered the following approaches:
divide the smaller value by the larger one, yielding a value between 0 and 1 that could be stored as an integer of some fixed precision plus one bit for remembering which number was divided by which. This is fairly straightforward and yields satisfactory compression, but is not a lossless method and thus only a secondary choice.
XOR the IEEE 754 binary64 encoded representations of both values and store the length of the long stretches of zeroes at the beginning of the exponent and mantissa plus the remaining bits which were different. Here I'm quite unsure how to judge the compression, although I think it should be good in most cases.
Are there standard ways to do this? What might be problems about my approaches above? What other solutions have you seen or used yourself?

Rarely are all the bits of a double-precision number meaningful.
If you have billions of values that are the result of some measurement, find the calibration and error of your measurement device. Quantize the values so that you only work with meaningful bits.
Often, you'll find that you only need 16 bits of actual dynamic range. You can probably compress all of this into arrays of "short" that retain all of the original input.
Use a simple "Z-score technique" where every value is really a signed fraction of the standard deviation.
So a sequence of samples with a mean of m and a standard deviation of s gets transformed into a bunch of Z score. Normal Z-score transformations use a double, but you should use a fixed-point version of that double. s/1000 or s/16384 or something that retains only the actual precision of your data, not the noise bits on the end.
for u in samples:
z = int( 16384*(u-m)/s )
for z in scaled_samples:
u = s*(z/16384.0)+m
Your Z-scores retain a pleasant easy-to-work with statistical relationship with the original samples.
Let's say you use a signed 16-bit Z-score. You have +/- 32,768. Scale this by 16,384 and your Z-scores have an effective resolution of 0.000061 decimal.
If you use a signed 24-but Z-score, you have +/- 8 million. Scale this by 4,194,304 and you have a resolution of 0.00000024.
I seriously doubt you have measuring devices this accurate. Further, any arithmetic done as part of filter, calibration or noise reduction may reduce the effective range because of noise bits introduced during the arithmetic. A badly thought-out division operator could make a great many of your decimal places nothing more than noise.

Whatever compression scheme you pick, you can decouple that from the problem of needing to be able to perform arbitrary seeks by compressing into fixed-size blocks and prepending to each block a header containing all the data required to decompress it (e.g. for a delta encoding scheme, the block would contain deltas enconded in some fashion that takes advantage of their small magnitude to make them take less space, e.g. fewer bits for exponent/mantissa, conversion to fixed-point value, Huffman encoding etc; and the header a single uncompressed sample); seeking then becomes a matter of cheaply selecting the appropriate block, then decompressing it.
If the compression ratio is so variable that much space is being wasted padding the compressed data to produce fixed size blocks, a directory of offsets into the compressed data could be built instead and the state required to decompress recorded in that.

If you know a group of doubles has the same exponent, you could store the exponent once, and only store the mantissa for each value.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js