How to parse the division of two very large numbers into a double? - c++

I have a geometric algorithm which takes as input a polygon. However, the files I am supposed to use as input files store the coordinates of the polygons in a rather peculiar way. Each file consists of one line, a counterclockwise sequence of the vertices. Each vertex is represented by its x and y coordinates each of which is written as the quotient of two integers int/int. However, these integers are incredibly large. I wrote a program that parses them from a string into long long using the function std::stoll. However, it appears that some of the numbers in the input file are larger than 2^64.
The output coordinates are usually quite small, in the range 0-1000. How do I go about parsing these numbers and then dividing them, obtaining doubles? Is there any standard library way of doing this, or should I use something like the boost library?

If you are after a ratio of two large numbers as string, you can shorten the strings:
"194725681173571753193674" divided by "635482929374729202" is the same as
"1947256811735717" divided by "6354829293" to at least 9 digits (I just removed the same amount of digits on both sides). Depending on the needed precision, this might be the simplest solution. Just remove digits before converting to long long.

You can parse the inputs directly into a long double I believe. However, that approach will introduce precision errors. If precision is important, then avoid this.
A general solution for precise results is to represent the large integer with an array of integers where one integer represents the lower order bytes, next integer represents the larger bytes etc. This is generally called arbitrary precision arithmetic.
Is there any standard library way of doing this
No, other than basic building blocks such as vector for storing the array.
or should I use something like the boost library?
That's often a good place to start. Boost happens to have a library for this.

Related

How does Microsoft Calculator, calculate such large numbers?

I am up to about 8E10000 so how is it calculating such large number, there is no variable that can hold such large numbers.
Normal types in C can usually only store up to 64 bits, instead of a single variable, you can use an array of characters to store digits of your number and write functions for each operation (sum, minus and so on) in your program.
You may look at this: GNU Multiple Precision Arithmetic Library
In a nut shell they aren't using one variable to hold the operands but data structures than can probably hold arbitrary long numbers (like an array) and they evaluate operations by considering the number to be in a large radix system.
When you actually do a math operation the operands aren't variables but array (or any other data structure that is suitable) and you do it by doing the operation (where available) component wise.
When you want to add 2 array you choose a radix and then loop the arrays and add op1[i] to op2[i] then you take that value and check to see if it its bigger than your radix and compute a carriage that you add to next addition.
car = (op1[i] + op2[i])%radix
You need to be careful in choosing the radix and the underlaying data so an addition doesn't cause an overflow.
This how you also do when you add numbers in the base 10 by hand but without taking into account the radix.
You can also look over this for a bigint package.

How do I write files that gzip well?

I'm working on a web project, and I need to create a format to transmit files very efficiently (lots of data). The data is entirely numerical, and split into a few sections. Of course, this will be transferred with gzip compression.
I can't seem to find any information on what makes a file compress better than another file.
How can I encode floats (32bit) and short integers (16bit) in a format that results in the smallest gzip size?
P.s. it will be a lot of data, so saving 5% means a lot here. There won't likely be any repeats in the floats, but the integers will likely repeat about 5-10 times in each file.
The only way to compress data is to remove redundancy. This is essentially what any compression tool does - it looks for redundant/repeatable parts and replaces them with link/reference to the same data that was observed before in your stream.
If you want to make your data format more efficient, you should remove everything that could be possibly removed. For example, it is more efficient to store numbers in binary rather than in text (JSON, XML, etc). If you have to use text format, consider removing unnecessary spaces or linefeeds.
One good example of efficient binary format is google protocol buffers. It has lots of benefits, and not least of them is storing numbers as variable number of bytes (i.e. number 1 consumes less space than number 1000000).
Text or binary, but if you can sort your data before sending, it can increase possibility for gzip compressor to find redundant parts, and most likely to increase compression ratio.
Since you said 32-bit floats and 16-bit integers, you are already coding them in binary.
Consider the range and useful accuracy of your numbers. If you can limit those, you can recode the numbers using fewer bits. Especially the floats, which may have more bits than you need.
If the right number of bits is not a multiple of eight, then treat your stream of bytes as a stream of bits and use only the bits needed. Be careful to deal with the end of your data properly so that the added bits to go to the next byte boundary are not interpreted as another number.
If your numbers have some correlation to each other, then you should take advantage of that. For example, if the difference between successive numbers is usually small, which is the case for a representation of a waveform for example, then send the differences instead of the numbers. Differences can be coded using variable-length integers or Huffman coding or a combination, e.g. Huffman codes for ranges and extra bits within each range.
If there are other correlations that you can use, then design a predictor for the next value based on the previous values. Then send the difference between the actual and predicted value. In the previous example, the predictor is simply the last value. An example of a more complex predictor is a 2D predictor for when the numbers represent a 2D table and both adjacent rows and columns are correlated. The PNG image format has a few examples of 2D predictors.
All of this will require experimentation with your data, ideally large amounts of your data, to see what helps and what doesn't or has only marginal benefit.
Use binary instead of text.
A float in its text representation with 8 digits (a float has a precision of eight decimal digits), plus decimal separator, plus field separator, consumes 10 bytes. In binary representation, it takes only 4.
If you need to use text, use hex. It consumes less digits.
But although this makes a lot of difference for the uncompressed file, these differences might disappear after compression, since the compression algo should implicitly take care if that. But you may try.

Is it possble to combine number of float values into one float value and extract the values when needed?

Am working on an algorithm for an iPhone app, where the data i need to keep in memory is exceeding the limit, so is it possible to represent number of float numbers as one float value and retrieve those value when i need.
For instance:
float array[4];
array[0]=0.12324;
array[1]=0.56732;
array[2]=0.86555;
array[3]=0.34545;
float combinedvalue=?
Not in general, no. You can't store 4N bits of information in only N bits.
If there's some patten in your numbers, then you might find a scheme. For example, if all your numbers are of similar value, you could potentially store only the differences between the numbers in lower precision.
However, this kind of thing is difficult, and limited.
If those numbers are exactly 5 digits each, you can treat them as ints by multiplying with 100000. Then you'll need 17 bits for each number, 68 bits in total, which (with some bit-shifting) takes up 9 bytes. Does that help, 9 bytes instead of 16?
Please note that the implementation of your algorithm will also take up memory!
What you are requiring could be accomplished in several different ways.
For instance, in c++ you generally have single precision floats (4 bytes) as the smallest precision available, though I wouldn't be surprised if there are other packages that handle smaller precision floating point values.
Therefore, if you are using double precision floating point values and can get by with less precision then you can switch to a smaller precision.
Now, depending on your range of values you want to store, you might be able to use a fixed-point representation as well, but you will need to be familiar with the nuances of bit shifting and masking, etc. But, another added benefit of this approach is that it could make your program run faster since fixed-point (integer) arithmetic is much faster than floating-point arithmetic.
The choice of options depends on your data you need to store and how comfortable you are with lower level binary arithmetic.

How do I compress a large number of similar doubles?

I want to store billions (10^9) of double precision floating point numbers in memory and save space. These values are grouped in thousands of ordered sets (they are time series), and within a set, I know that the difference between values is usually not large (compared to their absolute value). Also, the closer to each other, the higher the probability of the difference being relatively small.
A perfect fit would be a delta encoding that stores only the difference of each value to its predecessor. However, I want random access to subsets of the data, so I can't depend on going through a complete set in sequence. I'm therefore using deltas to a set-wide baseline that yields deltas which I expect to be within 10 to 50 percent of the absolute value (most of the time).
I have considered the following approaches:
divide the smaller value by the larger one, yielding a value between 0 and 1 that could be stored as an integer of some fixed precision plus one bit for remembering which number was divided by which. This is fairly straightforward and yields satisfactory compression, but is not a lossless method and thus only a secondary choice.
XOR the IEEE 754 binary64 encoded representations of both values and store the length of the long stretches of zeroes at the beginning of the exponent and mantissa plus the remaining bits which were different. Here I'm quite unsure how to judge the compression, although I think it should be good in most cases.
Are there standard ways to do this? What might be problems about my approaches above? What other solutions have you seen or used yourself?
Rarely are all the bits of a double-precision number meaningful.
If you have billions of values that are the result of some measurement, find the calibration and error of your measurement device. Quantize the values so that you only work with meaningful bits.
Often, you'll find that you only need 16 bits of actual dynamic range. You can probably compress all of this into arrays of "short" that retain all of the original input.
Use a simple "Z-score technique" where every value is really a signed fraction of the standard deviation.
So a sequence of samples with a mean of m and a standard deviation of s gets transformed into a bunch of Z score. Normal Z-score transformations use a double, but you should use a fixed-point version of that double. s/1000 or s/16384 or something that retains only the actual precision of your data, not the noise bits on the end.
for u in samples:
z = int( 16384*(u-m)/s )
for z in scaled_samples:
u = s*(z/16384.0)+m
Your Z-scores retain a pleasant easy-to-work with statistical relationship with the original samples.
Let's say you use a signed 16-bit Z-score. You have +/- 32,768. Scale this by 16,384 and your Z-scores have an effective resolution of 0.000061 decimal.
If you use a signed 24-but Z-score, you have +/- 8 million. Scale this by 4,194,304 and you have a resolution of 0.00000024.
I seriously doubt you have measuring devices this accurate. Further, any arithmetic done as part of filter, calibration or noise reduction may reduce the effective range because of noise bits introduced during the arithmetic. A badly thought-out division operator could make a great many of your decimal places nothing more than noise.
Whatever compression scheme you pick, you can decouple that from the problem of needing to be able to perform arbitrary seeks by compressing into fixed-size blocks and prepending to each block a header containing all the data required to decompress it (e.g. for a delta encoding scheme, the block would contain deltas enconded in some fashion that takes advantage of their small magnitude to make them take less space, e.g. fewer bits for exponent/mantissa, conversion to fixed-point value, Huffman encoding etc; and the header a single uncompressed sample); seeking then becomes a matter of cheaply selecting the appropriate block, then decompressing it.
If the compression ratio is so variable that much space is being wasted padding the compressed data to produce fixed size blocks, a directory of offsets into the compressed data could be built instead and the state required to decompress recorded in that.
If you know a group of doubles has the same exponent, you could store the exponent once, and only store the mantissa for each value.

How to output floating point numbers in the original format in C++?

It's a coding practice. I read these numbers as double from a file:
112233 445566
8717829120000 2.4
16000000 1307674.368
10000 2092278988.8
1234567 890123
After some computation, I should output some of them. I want to make them appear just the same as in the file, no filling zeros, no scientific notation, how could I achieve it? Do I have to read in as string then convert them?
Edit: Erm...Do you guys mean that there is actually no way for the program to know how the numbers look like originally?
If you want the output to be identical to the input, then yes, you need to read them in as strings and store the strings to be output later.
Why? When dealing with floating point numbers, the computer can't represent most decimal fractional parts exactly in binary. So in a number like 2.4, the internal representation won't be exactly 2.4, it will be slightly different. Most of the time, the C/C++ I/O libraries will take such a binary number and print 2.4, but for some numbers, it might print something like 2.40000000001 or 2.399999999.
So, that's why you want to keep the original strings around.
There's no built-in way to do this. Yes, you'll need to keep the original representation, and some link between that representation and the parsed number (a map<string, double> for example).
If you need to output the original numbers as they were, you might be better off storing the original strings somewhere. During the decimal -> binary -> decimal conversion, some precision may get lost, due to the precision limits of double. You may not end up printing the exact same decimal digits.
You read them as strings, record all the many finnicky aspects (do they have decimal parts, how long, any padding zeroes, etc, etc) and remember all those aspects for each field, convert them into numbers if you absolutely must, then use all the crazy little aspect you stored when formatting the numbers for output again. Crazy, but the only way to achieve literally what you're asking for.