How do I compress a large number of similar doubles? - compression

I want to store billions (10^9) of double precision floating point numbers in memory and save space. These values are grouped in thousands of ordered sets (they are time series), and within a set, I know that the difference between values is usually not large (compared to their absolute value). Also, the closer to each other, the higher the probability of the difference being relatively small.
A perfect fit would be a delta encoding that stores only the difference of each value to its predecessor. However, I want random access to subsets of the data, so I can't depend on going through a complete set in sequence. I'm therefore using deltas to a set-wide baseline that yields deltas which I expect to be within 10 to 50 percent of the absolute value (most of the time).
I have considered the following approaches:
divide the smaller value by the larger one, yielding a value between 0 and 1 that could be stored as an integer of some fixed precision plus one bit for remembering which number was divided by which. This is fairly straightforward and yields satisfactory compression, but is not a lossless method and thus only a secondary choice.
XOR the IEEE 754 binary64 encoded representations of both values and store the length of the long stretches of zeroes at the beginning of the exponent and mantissa plus the remaining bits which were different. Here I'm quite unsure how to judge the compression, although I think it should be good in most cases.
Are there standard ways to do this? What might be problems about my approaches above? What other solutions have you seen or used yourself?

Rarely are all the bits of a double-precision number meaningful.
If you have billions of values that are the result of some measurement, find the calibration and error of your measurement device. Quantize the values so that you only work with meaningful bits.
Often, you'll find that you only need 16 bits of actual dynamic range. You can probably compress all of this into arrays of "short" that retain all of the original input.
Use a simple "Z-score technique" where every value is really a signed fraction of the standard deviation.
So a sequence of samples with a mean of m and a standard deviation of s gets transformed into a bunch of Z score. Normal Z-score transformations use a double, but you should use a fixed-point version of that double. s/1000 or s/16384 or something that retains only the actual precision of your data, not the noise bits on the end.
for u in samples:
z = int( 16384*(u-m)/s )
for z in scaled_samples:
u = s*(z/16384.0)+m
Your Z-scores retain a pleasant easy-to-work with statistical relationship with the original samples.
Let's say you use a signed 16-bit Z-score. You have +/- 32,768. Scale this by 16,384 and your Z-scores have an effective resolution of 0.000061 decimal.
If you use a signed 24-but Z-score, you have +/- 8 million. Scale this by 4,194,304 and you have a resolution of 0.00000024.
I seriously doubt you have measuring devices this accurate. Further, any arithmetic done as part of filter, calibration or noise reduction may reduce the effective range because of noise bits introduced during the arithmetic. A badly thought-out division operator could make a great many of your decimal places nothing more than noise.

Whatever compression scheme you pick, you can decouple that from the problem of needing to be able to perform arbitrary seeks by compressing into fixed-size blocks and prepending to each block a header containing all the data required to decompress it (e.g. for a delta encoding scheme, the block would contain deltas enconded in some fashion that takes advantage of their small magnitude to make them take less space, e.g. fewer bits for exponent/mantissa, conversion to fixed-point value, Huffman encoding etc; and the header a single uncompressed sample); seeking then becomes a matter of cheaply selecting the appropriate block, then decompressing it.
If the compression ratio is so variable that much space is being wasted padding the compressed data to produce fixed size blocks, a directory of offsets into the compressed data could be built instead and the state required to decompress recorded in that.

If you know a group of doubles has the same exponent, you could store the exponent once, and only store the mantissa for each value.

Related

Does log2 (n)*(x) set compression limits.

I may get all kinds of flags and penalties thrown at me for this. So please be patient. 2 questions
If the minimal number of bits to represent an arbitrary number of decimals is calculated by log2 (n)*(x)....n is range x is length, then you should be able to calculate max compression by turning the file into decimals by the>>> bin to dec.?
Is this result a law that one can not compress below the theoretical min compression limit, or is it an approximated limit?
Jon Hutton
It's actually a bit (ha) trickier. That formula assumes that the number is drawn from a uniform distribution, which is often not the case, but notably is the case for what is commonly called "random data" (though that is an inaccurate name, since data may be random but drawn from a non-uniform distribution).
The entropy H of X in bits is given by the formula:
H(X) = - sum[i](P(x[i]) log2(P(x[i])))
Where P gives the probability of every value x[i] that X may take. The bounds of i are implied and irrelevant, impossible options have a probability of zero anyway. In the uniform case, P(x[i]) is (by definition) 1/N for any possible x[i], we have H(X) = -N * (1/N log2(1/N)) = -log2(1/N) = log2(N).
The formula should in general not simply be multiplied by the length of the data, that only works if all symbols are independent and identically distributed (so for example on your file with IID uniform-random digits, it does work). Often for meaningful data, the probability distribution for a symbol depends on its context, and indeed a lot of compression techniques are aimed at exploiting this.
There is no law that says you cannot get lucky and thereby compress an individual file to fewer bits than are suggested by its entropy. You can arrange for it to be possible on purpose (but it won't necessarily happen), for example, let's say we expect that any letter is equally probable, but we decide to go against the flow and encode an A with the single bit 0, and any other letter as a 1 followed by 5 bits that indicate which letter it is. This is obviously a bad encoding given the expectation, there are only 26 letters and they're equally probable but we're using more than log2(26) ≈ 4.7 bits on average, the average would be (1 + 25 * 6)/26 ≈ 5.8. However, if by some accident we happen to actually get an A (there is a chance of 1/26th that this happens, the odds are not too bad), we compress it to a single bit, which is much better than expected. Of course one cannot rely on luck, it can only come as a surprise.
For further reference you could read about entropy (information theory) on Wikipedia.

How do I write files that gzip well?

I'm working on a web project, and I need to create a format to transmit files very efficiently (lots of data). The data is entirely numerical, and split into a few sections. Of course, this will be transferred with gzip compression.
I can't seem to find any information on what makes a file compress better than another file.
How can I encode floats (32bit) and short integers (16bit) in a format that results in the smallest gzip size?
P.s. it will be a lot of data, so saving 5% means a lot here. There won't likely be any repeats in the floats, but the integers will likely repeat about 5-10 times in each file.
The only way to compress data is to remove redundancy. This is essentially what any compression tool does - it looks for redundant/repeatable parts and replaces them with link/reference to the same data that was observed before in your stream.
If you want to make your data format more efficient, you should remove everything that could be possibly removed. For example, it is more efficient to store numbers in binary rather than in text (JSON, XML, etc). If you have to use text format, consider removing unnecessary spaces or linefeeds.
One good example of efficient binary format is google protocol buffers. It has lots of benefits, and not least of them is storing numbers as variable number of bytes (i.e. number 1 consumes less space than number 1000000).
Text or binary, but if you can sort your data before sending, it can increase possibility for gzip compressor to find redundant parts, and most likely to increase compression ratio.
Since you said 32-bit floats and 16-bit integers, you are already coding them in binary.
Consider the range and useful accuracy of your numbers. If you can limit those, you can recode the numbers using fewer bits. Especially the floats, which may have more bits than you need.
If the right number of bits is not a multiple of eight, then treat your stream of bytes as a stream of bits and use only the bits needed. Be careful to deal with the end of your data properly so that the added bits to go to the next byte boundary are not interpreted as another number.
If your numbers have some correlation to each other, then you should take advantage of that. For example, if the difference between successive numbers is usually small, which is the case for a representation of a waveform for example, then send the differences instead of the numbers. Differences can be coded using variable-length integers or Huffman coding or a combination, e.g. Huffman codes for ranges and extra bits within each range.
If there are other correlations that you can use, then design a predictor for the next value based on the previous values. Then send the difference between the actual and predicted value. In the previous example, the predictor is simply the last value. An example of a more complex predictor is a 2D predictor for when the numbers represent a 2D table and both adjacent rows and columns are correlated. The PNG image format has a few examples of 2D predictors.
All of this will require experimentation with your data, ideally large amounts of your data, to see what helps and what doesn't or has only marginal benefit.
Use binary instead of text.
A float in its text representation with 8 digits (a float has a precision of eight decimal digits), plus decimal separator, plus field separator, consumes 10 bytes. In binary representation, it takes only 4.
If you need to use text, use hex. It consumes less digits.
But although this makes a lot of difference for the uncompressed file, these differences might disappear after compression, since the compression algo should implicitly take care if that. But you may try.

How do I find the largest integer fully supported by hardware arithmetics?

I am implementing a BigInt class that must support arbitrary-precision operations on integers.
Quote from "The Algorithm Design Manual" by S.Skiena:
What base should I do [editor's note: arbitrary-precision] arithmetic in? - It is perhaps simplest to implement your own high-precision arithmetic package in decimal, and thus represent each integer as a string of base-10 digits. However, it is far more efficient to use a higher base, ideally equal to the square root of the largest integer supported fully by hardware arithmetic.
How do I find the largest integer supported fully by hardware arithmetic? If I understand correctly, being my machine an x64 based PC, the largest integer supported should be 2^64 (http://en.wikipedia.org/wiki/X86-64 - Architectural features: 64-bit integer capability), so I should use base 2^32, but is there a way in c++ to get this size programmatically so I can typedef my base_type to it?
You might be searching for std::uintmax_t and std::intmax_t.
static_cast<unsigned>(-1) is the max int. e.g. all bits set to 1 Is that what you are looking for ?
You can also use std::numeric_limits<unsigned>::max() or UINT_MAX, and all of these will yield the same result. and what these values tell is the maximum capacity of unsigned type. e.g. the maximum value that can be stored into unsigned type.
int (and, by extension, unsigned int) is the "natural" size for the architecture. So a type that has half the bits of an int should work reasonably well. Beyond that, you really need to configure for the particular hardware; the type of the storage unit and the type of the calculation unit should be typedefs in a header and their type selected to match the particular processor. Typically you'd make this selection after running some speed tests.
INT_MAX doesn't help here; it tells you the largest value that can be stored in an int, which may or may not be the largest value that the hardware can support directly. Similarly, INTMAX_MAX is no help, either; it tells you the largest value that can be stored as an integral type, but doesn't tell you whether operations on such a value can be done in hardware or require software emulation.
Back in the olden days, the rule of thumb was that operations on ints were done directly in hardware, and operations on longs were done as multiple integer operations, so operations on longs were much slower than operations on ints. That's no longer a good rule of thumb.
Things are not so black and white. There are MAY issues here, and you may have other things worth considering. I've now written two variable precision tools (in MATLAB, VPI and HPF) and I've chosen different approaches in each. It also matters whether you are writing an integer form or a high precision floating point form.
The difference is, integers can grow without bound in the number of digits. But if you are doing a floating point implementation with a user specified number of digits, you always know the number of digits in the mantissa. This is fixed.
First of all, it is simplest to use a single integer for each decimal digit. This makes many things work nicely, so I/O is easy. It is a bit inefficient in terms of storage though. Adds and subtracts are easy though. And if you use integers for each digit, then multiplies are even easy. In MATLAB for example, conv is pretty fast, though it is still O(n^2). I think gmp uses an fft multiply, so faster yet.
But assuming you use a basic conv multiply, then you need to worry about overflows for numbers with a huge number of digits. For example, suppose I store decimal digits as 8 bit signed integers. Using conv, followed by carries, I can do a multiply. For example, suppose I have the number 9999.
N = repmat(9,1,4)
N =
9 9 9 9
conv(N,N)
ans =
81 162 243 324 243 162 81
Thus even to form the product 9999*9999, I'd need to be careful as the digits will overflow an 8 bit signed integer. If I'm using 16 bit integers to accumulate the convolution products, then a multiply between a pair of 1000 digits integers can cause an overflow.
N = repmat(9,1,1000);
max(conv(N,N))
ans =
81000
So if you are worried about the possibility of millions of digits, you need to watch out.
One alternative is to use what I call migits, essentially working in a higher base than 10. Thus by using base 1000000 and doubles to store the elements, I can store 6 decimal digits per element. A convolution will still cause overflows for larger numbers though.
N = repmat(999999,1,10000);
log2(max(conv(N,N)))
ans =
53.151
Thus a convolution between two sets of base 1000000 migits that are 10000 migits in length (60000 decimal digits) will overflow the point where a double cannot represent an integer exactly.
So again, if you will use numbers with millions of digits, beware. A nice thing about the use of a higher base of migits with a convolution based multiply is since the conv operation is O(n^2), then going from base 10 to base 100 gives you a 4-1 speedup. Going to base 1000 yields a 9-1 speedup in the convolutions.
Finally, the use of a base other than 10 as migits makes it logical to implement guard digits (for floats.) In floating point arithmetic, you should never trust the least significant bits of a computation, so it makes sense to keep a few digits hidden in the shadows. So when I wrote my HPF tool, I gave the user control of how many digits would be carried along. This is not an issue for integers of course.
There are many other issues. I discuss them in the docs carried with those tools.

Is it possble to combine number of float values into one float value and extract the values when needed?

Am working on an algorithm for an iPhone app, where the data i need to keep in memory is exceeding the limit, so is it possible to represent number of float numbers as one float value and retrieve those value when i need.
For instance:
float array[4];
array[0]=0.12324;
array[1]=0.56732;
array[2]=0.86555;
array[3]=0.34545;
float combinedvalue=?
Not in general, no. You can't store 4N bits of information in only N bits.
If there's some patten in your numbers, then you might find a scheme. For example, if all your numbers are of similar value, you could potentially store only the differences between the numbers in lower precision.
However, this kind of thing is difficult, and limited.
If those numbers are exactly 5 digits each, you can treat them as ints by multiplying with 100000. Then you'll need 17 bits for each number, 68 bits in total, which (with some bit-shifting) takes up 9 bytes. Does that help, 9 bytes instead of 16?
Please note that the implementation of your algorithm will also take up memory!
What you are requiring could be accomplished in several different ways.
For instance, in c++ you generally have single precision floats (4 bytes) as the smallest precision available, though I wouldn't be surprised if there are other packages that handle smaller precision floating point values.
Therefore, if you are using double precision floating point values and can get by with less precision then you can switch to a smaller precision.
Now, depending on your range of values you want to store, you might be able to use a fixed-point representation as well, but you will need to be familiar with the nuances of bit shifting and masking, etc. But, another added benefit of this approach is that it could make your program run faster since fixed-point (integer) arithmetic is much faster than floating-point arithmetic.
The choice of options depends on your data you need to store and how comfortable you are with lower level binary arithmetic.

Floating point versus fixed point: what are the pros/cons?

Floating point type represents a number by storing its significant digits and its exponent separately on separate binary words so it fits in 16, 32, 64 or 128 bits.
Fixed point type stores numbers with 2 words, one representing the integer part, another representing the part past the radix, in negative exponents, 2^-1, 2^-2, 2^-3, etc.
Float are better because they have wider range in an exponent sense, but not if one wants to store number with more precision for a certain range, for example only using integer from -16 to 16, thus using more bits to hold digits past the radix.
In terms of performances, which one has the best performance, or are there cases where some is faster than the other ?
In video game programming, does everybody use floating point because the FPU makes it faster, or because the performance drop is just negligible, or do they make their own fixed type ?
Why isn't there any fixed type in C/C++ ?
That definition covers a very limited subset of fixed point implementations.
It would be more correct to say that in fixed point only the mantissa is stored and the exponent is a constant determined a-priori. There is no requirement for the binary point to fall inside the mantissa, and definitely no requirement that it fall on a word boundary. For example, all of the following are "fixed point":
64 bit mantissa, scaled by 2-32 (this fits the definition listed in the question)
64 bit mantissa, scaled by 2-33 (now the integer and fractional parts cannot be separated by an octet boundary)
32 bit mantissa, scaled by 24 (now there is no fractional part)
32 bit mantissa, scaled by 2-40 (now there is no integer part)
GPUs tend to use fixed point with no integer part (typically 32-bit mantissa scaled by 2-32). Therefore APIs such as OpenGL and Direct3D often use floating-point types which are capable of holding these values. However, manipulating the integer mantissa is often more efficient so these APIs allow specifying coordinates (in texture space, color space, etc) this way as well.
As for your claim that C++ doesn't have a fixed point type, I disagree. All integer types in C++ are fixed point types. The exponent is often assumed to be zero, but this isn't required and I have quite a bit of fixed-point DSP code implemented in C++ this way.
At the code level, fixed-point arithmetic is simply integer arithmetic with an implied denominator.
For many simple arithmetic operations, fixed-point and integer operations are essentially the same. However, there are some operations which the intermediate values must be represented with a higher number of bits and then rounded off. For example, to multiply two 16-bit fixed-point numbers, the result must be temporarily stored in 32-bit before renormalizing (or saturating) back to 16-bit fixed-point.
When the software does not take advantage of vectorization (such as CPU-based SIMD or GPGPU), integer and fixed-point arithmeric is faster than FPU. When vectorization is used, the efficiency of vectorization matters a lot more, such that the performance differences between fixed-point and floating-point is moot.
Some architectures provide hardware implementations for certain math functions, such as sin, cos, atan, sqrt, for floating-point types only. Some architectures do not provide any hardware implementation at all. In both cases, specialized math software libraries may provide those functions by using only integer or fixed-point arithmetic. Often, such libraries will provide multiple level of precisions, for example, answers which are only accurate up to N-bits of precision, which is less than the full precision of the representation. The limited-precision versions may be faster than the highest-precision version.
Fixed point is widely used in DSP and embedded-systems where often the target processor has no FPU, and fixed point can be implemented reasonably efficiently using an integer ALU.
In terms of performance, that is likley to vary depending on the target architecture and application. Obviously if there is no FPU, then fixed point will be considerably faster. When you have an FPU it will depend on the application too. For example performing some functions such as sqrt() or log() will be much faster when directly supported in the instruction set rather thna implemented algorithmically.
There is no built-in fixed point type in C or C++ I imagine because they (or at least C) were envisaged as systems level languages and the need fixed point is somewhat domain specific, and also perhaps because on a general purpose processor there is typically no direct hardware support for fixed point.
In C++ defining a fixed-point data type class with suitable operator overloads and associated math functions can easily overcome this shortcomming. However there are good and bad solutions to this problem. A good example can be found here: http://www.drdobbs.com/cpp/207000448. The link to the code in that article is broken, but I tracked it down to ftp://66.77.27.238/sourcecode/ddj/2008/0804.zip
You need to be careful when discussing "precision" in this context.
For the same number of bits in representation the maximum fixed point value has more significant bits than any floating point value (because the floating point format has to give some bits away to the exponent), but the minimum fixed point value has fewer than any non-denormalized floating point value (because the fixed point value wastes most of its mantissa in leading zeros).
Also depending on the way you divide the fixed point number up, the floating point value may be able to represent smaller numbers meaning that it has a more precise representation of "tiny but non-zero".
And so on.
The diferrence between floating point and integer math depends on the CPU you have in mind. On Intel chips the difference is not big in clockticks. Int math is still faster because there are multiple integer ALU's that can work in parallel. Compilers are also smart to use special adress calculation instructions to optimize add/multiply in a single instruction. Conversion counts as an operation too, so just choose your type and stick with it.
In C++ you can build your own type for fixed point math. You just define as struct with one int and override the appropriate overloads, and make them do what they normally do plus a shift to put the comma back to the right position.
You dont use float in games because it is faster or slower you use it because it is easier to implement the algorithms in floating point than in fixed point. You are assuming the reason has to do with computing speed and that is not the reason, it has to do with ease of programming.
For example you may define the width of the screen/viewport as going from 0.0 to 1.0, the height of the screen 0.0 to 1.0. The depth of the word 0.0 to 1.0. and so on. Matrix math, etc makes things real easy to implement. Do all of the math that way up to the point where you need to compute real pixels on a real screen size, say 800x400. Project the ray from the eye to the point on the object in the world and compute where it pierces the screen, using 0 to 1 math, then multiply x by 800, y times 400 and place that pixel.
floating point does not store the exponent and mantissa separately and the mantissa is a goofy number, what is left over after the exponent and sign, like 23 bits, not 16 or 32 or 64 bits.
floating point math at its core uses fixed point logic with extra logic and extra steps required. By definition compared apples to apples fixed point math is cheaper because you dont have to manipulate the data on the way into the alu and dont have to manipulate the data on the way out (normalize). When you add in IEEE and all of its garbage that adds even more logic, more clock cycles, etc. (properly signed infinity, quiet and signaling nans, different results for same operation if there is an exception handler enabled). As someone pointed out in a comment in a real system where you can do fixed and float in parallel, you can take advantage of some or all of the processors and recover some clocks that way. both with float and fixed clock rate can be increased by using vast quantities of chip real estate, fixed will remain cheaper, but float can approach fixed speeds using these kinds of tricks as well as parallel operation.
One issue not covered is the answers is a power consumption. Though it highly depends on specific hardware architecture, usually FPU consumes much more energy than ALU in CPU thus if you target mobile applications where power consumption is important it's worth consider fixed point impelementation of the algorithm.
It depends on what you're working on. If you're using fixed point then you lose precision; you have to select the number of places after the decimal place (which may not always be good enough). In floating point you don't need to worry about this as the precision offered is nearly always good enough for the task in hand - uses a standard form implementation to represent the number.
The pros and cons come down to speed and resources. On modern 32bit and 64bit platforms there is really no need to use fixed point. Most systems come with built in FPUs that are hardwired to be optimised for fixed point operations. Furthermore, most modern CPU intrinsics come with operations such as the SIMD set which help optimise vector based methods via vectorisation and unrolling. So fixed point only comes with a down side.
On embedded systems and small microcontrollers (8bit and 16bit) you may not have an FPU nor extended instruction sets. In which case you may be forced to use fixed point methods or the limited floating point instruction sets that are not very fast. So in these circumstances fixed point will be a better - or even your only - choice.