How would one be able to predict execution time and/or resulting compression ratio when compressing a file using a certain lossless compression algorithm? I am especially more concerned with local compression, since if you know time and compression ratio for local compression, you can easily calculate time for network compression based on currently available network throughput.
Let's say you have some information about file such as size, redundancy, type (we can say text to keep it simple). Maybe we have some statistical data from actual prior measurements. What else would be needed to perform prediction for execution time and/or compression ratio (even if a very rough one).
For just local compression, the size of the file would have effect since actual reading and writing data to/from storage media (sdcard, hard drive) would take more dominant portion of total execution.
The actual compression portion, will probably depend on redundancy/type, since most compression algorithms work by compressing small blocks of data (100kb or so). For example, larger HTML/Javascripts files compress better since they have higher redundancy.
I guess there is also a problem of scheduling, but this could probably be ignored for rough estimation.
This is a question that been in my head for quiet sometimes. I been wondering if some low overhead code (say on the server) can predict how long it would take to compress a file before performing actual compression?
Sample the file by taking 10-100 small pieces from random locations. Compress them individually. This should give you a lower bound on compression ratio.
This only returns meaningful results if the chunks are not too small. The compression algorithm must be able to make use of a certain size of history to predict the next bytes.
It depends on the data but with images you can take small small samples. Downsampling would change the result. Here is an example:PHP - Compress Image to Meet File Size Limit.
The compression ratio can be calculated with these formulas:
And the performance benchmarking can be done using V8 or Sunspider.
You can also use algorithms like DEFLATE or LZMA to compute the mechanism. PPM (Partial by Predicting Matching) can be used for predicting.
What would be the expected performance gain with using (u)int16 over float in an OpenCL kernel ? If any ?
I expect the memory transfer to be roughly divided by two but what of the device load ?
Strangely I can hardly find any benchs or documentations on the subject. (or maybe my google fu is just failing me...)
I'm working on image processing (filtering mostly). The precision is not that critical, indeed the result of several kernels operations is cast into a char. We narrowed several operations where using shorter data types is acceptable. So I was wondering if those operations can be speed up by using shorter data where the precision is not critical.
thanks for your help.
GPUs tend to do floating-point operations better than integral. For example, some will have extra pipelines for floating-point ops, and making everything integral just reduces the GPU's throughput. Data copy may not be your bottleneck and halving the amount by using 16-bit integers may not help. Moreover, on integrated GPU's like Intel or AMD's you can get zero-copy behavior. So the effect on image or buffer size is minimal (to a point).
Also, you might look into 16-bit floating point number support. That gets you the best part of both worlds (half the data w/ floating point numbers).
I am making a rigid body physics engine from scratch (for educational purposes), and I'm wondering if I should choose single or double precision floats for it.
I will be using OpenGL to visualize it and the glm library to calculate stuff internally in the engine as well as for the visualization. The convention seems to be to use floats for OpenGL pretty much everywhere and glm::vec3 and glm::vec4 seem to be using float internally. I also noticed that there is glm::dvec3 and glm::dvec4 though but nobody seems to be using it. How do I decide which on to use? double seems to make sense as it has more precision and pretty much the same performance on today's hardware (as far as I know), but everything else seems to use float except for some of GLu's functions and some of GLFW's.
This is all going to depend on your application. You pretty much already understand the tradeoffs between the two:
Single-precision
Less accurate
Faster computations even on todays hardware. Take up less memory and operations are faster. Get more out of cache optimizations, etc.
Double-precision
More accurate
Slower computations.
Typically in graphics applications the precision for floats is plenty given the number of pixels on the screen and scaling of the scene. In scientific settings or smaller scale simulation you may need the extra precision. It also may depend on your hardware. For instance, I coded a physically based simulation for rigid bodies on a netbook and switching to float gained on average 10-15 FPS which almost doubled the FPS at that point in my implementation.
My recommendation is that if this is an educational activity use floats and target the graphics application. If you find in your studies and timing and personal experience you need double-precision then head in that direction.
Surely the general rule is correctness first and performance second? That means using doubles unless you can convince yourself that you'll get fidelity required using floats.
The thing to look at is the effective size of one bit the coordinate system relative to the smallest size you intend to model.
For example, if you use earth coordinates, 100 degrees works around to around 1E7 metres.
An IEEE 754 float has only 23 bits of precision, so that gives a relative precision of only about 1E-7.
Hence the coordinate is only accurate to around 1 meter. This may or may not be sufficient for the problem.
I have learnt from experience to always use doubles for the physics and physical modelling calculations, but concede that cannot be a universal requirement.
It does not of course follow that the rendering should be using double; you may well want that as a float.
I used typedef in a common header and went with float as my default.
typedef real_t float;
I would not recommend using templates for this because it causes huge design problem as you try to use polymorphic/virtual function.
Why floats does work
The floats worked pretty fine for me for 3 reasons:
First, almost every physical simulation would involve adding some noise to the forces and torques to be realistic. This random noises are usually far larger in magnitude than precision of floats.
Second, having limited precision is actually beneficial on many instances. Consider that almost all of the classical mechanics for rigid body doesn't apply in real world because there is no such thing as perfect rigid body. So when you apply force to less than perfect rigid body you don't gets perfect acceleration to the 7th digit.
Third, many simulations are for short duration so the accumulated errors remain small enough. Using double precision doesn't change this automatically. Creating long running simulations that matches the real world is extremely difficult and would be very specialized project.
When floats don't work
Here are the situation where I had to consider using double.
Latitude and longitudes should be double. Floats simply doesn't have good enough resolution for most purposes for these quantities.
Computing integral of very small quantities over time. For example, Gaussian Markov process is good way to represent random walks in sensor bias. However the values will typically be very small and accumulates. Errors in calculation could be much bigger in floats than doubles.
Specialized simulations that goes beyond usual classical mechanics of linear and rotational motions of rigid body. For example, if you do things with protein molecules, crystal growth, micro-gravity physics etc then you probably want to use double.
When doubles don't work
There are actually times when higher precision in double hurts, although its rare. An example from What every computer scientists should know...: if you have some quantity that is converging to 1 over time. You take its log and do something if result is 0. When using double, you might never get to 1 because rounding might not happen but with floats it might.
Another example: You need to use special code to compare real values. These code often has default rounding to epsilon which for float is fairly reasonable 1E-6 but for double its 1E-15. If you are not careful, this can give lot of surprises.
Performance
Here's another surprise: On modern x86 hardware there is little difference between raw performance of float vs double. The memory alignment, caching etc almost overwhelmingly dominates more than floating point types. On my machine a simple summation test of 100M random numbers with floats took 22 sec and with double it takes 25 sec. So floats are 12% faster indeed but I still think its too low to abandon double just for performance. However if you use SSE instructions or GPUs or embedded/mobile hardware like Arduino then floats would be much more faster and that can most certainly be driving factor.
A Physics engine that does nothing but linear and rotational motions of rigid body can run at 2000Hz on today's desktop-grade hardware on single thread. You can trivially parallelize this to many cores. Lot of simple low end simulations require just 50Hz. At 100Hz things starts to get pretty smooth. If you have things like PID controllers, you might have to go up to 500Hz. But even at that worse-case rate, you can still simulate plenty of objects with good enough desktop.
In summary, don't let performance be your driving factor unless you actually measure it.
What to do
A rule of thumb is to use as much precision as you need to get your code work. For simple physics engine for rigid body, float are often good enough. However you want to be able to change your mind without revamping your code. So the best approach is to use typedef as mentioned at the start and make sure you have your code working for float as well as double. Then measure often and chose the type as your project evolves.
Another vital thing in your case: Keep physics engine religiously separated from rendering system. Output from physics engine could be either double or float and should be typecasted to whatever rendering system needs.
Here's the short answer.
Q. Why does OpenGL use float rather than double?
A. Because most of the time you don't need the precision and doubles
are twice the size.
Another thing to consider is that you shouldn't use doubles everywhere, just as some things may take require using a double as opposed to a float. For example, if you are drawing a circle by drawing squares by looping through the angles, there can only be so many squares shown on the screen. They will overlap, and in this case, doubles would be pointless. However if you're doing arbitrary floating point arithmetic, you may need the extra precision if you're trying to accurately represent the Mandelbrot series (although that totally depends on your algorithm.)
Either way, in the end, you will need to usually cast back to float if you intend to use those values in drawing.
Single prec operations are faster and the data uses less memory less network bandwidth. So you only use double if you gain something in exchange for slower ops and more mem and bandwidth required. There are certainly applications of rigid body physics where the extra precision would be worth it, such as in manipulating lat\lon where single precision only gives you meter accuracy but is this your case?
Since it's educational purpose, maybe you want to educate yourself in the use of high precision physics algorithms where the extra accuracy would matter but lots of rigid body phys involves processes that can only be approximately quantified such as friction between 2 solids, collision reaction after detection etc, that extra precision wont matter you just get more precise approximate behavior :)
I'm seeking advice on how to better assist compression tools get better lossless compression.
I have many large files (>100meg) containing sensor readings from a variety of sensors. The samples from various sensors are of different bit sizes (16 bit, 24 bit, 32 bit) and different frequencies (70Hz to 250Hz). With the common compressors I'm aware of (zip, gzip, bzip2) I can get a compressed file about 70% of the original file size. It seems to me if I could tell the compression tool these bytes are this type of sample and those bytes are another sample type there may be compression gains to be had but I'm not aware of anything that would let me do this.
Step 0 would be to code the data in binary. (16 bits in two bytes, 24 bits in three bytes, etc.) I hope that you're already doing that.
Step 1 would be to use differences. From your description, I bet that successive values don't change much. Therefore differences will be small and have many leading zero bits. Try that, and then a general-purpose compressor.
Step 2 would be to use variable-length integer coding. The high bit of each byte determines the span of each integer. The first byte of an integer always has a high bit of zero. All subsequent bytes of the same integer have a high bit of one. Build the integer out of the low seven bits of each byte. (I take the first byte to have the least significant bits, but you could do it most-significant bit order as well.) This will code your small differences in one byte. Also this coding will handle any number of bits in the samples, which is convenient in your application. Try this, and then a general-purpose compressor.
Step 3 might be more detailed analysis of the waveforms for a better predictor. Step 1 simply uses the last value as the predictor. You could have a more complex function of the previous n values as the predictor for the next value. Whether this would help is highly dependent on your data.
I was thinking about compression, and it seems like there would have to be some sort of limit to the compression that could be applied to it, otherwise it'd be a single byte.
So my question is, how many times can I compress a file before:
It does not get any smaller?
The file becomes corrupt?
Are these two points the same or different?
Where does the point of diminishing returns appear?
How can these points be found?
I'm not talking about any specific algorithm or particular file, just in general.
For lossless compression, the only way you can know how many times you can gain by recompressing a file is by trying. It's going to depend on the compression algorithm and the file you're compressing.
Two files can never compress to the same output, so you can't go down to one byte. How could one byte represent all the files you could decompress to?
The reason that the second compression sometimes works is that a compression algorithm can't do omniscient perfect compression. There's a trade-off between the work it has to do and the time it takes to do it. Your file is being changed from all data to a combination of data about your data and the data itself.
Example
Take run-length encoding (probably the simplest useful compression) as an example.
04 04 04 04 43 43 43 43 51 52 11 bytes
That series of bytes could be compressed as:
[4] 04 [4] 43 [-2] 51 52 7 bytes (I'm putting meta data in brackets)
Where the positive number in brackets is a repeat count and the negative number in brackets is a command to emit the next -n characters as they are found.
In this case we could try one more compression:
[3] 04 [-4] 43 fe 51 52 7 bytes (fe is your -2 seen as two's complement data)
We gained nothing, and we'll start growing on the next iteration:
[-7] 03 04 fc 43 fe 51 52 8 bytes
We'll grow by one byte per iteration for a while, but it will actually get worse. One byte can only hold negative numbers to -128. We'll start growing by two bytes when the file surpasses 128 bytes in length. The growth will get still worse as the file gets bigger.
There's a headwind blowing against the compression program--the meta data. And also, for real compressors, the header tacked on to the beginning of the file. That means that eventually the file will start growing with each additional compression.
RLE is a starting point. If you want to learn more, look at LZ77 (which looks back into the file to find patterns) and LZ78 (which builds a dictionary). Compressors like zip often try multiple algorithms and use the best one.
Here are some cases I can think of where multiple compression has worked.
I worked at an Amiga magazine that shipped with a disk. Naturally, we packed the disk to the gills. One of the tools we used let you pack an executable so that when it was run, it decompressed and ran itself. Because the decompression algorithm had to be in every executable, it had to be small and simple. We often got extra gains by compressing twice. The decompression was done in RAM. Since reading a floppy was slow, we often got a speed increase as well!
Microsoft supported RLE compression on bmp files. Also, many word processors did RLE encoding. RLE files are almost always significantly compressible by a better compressor.
A lot of the games I worked on used a small, fast LZ77 decompressor. If you compress a large rectangle of pixels (especially if it has a lot of background color, or if it's an animation), you can very often compress twice with good results. (The reason? You only have so many bits to specify the lookback distance and the length, So a single large repeated pattern is encoded in several pieces, and those pieces are highly compressible.)
Generally the limit is one compression. Some algorithms results in a higher compression ratio, and using a poor algorithm followed by a good algorithm will often result in improvements. But using the good algorithm in the first place is the proper thing to do.
There is a theoretical limit to how much a given set of data can be compressed. To learn more about this you will have to study information theory.
In general for most algorithms, compressing more than once isn't useful. There's a special case though.
If you have a large number of duplicate files, the zip format will zip each independently, and you can then zip the first zip file to remove duplicate zip information. Specifically, for 7 identical Excel files sized at 108kb, zipping them with 7-zip results in a 120kb archive. Zipping again results in an 18kb archive. Going past that you get diminishing returns.
Suppose we have a file N bits long, and we want to compress it losslessly, so that we can recover the original file. There are 2^N possible files N bits long, and so our compression algorithm has to change one of these files to one of 2^N possible others. However, we can't express 2^N different files in less than N bits.
Therefore, if we can take some files and compress them, we have to have some files that length under compression, to balance out the ones that shorten.
This means that a compression algorithm can only compress certain files, and it actually has to lengthen some. This means that, on the average, compressing a random file can't shorten it, but might lengthen it.
Practical compression algorithms work because we don't usually use random files. Most of the files we use have some sort of structure or other properties, whether they're text or program executables or meaningful images. By using a good compression algorithm, we can dramatically shorten files of the types we normally use.
However, the compressed file is not one of those types. If the compression algorithm is good, most of the structure and redundancy have been squeezed out, and what's left looks pretty much like randomness.
No compression algorithm, as we've seen, can effectively compress a random file, and that applies to a random-looking file also. Therefore, trying to re-compress a compressed file won't shorten it significantly, and might well lengthen it some.
So, the normal number of times a compression algorithm can be profitably run is one.
Corruption only happens when we're talking about lossy compression. For example, you can't necessarily recover an image precisely from a JPEG file. This means that a JPEG compressor can reliably shorten an image file, but only at the cost of not being able to recover it exactly. We're often willing to do this for images, but not for text, and particularly not executable files.
In this case, there is no stage at which corruption begins. It starts when you begin to compress it, and gets worse as you compress it more. That's why good image-processing programs let you specify how much compression you want when you make a JPEG: so you can balance quality of image against file size. You find the stopping point by considering the cost of file size (which is more important for net connections than storage, in general) versus the cost of reduced quality. There's no obvious right answer.
Usually compressing once is good enough if the algorithm is good.
In fact, compressing multiple times could lead to an increase in the size
Your two points are different.
Compression done repeatedly and achieving no improvement in size reduction
is an expected theoretical condition
Repeated compression causing corruption
is likely to be an error in the implementation (or maybe the algorithm itself)
Now lets look at some exceptions or variations,
Encryption may be applied repeatedly without reduction in size
(in fact at times increase in size) for the purpose of increased security
Image, video or audio files increasingly compressed
will loose data (effectively be 'corrupted' in a sense)
You can compress a file as many times as you like. But for most compression algorithms the resulting compression from the second time on will be negligible.
Compression (I'm thinking lossless) basically means expressing something more concisely. For example
111111111111111
could be more consisely expressed as
15 X '1'
This is called run-length encoding. Another method that a computer can use is to find a pattern that is regularly repeated in a file.
There is clearly a limit to how much these techniques can be used, for example run-length encoding is not going to be effect on
15 X '1'
since there are no repeating patterns. Similarly if the pattern replacement methods converts long patterns to 3 char ones, reapplying it will have little effect, because the only remaining repeating patterns will be 3-length or shorter. Generally applying compression to a already compressed file makes it slightly bigger, because of various overheads. Applying good compression to a poorly compressed file is usually less effective than applying just the good compression.
How many times can I compress a file before it does not get any smaller?
In general, not even one. Whatever compression algorithm you use, there must always exists a file that does not get compressed at all, otherwise you could always compress repeatedly until you reach 1 byte, by your same argument.
How many times can I compress a file before it becomes corrupt?
If the program you use to compress the file does its job, the file will never corrupt (of course I am thinking to lossless compression).
You can compress infinite times. However, the second and further compressions usually will only produce a file larger than the previous one. So there is no point in compressing more than once.
It is a very good question. You can view to file from different point of view. Maybe you know a priori that this file contain arithmetic series.
Lets view to it as datastream of "bytes", "symbols", or "samples".
Some answers can give to you "information theory" and "mathematical statistics"
Please check monography of that researchers for full-deep understanding:
A. Kolmogorov
S. Kullback
ะก. Shannon
N. Wiener
One of the main concept in information theory is entropy.
If you have a stream of "bytes"....Entropy of that bytes doesn't depend on values of your "bytes", or "samples"...
If was defined only by frequencies with which bytes retrive different values.
Maximum entropy has place to be for full random datastream.
Minimum entropy, which equal to zero, has place to be for case when your "bytes" has identical value.
It does not get any smaller?
So the entropy is minimum number of bits per your "byte", which you need to use when writing information to the disk. Of course it is so if you use god's algorithm. Real life compression lossless heuristic algorithms are not so.
The file becomes corrupt?
I dont understand sense of the question. You can write no bits to the disk and you will write a corrupted file to the disk with size equal to 0 bits. Of course it is corrupted, but his size is zero bits.
Here is the ultimate compression algorithm (in Python) which by repeated use will compress any string of digits down to size 0 (it's left as an exercise to the reader how to apply this to a string of bytes).
def compress(digitString):
if digitString=="":
raise "already as small as possible"
currentLen=len(digitString)
if digitString=="0"*currentLen:
return "9"*(currentLen-1)
n=str(long(digitString)-1); #convert to number and decrement
newLen=len(n);
return ("0"*(currentLen-newLen))+n; # add zeros to keep same length
#test it
x="12";
while not x=="":
print x;
x=compress(x)
The program outputs 12 11 10 09 08 07 06 05 04 03 02 01 00 9 8 7 6 5 4 3 2 1 0 then empty string. It doesn't compress the string at each pass but it will with enough passes compress any digit string down to a zero length string. Make sure you write down how many times you send it through the compressor otherwise you won't be able to get it back.
I would like to state that the limit of compression itself hasn't really been adapted to tis fullest limit. Since each pixel or written language is in black or write outline. One could write a program that can decompile into what it was, say a book, flawlessly, but could compress the pixel pattern and words into a better system of compression. Meaning It would probably take a lot longer to compress, but as a system file gets larget gigs or terra bytes, the repeated letters of P and R and q and the black and white deviations could be compressed expotentially into a complex automated formula. THe mhcien doesn't need the data to make sense, it just can make a game making a highly compressed pattern. This in turn then allows us the humans to create a customized compression reading engine. Meaning now we have real compression power. Design an entire engine that can restore the information on the user side. The engine has its own language that is optimal, no spaces, just fillign black and white pixel boxes of the smallest set or even writing its own patternaic language. Nad thus it can at the same time for the mostoptiaml performace, give out a unique cipher or decompression formula when its down, and thus the file is optimally compressed and has a password that is unique for the engine to decompress it later. The machine can do amost limitlesset of iterations to compress the file further. Its like having a open book and putting all the written stories of humanity currently on to one A4 sheet. I don't know but it is another theory. So what happens is split volume, because the formula to decrompress would have its own size, evne the naming of the folder and or icon information has a size so one could go further to put every form of data a a string of information. hmm..
It all depends on the algorithm. In other words the question can be how many times a file can be compressed using this algorithm first, then this one next...
Example of a more advanced compression technique using "a double table, or cross matrix"
Also elimiates extrenous unnessacry symbols in algorithm
[PREVIOUS EXAMPLE]
Take run-length encoding (probably the simplest useful compression) as an example.
04 04 04 04 43 43 43 43 51 52 11 bytes
That series of bytes could be compressed as:
[4] 04 [4] 43 [-2] 51 52 7 bytes (I'm putting meta data in brackets)
[TURNS INTO]
04.43.51.52 VALUES
4.4.**-2 COMPRESSION
Further Compression Using Additonal Symbols as substitute values
04.A.B.C VALUES
4.4.**-2 COMPRESSION
In theory, we will never know, it is a never-ending thing:
In computer science and mathematics, the term full employment theorem
has been used to refer to a theorem showing that no algorithm can
optimally perform a particular task done by some class of
professionals. The name arises because such a theorem ensures that
there is endless scope to keep discovering new techniques to improve
the way at least some specific task is done. For example, the full
employment theorem for compiler writers states that there is no such
thing as a provably perfect size-optimizing compiler, as such a proof
for the compiler would have to detect non-terminating computations and
reduce them to a one-instruction infinite loop. Thus, the existence of
a provably perfect size-optimizing compiler would imply a solution to
the halting problem, which cannot exist, making the proof itself an
undecidable problem.
(source)