Read sequential file - Compressed file vs Uncompressed

Read sequential file - Compressed file vs Uncompressed - c++

I am looking for the fastest way to read a sequential file from disk.
I read in some posts that if I compressed the file using, for example, lz4, I could achieve better performance than read the flat file, because I will minimize the i/o operations.
But when I try this approach, scanning a lz4 compressed file gives me a poor performance than scanning the flat file. I didn't try the lz4demo above, but looking for it, my code is very similar.
I have found this benchmarks:
http://skipperkongen.dk/2012/02/28/uncompressed-versus-compressed-read/
http://code.google.com/p/lz4/source/browse/trunk/lz4demo.c?r=75
Is it really possible to improve performance reading a compressed sequential file over an uncompressed one? What am I doing wrong?

Yes, it is possible to improve disk read by using compression.
This effect is most likely to happen if you use a multi-threaded reader : while one thread reads compressed data from disk, the other one decode the previous compressed block within memory.
Considering the speed of LZ4, the decoding operation is likely to finish before the other thread complete reading the next block. This way, you'll achieved a bandwidth improvement, proportional to the compression ratio of the tested file.
Obviously, there are other effects to consider when benchmarking. For example, seek times of HDD are several order of magnitude larger than SSD, and under bad circumstances, it can become the dominant part of the timing, reducing any bandwidth advantage to zero.

It depends on the speed of the disk vs. the speed and space savings of decompression. I'm sure you can put this into a formula.
Is it really possible to improve performance reading a compresses
sequential file over an uncompressed one? What am i doing wrong?
Yes, it is possible (example: a 1kb zip file could contain 1GB of data - it would most likely be faster to read and decompress the ZIP).
Benchmark different algorithms and their decompression speeds. There are compression benchmark websites for that. There are also special-purpose high-speed compression algorithms.
You could also try to change the data format itself. Maybe switch to protobuf which might be faster and smaller than CSV.

Related

Prediciting time or compression ratio for lossless compress of a file?

How would one be able to predict execution time and/or resulting compression ratio when compressing a file using a certain lossless compression algorithm? I am especially more concerned with local compression, since if you know time and compression ratio for local compression, you can easily calculate time for network compression based on currently available network throughput.
Let's say you have some information about file such as size, redundancy, type (we can say text to keep it simple). Maybe we have some statistical data from actual prior measurements. What else would be needed to perform prediction for execution time and/or compression ratio (even if a very rough one).
For just local compression, the size of the file would have effect since actual reading and writing data to/from storage media (sdcard, hard drive) would take more dominant portion of total execution.
The actual compression portion, will probably depend on redundancy/type, since most compression algorithms work by compressing small blocks of data (100kb or so). For example, larger HTML/Javascripts files compress better since they have higher redundancy.
I guess there is also a problem of scheduling, but this could probably be ignored for rough estimation.
This is a question that been in my head for quiet sometimes. I been wondering if some low overhead code (say on the server) can predict how long it would take to compress a file before performing actual compression?

Sample the file by taking 10-100 small pieces from random locations. Compress them individually. This should give you a lower bound on compression ratio.
This only returns meaningful results if the chunks are not too small. The compression algorithm must be able to make use of a certain size of history to predict the next bytes.

It depends on the data but with images you can take small small samples. Downsampling would change the result. Here is an example:PHP - Compress Image to Meet File Size Limit.

The compression ratio can be calculated with these formulas:
And the performance benchmarking can be done using V8 or Sunspider.
You can also use algorithms like DEFLATE or LZMA to compute the mechanism. PPM (Partial by Predicting Matching) can be used for predicting.

Packing and compressing resource data

I try to pack and compress game client resource data using zlib. If I compress the data, it will reduce Disk I/O as reduced file size but it increases CPU usage when uncompress.
Question1
if a resource used for rendering is compressed, processing (rendering and uncompressing) uses CPU, so i think it seems to be rather slow, is it right?
If no compression, Disk I/O has not changed and an additional CPU usage does not occur. And if you read only a portion of the file, DISK I/O can be reduced by using the CreateFileMapping(), MapViewOfFile() functions.
Question2
In the case of the resource, such as uncompressed image(for example tga, not png) when we have to read whole file (ex. image file), we can't get adventage of CreateFileMapping(), MapViewOfFile(), so i think compressing resource is better, how do you think?
Question3
What do you think about compressing resource data when packing?

Resources for games are not only packed to reduce size, but also to reduce the number of seeks by collapsing many small files into one, which matters a lot more than the size on disk. A single unnecessary seek on a conventional hard disk costs as much time as reading a gigabyte of data. Even if your "compression" consists of only concatenating small files together, you already gain performance.
As a small bonus, having resources packed in an archive somewhat obscures them from computer unsavy people, deterring them from modifying game assets (though admittedly, this is not a very big hurdle!).
Q1: Depending on what compression algorithm you use, you can easily get upwards of 1 GB/s decompression (close to 2 GB/s with a fast CPU). Sequential disk I/O is still around 300-400 MB/s maximum even on solid state (and usually less). Random access disk I/O is 5-20 times slower, depending on the disk and the access pattern.
On the other hand, you can get as little as a few dozen kilobytes per second in decompression speed if you choose a slow algorithm, which is much worse than just loading more data from disk. The secret is to choose an algorithm that compresses reasonably well (not perfectly, just reasonably) and runs at good decompression speed. Compression speed usually does not matter, since this is done offline once. Candidate algorithms are for example LZF, Snappy, or LZ4.
File mapping can generally be used regardless of whether the contents are compressed. Also, filemapping is not only an advantage for very small portions, on the contrary. The larger your reads, the more advantageous it becomes (very small views may actually be faster using conventional reads).
Q2: Uncompressed images do not normally occur in a game. Most of the time you will want to use DXT compression, not so much to reduce disk I/O but to reduce memory and PCIe bandwidth requirements and GPU memory consumption. DXT is a very poor compression, but it works in hardware and has an exactly predictable compression ratio. You can compress DXT-compressed textures again with a conventional general-purpose compressor (with varying rates, depending on what compressor you used, there are some that are especially optimized for that purpose).
Q3: Packing resources is definitively advisable for any non-trivial game.

Can reading zipped files be faster than uncompressed?

Is there any chance that packing a large file with some simple algorithm enables me to read the data faster than from an uncompressed file (due to the hard drive being slower than uncompressing)?
What kind of compression rate would I need to have? Can any fast compression algorithm do that?

Yes. That is often the case with deflate compression, used by zip, gzip, and zlib, when reading from hard drives with a typical compression factor of, say, four.
From SSDs, you may need to go to something with faster decompression. One you could try is lz4.
Your mileage may vary.

You could also try Density, its command line client "sharc" is benchmarked here.

What approach works best for quickly reading files off of optical drives?

When reading files off of a hard drive, mmap is generally regarded as a good way to quickly get data into memory. When working with optical drives, accesses take more time and you have a higher latency to worry about. What approach/abstraction do you use to hide/eliminate as much latency and/or overall load time of the optical drive as possible?

There's no real abstraction you can employ. Optical drives have very specific characteristics that must be optimized for to get the best performance.
Some tips:
The biggest killer on optical drives is seek time. Where possible make sure all the files you are reading are sequential on disc and as closely packed as possible. If you must seek then seek in one direction and as infrequently as possible.
Asynchronous reading can also massively improve performance. If you need to load and process files A,B & C then before processing A you should start reading file B, and while processing B you should be reading file C and so on.
Generally the more data you can read in one go the better, e.g avoid lots of little reads(). You will only get the theoretical throughput of a disc while reading large amounts of data. Some OS's /drivers will minimize the penalty of reading lots of little files by caching sectors, some will not.
Doing lots of exists(filename) checking can also be detrimental on some filesystems / OSs where only parts of the TOC are cached.
In our applications we usually pack files into one or more "lumped" files and have them ordered sequentially based on their access order. Some files (and directories) are compressed and read in their entirety before being decompressed in memory. This can be a win if you have a directory that contains a multitude of small files (e.g XML or scripts).
Basically lots of benchmarking and tweaking :)

Minimize or eliminate seeks by reading in giant chunks of data sequentially from a few files (optimally one).

First you must keep in mind, that modern optical drives are quite fast reading sequential data, but seeking data is still a lot slower than on HDs. So if you must seek a lot within a big file (e.g. jump randomly around within a 500+ MB file), it might actually be faster to first copy the whole 500 MB to HD (into a temporary file), which will be done in sequential, fast reads, perform the operation on the temp file (much faster since much faster access times on HD) and delete the file again if you are done with it.
The same of above applies to little big vs many small files as well. Working with a couple of big files is much faster than with many small files, since every time you switch from one small file to another one the huge seeking time will give you headaches again. This is the reason why many games that ship on optical media packs game data in huge archive files (e.g. all textures of one level are in one huge file instead of having one small file per texture), so try keeping data well structured in big files you can read as sequential as possible.
HD caching itself is a good technique. There is this game I remember, though I forgot the title, that always kept the 3D data of your environment on HD. While you were moving through the world, it was constantly copying data from DVD to HD. Thus the surrounding 3D landscape was always available on HD for fast access, however not the whole DVD was copied, only about 200-300 MB were temporarily cached on HD to save HD space. The only annoying thing about that was that you often had DVD access "noise" while playing the game, however most of the time the whole process was happening only during CPU idle times, so it did not really affect game play. Only if you ran very fast constantly within the same direction it could happen that the DVD drive was falling back and all of a sudden the game stopped with a loading indicator for a couple of seconds. However I've been playing this games for days and maybe saw this loading indicator three times within a single week. If you were moving slow or not constantly into the same direction, there never was a loading indicator.

Slow drives are going to be slow. Sorry. However, optical drive hardware will normally be optimized to do sequential reads, so if you can make your code work that way you might see some improvement. I doubt you'll see much difference between mmap(), fread(), et al, for sequential access. You might also be able to tune your read buffer size to be a multiple of the drive's block size, if your OS isn't already doing that for you. Optical drive can have large block sizes compared to hard drives, and if your buffers aren't large enough you're paying a price.

I'm not sure that there is a lot that you can do by the time that you are reading it. You could look at the create file API -- you can pass some hints to Windows that tell it that you are opening the file for Sequential or Random access. That is supposed to allow Windows to optimize the caching strategy used for the file.
You can tune the "chunks" that you bite off when reading your file to make them larger or smaller. You might get a slight improvement if you read in chunks that are multiples of the allocation unit size on the disk.
The hardware and media can make a difference. Say you have a DVD drive that reads at 16x. It will require media that is rated at 16x or higher, and some drives don't work well with some media brands. So even if the media meets the ratings, you might not be reading at the maximum speed. (usually a good hardware review on an optical drive will include details like this).
The layout of the files on the optical disk could be important. Was it burned all at once? Was it just mounted as a disk (like a packet-mode R/W?). I don't have experience with this, but given the longer seek times on an optical drive, fragmented files might have a greater impact than they do with a modern hard drive.

What is the best compression algorithm that allows random reads/writes in a file? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 11 months ago.
Improve this question
What is the best compression algorithm that allows random reads/writes in a file?
I know that any adaptive compression algorithms would be out of the question.
And I know huffman encoding would be out of the question.
Does anyone have a better compression algorithm that would allow random reads/writes?
I think you could use any compression algorithm if you write it in blocks, but ideally I would not like to have to decompress a whole block at a time. But if you have suggestions on an easy way to do this and how to know the block boundaries, please let me know. If this is part of your solution, please also let me know what you do when the data you want to read is across a block boundary?
In the context of your answers please assume the file in question is 100GB, and sometimes I'll want to read the first 10 bytes, and sometimes I'll want to read the last 19 bytes, and sometimes I'll want to read 17 bytes in the middle. .

I am stunned at the number of responses that imply that such a thing is impossible.
Have these people never heard of "compressed file systems",
which have been around since before Microsoft was sued in 1993 by Stac Electronics over compressed file system technology?
I hear that LZS and LZJB are popular algorithms for people implementing compressed file systems, which necessarily require both random-access reads and random-access writes.
Perhaps the simplest and best thing to do is to turn on file system compression for that file, and let the OS deal with the details.
But if you insist on handling it manually, perhaps you can pick up some tips by reading about NTFS transparent file compression.
Also check out:
"StackOverflow: Compression formats with good support for random access within archives?"

A dictionary-based compression scheme, with each dictionary entry's code being encoded with the same size, will result in being able to begin reading at any multiple of the code size, and writes and updates are easy if the codes make no use of their context/neighbors.
If the encoding includes a way of distinguishing the start or end of codes then you do not need the codes to be the same length, and you can start reading anywhere in the middle of the file. This technique is more useful if you're reading from an unknown position in a stream.

I think Stephen Denne might be onto something here. Imagine:
zip-like compression of sequences to codes
a dictionary mapping code -> sequence
file will be like a filesystem
each write generates a new "file" (a sequence of bytes, compressed according to dictionary)
"filesystem" keeps track of which "file" belongs to which bytes (start, end)
each "file" is compressed according to dictionary
reads work filewise, uncompressing and retrieving bytes according to "filesystem"
writes make "files" invalid, new "files" are appended to replace the invalidated ones
this system will need:
defragmentation mechanism of filesystem
compacting dictionary from time to time (removing unused codes)
done properly, housekeeping could be done when nobody is looking (idle time) or by creating a new file and "switching" eventually
One positive effect would be that the dictionary would apply to the whole file. If you can spare the CPU cycles, you could periodically check for sequences overlapping "file" boundaries and then regrouping them.
This idea is for truly random reads. If you are only ever going to read fixed sized records, some parts of this idea could get easier.

I don't know of any compression algorithm that allows random reads, never mind random writes. If you need that sort of ability, your best bet would be to compress the file in chunks rather than as a whole.
e.g.We'll look at the read-only case first. Let's say you break up your file into 8K chunks. You compress each chunk and store each compressed chunk sequentially. You will need to record where each compressed chunk is stored and how big it is. Then, say you need to read N bytes starting at offset O. You will need to figure out which chunk it's in (O / 8K), decompress that chunk and grab those bytes. The data you need may span multiple chunks, so you have to deal with that scenario.
Things get complicated when you want to be able to write to the compressed file. You have to deal with compressed chunks getting bigger and smaller. You may need to add some extra padding to each chunk in case it expands (it's still the same size uncompressed, but different data will compress to different sizes). You may even need to move chunks if the compressed data is too big to fit back in the original space it was given.
This is basically how compressed file systems work. You might be better off turning on file system compression for your files and just read/write to them normally.

Compression is all about removing redundancy from the data. Unfortunately, it's unlikely that the redundancy is going to be distributed with monotonous evenness throughout the file, and that's about the only scenario in which you could expect compression and fine-grained random access.
However, you could get close to random access by maintaining an external list, built during the compression, which shows the correspondence between chosen points in the uncompressed datastream and their locations in the compressed datastream. You'd obviously have to choose a method where the translation scheme between the source stream and its compressed version does not vary with the location in the stream (i.e. no LZ77 or LZ78; instead you'd probably want to go for Huffman or byte-pair encoding.) Obviously this would incur a lot of overhead, and you'd have to decide on just how you wanted to trade off between the storage space needed for "bookmark points" and the processor time needed to decompress the stream starting at a bookmark point to get the data you're actually looking for on that read.
As for random-access writing... that's all but impossible. As already noted, compression is about removing redundancy from the data. If you try to replace data that could be and was compressed because it was redundant with data that does not have the same redundancy, it's simply not going to fit.
However, depending on how much random-access writing you're going to do -- you may be able to simulate it by maintaining a sparse matrix representing all data written to the file after the compression. On all reads, you'd check the matrix to see if you were reading an area that you had written to after the compression. If not, then you'd go to the compressed file for the data.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js