Related
I was just wondering what could be the impact if, say, Microsoft decided to automaticly "lossless" compress every single file saved in a computer.
What are the pros? The cons? Is it feasible?
Speed.
When compressing a file of any kind you're encoding its contents in a more compact form, often using dictionaries and/or prefix codes (An example: huffman coding). To access the data you have to uncompress it, and this translates to time and used memory, as to access a specific piece of the file you have to decompress it as a whole. While decompressing you ave to save the results somewhere and the most appropriate place is RAM.
Of course this wouldn't be a great problem (decompressing the whole file) if all of it needed to be read, and not even in the case of a stream reading it, but if a program wanted to write to the compressed file all the data would have to be compressed again, or at least a part of it.
As you can see, compressing files in the filesystem would reduce a lot the bandwidth available to applications - to read a single byte you have to read a chunk of the file and decompress it - and also require more RAM.
In the Unix world, there is a famous format called "tar.gz".
But now, I want to develop a game and random accessing a file will be more efficient. If it is archived first, it will cause sequential access.
I know that there is an alternative format called zip or 7z, but what about other formats?
Not only gz.tar, I'd like to a minor compressing library and also get archiving features.
Should I use *.tar or other solutions are available?
PS: I'm using C++.
"Random" access is not good on a .tar.gz, since that is a .tar file that has been wrapped in a .gz compression, so to get to things in the .tar file, you'd first have to decompress the .tar file.
It would be possible to use a .tar file that contains individual files compressed with .gz. You can read the table of content of the .tar file and find/store where all the files are in the archive, and then extract as you need. However, you may find that using your own format is "better" (for example, if I remember correctly, the "header" for a tar-archive is a file at a time, you may want to build your header in one lump, before you store the files [which does mean at least enumerating all the relevant files first, then forming the compressed variant and "patching up" the header with the offsets in compressed form]
For a game, one critical factor would probably be the decompression speed, so you may want to look at different libraries and which one has the best decompression speed. I found this when searching for a comparison:
http://catchchallenger.first-world.info//wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO
You may also care about memory usage, which also varies a bit depending on algorithm.
And I'm guessing your individual files will be much smaller than the entire tar-ball of Linux, so you may want to do your own benchmark, with your own data - after all, the speed of different compression formats does, to some degree, depend on the format of the data.
Normally, for computer games, what you need is a format where each file is compressed individually before being assembled into one file. This is the crucial difference between .tar.gz and .zip / .7z formats, that is, tar-gz is a "compressed archive" while zip / 7z are "archives of compressed files". In fact, both file formats use the same compression algorithm (by default), and the only reason that .tar.gz files are typically smaller is because they compress the entire archive instead of file-by-file, which increases the overall compression ratio.
AFAIK, most computer games use a zip format or a custom format that closely matches it, because it does per-file compression. For instance, Quake engines have always (.pak, .pk3, .pk4) relied on an off-the-shelf zip format with a few minor additions (like a built-in checksum, I think).
The .tar.gz format is created by first making an archive that puts all the (uncompressed) files into one .tar file. Then, that big archive file is compressed with the gzip method to create the final .tar.gz file. The point is that to get any one of the files from the archive you have the decompress the entire thing. This is very appropriate for backups or large transfers, but not appropriate at all for a game engine media archive.
That said, you could technically do the reverse of tar-gz, which is to compress each file individually with gzip, and then put them together in a .tar archive. But this is probably not worth the extra trouble, as it is pretty much exactly what zip files are (in "one easy step"). So, it will be a lot easier to use an off-the-shelf all-in-one format like zip that will allow you to extract individual files at a time. There are many off-the-shelf libraries for extracting and manipulating files in zip archives, just start with libzip (not to be confused with zlib (for gzip or .gz)).
In the Unix world, there is a famous format called "tar.gz".
Probably the biggest reason why "tar-ballz" are so popular and famously used in Unix-like systems is that they preserve file permissions (and other meta-data, I guess). I think that some implementations of zip and 7z might provide that feature as an extension to the format, but most don't have it. The convenient thing with tar archives is that whatever you put in there comes out exactly the same at the other end, with all permissions and whatever else preserved. And the "gzip" compression (from zlib) has just been historically an industry-standard compression algorithm, although, now, there are better ones, also supported by tar, such as .tar.lzma (or .tlz) or .tar.xz.
but what about other formats?
There aren't really that many other formats. Mostly, compressed archive formats often reuse the same few algorithms (DEFLATE, LZ77 / LZMA / LZMA2, BZIP, etc.), and often, formats like zip / 7z / rar are only really container formats that can employ any of those compression algorithms (and even mix and match depending on the individual file types). The point is that you won't really find much that is better than zip or 7z. And their competitors are more or less gone today (like rar?).
Should I use *.tar or other solutions are available?
No, use zip or 7z. Tar-balls are for backups. They are optimized for that purpose (e.g., dump a large folder full of files into a tar-ball, and recover it later, with everything preserved and with best full-archive compression). For your application, zip or 7z is more appropriate.
Is this possible without any consistency problem.
Which popular compression algorithm will best suite for these kind of purpose ?
A compressed file system is the simplest way to do that. You can enable compression in NTFS, ZFS or BTRFS, depending on your OS. But the compression is not great.
A fast algorithm with an flexible container would also do. Zip seems workable: How to update one file in a zip archive
I have a map-reduce java program in which I try to only compress the mapper output but not the reducer output. I thought that this would be possible by setting the following properties in the Configuration instance as listed below. However, when I run my job, the generated output by the reducer still is compressed since the file generated is: part-r-00000.gz. Has anyone successfully just compressed the mapper data but not the reducer? Is that even possible?
//Compress mapper output
conf.setBoolean("mapred.output.compress", true);
conf.set("mapred.output.compression.type", CompressionType.BLOCK.toString());
conf.setClass("mapred.output.compression.codec", GzipCodec.class, CompressionCodec.class);
mapred.compress.map.output: Is the compression of data between the mapper and the reducer. If you use snappy codec this will most likely increase read write speed and reduce network overhead. Don't worry about spitting here. These files are not stored in hdfs. They are temp files that exist only for the map reduce job.
mapred.map.output.compression.codec: I would use snappy
mapred.output.compress: This boolean flag will define is the whole map/reduce job will output compressed data. I would always set this to true also. Faster read/write speeds and less disk spaced used.
mapred.output.compression.type: I use block. This will make the compression splittable even for all compression formats (gzip, snappy, and bzip2) just make sure you're using a splitable file format like sequence, RCFile, or Avro.
mapred.output.compression.codec: this is the compression codec for the map/reduce job. I mostly use one of the three: Snappy (Fastest r/w 2x-3x compression), gzip (normal r fast w 5x-8x compression), bzip2 (slow r/w 8x-12x compression)
Also remember when compression mapred output, that because of splitting compression will differ base on your sorting order. The close like data is together the better the compression.
With MR2, now we should set
conf.set("mapreduce.map.output.compress", true)
conf.set("mapreduce.output.fileoutputformat.compress", false)
For more details, refer: http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
"output compression" will compress your final output. To compress map-outputs only, use something like this:
conf.set("mapred.compress.map.output", "true")
conf.set("mapred.output.compression.type", "BLOCK");
conf.set("mapred.map.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec");
You need to set "mapred.compress.map.output" to true.
Optionally you can choose your compression codec by setting "mapred.map.output.compression.codec".
NOTE1: mapred output compression should never be BLOCK. See the following JIRA for detail:
https://issues.apache.org/jira/browse/HADOOP-1194
NOTE2: GZIP and BZ2 are CPU intensive. If you have slow network and GZIP or BZ2 gives better compression ratio, it may justify the spending of CPU cycles. Otherwise, consider LZO or Snappy codec.
NOTE3: if you want to use map output compression, consider install the native codec which is invoked via JNI and gives you better performance.
If you use MapR's distribution for Hadoop, you can get the benefits of compression without all the folderol with the codecs.
MapR compresses natively at the file system level so that the application doesn't need to know or care. Compression can be turned on or off at the directory level so you can compress inputs, but not outputs or whatever you like. Generally, the compression is so fast (it uses an algorithm similar to snappy by default) that most applications see a performance boost when using native compression. If your files are already compressed, that is detected very quickly and compression is turned off automatically so you don't see a penalty there, either.
I have started to look into Hadoop. If my understanding is right i could process a very big file and it would get split over different nodes, however if the file is compressed then the file could not be split and wold need to be processed by a single node (effectively destroying the advantage of running a mapreduce ver a cluster of parallel machines).
My question is, assuming the above is correct, is it possible to split a large file manually in fixed-size chunks, or daily chunks, compress them and then pass a list of compressed input files to perform a mapreduce?
BZIP2 is splittable in hadoop - it provides very good compression ratio but from CPU time and performances is not providing optimal results, as compression is very CPU consuming.
LZO is splittable in hadoop - leveraging hadoop-lzo you have splittable compressed LZO files. You need to have external .lzo.index files to be able to process in parallel. The library provides all means of generating these indexes in local or distributed manner.
LZ4 is splittable in hadoop - leveraging hadoop-4mc you have splittable compressed 4mc files. You don't need any external indexing, and you can generate archives with provided command line tool or by Java/C code, inside/outside hadoop. 4mc makes available on hadoop LZ4 at any level of speed/compression-ratio: from fast mode reaching 500 MB/s compression speed up to high/ultra modes providing increased compression ratio, almost comparable with GZIP one.
Consider using LZO compression. It's splittable. That means a big .lzo file can be processed by many mappers. Bzip2 can do that, but it's slow.
Cloudera had an introduction about it. For MapReduce, LZO sounds a good balance between compression ratio and compress/decompress speed.
yes, you could have one large compressed file, or multiple compressed files (multiple files specified with -files or the api).
TextInputFormat and descendants should automatically handle .gz compressed files. you can also implement your own InputFormat (which will split the input file into chunks for processing) and RecordReader (which extract one record at a time from the chunk)
another alternative for generic copmression might be to use a compressed file system (such as ext3 with the compression patch, zfs, compFUSEd, or FuseCompress...)
You can use bz2 as your compress codec, and this format also can been split.