Very basic question about Hadoop and compressed input files - compression

I have started to look into Hadoop. If my understanding is right i could process a very big file and it would get split over different nodes, however if the file is compressed then the file could not be split and wold need to be processed by a single node (effectively destroying the advantage of running a mapreduce ver a cluster of parallel machines).
My question is, assuming the above is correct, is it possible to split a large file manually in fixed-size chunks, or daily chunks, compress them and then pass a list of compressed input files to perform a mapreduce?

BZIP2 is splittable in hadoop - it provides very good compression ratio but from CPU time and performances is not providing optimal results, as compression is very CPU consuming.
LZO is splittable in hadoop - leveraging hadoop-lzo you have splittable compressed LZO files. You need to have external .lzo.index files to be able to process in parallel. The library provides all means of generating these indexes in local or distributed manner.
LZ4 is splittable in hadoop - leveraging hadoop-4mc you have splittable compressed 4mc files. You don't need any external indexing, and you can generate archives with provided command line tool or by Java/C code, inside/outside hadoop. 4mc makes available on hadoop LZ4 at any level of speed/compression-ratio: from fast mode reaching 500 MB/s compression speed up to high/ultra modes providing increased compression ratio, almost comparable with GZIP one.

Consider using LZO compression. It's splittable. That means a big .lzo file can be processed by many mappers. Bzip2 can do that, but it's slow.
Cloudera had an introduction about it. For MapReduce, LZO sounds a good balance between compression ratio and compress/decompress speed.

yes, you could have one large compressed file, or multiple compressed files (multiple files specified with -files or the api).
TextInputFormat and descendants should automatically handle .gz compressed files. you can also implement your own InputFormat (which will split the input file into chunks for processing) and RecordReader (which extract one record at a time from the chunk)
another alternative for generic copmression might be to use a compressed file system (such as ext3 with the compression patch, zfs, compFUSEd, or FuseCompress...)

You can use bz2 as your compress codec, and this format also can been split.

Related

Compressing large, near-identical files

I have a bunch of large HDF5 files (all around 1.7G), which share a lot of their content – I guess that more than 95% of the data of each file is found repeated in every other.
I would like to compress them in an archive.
My first attempt using GNU tar with the -z option (gzip) failed: the process was terminated when the archive reached 50G (probably a file size limitation imposed by the sysadmin). Apparently, gzip wasn't able to take advantage of the fact that the files are near-identical in this setting.
Compressing these particular files obviously doesn't require a very fancy compression algorithm, but a veeery patient one.
Is there a way to make gzip (or another tool) detect these large repeated blobs and avoid repeating them in the archive?
Sounds like what you need is a binary diff program. You can google for that, and then try using binary diff between two of them, and then compressing one of them and the resulting diff. You could get fancy and try diffing all combinations, picking the smallest ones to compress, and send only one original.

Streaming mode vs block mode

I can't figure out what exactly is the streaming mode offered by modern compression/decompression algorithms (eg ZStandard or LZ4) and how I can exploit it.
As an example, suppose I have 4x16KB file. I can (individually) compress each file and obtain 4xDifferentCompressedLength files. However I could compress all 4 files together (sending them sequentially, right?) using streaming mode and obtain 1xCompressedLength and expect the compression ratio to be better.
Can I decompress (say) only the 3rd file without decompressing all the previous files? Do streaming mode introduce dependency between the files I appended?
Yes, streaming introduce dependency between files.
In your example, decoding file3 would require to decode first file1 then file2.
Note also that data will appear as appended, with no specific marker between files. So one would need a way to know where each file starts and ends if it's important. Sometimes it's implicit (ex : fixed 16KB size), sometimes it can be deducted from data itself (specific end-of-mark), sometimes it needs additional metadata. It all depends on the application.
You are correct that the compression ratio of C(4xFiles) is expected to be better than 4xC(File), especially if the 4 files are somewhat related (for example if they all are text files).

Should I use .tar.gz?

In the Unix world, there is a famous format called "tar.gz".
But now, I want to develop a game and random accessing a file will be more efficient. If it is archived first, it will cause sequential access.
I know that there is an alternative format called zip or 7z, but what about other formats?
Not only gz.tar, I'd like to a minor compressing library and also get archiving features.
Should I use *.tar or other solutions are available?
PS: I'm using C++.
"Random" access is not good on a .tar.gz, since that is a .tar file that has been wrapped in a .gz compression, so to get to things in the .tar file, you'd first have to decompress the .tar file.
It would be possible to use a .tar file that contains individual files compressed with .gz. You can read the table of content of the .tar file and find/store where all the files are in the archive, and then extract as you need. However, you may find that using your own format is "better" (for example, if I remember correctly, the "header" for a tar-archive is a file at a time, you may want to build your header in one lump, before you store the files [which does mean at least enumerating all the relevant files first, then forming the compressed variant and "patching up" the header with the offsets in compressed form]
For a game, one critical factor would probably be the decompression speed, so you may want to look at different libraries and which one has the best decompression speed. I found this when searching for a comparison:
http://catchchallenger.first-world.info//wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO
You may also care about memory usage, which also varies a bit depending on algorithm.
And I'm guessing your individual files will be much smaller than the entire tar-ball of Linux, so you may want to do your own benchmark, with your own data - after all, the speed of different compression formats does, to some degree, depend on the format of the data.
Normally, for computer games, what you need is a format where each file is compressed individually before being assembled into one file. This is the crucial difference between .tar.gz and .zip / .7z formats, that is, tar-gz is a "compressed archive" while zip / 7z are "archives of compressed files". In fact, both file formats use the same compression algorithm (by default), and the only reason that .tar.gz files are typically smaller is because they compress the entire archive instead of file-by-file, which increases the overall compression ratio.
AFAIK, most computer games use a zip format or a custom format that closely matches it, because it does per-file compression. For instance, Quake engines have always (.pak, .pk3, .pk4) relied on an off-the-shelf zip format with a few minor additions (like a built-in checksum, I think).
The .tar.gz format is created by first making an archive that puts all the (uncompressed) files into one .tar file. Then, that big archive file is compressed with the gzip method to create the final .tar.gz file. The point is that to get any one of the files from the archive you have the decompress the entire thing. This is very appropriate for backups or large transfers, but not appropriate at all for a game engine media archive.
That said, you could technically do the reverse of tar-gz, which is to compress each file individually with gzip, and then put them together in a .tar archive. But this is probably not worth the extra trouble, as it is pretty much exactly what zip files are (in "one easy step"). So, it will be a lot easier to use an off-the-shelf all-in-one format like zip that will allow you to extract individual files at a time. There are many off-the-shelf libraries for extracting and manipulating files in zip archives, just start with libzip (not to be confused with zlib (for gzip or .gz)).
In the Unix world, there is a famous format called "tar.gz".
Probably the biggest reason why "tar-ballz" are so popular and famously used in Unix-like systems is that they preserve file permissions (and other meta-data, I guess). I think that some implementations of zip and 7z might provide that feature as an extension to the format, but most don't have it. The convenient thing with tar archives is that whatever you put in there comes out exactly the same at the other end, with all permissions and whatever else preserved. And the "gzip" compression (from zlib) has just been historically an industry-standard compression algorithm, although, now, there are better ones, also supported by tar, such as .tar.lzma (or .tlz) or .tar.xz.
but what about other formats?
There aren't really that many other formats. Mostly, compressed archive formats often reuse the same few algorithms (DEFLATE, LZ77 / LZMA / LZMA2, BZIP, etc.), and often, formats like zip / 7z / rar are only really container formats that can employ any of those compression algorithms (and even mix and match depending on the individual file types). The point is that you won't really find much that is better than zip or 7z. And their competitors are more or less gone today (like rar?).
Should I use *.tar or other solutions are available?
No, use zip or 7z. Tar-balls are for backups. They are optimized for that purpose (e.g., dump a large folder full of files into a tar-ball, and recover it later, with everything preserved and with best full-archive compression). For your application, zip or 7z is more appropriate.

Hadoop, how to compress mapper output but not the reducer output

I have a map-reduce java program in which I try to only compress the mapper output but not the reducer output. I thought that this would be possible by setting the following properties in the Configuration instance as listed below. However, when I run my job, the generated output by the reducer still is compressed since the file generated is: part-r-00000.gz. Has anyone successfully just compressed the mapper data but not the reducer? Is that even possible?
//Compress mapper output
conf.setBoolean("mapred.output.compress", true);
conf.set("mapred.output.compression.type", CompressionType.BLOCK.toString());
conf.setClass("mapred.output.compression.codec", GzipCodec.class, CompressionCodec.class);
mapred.compress.map.output: Is the compression of data between the mapper and the reducer. If you use snappy codec this will most likely increase read write speed and reduce network overhead. Don't worry about spitting here. These files are not stored in hdfs. They are temp files that exist only for the map reduce job.
mapred.map.output.compression.codec: I would use snappy
mapred.output.compress: This boolean flag will define is the whole map/reduce job will output compressed data. I would always set this to true also. Faster read/write speeds and less disk spaced used.
mapred.output.compression.type: I use block. This will make the compression splittable even for all compression formats (gzip, snappy, and bzip2) just make sure you're using a splitable file format like sequence, RCFile, or Avro.
mapred.output.compression.codec: this is the compression codec for the map/reduce job. I mostly use one of the three: Snappy (Fastest r/w 2x-3x compression), gzip (normal r fast w 5x-8x compression), bzip2 (slow r/w 8x-12x compression)
Also remember when compression mapred output, that because of splitting compression will differ base on your sorting order. The close like data is together the better the compression.
With MR2, now we should set
conf.set("mapreduce.map.output.compress", true)
conf.set("mapreduce.output.fileoutputformat.compress", false)
For more details, refer: http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
"output compression" will compress your final output. To compress map-outputs only, use something like this:
conf.set("mapred.compress.map.output", "true")
conf.set("mapred.output.compression.type", "BLOCK");
conf.set("mapred.map.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec");
You need to set "mapred.compress.map.output" to true.
Optionally you can choose your compression codec by setting "mapred.map.output.compression.codec".
NOTE1: mapred output compression should never be BLOCK. See the following JIRA for detail:
https://issues.apache.org/jira/browse/HADOOP-1194
NOTE2: GZIP and BZ2 are CPU intensive. If you have slow network and GZIP or BZ2 gives better compression ratio, it may justify the spending of CPU cycles. Otherwise, consider LZO or Snappy codec.
NOTE3: if you want to use map output compression, consider install the native codec which is invoked via JNI and gives you better performance.
If you use MapR's distribution for Hadoop, you can get the benefits of compression without all the folderol with the codecs.
MapR compresses natively at the file system level so that the application doesn't need to know or care. Compression can be turned on or off at the directory level so you can compress inputs, but not outputs or whatever you like. Generally, the compression is so fast (it uses an algorithm similar to snappy by default) that most applications see a performance boost when using native compression. If your files are already compressed, that is detected very quickly and compression is turned off automatically so you don't see a penalty there, either.

What compression/archive formats support inter-file compression?

This question on archiving PDF's got me wondering -- if I wanted to compress (for archival purposes) lots of files which are essentially small changes made on top of a master template (a letterhead), it seems like huge compression gains can be had with inter-file compression.
Do any of the standard compression/archiving formats support this? AFAIK, all the popular formats focus on compressing each single file.
Several formats do inter-file compression.
The oldest example is .tar.gz; a .tar has no compression but concatenates all the files together, with headers before each file, and a .gz can compress only one file. Both are applied in sequence, and it's a traditional format in the Unix world. .tar.bz2 is the same, only with bzip2 instead of gzip.
More recent examples are formats with optional "solid" compression (for instance, RAR and 7-Zip), which can internally concatenate all the files before compressing, if enabled by a command-line flag or GUI option.
Take a look at google's open-vcdiff.
http://code.google.com/p/open-vcdiff/
It is designed for calculating small compressed deltas and implements RFC 3284.
http://www.ietf.org/rfc/rfc3284.txt
Microsoft has an API for doing something similar, sans any semblance of a standard.
In general the algorithms you are looking for are ones based on Bentley/McIlroy:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.8470
In particular these algorithms will be a win if the size of the template is larger than the window size (~32k) used by gzip or the block size (100-900k) used by bzip2.
They are used by Google internally inside of their BIGTABLE implementation to store compressed web pages for much the same reason you are seeking them.
Since LZW compression (which pretty much they all use) involves building a table of repeated characters as you go along, such as schema as you desire would limit you to having to decompress the entire archive at once.
If this is acceptable in your situation, it may be simpler to implement a method which just joins your files into one big file before compression.