Why is bzip2's maximum blocksize 900k?

Why is bzip2's maximum blocksize 900k? - compression

bzip2 (i.e. this program by Julian Seward)'s lists available block-sizes between 100k and 900k:
$ bzip2 --help
bzip2, a block-sorting file compressor. Version 1.0.6, 6-Sept-2010.
usage: bzip2 [flags and input files in any order]
-1 .. -9 set block size to 100k .. 900k
This number corresponds to the hundred_k_blocksize value written into the header of a compressed file.
From the documentation, memory requirements are as follows:
Compression: 400k + ( 8 x block size )
Decompression: 100k + ( 4 x block size ), or
100k + ( 2.5 x block size )
At the time the original program was written (1996), I imagine 7.6M (400k + 8 * 900k) might have been a hefty amount of memory on a computer, but for today's machines it's nothing.
My question is two part:
1) Would better compression be achieved with larger block sizes? (Naively I'd assume yes). Is there any reason not to use larger blocks? How does the cpu time for compression scale with the size of block?
2) Practically, are there any forks of the bzip2 code (or alternate implementations) that allow for larger block sizes? Would this require significant revision to the source code?
The file format seems flexible enough to handle this. For example ... since hundred_k_blocksize holds an 8-bit character that indicates the block-size, one could extend down the ASCII table to indicate larger block-sizes (e.g. ':' = x3A => 1000k, ';' = x3B => 1100k, '<' = x3C => 1200k, ...).

Your intuition that a larger block size should lead to a higher compression ratio is supported by Matt Mahoney's compilation of programs from his large text compression benchmark. For example, the open-source BWT program, BBB, (http://mattmahoney.net/dc/text.html#1640) has a ~40% compression ratio improvement going from a blocksize of 10^6 to 10^9. Between these two values, the compression time doubles. Now that the "xz" program, which uses is an LZ variant (called LZMA2) originally described by 7zip author, Igor Pavlov, is beginning to overtake bzip2 as the default strategy for compressing source code, it is worth studying the possibility of upping bzip2's block size to see if it might be a viable alternative. Also, bzip2 avoided arithmetic coding due to patent restrictions, which have since expired. Combined with the possibility of using the fast asymmetric numeral systems for entropy coding developed by Jarek Duda, a modernized bzip2 could very well be competitive in both compression ratio and speed to xz.

Related

Parallel bzip2 decompression scan may fail?

Bzip2 byte stream compression in parallel can be easily done with a FIFO queue where every chunk is processed as parallel task and streamed into a file.
The other way round parallel decompression is not so easy, because everything is bit-aligned and the exact bit-length of a block is known after it's decompressed.
As far as I can see, parallel decompress implementations use magic numbers for block start and stream end and perform a bit-scan. Isn't there a small chance that one of the streams contain such a magic value by coincidence?
Possible block validations:
4 Bytes CRC
6 Bytes "compressed magic"
6 Bytes "end of stream magic"
some bit combinations for the huffman trees are not allowed
max. x Bytes of huffman stream (range to search for next magic)
Per file:
4 Bytes File CRC
padding at the end
I could implement such a scan by just bit-shifting from the stream until I have a magic. But then when I read block N and it fails, I should (maybe also not) take into account, that it was a false positive. For a parallel implementation I can then stop all tasks for blocks N, N+1, N+2, .. , then try to find the next signature and go one. That makes everything very complicated and I don't know if it's worth the effort? I guess maybe not, but is there a chance that a parallel bzip2 implementation fails?
I'm wondering why a file format uses magic numbers as markers, but doesn't include jump hints. I guess the magic numbers are important for filesystem recovery, but anyway, why can't a block contain e.g 16bits for telling how far to jump to the next block.

Yes, the source code you linked notes that the magic 48-bit value can show up in compressed data by chance. It also notes the probability, around 10-14 (actually 2-48, closer to 3.55x10-15). That probability is at every sample, so on average one will occur in every 32 terabytes of compressed data. That's about one month of run time on one core on my machine. Not all that long. In a production environment, you should assume that it will happen. Because it will.
Also as noted in the source you linked, due to the possibility of a false positive, you need to then validate the remainder of the block. You would not stop the subsequent possible block processing, since it is extremely likely that they are all valid blocks. Just validate all and keep the validated ones. Verify when combining that the valid blocks exactly covered the input, with no overlaps. A properly implemented parallel bzip2 decompressor will always work on valid bzip2 streams.
It would need to be more than 16 bits, but yes, in principle a block could have contained the offset to the next block, since it already contains a CRC at the start of the block. Julian did consider that in the revision of bzip2, but decided against it:
bzip2-1.0.X, 0.9.5 and 0.9.0 use exactly the same file format as the
original version, bzip2-0.1. This decision was made in the interests
of stability. Creating yet another incompatible compressed file format
would create further confusion and disruption for users.
...
The compressed file format was never designed to be handled by a
library, and I have had to jump though some hoops to produce an
efficient implementation of decompression. It's a bit hairy. Try
passing decompress.c through the C preprocessor and you'll see what I
mean. Much of this complexity could have been avoided if the
compressed size of each block of data was recorded in the data stream.

Best compression algo for random string

I have some string like below
ff8870fd30db56efd72e8b499a454c4e27be6ab70e23dd59a864563628e998a
which is around 2Kbytes I tried to compress but I am not getting good compression ratio
with gz I got only 400 bytes reduction and with defalte I got 450 reduction ..
Is there any better alogoritham to have more compression atleast more than 50% reduction .

By definition, you cannot compress random data because it will not contain any structure you can represent / describe in a more efficient way, using less bits.
If this is possible, the data contains a structure and is no longer random.
A common counter-argument is that, given enough odds, even an all 0 string can be generated by a RNG, but devil is in details: it is all about odds!
Even in a tiny 2KB space you have 2^(2048*8) possible strings if the data is generated by a true RNG or a robust PRNG algorithm seeded with reasonable amount of noise, and the vast majority of those stings will not contain any reasonable amout of "order" you can compress.
The fact you are obtaining a 400 B / 450 B compression on 2 KB is a strong hint the string you are looking at is not really random, just non-human readable or "random-looking".
GZ format is based on Deflate compression algorithm, so it is not clear why the two figures are presented separately - Deflate accepts various parameters for fine tuning compression at the expense of speed, so different settings can justify the different results.
To get better compression on random-looking (but not really random!) data can try with LZMA2 (7-Zip) or even better ZPAQ (http://mattmahoney.net/dc/zpaq.html).

I do know that this is much later than the OP.... however if you look at HOW the data is being represented then yes it is going to be hard to find repetition as a string... however...
with the example
"ff8870fd30db56efd72e8b499a454c4e27be6ab70e23dd59a864563628e998a" as given...
How else COULD this information be represented? these look to all be HEX couplets.. .for example
"0xff 0x88 0x70" etc... so.....if this was stored in bytes.... you automatically get 100% compression since each character is a single byte in itself...
if We wanted to get very clever we could look at some math where say we could map this data to more easily compressible data.. of course this would only be beneficial for very large data, as the encoding of small amounts of data would likely make it larger...

Using only part of crc32

Will using only 2 upper/lower bytes of a crc32 sum make it weaker than crc16?
Background:
I'm currently implementing a wireless protocol.
I have chunks of 64byte each, and according to
Data Length vs CRC Length
I would need at most crc16.
Using crc16 instead of crc32 would free up bandwidth for use in Forward error correction (64byte is one block in FEC).
However, my hardware is quite low powered but has hardware support for CRC32.
So my idea was to use the hardware crc32 engine and just throw away 2 of the result bytes.
I know that this is not a crc16 sum, but that does not matter because I control both sides of the transmission.
In case it matters: I can use both crc32 (poly 0x04C11DB7) or crc32c (poly 0x1EDC6F41).

Yes, it will be weaker, but only for small numbers of bit errors. You get none of the guarantees of a CRC-16 by instead taking half of a CRC-32. E.g. the number of bits in a burst that are always detectable.
What is the noise source that you are trying to protect against?

Interleaving bzip2 and non-bzip2 data

I am looking at making a file format that interleaves two types of chunks of raw bytes.
One chunk will contain a block of bzip2-compressed data, which has a header containing the usual bzip2 magic number (BZh9).
The second chunk will consist of the other data of interest, which has a header containing a different magic number (TBD).
The two magic numbers would be used for seeking, identifying and processing the two data block types differently.
My question is: Is there a magic number I can pick for the second block type, which would very unlikely (or better, impossible) to be found inside a bzip2-compressed block of bytes?
In other words, are there particular bytes that bzip2 excludes or would be probabilistically unlikely to use when compressing, within some statistical threshold, which I could use for a header for another data type in the same file?
One option is that, when I find header bytes for a second block type, I would simply try to process data in the second block type, and if that processing fails, then I assume I am accidentally inside a compressed bzip2 block. But I'd like to know if there is the possibility that there are bytes that would not be found in a bzip2 block, or would not be likely to be found.

No. bzip2 compressed data can contain any pair of bytes, essentially all with equal probability. All you could do would be to define a longer series of bytes as the signature, to reduce the probability that that series accidentally appears in the compressed data. But it still could.
The bzip2 format is self-terminating, so if you're willing to take the time to decode the bzip2 data, you can always find where the next thing is.
To answer the question in a comment, the entire bzip2 stream necessarily terminates on a byte boundary. The last byte may have 0 to 7 bits of zero pad. You can search backwards from the start of your second stream component to look for the bzip2 end marker 0x177245385090 (first 12 decimal digits of the square root of pi), which can start at any bit in a specific byte. It would be 80 to 87 bits back.

How can we split one 100 GB file into hundred 1 GB file?

This question came to mind when I was trying to solve this problem.
I have harddrive with capacity 120 GB, of which 100 GB is occupied by a single huge file. So 20 GB is still free.
My question is, how can we split this huge file into smaller ones, say 1 GB each? I see that if I had ~100 GB free space, probably it was possible with simple algorithm. But given only 20 GB free space, we can write upto 20 1GB files. I've no idea how to delete contents from the bigger file while reading from it.
Any solution?
It seems I've to truncate the file by 1 GB, once I finish writing one file, but that boils down to this queston:
Is it possible to truncate a part of a file? How exactly?
I would like to see an algorithm (or an outline of an algorithm) that works in C or C++ (preferably Standard C and C++), so I may know the lower level details. I'm not looking for a magic function, script or command that can do this job.

According to this question (Partially truncating a stream) you should be able to use, on a system that is POSIX compliant, a call to int ftruncate(int fildes, off_t length) to resize an existing file.
Modern implementations will probably resize the file "in place" (though this is unspecified in the documentation). The only gotcha is that you may have to do some extra work to ensure that off_t is a 64 bit type (provisions exist within the POSIX standard for 32 bit off_t types).
You should take steps to handle error conditions, just in case it fails for some reason, since obviously, any serious failure could result in the loss of your 100GB file.
Pseudocode (assume, and take steps to ensure, all data types are large enough to avoid overflows):
open (string filename) // opens a file, returns a file descriptor
file_size (descriptor file) // returns the absolute size of the specified file
seek (descriptor file, position p) // moves the caret to specified absolute point
copy_to_new_file (descriptor file, string newname)
// creates file specified by newname, copies data from specified file descriptor
// into newfile until EOF is reached
set descriptor = open ("MyHugeFile")
set gigabyte = 2^30 // 1024 * 1024 * 1024 bytes
set filesize = file_size(descriptor)
set blocks = (filesize + gigabyte - 1) / gigabyte
loop (i = blocks; i > 0; --i)
set truncpos = gigabyte * (i - 1)
seek (descriptor, truncpos)
copy_to_new_file (descriptor, "MyHugeFile" + i))
ftruncate (descriptor, truncpos)
Obviously some of this pseudocode is analogous to functions found in the standard library. In other cases, you will have to write your own.

There is no standard function for this job.
For Linux you can use the ftruncate method, while for Windows you can use _chsize or SetEndOfFile. A simple #ifdef will make it cross-platform.
Also read this Q&A.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js