Parallel bzip2 decompression scan may fail? - compression

Bzip2 byte stream compression in parallel can be easily done with a FIFO queue where every chunk is processed as parallel task and streamed into a file.
The other way round parallel decompression is not so easy, because everything is bit-aligned and the exact bit-length of a block is known after it's decompressed.
As far as I can see, parallel decompress implementations use magic numbers for block start and stream end and perform a bit-scan. Isn't there a small chance that one of the streams contain such a magic value by coincidence?
Possible block validations:
4 Bytes CRC
6 Bytes "compressed magic"
6 Bytes "end of stream magic"
some bit combinations for the huffman trees are not allowed
max. x Bytes of huffman stream (range to search for next magic)
Per file:
4 Bytes File CRC
padding at the end
I could implement such a scan by just bit-shifting from the stream until I have a magic. But then when I read block N and it fails, I should (maybe also not) take into account, that it was a false positive. For a parallel implementation I can then stop all tasks for blocks N, N+1, N+2, .. , then try to find the next signature and go one. That makes everything very complicated and I don't know if it's worth the effort? I guess maybe not, but is there a chance that a parallel bzip2 implementation fails?
I'm wondering why a file format uses magic numbers as markers, but doesn't include jump hints. I guess the magic numbers are important for filesystem recovery, but anyway, why can't a block contain e.g 16bits for telling how far to jump to the next block.

Yes, the source code you linked notes that the magic 48-bit value can show up in compressed data by chance. It also notes the probability, around 10-14 (actually 2-48, closer to 3.55x10-15). That probability is at every sample, so on average one will occur in every 32 terabytes of compressed data. That's about one month of run time on one core on my machine. Not all that long. In a production environment, you should assume that it will happen. Because it will.
Also as noted in the source you linked, due to the possibility of a false positive, you need to then validate the remainder of the block. You would not stop the subsequent possible block processing, since it is extremely likely that they are all valid blocks. Just validate all and keep the validated ones. Verify when combining that the valid blocks exactly covered the input, with no overlaps. A properly implemented parallel bzip2 decompressor will always work on valid bzip2 streams.
It would need to be more than 16 bits, but yes, in principle a block could have contained the offset to the next block, since it already contains a CRC at the start of the block. Julian did consider that in the revision of bzip2, but decided against it:
bzip2-1.0.X, 0.9.5 and 0.9.0 use exactly the same file format as the
original version, bzip2-0.1. This decision was made in the interests
of stability. Creating yet another incompatible compressed file format
would create further confusion and disruption for users.
...
The compressed file format was never designed to be handled by a
library, and I have had to jump though some hoops to produce an
efficient implementation of decompression. It's a bit hairy. Try
passing decompress.c through the C preprocessor and you'll see what I
mean. Much of this complexity could have been avoided if the
compressed size of each block of data was recorded in the data stream.

Related

Best compression algo for random string

I have some string like below
ff8870fd30db56efd72e8b499a454c4e27be6ab70e23dd59a864563628e998a
which is around 2Kbytes I tried to compress but I am not getting good compression ratio
with gz I got only 400 bytes reduction and with defalte I got 450 reduction ..
Is there any better alogoritham to have more compression atleast more than 50% reduction .
By definition, you cannot compress random data because it will not contain any structure you can represent / describe in a more efficient way, using less bits.
If this is possible, the data contains a structure and is no longer random.
A common counter-argument is that, given enough odds, even an all 0 string can be generated by a RNG, but devil is in details: it is all about odds!
Even in a tiny 2KB space you have 2^(2048*8) possible strings if the data is generated by a true RNG or a robust PRNG algorithm seeded with reasonable amount of noise, and the vast majority of those stings will not contain any reasonable amout of "order" you can compress.
The fact you are obtaining a 400 B / 450 B compression on 2 KB is a strong hint the string you are looking at is not really random, just non-human readable or "random-looking".
GZ format is based on Deflate compression algorithm, so it is not clear why the two figures are presented separately - Deflate accepts various parameters for fine tuning compression at the expense of speed, so different settings can justify the different results.
To get better compression on random-looking (but not really random!) data can try with LZMA2 (7-Zip) or even better ZPAQ (http://mattmahoney.net/dc/zpaq.html).
I do know that this is much later than the OP.... however if you look at HOW the data is being represented then yes it is going to be hard to find repetition as a string... however...
with the example
"ff8870fd30db56efd72e8b499a454c4e27be6ab70e23dd59a864563628e998a" as given...
How else COULD this information be represented? these look to all be HEX couplets.. .for example
"0xff 0x88 0x70" etc... so.....if this was stored in bytes.... you automatically get 100% compression since each character is a single byte in itself...
if We wanted to get very clever we could look at some math where say we could map this data to more easily compressible data.. of course this would only be beneficial for very large data, as the encoding of small amounts of data would likely make it larger...

Interleaving bzip2 and non-bzip2 data

I am looking at making a file format that interleaves two types of chunks of raw bytes.
One chunk will contain a block of bzip2-compressed data, which has a header containing the usual bzip2 magic number (BZh9).
The second chunk will consist of the other data of interest, which has a header containing a different magic number (TBD).
The two magic numbers would be used for seeking, identifying and processing the two data block types differently.
My question is: Is there a magic number I can pick for the second block type, which would very unlikely (or better, impossible) to be found inside a bzip2-compressed block of bytes?
In other words, are there particular bytes that bzip2 excludes or would be probabilistically unlikely to use when compressing, within some statistical threshold, which I could use for a header for another data type in the same file?
One option is that, when I find header bytes for a second block type, I would simply try to process data in the second block type, and if that processing fails, then I assume I am accidentally inside a compressed bzip2 block. But I'd like to know if there is the possibility that there are bytes that would not be found in a bzip2 block, or would not be likely to be found.
No. bzip2 compressed data can contain any pair of bytes, essentially all with equal probability. All you could do would be to define a longer series of bytes as the signature, to reduce the probability that that series accidentally appears in the compressed data. But it still could.
The bzip2 format is self-terminating, so if you're willing to take the time to decode the bzip2 data, you can always find where the next thing is.
To answer the question in a comment, the entire bzip2 stream necessarily terminates on a byte boundary. The last byte may have 0 to 7 bits of zero pad. You can search backwards from the start of your second stream component to look for the bzip2 end marker 0x177245385090 (first 12 decimal digits of the square root of pi), which can start at any bit in a specific byte. It would be 80 to 87 bits back.

How to decompress less than original size with Lz4 library?

I'm using LZ4 library and when decompressing data with:
int LZ4_decompress_fast_continue (void* LZ4_streamDecode, const char* source, char* dest, int originalSize);
I need only first n bytes of the originally encoded N bytes, where n < N. So in order to improve the performance, it makes sense to decompress only a part of the original buffer.
I wonder if I can pass n instead of N to the originalSize argument of the function?
My initial test showed, that it's not possible (I got incorrectly decompressed data). Though maybe there is a way, for example if n is a multiple of some CHUNK_SIZE? All original N bytes were compressed with 1 call of a compress function.
LZ4_decompress_safe_continue() and LZ4_decompress_fast_continue() can only decode full blocks. They consider a partial block as an error, and report it as such. They also consider that if there is not enough room to decompress a full block, it's also an error.
The functionality you are looking for doesn't exist yet. But there is a close cousin that might help.
LZ4_decompress_safe_partial() can decode a part of a block.
Note that, in contrast with _continue() variants, it only works on independent blocks.
Note also that the compressed block must nonetheless be complete, and the output buffer must nonetheless have enough space to decode the entire block. So the only advantage provided by this function is speed : if you want only the first 10 bytes, it will stop as soon as it has generated enough bytes.
"as soon as" doesn't mean "exactly at 10". It could be much later, and in the worst case, it could be after decoding the entire block. That's because the internal decoding engine is still the same : it decodes entire sequences, and doesn't "break them" in the middle, for speed considerations.
If you need to extract less bytes than a full block in order to save some memory, I'm afraid there is no solution yet. Report it as a feature request to upstream.
This seems to have been implemented in lz4 1.8.3.

Proper, efficient file reading

I'd like to read and process (e.g. print) entries from the first row of a CSV file one at a time. I assume Unix-style \n newlines, that no entry is longer than 255 chars and (for now) that there's a newline before EOF. This is meant to be a more efficient alternative to fgets() followed by strtok().
#include <stdio.h>
#include <string.h>
int main() {
int i;
char ch, buf[256];
FILE *fp = fopen("test.csv", "r");
for (;;) {
for (i = 0; ; i++) {
ch = fgetc(fp);
if (ch == ',') {
buf[i] = '\0';
puts(buf);
break;
} else if (ch == '\n') {
buf[i] = '\0';
puts(buf);
fclose(fp);
return 0;
} else buf[i] = ch;
}
}
}
Is this method as efficient and correct as possible?
What is the best way to test for EOF and file reading errors using this method? (Possibilities: testing against the character macro EOF, feof(), ferror(), etc.).
Can I perform the same task using C++ file I/O with no loss of efficiency?
What is most efficient is going to depend a lot on the operating system, standard libraries (e.g. libc), and even the hardware you are running on. This makes it nearly impossible to tell you what's "most efficient".
That having been said, there are a few things you could try:
Use mmap() or a local operating system equivalent (Windows has CreateFileMapping / OpenFileMapping / MapViewOfFile, and probably others). Then you don't do explicit file reads: you simply access the file as if it were already in memory, and anything that isn't there will be faulted in by the page fault mechanism.
Read the entire file into a buffer manually, then work on that buffer. The fewer times you call into file read functions, the fewer function-call overheads you take, and likely also fewer application/OS domain switches. Obviously this uses more memory, but may very well be worth it.
Use a more optimal string scanner for your problem and platform. Going character-by-character yourself is almost never as fast as relying on something existing that's close to your problem domain. For example, you can bet that strchr and memchr are probably better-optimized than most code you can roll yourself, doing things like reading entire cachelines or words at once, scanning using better algorithms for this kind of search, etc. For more complicated cases, you might consider a full regular expression engine that could compile your regex to something fast for your complicated case.
Avoid copying your string around. It may be helpful to think in terms of "find delimiters" and then "output between delimiters". You could for example use strchr to find the next character of interest, and then fwrite or something to write to stdout directly from your input buffer. Then you're keeping most of your work in a few local registers, rather than using a stack or heap buf.
When in doubt, though, try a few possibilities and profile, profile, profile.
Also for this kind of problem, be very aware of differences between runs that are caused by OS and hardware caches: profile a bunch of runs rather than just one after each change -- and if possible, use tests that will either likely always hit caches (if you're trying to measure best-case performance) or tests that will likely miss (if you're trying to measure worst-case performance).
Regarding C++ file IO (fstream and such), just be aware that they're larger, more complicated beasts. They tend to include things such as locale management, automatic buffering, and the like -- as well as being less prone to particular types of coding mistakes.
If you're doing something pretty simple (like what you describe here), I tend to find C++ library stuff gets in the way. (Use a debugger and "step instruction" through a stringstream method versus some C string functions some time, you'll get a good feel for this quickly.)
It all depends on whether you're going to want or need that additional functionality or safety in the future.
Finally, the obligatory "don't sweat the small stuff". Only spend time on optimizing here if it's really important. Otherwise trust the libraries and OS to do the right thing for you most of the time -- if you get too far into micro-optimizations you'll find you're shooting yourself in the foot later. This is not to discourage you from thinking in terms of "should I read the whole file in ahead of time, will that break future use cases" -- because that's macro, rather than micro.
But generally speaking if you're not doing this kind of "make it faster" investigation for a good reason -- i.e. "need this app to perform better now that I've written it, and this code shows up as slow in profiler", or "doing this for fun so I can better understand the system" -- well, spend your time elsewhere first. =)
One method, provided you are going to scan through the file serially, is to use 2 buffers of a decent enough size (16K is the optimal size for SSDs and 4K for HDDs IIRC. But 16K should suffice for both). You start off by performing an asynchronous load (In windows look up Overlapped I/O and on Unix/OSX use O_NONBLOCK) of the first 16K into buffer 0 and then start another load into buffer 1 of bytes 16K to 32K. When your read position hits 16K, swap the buffers (so you are now reading from buffer 1 instead) wait for any further loads to complete into buffer 1 and perform an asynchronous load of bytes 32K to 48K into buffer 0 and so on. This way, you have far less chance of ever having to wait for a load to complete as it should be happening while you are processing the previous 16K.
I moved over to a scheme like this in my XML parser having been using fopen and fgetc previously and the speedup was huge. Loading in a 15 meg XML file and processing it reduced from minutes to seconds. Of course, Your milage may vary.
use fgets to read one line at a time. C++ file I/O are basically wrapper code with some compiler optimization tucked inside ( and many unwanted functionality ). Unless you are reading millions of lines of code and measuring time, it does not matter.

Unexpected "padding" in a Fortran unformatted file

I don't understand the format of unformatted files in Fortran.
For example:
open (3,file=filename,form="unformatted",access="sequential")
write(3) matrix(i,:)
outputs a column of a matrix into a file. I've discovered that it pads the file with 4 bytes on either end, however I don't really understand why, or how to control this behavior. Is there a way to remove the padding?
For unformated IO, Fortran compilers typically write the length of the record at the beginning and end of the record. Most but not all compilers use four bytes. This aids in reading records, e.g., length at the end assists with a backspace operation. You can suppress this with the new Stream IO mode of Fortran 2003, which was added for compatibility with other languages. Use access='stream' in your open statement.
I never used sequential access with unformatted output for this exact reason. However it depends on the application and sometimes it is convenient to have a record length indicator (especially for unstructured data). As suggested by steabert in Looking at binary output from fortran on gnuplot, you can avoid this by using keyword argument ACCESS = 'DIRECT', in which case you need to specify record length. This method is convenient for efficient storage of large multi-dimensional structured data (constant record length). Following example writes an unformatted file whose size equals the size of the array:
REAL(KIND=4),DIMENSION(10) :: a = 3.141
INTEGER :: reclen
INQUIRE(iolength=reclen)a
OPEN(UNIT=10,FILE='direct.out',FORM='UNFORMATTED',&
ACCESS='DIRECT',RECL=reclen)
WRITE(UNIT=10,REC=1)a
CLOSE(UNIT=10)
END
Note that this is not the ideal aproach in sense of portability. In an unformatted file written with direct access, there is no information about the size of each element. A readme text file that describes the data size does the job fine for me, and I prefer this method instead of padding in sequential mode.
Fortran IO is record based, not stream based. Every time you write something through write() you are not only writing the data, but also beginning and end markers for that record. Both record markers are the size of that record. This is the reason why writing a bunch of reals in a single write (one record: one begin marker, the bunch of reals, one end marker) has a different size with respect to writing each real in a separate write (multiple records, each of one begin marker, one real, and one end marker). This is extremely important if you are writing down large matrices, as you could balloon the occupation if improperly written.
Fortran Unformatted IO I am quite familiar with differing outputs using the Intel and Gnu compilers. Fortunately my vast experience dating back to 1970's IBM's allowed me to decode things. Gnu pads records with 4 byte integer counters giving the record length. Intel uses a 1 byte counter and a number of embedded coding values to signify a continuation record or the end of a count. One can still have very long record lengths even though only 1 byte is used.
I have software compiled by the Gnu compiler that I had to modify so it could read an unformatted file generated by either compiler, so it has to detect which format it finds. Reading an unformatted file generated by the Intel compiler (which follows the "old' IBM days) takes "forever" using Gnu's fgetc or opening the file in stream mode. Converting the file to what Gnu expects results in a factor of up to 100 times faster. It depends on your file size if you want to bother with detection and conversion or not. I reduced my program startup time (which opens a large unformatted file) from 5 minutes down to 10 seconds. I had to add in options to reconvert back again if the user wants to take the file back to an Intel compiled program. It's all a pain, but there you go.