How to decompress less than original size with Lz4 library?

How to decompress less than original size with Lz4 library? - c++

I'm using LZ4 library and when decompressing data with:
int LZ4_decompress_fast_continue (void* LZ4_streamDecode, const char* source, char* dest, int originalSize);
I need only first n bytes of the originally encoded N bytes, where n < N. So in order to improve the performance, it makes sense to decompress only a part of the original buffer.
I wonder if I can pass n instead of N to the originalSize argument of the function?
My initial test showed, that it's not possible (I got incorrectly decompressed data). Though maybe there is a way, for example if n is a multiple of some CHUNK_SIZE? All original N bytes were compressed with 1 call of a compress function.

LZ4_decompress_safe_continue() and LZ4_decompress_fast_continue() can only decode full blocks. They consider a partial block as an error, and report it as such. They also consider that if there is not enough room to decompress a full block, it's also an error.
The functionality you are looking for doesn't exist yet. But there is a close cousin that might help.
LZ4_decompress_safe_partial() can decode a part of a block.
Note that, in contrast with _continue() variants, it only works on independent blocks.
Note also that the compressed block must nonetheless be complete, and the output buffer must nonetheless have enough space to decode the entire block. So the only advantage provided by this function is speed : if you want only the first 10 bytes, it will stop as soon as it has generated enough bytes.
"as soon as" doesn't mean "exactly at 10". It could be much later, and in the worst case, it could be after decoding the entire block. That's because the internal decoding engine is still the same : it decodes entire sequences, and doesn't "break them" in the middle, for speed considerations.
If you need to extract less bytes than a full block in order to save some memory, I'm afraid there is no solution yet. Report it as a feature request to upstream.

This seems to have been implemented in lz4 1.8.3.

Related

Parallel bzip2 decompression scan may fail?

Bzip2 byte stream compression in parallel can be easily done with a FIFO queue where every chunk is processed as parallel task and streamed into a file.
The other way round parallel decompression is not so easy, because everything is bit-aligned and the exact bit-length of a block is known after it's decompressed.
As far as I can see, parallel decompress implementations use magic numbers for block start and stream end and perform a bit-scan. Isn't there a small chance that one of the streams contain such a magic value by coincidence?
Possible block validations:
4 Bytes CRC
6 Bytes "compressed magic"
6 Bytes "end of stream magic"
some bit combinations for the huffman trees are not allowed
max. x Bytes of huffman stream (range to search for next magic)
Per file:
4 Bytes File CRC
padding at the end
I could implement such a scan by just bit-shifting from the stream until I have a magic. But then when I read block N and it fails, I should (maybe also not) take into account, that it was a false positive. For a parallel implementation I can then stop all tasks for blocks N, N+1, N+2, .. , then try to find the next signature and go one. That makes everything very complicated and I don't know if it's worth the effort? I guess maybe not, but is there a chance that a parallel bzip2 implementation fails?
I'm wondering why a file format uses magic numbers as markers, but doesn't include jump hints. I guess the magic numbers are important for filesystem recovery, but anyway, why can't a block contain e.g 16bits for telling how far to jump to the next block.

Yes, the source code you linked notes that the magic 48-bit value can show up in compressed data by chance. It also notes the probability, around 10-14 (actually 2-48, closer to 3.55x10-15). That probability is at every sample, so on average one will occur in every 32 terabytes of compressed data. That's about one month of run time on one core on my machine. Not all that long. In a production environment, you should assume that it will happen. Because it will.
Also as noted in the source you linked, due to the possibility of a false positive, you need to then validate the remainder of the block. You would not stop the subsequent possible block processing, since it is extremely likely that they are all valid blocks. Just validate all and keep the validated ones. Verify when combining that the valid blocks exactly covered the input, with no overlaps. A properly implemented parallel bzip2 decompressor will always work on valid bzip2 streams.
It would need to be more than 16 bits, but yes, in principle a block could have contained the offset to the next block, since it already contains a CRC at the start of the block. Julian did consider that in the revision of bzip2, but decided against it:
bzip2-1.0.X, 0.9.5 and 0.9.0 use exactly the same file format as the
original version, bzip2-0.1. This decision was made in the interests
of stability. Creating yet another incompatible compressed file format
would create further confusion and disruption for users.
...
The compressed file format was never designed to be handled by a
library, and I have had to jump though some hoops to produce an
efficient implementation of decompression. It's a bit hairy. Try
passing decompress.c through the C preprocessor and you'll see what I
mean. Much of this complexity could have been avoided if the
compressed size of each block of data was recorded in the data stream.

Interleaving bzip2 and non-bzip2 data

I am looking at making a file format that interleaves two types of chunks of raw bytes.
One chunk will contain a block of bzip2-compressed data, which has a header containing the usual bzip2 magic number (BZh9).
The second chunk will consist of the other data of interest, which has a header containing a different magic number (TBD).
The two magic numbers would be used for seeking, identifying and processing the two data block types differently.
My question is: Is there a magic number I can pick for the second block type, which would very unlikely (or better, impossible) to be found inside a bzip2-compressed block of bytes?
In other words, are there particular bytes that bzip2 excludes or would be probabilistically unlikely to use when compressing, within some statistical threshold, which I could use for a header for another data type in the same file?
One option is that, when I find header bytes for a second block type, I would simply try to process data in the second block type, and if that processing fails, then I assume I am accidentally inside a compressed bzip2 block. But I'd like to know if there is the possibility that there are bytes that would not be found in a bzip2 block, or would not be likely to be found.

No. bzip2 compressed data can contain any pair of bytes, essentially all with equal probability. All you could do would be to define a longer series of bytes as the signature, to reduce the probability that that series accidentally appears in the compressed data. But it still could.
The bzip2 format is self-terminating, so if you're willing to take the time to decode the bzip2 data, you can always find where the next thing is.
To answer the question in a comment, the entire bzip2 stream necessarily terminates on a byte boundary. The last byte may have 0 to 7 bits of zero pad. You can search backwards from the start of your second stream component to look for the bzip2 end marker 0x177245385090 (first 12 decimal digits of the square root of pi), which can start at any bit in a specific byte. It would be 80 to 87 bits back.

Proper, efficient file reading

I'd like to read and process (e.g. print) entries from the first row of a CSV file one at a time. I assume Unix-style \n newlines, that no entry is longer than 255 chars and (for now) that there's a newline before EOF. This is meant to be a more efficient alternative to fgets() followed by strtok().
#include <stdio.h>
#include <string.h>
int main() {
int i;
char ch, buf[256];
FILE *fp = fopen("test.csv", "r");
for (;;) {
for (i = 0; ; i++) {
ch = fgetc(fp);
if (ch == ',') {
buf[i] = '\0';
puts(buf);
break;
} else if (ch == '\n') {
buf[i] = '\0';
puts(buf);
fclose(fp);
return 0;
} else buf[i] = ch;
}
}
}
Is this method as efficient and correct as possible?
What is the best way to test for EOF and file reading errors using this method? (Possibilities: testing against the character macro EOF, feof(), ferror(), etc.).
Can I perform the same task using C++ file I/O with no loss of efficiency?

What is most efficient is going to depend a lot on the operating system, standard libraries (e.g. libc), and even the hardware you are running on. This makes it nearly impossible to tell you what's "most efficient".
That having been said, there are a few things you could try:
Use mmap() or a local operating system equivalent (Windows has CreateFileMapping / OpenFileMapping / MapViewOfFile, and probably others). Then you don't do explicit file reads: you simply access the file as if it were already in memory, and anything that isn't there will be faulted in by the page fault mechanism.
Read the entire file into a buffer manually, then work on that buffer. The fewer times you call into file read functions, the fewer function-call overheads you take, and likely also fewer application/OS domain switches. Obviously this uses more memory, but may very well be worth it.
Use a more optimal string scanner for your problem and platform. Going character-by-character yourself is almost never as fast as relying on something existing that's close to your problem domain. For example, you can bet that strchr and memchr are probably better-optimized than most code you can roll yourself, doing things like reading entire cachelines or words at once, scanning using better algorithms for this kind of search, etc. For more complicated cases, you might consider a full regular expression engine that could compile your regex to something fast for your complicated case.
Avoid copying your string around. It may be helpful to think in terms of "find delimiters" and then "output between delimiters". You could for example use strchr to find the next character of interest, and then fwrite or something to write to stdout directly from your input buffer. Then you're keeping most of your work in a few local registers, rather than using a stack or heap buf.
When in doubt, though, try a few possibilities and profile, profile, profile.
Also for this kind of problem, be very aware of differences between runs that are caused by OS and hardware caches: profile a bunch of runs rather than just one after each change -- and if possible, use tests that will either likely always hit caches (if you're trying to measure best-case performance) or tests that will likely miss (if you're trying to measure worst-case performance).
Regarding C++ file IO (fstream and such), just be aware that they're larger, more complicated beasts. They tend to include things such as locale management, automatic buffering, and the like -- as well as being less prone to particular types of coding mistakes.
If you're doing something pretty simple (like what you describe here), I tend to find C++ library stuff gets in the way. (Use a debugger and "step instruction" through a stringstream method versus some C string functions some time, you'll get a good feel for this quickly.)
It all depends on whether you're going to want or need that additional functionality or safety in the future.
Finally, the obligatory "don't sweat the small stuff". Only spend time on optimizing here if it's really important. Otherwise trust the libraries and OS to do the right thing for you most of the time -- if you get too far into micro-optimizations you'll find you're shooting yourself in the foot later. This is not to discourage you from thinking in terms of "should I read the whole file in ahead of time, will that break future use cases" -- because that's macro, rather than micro.
But generally speaking if you're not doing this kind of "make it faster" investigation for a good reason -- i.e. "need this app to perform better now that I've written it, and this code shows up as slow in profiler", or "doing this for fun so I can better understand the system" -- well, spend your time elsewhere first. =)

One method, provided you are going to scan through the file serially, is to use 2 buffers of a decent enough size (16K is the optimal size for SSDs and 4K for HDDs IIRC. But 16K should suffice for both). You start off by performing an asynchronous load (In windows look up Overlapped I/O and on Unix/OSX use O_NONBLOCK) of the first 16K into buffer 0 and then start another load into buffer 1 of bytes 16K to 32K. When your read position hits 16K, swap the buffers (so you are now reading from buffer 1 instead) wait for any further loads to complete into buffer 1 and perform an asynchronous load of bytes 32K to 48K into buffer 0 and so on. This way, you have far less chance of ever having to wait for a load to complete as it should be happening while you are processing the previous 16K.
I moved over to a scheme like this in my XML parser having been using fopen and fgetc previously and the speedup was huge. Loading in a 15 meg XML file and processing it reduced from minutes to seconds. Of course, Your milage may vary.

use fgets to read one line at a time. C++ file I/O are basically wrapper code with some compiler optimization tucked inside ( and many unwanted functionality ). Unless you are reading millions of lines of code and measuring time, it does not matter.

Writing binary files C++, way to force something to be at byte 18?

I'm currently trying to write a .bmp file in C++ and for the most part it works, there is however, just one issue. When I start trying to save images with different widths and heights everything goes askew and I'm struggling to solve it, so is there any way to force something to write to a specific byte (adding padding in between it and the last thing written)?

There are several sort of obvious answers, such as keeping your data in memory in a buffer, then putting the desired value in as bufr[offset]=mydata;. I presume you want something a little fancier than that, because you are, for example, doing this in a streaming sort of application where you can't have the whole object in memory at the same time.
In that case, what you're looking for is the magic offered by fseek(3) and ftell(3) (see man pages). Seek positions the file as a specific offset; tell gets the file's current offset. If it's a constant offset of 18, the you simply finish up with the file, and do
fseek(fp, 18L, SEEK_CUR)
where fp is the file pointer, SEEK_CUR is a constant declared in stdio.h, and 18 is the number 18.
Update
By the way, this is based on the system call lseek(2). Something that confuses people (read "me", I never remember this until I have been searching) is there is no matching "ltell(2)" system call. Instead, to get the current file offset, you use
off_t offset;
offset = lseek(fp, 0L, SEEK_CUR);
because lseek returns the offset after its operation. The example code above gives us the offset after moving 0 bytes from the current offset, which is of course the current offset.
UPdate
aha, C++. You said C. For C++, there are member functions for seek and tell. See the fstream man page.

Count how many bytes have been written. Write zeroes until the count hits 18. Then resume writing your real data.

If you are on Windows, everything comes to writing predefined structures: "Bitmap storage".
Also there is an example that shows how they should be filled: "Storing an Image".
If you are writing not-just-for-windows code then you can mimic these structs and fallow the guide.

Marshall multiple protobuf to file

Background:
I'm using Google's protobuf, and I would like to read/write several gigabytes of protobuf marshalled data to a file using C++. As it's recommended to keep the size of each protobuf object under 1MB, I figured a binary stream (illustrated below) written to a file would work. Each offset contains the number of bytes to the next offset until the end of the file is reached. This way, each protobuf can stay under 1MB, and I can glob them together to my heart's content.
[int32 offset]
[protobuf blob 1]
[int32 offset]
[protobuf blob 2]
...
[eof]
I have an implemntation that works on Github:
src/glob.hpp
src/glob.cpp
test/readglob.cpp
test/writeglob.cpp
But I feel I have written some poor code, and would appreciate some advice on how to improve it. Thus,
Questions:
I'm using reinterpret_cast<char*> to read/write the 32 bit integers to and from the binary fstream. Since I'm using protobuf, I'm making the assumption that all machines are little-endian. I also assert that an int is indeed 4 bytes. Is there a better way to read/write a 32 bit integer to a binary fstream given these two limiting assumptions?
In reading from fstream, I create a temporary fixed-length char buffer, so that I can then pass this fixed-length buffer to the protobuf library to decode using ParseFromArray, as ParseFromIstream will consume the entire stream. I'd really prefer just to tell the library to read at most the next N bytes from fstream, but there doesn't seem to be that functionality in protobuf. What would be the most idiomatic way to pass a function at most N bytes of an fstream? Or is my design sufficiently upside down that I should consider a different approach entirely?
Edit:
#codymanix: I'm casting to char since istream::read requires a char array if I'm not mistaken. I'm also not using the extraction operator >> since I read it was poor form to use with binary streams. Or is this last piece of advice bogus?
#Martin York: Removed new/delete in favor of std::vector<char>. glob.cpp is now updated. Thanks!

Don't use new []/delete[].
Instead us a std::vector as deallocation is guaranteed in the event of exceptions.
Don't assume that reading will return all the bytes you requested.
Check with gcount() to make sure that you got what you asked for.
Rather than have Glob implement the code for both input and output depending on a switch in the constructor. Rather implement two specialized classes like ifstream/ofstream. This will simplify both the interface and the usage.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js