Interleaving bzip2 and non-bzip2 data - compression

I am looking at making a file format that interleaves two types of chunks of raw bytes.
One chunk will contain a block of bzip2-compressed data, which has a header containing the usual bzip2 magic number (BZh9).
The second chunk will consist of the other data of interest, which has a header containing a different magic number (TBD).
The two magic numbers would be used for seeking, identifying and processing the two data block types differently.
My question is: Is there a magic number I can pick for the second block type, which would very unlikely (or better, impossible) to be found inside a bzip2-compressed block of bytes?
In other words, are there particular bytes that bzip2 excludes or would be probabilistically unlikely to use when compressing, within some statistical threshold, which I could use for a header for another data type in the same file?
One option is that, when I find header bytes for a second block type, I would simply try to process data in the second block type, and if that processing fails, then I assume I am accidentally inside a compressed bzip2 block. But I'd like to know if there is the possibility that there are bytes that would not be found in a bzip2 block, or would not be likely to be found.

No. bzip2 compressed data can contain any pair of bytes, essentially all with equal probability. All you could do would be to define a longer series of bytes as the signature, to reduce the probability that that series accidentally appears in the compressed data. But it still could.
The bzip2 format is self-terminating, so if you're willing to take the time to decode the bzip2 data, you can always find where the next thing is.
To answer the question in a comment, the entire bzip2 stream necessarily terminates on a byte boundary. The last byte may have 0 to 7 bits of zero pad. You can search backwards from the start of your second stream component to look for the bzip2 end marker 0x177245385090 (first 12 decimal digits of the square root of pi), which can start at any bit in a specific byte. It would be 80 to 87 bits back.

Related

Parallel bzip2 decompression scan may fail?

Bzip2 byte stream compression in parallel can be easily done with a FIFO queue where every chunk is processed as parallel task and streamed into a file.
The other way round parallel decompression is not so easy, because everything is bit-aligned and the exact bit-length of a block is known after it's decompressed.
As far as I can see, parallel decompress implementations use magic numbers for block start and stream end and perform a bit-scan. Isn't there a small chance that one of the streams contain such a magic value by coincidence?
Possible block validations:
4 Bytes CRC
6 Bytes "compressed magic"
6 Bytes "end of stream magic"
some bit combinations for the huffman trees are not allowed
max. x Bytes of huffman stream (range to search for next magic)
Per file:
4 Bytes File CRC
padding at the end
I could implement such a scan by just bit-shifting from the stream until I have a magic. But then when I read block N and it fails, I should (maybe also not) take into account, that it was a false positive. For a parallel implementation I can then stop all tasks for blocks N, N+1, N+2, .. , then try to find the next signature and go one. That makes everything very complicated and I don't know if it's worth the effort? I guess maybe not, but is there a chance that a parallel bzip2 implementation fails?
I'm wondering why a file format uses magic numbers as markers, but doesn't include jump hints. I guess the magic numbers are important for filesystem recovery, but anyway, why can't a block contain e.g 16bits for telling how far to jump to the next block.
Yes, the source code you linked notes that the magic 48-bit value can show up in compressed data by chance. It also notes the probability, around 10-14 (actually 2-48, closer to 3.55x10-15). That probability is at every sample, so on average one will occur in every 32 terabytes of compressed data. That's about one month of run time on one core on my machine. Not all that long. In a production environment, you should assume that it will happen. Because it will.
Also as noted in the source you linked, due to the possibility of a false positive, you need to then validate the remainder of the block. You would not stop the subsequent possible block processing, since it is extremely likely that they are all valid blocks. Just validate all and keep the validated ones. Verify when combining that the valid blocks exactly covered the input, with no overlaps. A properly implemented parallel bzip2 decompressor will always work on valid bzip2 streams.
It would need to be more than 16 bits, but yes, in principle a block could have contained the offset to the next block, since it already contains a CRC at the start of the block. Julian did consider that in the revision of bzip2, but decided against it:
bzip2-1.0.X, 0.9.5 and 0.9.0 use exactly the same file format as the
original version, bzip2-0.1. This decision was made in the interests
of stability. Creating yet another incompatible compressed file format
would create further confusion and disruption for users.
...
The compressed file format was never designed to be handled by a
library, and I have had to jump though some hoops to produce an
efficient implementation of decompression. It's a bit hairy. Try
passing decompress.c through the C preprocessor and you'll see what I
mean. Much of this complexity could have been avoided if the
compressed size of each block of data was recorded in the data stream.

Efficient ways to filter unwanted data from a buffer in c++

Lets assume, data is stored in a character buffer in below format
===========================================================================
|length| message1| length| message2| length| message3|...|length |messagen|
===========================================================================
Length - indicates size of following message
From this character buffer, suppose only message2 is unwanted and rest all are relevant,
How efficiently, message2 can be removed so that rest all data in buffer can be used happily?
I have come accross, in-place algorithm, where we can shift the messages in the buffer itself without any additional copy
But with this also, there is an overhead of shifting (n-2) messages because message2 is irrelevant
Is there any better approach/ solution to it to do in c++?
Let me add more details-
Requirement here is need to remove/filter irrelevant data from the buffer and then pass it as input to another function for further processing
irrelevant data can come at any position in the character buffer. Just for example, referred as message 2
You don't say how many bits are in your length field, but assuming you can spare an extra bit to make the length-values signed rather than unsigned, I'd be tempted to adopt a convention that says "if a length-header has a negative value, that indicates that the its message-body is invalid and should be ignored".
Once you've adopted that convention, then flagging message 2 as invalid is just a matter of overwriting its length-header with the negation of its current value.
Of course, the code that is reading the buffer later on has to also follow the convention, so if e.g. if it sees a length-header with value of -57 then it should just skip ahead by 57 bytes without processing them.

How to decompress less than original size with Lz4 library?

I'm using LZ4 library and when decompressing data with:
int LZ4_decompress_fast_continue (void* LZ4_streamDecode, const char* source, char* dest, int originalSize);
I need only first n bytes of the originally encoded N bytes, where n < N. So in order to improve the performance, it makes sense to decompress only a part of the original buffer.
I wonder if I can pass n instead of N to the originalSize argument of the function?
My initial test showed, that it's not possible (I got incorrectly decompressed data). Though maybe there is a way, for example if n is a multiple of some CHUNK_SIZE? All original N bytes were compressed with 1 call of a compress function.
LZ4_decompress_safe_continue() and LZ4_decompress_fast_continue() can only decode full blocks. They consider a partial block as an error, and report it as such. They also consider that if there is not enough room to decompress a full block, it's also an error.
The functionality you are looking for doesn't exist yet. But there is a close cousin that might help.
LZ4_decompress_safe_partial() can decode a part of a block.
Note that, in contrast with _continue() variants, it only works on independent blocks.
Note also that the compressed block must nonetheless be complete, and the output buffer must nonetheless have enough space to decode the entire block. So the only advantage provided by this function is speed : if you want only the first 10 bytes, it will stop as soon as it has generated enough bytes.
"as soon as" doesn't mean "exactly at 10". It could be much later, and in the worst case, it could be after decoding the entire block. That's because the internal decoding engine is still the same : it decodes entire sequences, and doesn't "break them" in the middle, for speed considerations.
If you need to extract less bytes than a full block in order to save some memory, I'm afraid there is no solution yet. Report it as a feature request to upstream.
This seems to have been implemented in lz4 1.8.3.

Can a file end with a 0x80 char?

I'm implementing my own version of blowfish encoder/decoder. I use a standard padding of 0x80 if necessary.
My question is if I need to add a padding chars even if I don't need it, because in the case of a file that ends naturally with 0x80, in the deconding part I will remove this character, but in this case it is a wrong action, since te 0x80 is part of the file itself.
This of course can be solved by adding a final char even if the total number of characters is a multiple of the encoding block (64bit in this case). I can implement this countermeasure, but first I'd prefer to know if I really need it.
Natural consequence is thinking if this type of char is chosen because never happens in a file (so the wrong situation above never happens), but I'm not sure at all.
Thanks! and sorry for the dummy question..
On Linux and other filesystem a file can contain any sequence of any bytes. So ideally you can not depended on any particular byte to decide file end. (Although EOF is there..!!)
What i am suggesting is the Most file formats are using.
You can have specific 4-5 magic bytes Header for your file format. and followed by you can have size of rest of bytes. So after some bytes your last byte would be there.
Edit:
In above suggestion In encoder you need to update size of file after adding any new data in files.
If you do not want that then you can encode your data in perticular chunk of data and then encode them packet by packet. Your file will be number of some packet. such things are used in NAL units
Blowfish is a block cipher. It always takes 64 bit input and outputs 64 bit output. If you want to encrypt a stream that is not a multiple of 64 bit long you will need to add some padding bytes. When you decrypt the encrypted stream you always get a multiple of 64 bit. But you have no information if the encrypted stream contained 'real' data or padding bytes. You need to keep track of that yourself. A simple approach would be to store the set of 'data length' and 'encrypted stream'. Another approach would be to prepend the clear text stream with a data length value, for example a 64 bit unsigned integer. Then after decrypting the encrypted stream you will have that length as the first value and then you know how many bytes of the last block are real data and how many are just padding.
And regarding your question about what bytes can be at the end of a file: any. You can have files with arbitrary content. Each byte in the file can be of any value, there is no restriction.
Regular binary file can contain any bytes sequence, so file can end with 0x80, with NULL or any other.
If you are talking about some specific standard, so it depends.. However I think that there is no such file type that could not contain some specific character in the end, I know about file types that ignores as many last characters as not needed (because header determines size) so you should do so, but never heard about illegal file data (except cracked).
So as mentioned use header, reserve for example 8 bytes that determines size. That is easy solution.
Also, before asking such question, you should ask yourself, why file should end with some special character?
The answer is Yes. On every operating system in current use, a file can end with any possible sequence of bytes. In fact, you should generate such a file to test your implementation.
In the general case you cannot recognise trailing padding characters or remove them reliably without knowing the length of the file. Therefore encoding the length of the file must be part of your cryptographic protocol.
Simply put the length of the file at the beginning and encrypt the whole thing, including any padding bytes you like (random is probably best). Once unencrypted you will have the file length to tell you where to truncate.

Most efficient to read file into separate variables using fstream

I have tons of files which look a little like:
12-3-125-BINARYDATA
What would be the most efficient way to save the 12, 3 and 125 as separate integer variables, and the BINARYDATA as a char-vector?
I'd really like to use fstream, but I don't exactly know how to (got it working with std::strings, but the BINARYDATA part was all messed up).
The most efficient method for reading data is to read many "chunks", or records into memory using the fewest I/O function calls, then parsing the data in memory.
For example, reading 5 records with one fread call is more efficient than 5 calls to fread to read in a record. Accessing memory is always faster than accessing external data such as files.
Some platforms have the ability to memory-map a file. This may be more efficient than reading the using I/O functions. Profiling will determine the most efficient.
Fixed length records are always more efficient than variable length records. Variable length records involve either reading until a fixed size is read or reading until a terminal (sentinel) value is found. For example, a text line is a variable record and must be read one byte at a time until the terminating End-Of-Line marker is found. Buffering may help in this case.
What do you mean by Binary Data? Is it a 010101000 char by char or "real" binary data? If they are real "binary data", just read the file as binary file. First read 2 bytes for the first int, next 1 bytes for -, 2 bytes for 3,and so on, until you read the first pos of binary data, just get file length and read all of it.