Problem: sometimes we have to interleave multiple streams into one,
in which case its necessary to provide a way to identify block
boundaries within a stream. What kind of format would be good
for such a task?
(All processing has to be purely sequential and i/o operations are
blockwise and aligned.)
From decoding side, the best way is to have length prefixes
for blocks. But at encoding side, it requires either random access
to output file (seek to stream start and write the header), or
being able to cache the whole streams, which is, in general, impossible.
Alternatively, we can add length headers (+ some flags) to
blocks of cacheable size. Its surely a solution, but handling is much more
complex than [1], especially at encoding (presuming that i/o operations
are done with aligned fixed-size blocks).
Well, one possible implementation is to write a 0 byte into the buffer,
then stream data until its filled. So prefix byte = 0 would mean that
there're bufsize-1 bytes of stream data next, and !=0 would mean that
there's less... in which case we would be able to insert another prefix
byte if end-of-stream is reached. This would only work with bufsize=32k
or so, because otherwise the block length would require 3+ bytes to store,
and there would be a problem with handling of the case with end-of-stream
when there's only one byte of free space in the buffer.
(One solution to that would be storing 2-byte prefixes to each buffer
and adding 3rd byte when necessary; another is to provide a 2-byte encoding
for some special block lengths like bufsize-2).
Either way its no so good, because even 1 extra byte per 64k would accumulate
to a noticeable number with large files (1526 bytes per 100M). Also hardcoding
of the block size into format is bad too.
Escape prefix. Eg. EC 4B A7 00 = EC 4B A7, EC 4B A7 01 = end-of-stream.
Now this is really easy to encode, but decoding is pretty painful - requires
a messy state machine even to extract single bytes.
But overall it adds least overhead, so it seem that we still need to find
a good implementation for buffered decoding.
Escape prefix with all same bytes (Eg. FF FF FF). Much easier to check,
but runs of the same byte in the stream would produce a huge overhead (like 25%),
and its not unlikely with any byte value chosen for escape code.
Escape postfix. Store the payload byte before the marker - then decoder
just has to skip 1 byte before masked marker, and 4 bytes for control code.
So this basically introduces a fixed 4-byte delay for decoder, while [3]
has a complex path where marker bytes have to be returned one by one.
Still, with [3] encoder is much simpler (it just has to write an extra 0
when marker matches), and this doesn't really simplify the buffer processing.
Update: Actually I'm pretty sure that [3] or [5] would be the option I'd use,
I only listed other options in hope to get more alternatives (for example, it
would be ok if redundancy is 1 bit per block on average). So the main question
atm is how to parse the stream for [3]... current state machine looks like this:
int saved_c;
int o_last, c_last;
int GetByte( FILE* f ) {
int c;
Start:
if( o_last>=10 ) {
if( c_last>=(o_last-10) ) { c=saved_c; o_last=0; }
else c=byte("\xEC\x4B\xA7"[c_last++]);
} else {
c = getc(f);
if( o_last<3 ) {
if( char(c)==("\xEC\x4B\xA7"[o_last]) ) { o_last++; goto Start; }
else if( o_last>0 ) { saved_c=c; c_last=0; o_last+=10; goto Start; } // 11,12
// else just return c
} else {
if( c>0 ) { c=-1-c, o_last=0; printf( "c=%i\n", c ); }
else { saved_c=0xA7; c_last=0; o_last+=10-1; goto Start; } // 12
}
}
return c;
}
and its certainly ugly (and slow)
How about using blocks of fixed size, e.g. 1KB?
Each block would contain a byte (or 4 bytes) indicating which stream it is, and then it follows with just data.
Benefits:
You don't have to pay attention regarding the data itself. Data cannot accidently trigger any behaviour from your system (e.g. accidently terminate the stream)
I does not require random file access when encoding. In particular, you don't store the length of the block as it is fixed.
If data goes corrupt, the system can recover in the next block.
Drawbacks:
If you have to switch from stream to stream a lot, with each having only few bytes of data, the block may be underutilised. Lots of bytes will be empty.
If the block size is too small (e.g. if you want to solve the above problem), you can get huge overhead from the header.
From a standpoint of simplicity for sequential read and write, I would take solution 1, and just use a short buffer, limited to 256 bytes. Then you have a one-byte length followed by data. If a single stream has more than 256 consecutive bytes, you simply write out another length header and the data.
If you have any further requirements though you might have to do something more elaborate. For instance random-access reads probably require a magic number that can't be duplicated in the real data (or that's always escaped when it appears in the real data).
So I ended up using the ugly parser from OP in my actual project
( http://encode.ru/threads/1231-lzma-recompressor )
But now it seems that the actual answer is to let data type handlers
to terminate their streams however they want, which means that escape coding
in case of uncompressed data, but when some kind of entropy coding is
used in the stream, its usually possible to incorporate a more efficient EOF code.
Related
Bzip2 byte stream compression in parallel can be easily done with a FIFO queue where every chunk is processed as parallel task and streamed into a file.
The other way round parallel decompression is not so easy, because everything is bit-aligned and the exact bit-length of a block is known after it's decompressed.
As far as I can see, parallel decompress implementations use magic numbers for block start and stream end and perform a bit-scan. Isn't there a small chance that one of the streams contain such a magic value by coincidence?
Possible block validations:
4 Bytes CRC
6 Bytes "compressed magic"
6 Bytes "end of stream magic"
some bit combinations for the huffman trees are not allowed
max. x Bytes of huffman stream (range to search for next magic)
Per file:
4 Bytes File CRC
padding at the end
I could implement such a scan by just bit-shifting from the stream until I have a magic. But then when I read block N and it fails, I should (maybe also not) take into account, that it was a false positive. For a parallel implementation I can then stop all tasks for blocks N, N+1, N+2, .. , then try to find the next signature and go one. That makes everything very complicated and I don't know if it's worth the effort? I guess maybe not, but is there a chance that a parallel bzip2 implementation fails?
I'm wondering why a file format uses magic numbers as markers, but doesn't include jump hints. I guess the magic numbers are important for filesystem recovery, but anyway, why can't a block contain e.g 16bits for telling how far to jump to the next block.
Yes, the source code you linked notes that the magic 48-bit value can show up in compressed data by chance. It also notes the probability, around 10-14 (actually 2-48, closer to 3.55x10-15). That probability is at every sample, so on average one will occur in every 32 terabytes of compressed data. That's about one month of run time on one core on my machine. Not all that long. In a production environment, you should assume that it will happen. Because it will.
Also as noted in the source you linked, due to the possibility of a false positive, you need to then validate the remainder of the block. You would not stop the subsequent possible block processing, since it is extremely likely that they are all valid blocks. Just validate all and keep the validated ones. Verify when combining that the valid blocks exactly covered the input, with no overlaps. A properly implemented parallel bzip2 decompressor will always work on valid bzip2 streams.
It would need to be more than 16 bits, but yes, in principle a block could have contained the offset to the next block, since it already contains a CRC at the start of the block. Julian did consider that in the revision of bzip2, but decided against it:
bzip2-1.0.X, 0.9.5 and 0.9.0 use exactly the same file format as the
original version, bzip2-0.1. This decision was made in the interests
of stability. Creating yet another incompatible compressed file format
would create further confusion and disruption for users.
...
The compressed file format was never designed to be handled by a
library, and I have had to jump though some hoops to produce an
efficient implementation of decompression. It's a bit hairy. Try
passing decompress.c through the C preprocessor and you'll see what I
mean. Much of this complexity could have been avoided if the
compressed size of each block of data was recorded in the data stream.
I have this code:
a = File.open("/dev/urandom")
b = a.read(1024)
a.close
puts b
I expected to get the first 1024 bytes from the /dev/urandom device\file, instead I got an Error which says read accepts only slice and not Integer.
So I tried to do it like that:
b = a.read(("a" * 1000).to_slice)
But then I got back "1000" in the output.
What is the right way to read x bytes from a file in Crystal ?
What you did is not very ideal, but it actually worked. IO#read(Slice(UInt8)) returns the number of bytes actually read, in case the file is smaller than what you requested or the data isn't available for some other reason. In other words, it's a partial read. So you get 1000 in b because the slice you passed was filled with 1000 bytes. There's IO#read_fully(Slice(UInt8)) which blocks until it fulfilled as much of the request as possible, but too can't guarantee it in any case.
A better approach looks like this:
File.open("/dev/urandom") do |io|
buffer = Bytes.new(1000) # Bytes is an alias for Slice(UInt8)
bytes_read = io.read(buffer)
# We truncate the slice to what we actually got back,
# /dev/urandom never blocks, so this isn't needed in this specific
# case, but good practice in general
buffer = buffer[0, bytes_read]
pp buffer
end
IO also provides various convenience functions for reading strings until specific tokens or up to a limit, in various encodings. Many types also implement the from_io interface, which allows you to easily read structured data.
I'm using LZ4 library and when decompressing data with:
int LZ4_decompress_fast_continue (void* LZ4_streamDecode, const char* source, char* dest, int originalSize);
I need only first n bytes of the originally encoded N bytes, where n < N. So in order to improve the performance, it makes sense to decompress only a part of the original buffer.
I wonder if I can pass n instead of N to the originalSize argument of the function?
My initial test showed, that it's not possible (I got incorrectly decompressed data). Though maybe there is a way, for example if n is a multiple of some CHUNK_SIZE? All original N bytes were compressed with 1 call of a compress function.
LZ4_decompress_safe_continue() and LZ4_decompress_fast_continue() can only decode full blocks. They consider a partial block as an error, and report it as such. They also consider that if there is not enough room to decompress a full block, it's also an error.
The functionality you are looking for doesn't exist yet. But there is a close cousin that might help.
LZ4_decompress_safe_partial() can decode a part of a block.
Note that, in contrast with _continue() variants, it only works on independent blocks.
Note also that the compressed block must nonetheless be complete, and the output buffer must nonetheless have enough space to decode the entire block. So the only advantage provided by this function is speed : if you want only the first 10 bytes, it will stop as soon as it has generated enough bytes.
"as soon as" doesn't mean "exactly at 10". It could be much later, and in the worst case, it could be after decoding the entire block. That's because the internal decoding engine is still the same : it decodes entire sequences, and doesn't "break them" in the middle, for speed considerations.
If you need to extract less bytes than a full block in order to save some memory, I'm afraid there is no solution yet. Report it as a feature request to upstream.
This seems to have been implemented in lz4 1.8.3.
Is there some reasonably fast code out there which can help me quickly search a large bitmap (a few megabytes) for runs of contiguous zero or one bits?
By "reasonably fast" I mean something that can take advantage of the machine word size and compare entire words at once, instead of doing bit-by-bit analysis which is horrifically slow (such as one does with vector<bool>).
It's very useful for e.g. searching the bitmap of a volume for free space (for defragmentation, etc.).
Windows has an RTL_BITMAP data structure one can use along with its APIs.
But I needed the code for this sometime ago, and so I wrote it here (warning, it's a little ugly):
https://gist.github.com/3206128
I have only partially tested it, so it might still have bugs (especially on reverse). But a recent version (only slightly different from this one) seemed to be usable for me, so it's worth a try.
The fundamental operation for the entire thing is being able to -- quickly -- find the length of a run of bits:
long long GetRunLength(
const void *const pBitmap, unsigned long long nBitmapBits,
long long startInclusive, long long endExclusive,
const bool reverse, /*out*/ bool *pBit);
Everything else should be easy to build upon this, given its versatility.
I tried to include some SSE code, but it didn't noticeably improve the performance. However, in general, the code is many times faster than doing bit-by-bit analysis, so I think it might be useful.
It should be easy to test if you can get a hold of vector<bool>'s buffer somehow -- and if you're on Visual C++, then there's a function I included which does that for you. If you find bugs, feel free to let me know.
I can't figure how to do well directly on memory words, so I've made up a quick solution which is working on bytes; for convenience, let's sketch the algorithm for counting contiguous ones:
Construct two tables of size 256 where you will write for each number between 0 and 255, the number of trailing 1's at the beginning and at the end of the byte. For example, for the number 167 (10100111 in binary), put 1 in the first table and 3 in the second table. Let's call the first table BBeg and the second table BEnd. Then, for each byte b, two cases: if it is 255, add 8 to your current sum of your current contiguous set of ones, and you are in a region of ones. Else, you end a region with BBeg[b] bits and begin a new one with BEnd[b] bits.
Depending on what information you want, you can adapt this algorithm (this is a reason why I don't put here any code, I don't know what output you want).
A flaw is that it does not count (small) contiguous set of ones inside one byte ...
Beside this algorithm, a friend tells me that if it is for disk compression, just look for bytes different from 0 (empty disk area) and 255 (full disk area). It is a quick heuristic to build a map of what blocks you have to compress. Maybe it is beyond the scope of this topic ...
Sounds like this might be useful:
http://www.aggregate.org/MAGIC/#Population%20Count%20%28Ones%20Count%29
and
http://www.aggregate.org/MAGIC/#Leading%20Zero%20Count
You don't say if you wanted to do some sort of RLE or to simply count in-bytes zeros and one bits (like 0b1001 should return 1x1 2x0 1x1).
A look up table plus SWAR algorithm for fast check might gives you that information easily.
A bit like this:
byte lut[0x10000] = { /* see below */ };
for (uint * word = words; word < words + bitmapSize; word++) {
if (word == 0 || word == (uint)-1) // Fast bailout
{
// Do what you want if all 0 or all 1
}
byte hiVal = lut[*word >> 16], loVal = lut[*word & 0xFFFF];
// Do what you want with hiVal and loVal
The LUT will have to be constructed depending on your intended algorithm. If you want to count the number of contiguous 0 and 1 in the word, you'll built it like this:
for (int i = 0; i < sizeof(lut); i++)
lut[i] = countContiguousZero(i); // Or countContiguousOne(i)
// The implementation of countContiguousZero can be slow, you don't care
// The result of the function should return the largest number of contiguous zero (0 to 15, using the 4 low bits of the byte, and might return the position of the run in the 4 high bits of the byte
// Since you've already dismissed word = 0, you don't need the 16 contiguous zero case.
This is my function which creates a binary file
void writefile()
{
ofstream myfile ("data.abc", ios::out | ios::binary);
streamoff offset = 1;
if(myfile.is_open())
{
char c='A';
myfile.write(&c, offset );
c='B';
myfile.write(&c, offset );
c='C';
myfile.write(&c,offset);
myfile.write(StartAddr,streamoff (16) );
myfile.close();
}
else
cout << "Some error" << endl ;
}
The value of StartAddr is 1000, hence the expected output file is:
A B C 1000 NUL NUL NUL
However, strangely my output file appends this: data.abc
So the final outcome is: A B C 1000 NUL NUL NUL data.abc
Please help me out with this. How to deal with this? Why is this strange behavior?
I recommend you quit with binary writing and work on writing the data in a textual format. You've already encountered some of the problems with writing data. There are still issues for you to come across about reading the data and portability. Expect more pain if you continue this route.
Use textual representations. For simplicity you can put one field per line and use std::getline to read it in. The textual representation allows you to view the data in any text editor, easily. Try using Notepad to view a binary file!
Oh, but binary data is soo much faster and takes up less space in the file. You've already wasted enough time and money than you would gain by using binary data. The speed of computers and huge memory capacities (disk and RAM) make binary representations a thing of the past (except in extreme cases).
As a learning tool, go ahead and use binary. For ease of development and quick schedules (IOW, finishing early), use textual representations.
Search Stack Overflow for "C++ micro optimization" for the justifications.
There are several issues with this code.
For starters, if you want to write individual characters t a stream, you don't need to use ostream::write. Instead, just use ostream::put, as shown here:
myfile.put('A');
Second, if you want to write out a string into a file stream, just use the stream insertion operator:
myfile << StartAddr;
This is perfectly safe, even in binary mode.
As for the particular problem you're reporting, I think that the issue is that you're trying to write out a string of length four (StartAddr), but you've told the stream to write out sixteen bytes. This means that you're writing out the four bytes for the string contents, then the null terminator, and then nine bytes of whatever happens to be in memory after the buffer. In your case, this is two more null bytes, then the meaningless text that you saw after that. To fix this, either change your code to write fewer bytes or, if StartAddr is a string, then just write it using <<.
With the line myfile.write(StartAddr,streamoff (16) ); you are instructing the myfile object to write 16 bytes to the stream starting at the address StartAddr. Imagine that StartAddr is an array of 16 bytes:
char StartAddr[16] = "1000\0\0\0data.b32\0";
myfile.write(StartAddr, sizeof(StartAddr));
Would generate the output that you see. Without seeing the declaration / definition of StartAddr I cannot say for certain, but it appears you are writing out a five byte nul terminated string "1000" followed by whatever happens to reside in the next 11 bytes after StartAddr. In this case, it appears a couple of nul bytes followed by the constant nul terminated string "data.b32" (which the compiler must put somewhere in memory) are what follow StartAddr.
Regardless, it is clear that you overread a buffer.
If you are trying to write a 16 bit integer type to a stream you have a couple of options, both based on the fact that there are typically 8 bits in a byte. The 'cleanest' one would be something like:
char x = (StartAddr & 0xFF);
myfile.write(x);
x = (StartAddr >> 8);
myfile.write(x);
This assumes StartAddr is a 16 bit integer type and does not take into account any translation that might occur (such as potential conversion of a value of 10 [a linefeed] into a carriage return / linefeed sequence).
Alternatively, you could write something like:
myfile.write(reinterpret_cast<char*>(&StartAddr), sizeof(StartAddr));