Checking for a specific value sequence within data during a CRC - crc

I'd like to preface this by saying that my knowledge of CRC techniques is very limited, I spent most of the day googlin' and reading things, but I can't quite find what I'm looking for. It may very well not be possible, if so just let me know!
What I have is a sequence of seemingly random data:
0xAF 0xBC 0x1F 0x5C... etc
Within this data, there is a field that is not random (that I put there), and I want to use a CRC check of the entire data set to see if this field is set to the correct value (lets say 0x12 0x34 0x56 0x78). I am trying to do this sneakily and this is key because I don't want a casual observer to know that I am looking for that field - this is why I don't just read out the location I want and compare against expected value.
The field's value is constant, the rest is pretty much random. There are some fields here and there that will also be constant if that helps.
Is this possible to do? I am not limited in the number of times I do the CRC check, or which direction I go through data, or of I change the polynomial, or really anything. I can also start from the middle of the array, or the third, or whatever, but I would prefer not to start near my field of interest.

The only function that comes to mind that will do what you want is a discrete wavelet transform. (A CRC will always depend on all of the bits that you are computing it over — that's kind of the point.)
You can find the coefficients to apply to the set of discrete wavelet basis functions that will give you a function with a finite basis that covers only the region of interest, using the orthogonality of the basis functions. It will appear that the wavelet functions are over the entire message, but the coefficients are rigged so that the values outside the region of interest cancel in the sum.
While this all may not be obvious to a casual reader of the code, it would be straightforward to write down the functions and coefficients, and multiply it out to see what bytes in the message are selected by the coefficients.

OK, so, to confirm, you have something like this as your data:
0xAF 0xBC 0x1F 0x5C 0x11 0x1F 0x5C 0x11
0x2D 0xAB 0xBB 0xCC 0x00 0xBB 0xCC 0x00
0x12 0x34 0x56 0x78 0xFF 0x56 0x78 0xFF
and you're trying to isolate something in a particular location of that data, e.g., to find the 0x12 0x34 0x56 0x78 value there.
To clarify, you're wanting to 1) check that value (that particular address range's value), and 2) then do a crc on the whole? Or are you wanting to integrate the hunt for the value into the crc algorithm?
Honestly trying to understand where you're going. I realize this isn't really an answer, but it's a better place for this than in a comment.

Related

Using unnecessary hexadecimal notation?

Whilst reading Apple's dlfcn.h header I can across these macros:
#define RTLD_LAZY 0x1
#define RTLD_NOW 0x2
#define RTLD_LOCAL 0x4
#define RTLD_GLOBAL 0x8
Is there any reason why the author of this header wrote these as hexadecimal numbers when the prefixing 0x could be removed safely? 0x1, 0x2 etc are the same numbers as without 0x.
Is this just personal coding style?
It's conventional to use hexadecimal rather than decimal for powers-of-2 sequences, because it scales legibly.
Sure, there are only four options here, but consider a possible extension:
0x01
0x02
0x04
0x08
0x10
0x20
0x40
0x80
// ...
While the equivalent sequence rendered in decimal notation will be instantly familiar to any programmer, it's not as legible/symmetrical:
1
2
4
8
16
32
64
128
// ...
So, using hexadecimal here becomes a convention.
Besides that, sure, style. Some people like to use hex for "numbers used by the computer" because it looks kind of robotic; c.f. decimal for "numbers used by humans".
Consider also that values using use of these constants are likely to be manipulated using bitwise operators (which are similarly convenient to do in hex), and viewed in debuggers that give byte values in hexadecimal. It's easier to cross-reference the values in source code if they use the same base as the program's operations and your tools. 0x22 is easier to understand in this context than 34 is.
Ultimately, you may as well ask why we ever use hexadecimal instead of decimal, since there is always a valid conversion between the two. The truth is that some bases are just more convenient than others in certain scenarios. You wouldn't count on your fingers in binary, because you have ten of them.

How to calculate CRC of a WinRAR file?

I know CRC calculation algorithm from Wikipedia. About structure of RAR file I read here. For example, there was written:
The file has the magic number of:
0x 52 61 72 21 1A 07 00
Which is a break down of the following to describe an Archive Header:
0x6152 - HEAD_CRC
0x72 - HEAD_TYPE
0x1A21 - HEAD_FLAGS
0x0007 - HEAD_SIZE
If I understand correctly, the HEAD_CRC (0x6152) is CRC value of Marker Block (MARK_HEAD). Somewhere I read, that CRC of a WinRAR file is calculated with standard polynomial 0xEDB88320, but when size of CRC is less than 4 bytes, it's necessary to use less significant bytes. In this case (of course if I undestand correctly) CRC value is 0x6152, so it has 2 bytes. Now I don't know, which bytes I have to take as less significant. From the standard polynomial (0xEDB88320)? Then 0x8320 probably are less significant bytes of this polynomial. Next, how to calculate CRC of the Marker Block (i. e. from the following bytes: 0x 52 61 72 21 1A 07 00), if we have already right polynomial?
There was likely a 16-bit check for an older format that is not derived from a 32-bit CRC. The standard 32-bit CRC, used by zip and rar, applied to the last five bytes of the header has no portion equal to the first two bytes. The Polish page appears to be incorrect in claiming that the two-byte check is the low two-bytes of a 32-bit CRC.
It does appear from the documentation that that header is constructed in a standard way as other blocks in the older format, so that the author, for fun, arranged for his format to give the check value "Ra" so that it could spell out "Rar!" followed by a text-terminating control-Z.
I found another 16-bit check in the unrar source code, but that check does not result in those values either.
Oh, and no, you can't take part of a CRC polynomial and expect that to be a good CRC polynomial for a smaller check. What the page in Polish is saying is that you would compute the full 32-bit CRC, and then take the low two bytes of the result. However that doesn't work for the magic number header.
Per WinRAR TechNote.txt file included with the install:
The marker block is actually considered as a fixed byte sequence: 0x52 0x61 0x72 0x21 0x1a 0x07 0x00
And as you already indicated, at the very end you can read:
The CRC is calculated using the standard polynomial 0xEDB88320. In case the size of the CRC is less than 4 bytes, only the low order bytes are used.
In Python, the calculation and grabbing of the 2 low order bytes goes like this:
zlib.crc32(correct_byte_range) & 0xffff
rerar has some code that does this, just like the rarfile library that it uses. ReScene .NET source code has an algorithm in C# for calculating the CRC32 hash. See also How do I calculate CRC32 mathematically?

Bitstream parsing and Endianness

I am trying to parse a bitstream, and I am having trouble getting my head around endianness. I have a byte buffer, and I need to be able to read bitfields out which are of varying lengths, anywhere from 1 bit to 8 bits mostly.
My problem comes with the endianness of the bytes. When I step through with a debugger, the bottom 4 bits appear to be in the top portion of the byte. That is, where I am expecting the first two bits to be 10 (they must be 10), however, the first byte in the bitstream is 0xA3, or 1010 0011, when checking with the debugger. Meaning, assuming that the bits are in the "correct" order, the first two bits are in fact 11 (reading right to left).
It would seem, however, that if the bits were not in the right order, and should be 0x3A, or 0011 1010, I then have 10 as my expected first two bits.
This confuses me, because it doesn't seem to be a matter of bit order, MSb to LSb/LSb to MSb, but rather nibble order. How does this happen? That seems to just be the way it came out of the file. There is a possibility this is an invalid bitstream, but I have seen this kind of thing before when reading files in Hex Editors, nibbles seemingly in the "wrong" order.
I am just confused and would like some help understanding what's going on. I don't often deal with things at this level.
You don't need to concern the bit order, because in C/C++ there is no way for you to iterate through the bits using pointer arithmetics. You can only manipulate the bits using bit-wise operators that are independent of the bit order of the local machine. What you mentioned in the OP is just a matter of visualization. Different debuggers may choose different ways to visualize the bits in a byte. There is no right or wrong for this matter. There is just preference. What really matters if the byte order.

Reducing a CRC32 value to 16 or 8 bit

In a message framing scheme I want to secure packets with a CRC error detection. Those packets are sent over TCP connection.
For small packets with a length smaller than 16 bytes, I'd choose CRC8. For packets smaller than 4096 bytes a CRC16 algorithm. For larger packets the CRC32 algorithm.
The most attractive CRC implementation currently is the CRC32C because of hardware support (at least on some Intel CPUs). But there are no special instructions for 8bit and 16bit CRCs.
My question is now: is it possible to reduce the 32Bit value of the CRC32C algorithm to a 16 or 8 bit value, without hurting the error detection performance in comparison to native CRC16 or CRC8 algorithms?
An example:
char buffer[256];
...
uint32_t crc32 = compute_crc32c_of_block( buffer, 256 );
uint16_t fake_crc16 = ( crc32 >> 16 ) ^ crc32;
uint8_t fake_crc8 = ( fake_crc16 >> 8 ) ^ fake_crc16;
Is fake_crc8 as good as a real CRC8 implementation?
Thanks in advance.
The low 8 bits of a 32-bit CRC will not have as good error-correcting properties as an 8-bit CRC, e.g. the assurance of detecting burst errors. However it may be good enough for your application, depending the characteristics of the noise source. If you have a good number of bit errors with correlation in their location, then you should use a real CRC. If you have either rare bit flips or lots of gross errors, then a portion of a CRC would likely work just as well.
There is no substitute for testing to see how they perform.
If you can specify the polynomial for hardware CRC then padding it with zeros for the bits that you don't want will result in CRC that also has zeros at those bit positions. Then you simply discard them.
Using default data from the calculator 0x31 0x32 0x33 0x34 0x35 0x36 0x37 0x38 0x39 http://www.sunshine2k.de/coding/javascript/crc/crc_js.html) initial value 0x0, final xor 0x0 - here is the example
CRC8 with 0x7 poly is 0xF4
CRC16 with 0x700 poly is 0xF400
CRC32 with 0x07000000 is 0xF4000000
Now, the problem is that the hardware may not support this kind of polynomial. For example hardware 16bit SPI hardware CRC calculator in STM32 processors only supports odd polynomials. 0xF4 is odd, but 0xF400 is even and it produces garbage.

What's the concept behind zip compression?

What's the concept behind zip compression? I can understand the concept of removing empty space etc, but presumably something has to be added to say how much/where that free space needs to be added back in during decompression?
What's the basic process for compressing a stream of bytes?
A good place to start would be to lookup the Huffman compression scheme. The basic idea behind huffman is that in a given file some bytes appear more frequently then others (in a plaintext file many bytes won't appear at all). Rather then spend 8 bits to encode every byte, why not use a shorter bit sequence to encode the most common characters, and a longer sequences to encode the less common characters (these sequences are determined by creating a huffman tree).
Once you get a handle on using these trees to encode/decode files based on character frequency, imagine that you then start working on word frequency - instead of encoding "they" as a sequence of 4 characters, why not consider it to be a single character due to its frequency, allowing it to be assigned its own leaf in the huffman tree. This is more or less the basis of ZIP and other lossless type compression - they look for common "words" (sequences of bytes) in a file (including sequences of just 1 byte if common enough) and use a tree to encode them. The zip file then only needs to include the tree info (a copy of each sequence and the number of times it appears) to allow the tree to be reconstructed and the rest of the file to be decoded.
Follow up:
To better answer the original question, the idea behind lossless compression is not so much to remove empty space, but to remove redundent information.
If you created a database to store music lyrics, you'd find a lot of space was being used to store the chorus which repeats several times. Instead of using all that space, you could simply place the word CHORUS before the first instance of the chorus lines, and then every time the chorus is to be repeated, just use CHORUS as a place holder (in fact this is pretty much the idea behind LZW compression - in LZW each line of the song would have a number shown before it. If a line repeats later in the song, rather then write out the whole line only the number is shown)
The basic concept is that instead of using eight bits to represent each byte, you use shorter representations for more frequently occuring bytes or sequences of bytes.
For example, if your file consists solely of the byte 0x41 (A) repeated sixteen times, then instead of representing it as the 8-bit sequence 01000001 shorten it to the 1-bit sequence 0. Then the file can be represented by 0000000000000000 (sixteen 0s). So then the file of the byte 0x41 repeated sixteen times can be represented by the file consisting of the byte 0x00 repeated twice.
So what we have here is that for this file (0x41 repeated sixteen times) the bits 01000001 don't convey any additional information over the bit 0. So, in this case, we throw away the extraneous bits to obtain a shorter representation.
That is the core idea behind compression.
As another example, consider the eight byte pattern
0x41 0x42 0x43 0x44 0x45 0x46 0x47 0x48
and now repeat it 2048 times. One way to follow the approach above is to represent bytes using three bits.
000 0x41
001 0x42
010 0x43
011 0x44
100 0x45
101 0x46
110 0x47
111 0x48
Now we can represent the above byte pattern by 00000101 00111001 01110111 (this is the three-byte pattern 0x05 0x39 0x77) repeated 2048 times.
But an even better approach is to represent the byte pattern
0x41 0x42 0x43 0x44 0x45 0x46 0x47 0x48
by the single bit 0. Then we can represent the above byte pattern by 0 repeated 2048 times which becomes the byte 0x00 repeated 256 times. Now we only need to store the dictionary
0 -> 0x41 0x42 0x43 0x44 0x45 0x46 0x47 0x48
and the byte pattern 0x00 repeated 256 times and we compressed the file from 16,384 bytes to (modulo the dictionary) 256 bytes.
That, in a nutshell is how compression works. The whole business comes down to finding short, efficient representations of the bytes and byte sequences in a given file. That's the simple idea, but the details (finding the representation) can be quite challenging.
See for example:
Data compression
Run length encoding
Huffman compression
Shannon-Fano coding
LZW
The concept between compression is basically statististical. If you've got a series of bytes, the chance of byte N being X in practice depends on the value distribution of the previous bytes 0..N-1. Without compression, you allocate 8 bits for each possible value X. With compression, the amounts of bytes allocated for each value X depends on the estimated chance p(N,X).
For instance, given a sequence "aaaa", a compression algorithm can assign a high value to p(5,a) and lower values to p(5,b). When p(X) is high, the bitstring assigned to X will be short, when p(X) is low a long bitstring is used. In this way, if p(N,X) is a good estimate then the average bitstring will be shorter than 8 bits.