Can two different data blocks have the same CRC - crc

Is it possible (even if very low possibility) to have two different data blocks (each 4K for example), and when calculating the CRC, they are found to match ?

Yes there will be conflicts. The number of possible combinations for 4K block would be 24096 * 28. The number of possible combinations for 32-bit CRC would be 232. So on average, 24072 different arrangements of bits in the 4K block will map to the same CRC number. Though the chance that you take two 4K blocks containing random data and they have a CRC match for 32-bit CRC is 2-32 i.e. a quarter of a billion.

Related

Selection of CRC-16-CCITT

Cyclic redundancy checks are used often, and works well with proper config. The ITU's ("CCIT") CRC gets used a lot -
Ref - CRC16-CCITT Reference - Joe Geluso
Why are the ITU's CRC values used so frequently? A common 'default' found, so to speak, just curious as to why
Polynomial 0x11021 is used for floppy disks. Part of the reason for choosing that polynomial is that there are only three 1 bits in 0x1021, which simplifies hardware based CRC calculations. This is also true for 0x10007 (FOP-16) and 0x14003 (CRC16, CRC16-IBM), so I'm not sure why 0x11021 was chosen versus the other two somewhat common ones with only three 1 bits in the lower 16 bits.
0x11021 is also used for XMODEM (a serial file transfer program for old computers), which is typically implemented in software, where the number of 1 bits in the polynomial doesn't matter, but may have been chosen since it was used for floppy disks.
0x11021 is the product of two prime polynomials: 0xf01f and 0x3. The 0x3 (x+1) will detect any odd number of bit errors, and it's 2 bit error detection is good for up to 32751 data bits + 16 crc bits = 32767 bits, good enough for floppy disk sector sizes 128, 256, 512, and 1024 bytes (could also be used for 2048 bytes, but I don't recall a floppy disk with a 2048 byte sector size). I'm not aware of any advantage in the choice of a polynomial for single burst error detection. Some polynomials would be better for single burst error correction, but single burst correction isn't common.
The two other polynomials I mentioned are similar, 0x10007 = 0xfffd * 0x3 , 0x14003 = 0xc001 * 0x3.

Maximum message length for CRC codes? [duplicate]

I've seen 8-bit, 16-bit, and 32-bit CRCs.
At what point do I need to jump to a wider CRC?
My gut reaction is that it is based on the data length:
1-100 bytes: 8-bit CRC
101 - 1000 bytes: 16-bit CRC
1001 - ??? bytes: 32-bit CRC
EDIT:
Looking at the Wikipedia page about CRC and Lott's answer, here' what we have:
<64 bytes: 8-bit CRC
<16K bytes: 16-bit CRC
<512M bytes: 32-bit CRC
It's not a research topic. It's really well understood: http://en.wikipedia.org/wiki/Cyclic_redundancy_check
The math is pretty simple. An 8-bit CRC boils all messages down to one of 256 values. If your message is more than a few bytes long, the possibility of multiple messages having the same hash value goes up higher and higher.
A 16-bit CRC, similarly, gives you one of the 65,536 available hash values. What are the odds of any two messages having one of these values?
A 32-bit CRC gives you about 4 billion available hash values.
From the wikipedia article: "maximal total blocklength is equal to 2**r − 1". That's in bits. You don't need to do much research to see that 2**9 - 1 is 511 bits. Using CRC-8, multiple messages longer than 64 bytes will have the same CRC checksum value.
The effectiveness of a CRC is dependent on multiple factors. You not only need to select the SIZE of the CRC but also the GENERATING POLYNOMIAL to use. There are complicated and non-intuitive trade-offs depending on:
The expected bit error rate of the channel.
Whether the errors tend to occur in bursts or tend to be spread out (burst is common)
The length of the data to be protected - maximum length, minimum length and distribution.
The paper Cyclic Redundancy Code Polynominal Selection For Embedded Networks, by Philip Koopman and Tridib Chakravarty, publised in the proceedings of the 2004 International Conference on Dependable Systems and Networks gives a very good overview and makes several recomendations. It also provides a bibliography for further understanding.
http://www.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf
The choice of CRC length versus file size is mainly relevant in cases where one is more likely to have an input which differs from the "correct" input by three or fewer bits than to have a one which is massively different. Given two inputs which are massively different, the possibility of a false match will be about 1/256 with most forms of 8-bit check value (including CRC), 1/65536 with most forms of 16-bit check value (including CRC), etc. The advantage of CRC comes from its treatment of inputs which are very similar.
With an 8-bit CRC whose polynomial generates two periods of length 128, the fraction of single, double, or triple bit errors in a packet shorter than that which go undetected won't be 1/256--it will be zero. Likewise with a 16-bit CRC of period 32768, using packets of 32768 bits or less.
If packets are longer than the CRC period, however, then a double-bit error will go undetected if the distance between the erroneous bits is a multiple of the CRC period. While that might not seem like a terribly likely scenario, a CRC8 will be somewhat worse at catching double-bit errors in long packets than at catching "packet is totally scrambled" errors. If double-bit errors are the second most common failure mode (after single-bit errors), that would be bad. If anything that corrupts some data is likely to corrupt a lot of it, however, the inferior behavior of CRCs with double-bit errors may be a non-issue.
I think the size of the CRC has more to do with how unique of a CRC you need instead of of the size of the input data. This is related to the particular usage and number of items on which you're calculating a CRC.
The CRC should be chosen specifically for the length of the messages, it is not just a question of the size of the CRC: http://www.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf
Here is a nice "real world" evaluation of CRC-N
http://www.backplane.com/matt/crc64.html
I use CRC-32 and file-size comparison and have NEVER, in the billions of files checked, run into a matching CRC-32 and File-Size collision. But I know a few exist, when not purposely forced to exist. (Hacked tricks/exploits)
When doing comparison, you should ALSO be checking "data-sizes". You will rarely have a collision of the same data-size, with a matching CRC, within the correct sizes.
Purposely manipulated data, to fake a match, is usually done by adding extra-data until the CRC matches a target. However, that results in a data-size that no-longer matches. Attempting to brute-force, or cycle through random, or sequential data, of the same exact size, would leave a real narrow collision-rate.
You can also have collisions within the data-size, just by the generic limits of the formulas used, and constraints of using bits/bytes and base-ten systems, which depends on floating-point values, which get truncated and clipped.
The point you would want to think about going larger, is when you start to see many collisions which can not be "confirmed" as "originals". (When they both have the same data-size, and (when tested backwards, they have a matching CRC. Reverse/byte or reverse/bits, or bit-offsets)
In any event, it should NEVER be used as the ONLY form of comparison, just for a quick form of comparison, for indexing.
You can use a CRC-8 to index the whole internet, and divide everything into one of N-catagories. You WANT those collisions. Now, with those pre-sorted, you only have to check one of N-directories, looking for "file-size", or "reverse-CRC", or whatever other comparison you can do to that smaller data-set, fast...
Doing a CRC-32 forwards and backwards on the same blob of data is more reliable than using CRC-64 in just one direction. (Or an MD5, for that matter.)
You can detect a single bit error with a CRC in any size packet. Detecting double bit errors or correction of single bit errors is limited to the number of distinct values the CRC can take, so for 8 bits, that would 256; for 16 bits, 65535; etc. 2^n; In practice, though, CRCs actually take on fewer distinct values for single bit errors. For example what I call the 'Y5' polynomial, the 0x5935 polynomial only takes on up to 256 different values before they repeat going back farther, but on the other hand it is able to correct double bit errors that distance, which is 30 bytes plus 2 bytes for errors in the CRC itself.
The number of bits you can correct with forward error correction is also limited by the Hamming Distance of the polynomial. For example, if the Hamming distance is three, you have to flip three bits to change from a set of bits that represents one valid message with matching CRC to another valid message with its own matching CRC. If that is the case, you can correct one bit with confidence. If the Hamming distance were 5, you could correct two bits. But when correcting multiple bits, you are effectively indexing multiple positions, so you need twice as many bits to represent the indexes of two corrected bits rather than one.
With forward error correction, you calculate the CRC on a packet and CRC together, and get a residual value. A good message with zero errors will always have the expected residual value (zero unless there's a nonzero initial value for the CRC register), and each bit position of error has a unique residual value, so use it to identify the position. If you ever get a CRC result with that residual, you know which bit (or bits) to flip to correct the error.

32 bit CRC with some inputs set to zero. Is this less accurate than dummy data?

Sorry if I should be able to answer this simple question myself!
I am working on an embedded system with a 32bit CRC done in hardware for speed. A utility exists that I cannot modify that initially takes 3 inputs (words) and returns a CRC.
If a standard 32 bit was implemented, would generating a CRC from a 32 bit word of actual data and 2 32 bit words comprising only of zeros produce a less reliable CRC than if I just made up/set some random values for the last 2 32?
Depending on the CRC/polynomial, my limited understanding of CRC would say the more data you put in the less accurate it is. But don't zero'd data reduce accuracy when performing the shifts?
Using zeros will be no different than some other value you might pick. The input word will be just as well spread among the CRC bits either way.
I agree with Mark Adler that zeros are mathematically no worse than other numbers. However, if the utility you can't change does something bad like set the initial CRC to zero, then choose non-zero pad words. An initial CRC=0 + Data=0 + Pads=0 produces a final CRC=0. This is technically valid, but routinely getting CRC=0 is undesirable for data integrity checking. You could compensate for a problem like this with non-zero pad characters, e.g. pad = -1.

Highly compressing a grid of numbers

I have a square grid which contains numbers and I need to compress it a lot so it can easily be transferred over a network. For instance, I need to be able to compress a 40x40 grid into less than 512 bytes, regardless of the values of the numbers in the grid. That is my basic requirement.
Each of the grid's cells contains a number from 0-7, so each cell can fit in 3 bits.
Does anyone know of a good algorithm that can achieve what I want?
You can encode your information differently. You dont't need to assign all numbers 0 to 7 a code with the same number of bits. You can assign based on number of times in the sequence.
First read the whole sequence counting the number of appearances of every number.
Based on that you can assign the code to each number.
If you assign the code following for example Huffman code the codes will be prefix code, meaning there is no extra character to separate numbers.
There are certain variations that you can introduce on the algorithm based on your test results to fine tune the compression ratio.
I used this technique in a project (university) and it provides, in general, good results. At least it should approximmate your theoretical 3 bits per character and can be much better if the distribution of probabilities helps.
What you want to do is perform a "burrowes-wheeler" transform on your data, and then compress it. Run-length encoding will be enough in this case.
http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform
This will likely outperform huffman in your case.
It is true that in some cases you will need more than 512 bytes. So in your protocol just make an exception for "perverse" grids. But in the general case you should be under 512 easily.
As others have stated, the problem as stated is not possible as 600 bytes are required to represent all possible grids. 600 bytes is from 40 rows, 40 columns, 3 bits per cell, and 8 bits per byte (40 * 40 * 3 / 8). As Kerrek SB explained in the comments, you pack 8 cells into 3 bytes.
In your own comments, you mentioned this was game state being transferred over the network. Assuming you have a mechanism to assure reliable transport of the data, then if there is a reasonable limit to the number of cells that can change between updates, or if you are allowed to send your updates when a certain number of cells change, you can achieve a representation in 512 bytes. If you use 1 bit to represent whether or not a cell has changed, you would need 200 bytes. Then, you have 312 remaining bytes to represent the new values of the cells that have changed. So, you can represent up to 312*8/3 = 832 modified cells.
As an aside, this representation can represent up to 1064 changed cells in less than 600 bytes.

Data Length vs CRC Length

I've seen 8-bit, 16-bit, and 32-bit CRCs.
At what point do I need to jump to a wider CRC?
My gut reaction is that it is based on the data length:
1-100 bytes: 8-bit CRC
101 - 1000 bytes: 16-bit CRC
1001 - ??? bytes: 32-bit CRC
EDIT:
Looking at the Wikipedia page about CRC and Lott's answer, here' what we have:
<64 bytes: 8-bit CRC
<16K bytes: 16-bit CRC
<512M bytes: 32-bit CRC
It's not a research topic. It's really well understood: http://en.wikipedia.org/wiki/Cyclic_redundancy_check
The math is pretty simple. An 8-bit CRC boils all messages down to one of 256 values. If your message is more than a few bytes long, the possibility of multiple messages having the same hash value goes up higher and higher.
A 16-bit CRC, similarly, gives you one of the 65,536 available hash values. What are the odds of any two messages having one of these values?
A 32-bit CRC gives you about 4 billion available hash values.
From the wikipedia article: "maximal total blocklength is equal to 2**r − 1". That's in bits. You don't need to do much research to see that 2**9 - 1 is 511 bits. Using CRC-8, multiple messages longer than 64 bytes will have the same CRC checksum value.
The effectiveness of a CRC is dependent on multiple factors. You not only need to select the SIZE of the CRC but also the GENERATING POLYNOMIAL to use. There are complicated and non-intuitive trade-offs depending on:
The expected bit error rate of the channel.
Whether the errors tend to occur in bursts or tend to be spread out (burst is common)
The length of the data to be protected - maximum length, minimum length and distribution.
The paper Cyclic Redundancy Code Polynominal Selection For Embedded Networks, by Philip Koopman and Tridib Chakravarty, publised in the proceedings of the 2004 International Conference on Dependable Systems and Networks gives a very good overview and makes several recomendations. It also provides a bibliography for further understanding.
http://www.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf
The choice of CRC length versus file size is mainly relevant in cases where one is more likely to have an input which differs from the "correct" input by three or fewer bits than to have a one which is massively different. Given two inputs which are massively different, the possibility of a false match will be about 1/256 with most forms of 8-bit check value (including CRC), 1/65536 with most forms of 16-bit check value (including CRC), etc. The advantage of CRC comes from its treatment of inputs which are very similar.
With an 8-bit CRC whose polynomial generates two periods of length 128, the fraction of single, double, or triple bit errors in a packet shorter than that which go undetected won't be 1/256--it will be zero. Likewise with a 16-bit CRC of period 32768, using packets of 32768 bits or less.
If packets are longer than the CRC period, however, then a double-bit error will go undetected if the distance between the erroneous bits is a multiple of the CRC period. While that might not seem like a terribly likely scenario, a CRC8 will be somewhat worse at catching double-bit errors in long packets than at catching "packet is totally scrambled" errors. If double-bit errors are the second most common failure mode (after single-bit errors), that would be bad. If anything that corrupts some data is likely to corrupt a lot of it, however, the inferior behavior of CRCs with double-bit errors may be a non-issue.
I think the size of the CRC has more to do with how unique of a CRC you need instead of of the size of the input data. This is related to the particular usage and number of items on which you're calculating a CRC.
The CRC should be chosen specifically for the length of the messages, it is not just a question of the size of the CRC: http://www.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf
Here is a nice "real world" evaluation of CRC-N
http://www.backplane.com/matt/crc64.html
I use CRC-32 and file-size comparison and have NEVER, in the billions of files checked, run into a matching CRC-32 and File-Size collision. But I know a few exist, when not purposely forced to exist. (Hacked tricks/exploits)
When doing comparison, you should ALSO be checking "data-sizes". You will rarely have a collision of the same data-size, with a matching CRC, within the correct sizes.
Purposely manipulated data, to fake a match, is usually done by adding extra-data until the CRC matches a target. However, that results in a data-size that no-longer matches. Attempting to brute-force, or cycle through random, or sequential data, of the same exact size, would leave a real narrow collision-rate.
You can also have collisions within the data-size, just by the generic limits of the formulas used, and constraints of using bits/bytes and base-ten systems, which depends on floating-point values, which get truncated and clipped.
The point you would want to think about going larger, is when you start to see many collisions which can not be "confirmed" as "originals". (When they both have the same data-size, and (when tested backwards, they have a matching CRC. Reverse/byte or reverse/bits, or bit-offsets)
In any event, it should NEVER be used as the ONLY form of comparison, just for a quick form of comparison, for indexing.
You can use a CRC-8 to index the whole internet, and divide everything into one of N-catagories. You WANT those collisions. Now, with those pre-sorted, you only have to check one of N-directories, looking for "file-size", or "reverse-CRC", or whatever other comparison you can do to that smaller data-set, fast...
Doing a CRC-32 forwards and backwards on the same blob of data is more reliable than using CRC-64 in just one direction. (Or an MD5, for that matter.)
You can detect a single bit error with a CRC in any size packet. Detecting double bit errors or correction of single bit errors is limited to the number of distinct values the CRC can take, so for 8 bits, that would 256; for 16 bits, 65535; etc. 2^n; In practice, though, CRCs actually take on fewer distinct values for single bit errors. For example what I call the 'Y5' polynomial, the 0x5935 polynomial only takes on up to 256 different values before they repeat going back farther, but on the other hand it is able to correct double bit errors that distance, which is 30 bytes plus 2 bytes for errors in the CRC itself.
The number of bits you can correct with forward error correction is also limited by the Hamming Distance of the polynomial. For example, if the Hamming distance is three, you have to flip three bits to change from a set of bits that represents one valid message with matching CRC to another valid message with its own matching CRC. If that is the case, you can correct one bit with confidence. If the Hamming distance were 5, you could correct two bits. But when correcting multiple bits, you are effectively indexing multiple positions, so you need twice as many bits to represent the indexes of two corrected bits rather than one.
With forward error correction, you calculate the CRC on a packet and CRC together, and get a residual value. A good message with zero errors will always have the expected residual value (zero unless there's a nonzero initial value for the CRC register), and each bit position of error has a unique residual value, so use it to identify the position. If you ever get a CRC result with that residual, you know which bit (or bits) to flip to correct the error.