Related
For functional safety reasons I need to store a crc-16 or similar for protection of data. Data length would be up to 80 bytes. I need to use one value of the 16bit value for indication, that the data was modified intentionally and crc is not calculated yet.
As far as I understand, every value of a 16bit value could be the result of CRC-16. There is no unused value which could indicate "uninitialised".
What is the best solution?
take "0" as the uninitialised value and store "1" if the calculation delivers "0"
use a smaller CRC, e.g. CRC-15
is there a better solution?
I use C and C++ but this should not play a big role.
Update, taking into account the suggestion of rcgldr to use CRC-15:
I will calculate CRC-15 value (which is 0..32767) or the value 65432 which should indicate that the data should be checked and the CRC calculated. I want not use only 1 bit or 0x0000 or 0xFFFF for invalidating the crc, as these bit patterns could occur more likely than an odd number, outside the valid range of CRC-15, like 65432.
Taking into account the suggestion of Adler:
I calculate CRC-16 and if the value is 65432, write e.g. 0xFFFF instead. 65432 is thus reserved for indicating the modification.
I have the feeling that CRC-15 looks more clean, but Adler is right that I loose information. On the other hand my data (calibration data) are stored in memory and bit errors are not so likely like it would be with data transfer via serial interface (this will be protected seperately). The chance that an error is not detected is about 1:32767.
Your first option. You can preserve most of the power of a 16-bit CRC by mapping a CRC value of 0 to 1. Then the value 1 appears twice as often as the other non-zero values, and 0 never appears. This very slightly weakens the power of the CRC to detect errors, and you have now freed up the zero value to indicate that the CRC has not been calculated.
Taking an entire bit for that indication weakens the power of the CRC, indicated by the probability of a false positive, by a factor of two.
Without knowing why or how often a CRC is not initialized (calculated), the expected error rate and how non-initialized CRC is handled, I'm not sure what to recommend.
take "0" as the uninitialised value and store "1" if the calculation delivers "0
Both 0 and 1 are possible valid and invalid (indicating error) CRC 16 values, but as Mark Adler answered, it only slightly weakens the CRC, since a calculated CRC of 0 or 1 are both mapped to 1. When the data is received, and the CRC recalculated on just the data, then if the message CRC == 1, the code would accept recalculated CRC of 0 or 1 as indicating no errors.
use a smaller CRC, e.g. CRC-15
Use a 15 bit CRC and a flag bit to indicate if the 15 bit CRC is calculated or not. As Mark Adler answered, it's weaker than having 0 and 1 mapped to 1, but if the error rate for 82 bytes is very low, it may not matter from a practical standpoint.
If we are sure that we are in "single error" mode , how we can correct that error using CRC gotten and CRC expected? i know how to detect errors but how to correct?
Depending on the number of bits in a message (data + crc), and the CRC polynomial, a single bit error can be corrected. In order for this to work, every single bit error would have to produce a unique CRC. If there are any duplicates, it won't work, but a different CRC polynomial might solve the issue.
If the number of bits is not too large, a table can be used. Each entry in the table would contain a CRC and the bit index of the error. The table can be sorted by CRC so that a binary search can be used.
Another option is to compute the CRC, then reverse cycle the CRC until only the least significant bit is 1 while the rest are 0. This can be expanded to handle single burst correction, reverse cycling the CRC until more than half of the most significant bits are 0, depending on the CRC and message length.
http://www2.hawaii.edu/~tmandel/papers/CRCBurst.pdf
CRCBurst.pdf's algorithm is similar to reverse cycling a CRC, except it requires the least significant bit to be 1, which is an issue for a short burst at the beginning of a message, but if using reverse cycling, the CRC can be backwards cycled until least significant bit is 1, and the leading bits of the CRC that correspond to bits that would precede the message are ignored.
There is a 32 bit CRC that can correct up to 3 error bits for a message with 1024 bits (992 data, 32 CRC), but the table is huge (1.4 GB):
https://stackoverflow.com/a/62201417/3282056
Link to example code:
https://github.com/jeffareid/misc/blob/master/crccor3.c
An error correcting BCH code could be used for 992 bits of data and 30 parity bits for 3 error bit correction.
A CRC is not an error-correcting code, and does not have the information required in general to locate the error, even if you assume that there is only one bit in error. You don't even know if the bit in error is in the message or in the CRC. A CRC is an error-detecting code.
If you have a short enough message, there are ways to locate where the error may be. See https://stackoverflow.com/a/6169837/1180620
There are many error-correcting codes you could choose from. Reed-Solomon codes are commonly used, and can be tuned to your application with the choice of n and k.
Cyclic redundancy checks are used often, and works well with proper config. The ITU's ("CCIT") CRC gets used a lot -
Ref - CRC16-CCITT Reference - Joe Geluso
Why are the ITU's CRC values used so frequently? A common 'default' found, so to speak, just curious as to why
Polynomial 0x11021 is used for floppy disks. Part of the reason for choosing that polynomial is that there are only three 1 bits in 0x1021, which simplifies hardware based CRC calculations. This is also true for 0x10007 (FOP-16) and 0x14003 (CRC16, CRC16-IBM), so I'm not sure why 0x11021 was chosen versus the other two somewhat common ones with only three 1 bits in the lower 16 bits.
0x11021 is also used for XMODEM (a serial file transfer program for old computers), which is typically implemented in software, where the number of 1 bits in the polynomial doesn't matter, but may have been chosen since it was used for floppy disks.
0x11021 is the product of two prime polynomials: 0xf01f and 0x3. The 0x3 (x+1) will detect any odd number of bit errors, and it's 2 bit error detection is good for up to 32751 data bits + 16 crc bits = 32767 bits, good enough for floppy disk sector sizes 128, 256, 512, and 1024 bytes (could also be used for 2048 bytes, but I don't recall a floppy disk with a 2048 byte sector size). I'm not aware of any advantage in the choice of a polynomial for single burst error detection. Some polynomials would be better for single burst error correction, but single burst correction isn't common.
The two other polynomials I mentioned are similar, 0x10007 = 0xfffd * 0x3 , 0x14003 = 0xc001 * 0x3.
I've seen 8-bit, 16-bit, and 32-bit CRCs.
At what point do I need to jump to a wider CRC?
My gut reaction is that it is based on the data length:
1-100 bytes: 8-bit CRC
101 - 1000 bytes: 16-bit CRC
1001 - ??? bytes: 32-bit CRC
EDIT:
Looking at the Wikipedia page about CRC and Lott's answer, here' what we have:
<64 bytes: 8-bit CRC
<16K bytes: 16-bit CRC
<512M bytes: 32-bit CRC
It's not a research topic. It's really well understood: http://en.wikipedia.org/wiki/Cyclic_redundancy_check
The math is pretty simple. An 8-bit CRC boils all messages down to one of 256 values. If your message is more than a few bytes long, the possibility of multiple messages having the same hash value goes up higher and higher.
A 16-bit CRC, similarly, gives you one of the 65,536 available hash values. What are the odds of any two messages having one of these values?
A 32-bit CRC gives you about 4 billion available hash values.
From the wikipedia article: "maximal total blocklength is equal to 2**r − 1". That's in bits. You don't need to do much research to see that 2**9 - 1 is 511 bits. Using CRC-8, multiple messages longer than 64 bytes will have the same CRC checksum value.
The effectiveness of a CRC is dependent on multiple factors. You not only need to select the SIZE of the CRC but also the GENERATING POLYNOMIAL to use. There are complicated and non-intuitive trade-offs depending on:
The expected bit error rate of the channel.
Whether the errors tend to occur in bursts or tend to be spread out (burst is common)
The length of the data to be protected - maximum length, minimum length and distribution.
The paper Cyclic Redundancy Code Polynominal Selection For Embedded Networks, by Philip Koopman and Tridib Chakravarty, publised in the proceedings of the 2004 International Conference on Dependable Systems and Networks gives a very good overview and makes several recomendations. It also provides a bibliography for further understanding.
http://www.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf
The choice of CRC length versus file size is mainly relevant in cases where one is more likely to have an input which differs from the "correct" input by three or fewer bits than to have a one which is massively different. Given two inputs which are massively different, the possibility of a false match will be about 1/256 with most forms of 8-bit check value (including CRC), 1/65536 with most forms of 16-bit check value (including CRC), etc. The advantage of CRC comes from its treatment of inputs which are very similar.
With an 8-bit CRC whose polynomial generates two periods of length 128, the fraction of single, double, or triple bit errors in a packet shorter than that which go undetected won't be 1/256--it will be zero. Likewise with a 16-bit CRC of period 32768, using packets of 32768 bits or less.
If packets are longer than the CRC period, however, then a double-bit error will go undetected if the distance between the erroneous bits is a multiple of the CRC period. While that might not seem like a terribly likely scenario, a CRC8 will be somewhat worse at catching double-bit errors in long packets than at catching "packet is totally scrambled" errors. If double-bit errors are the second most common failure mode (after single-bit errors), that would be bad. If anything that corrupts some data is likely to corrupt a lot of it, however, the inferior behavior of CRCs with double-bit errors may be a non-issue.
I think the size of the CRC has more to do with how unique of a CRC you need instead of of the size of the input data. This is related to the particular usage and number of items on which you're calculating a CRC.
The CRC should be chosen specifically for the length of the messages, it is not just a question of the size of the CRC: http://www.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf
Here is a nice "real world" evaluation of CRC-N
http://www.backplane.com/matt/crc64.html
I use CRC-32 and file-size comparison and have NEVER, in the billions of files checked, run into a matching CRC-32 and File-Size collision. But I know a few exist, when not purposely forced to exist. (Hacked tricks/exploits)
When doing comparison, you should ALSO be checking "data-sizes". You will rarely have a collision of the same data-size, with a matching CRC, within the correct sizes.
Purposely manipulated data, to fake a match, is usually done by adding extra-data until the CRC matches a target. However, that results in a data-size that no-longer matches. Attempting to brute-force, or cycle through random, or sequential data, of the same exact size, would leave a real narrow collision-rate.
You can also have collisions within the data-size, just by the generic limits of the formulas used, and constraints of using bits/bytes and base-ten systems, which depends on floating-point values, which get truncated and clipped.
The point you would want to think about going larger, is when you start to see many collisions which can not be "confirmed" as "originals". (When they both have the same data-size, and (when tested backwards, they have a matching CRC. Reverse/byte or reverse/bits, or bit-offsets)
In any event, it should NEVER be used as the ONLY form of comparison, just for a quick form of comparison, for indexing.
You can use a CRC-8 to index the whole internet, and divide everything into one of N-catagories. You WANT those collisions. Now, with those pre-sorted, you only have to check one of N-directories, looking for "file-size", or "reverse-CRC", or whatever other comparison you can do to that smaller data-set, fast...
Doing a CRC-32 forwards and backwards on the same blob of data is more reliable than using CRC-64 in just one direction. (Or an MD5, for that matter.)
You can detect a single bit error with a CRC in any size packet. Detecting double bit errors or correction of single bit errors is limited to the number of distinct values the CRC can take, so for 8 bits, that would 256; for 16 bits, 65535; etc. 2^n; In practice, though, CRCs actually take on fewer distinct values for single bit errors. For example what I call the 'Y5' polynomial, the 0x5935 polynomial only takes on up to 256 different values before they repeat going back farther, but on the other hand it is able to correct double bit errors that distance, which is 30 bytes plus 2 bytes for errors in the CRC itself.
The number of bits you can correct with forward error correction is also limited by the Hamming Distance of the polynomial. For example, if the Hamming distance is three, you have to flip three bits to change from a set of bits that represents one valid message with matching CRC to another valid message with its own matching CRC. If that is the case, you can correct one bit with confidence. If the Hamming distance were 5, you could correct two bits. But when correcting multiple bits, you are effectively indexing multiple positions, so you need twice as many bits to represent the indexes of two corrected bits rather than one.
With forward error correction, you calculate the CRC on a packet and CRC together, and get a residual value. A good message with zero errors will always have the expected residual value (zero unless there's a nonzero initial value for the CRC register), and each bit position of error has a unique residual value, so use it to identify the position. If you ever get a CRC result with that residual, you know which bit (or bits) to flip to correct the error.
I've seen 8-bit, 16-bit, and 32-bit CRCs.
At what point do I need to jump to a wider CRC?
My gut reaction is that it is based on the data length:
1-100 bytes: 8-bit CRC
101 - 1000 bytes: 16-bit CRC
1001 - ??? bytes: 32-bit CRC
EDIT:
Looking at the Wikipedia page about CRC and Lott's answer, here' what we have:
<64 bytes: 8-bit CRC
<16K bytes: 16-bit CRC
<512M bytes: 32-bit CRC
It's not a research topic. It's really well understood: http://en.wikipedia.org/wiki/Cyclic_redundancy_check
The math is pretty simple. An 8-bit CRC boils all messages down to one of 256 values. If your message is more than a few bytes long, the possibility of multiple messages having the same hash value goes up higher and higher.
A 16-bit CRC, similarly, gives you one of the 65,536 available hash values. What are the odds of any two messages having one of these values?
A 32-bit CRC gives you about 4 billion available hash values.
From the wikipedia article: "maximal total blocklength is equal to 2**r − 1". That's in bits. You don't need to do much research to see that 2**9 - 1 is 511 bits. Using CRC-8, multiple messages longer than 64 bytes will have the same CRC checksum value.
The effectiveness of a CRC is dependent on multiple factors. You not only need to select the SIZE of the CRC but also the GENERATING POLYNOMIAL to use. There are complicated and non-intuitive trade-offs depending on:
The expected bit error rate of the channel.
Whether the errors tend to occur in bursts or tend to be spread out (burst is common)
The length of the data to be protected - maximum length, minimum length and distribution.
The paper Cyclic Redundancy Code Polynominal Selection For Embedded Networks, by Philip Koopman and Tridib Chakravarty, publised in the proceedings of the 2004 International Conference on Dependable Systems and Networks gives a very good overview and makes several recomendations. It also provides a bibliography for further understanding.
http://www.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf
The choice of CRC length versus file size is mainly relevant in cases where one is more likely to have an input which differs from the "correct" input by three or fewer bits than to have a one which is massively different. Given two inputs which are massively different, the possibility of a false match will be about 1/256 with most forms of 8-bit check value (including CRC), 1/65536 with most forms of 16-bit check value (including CRC), etc. The advantage of CRC comes from its treatment of inputs which are very similar.
With an 8-bit CRC whose polynomial generates two periods of length 128, the fraction of single, double, or triple bit errors in a packet shorter than that which go undetected won't be 1/256--it will be zero. Likewise with a 16-bit CRC of period 32768, using packets of 32768 bits or less.
If packets are longer than the CRC period, however, then a double-bit error will go undetected if the distance between the erroneous bits is a multiple of the CRC period. While that might not seem like a terribly likely scenario, a CRC8 will be somewhat worse at catching double-bit errors in long packets than at catching "packet is totally scrambled" errors. If double-bit errors are the second most common failure mode (after single-bit errors), that would be bad. If anything that corrupts some data is likely to corrupt a lot of it, however, the inferior behavior of CRCs with double-bit errors may be a non-issue.
I think the size of the CRC has more to do with how unique of a CRC you need instead of of the size of the input data. This is related to the particular usage and number of items on which you're calculating a CRC.
The CRC should be chosen specifically for the length of the messages, it is not just a question of the size of the CRC: http://www.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf
Here is a nice "real world" evaluation of CRC-N
http://www.backplane.com/matt/crc64.html
I use CRC-32 and file-size comparison and have NEVER, in the billions of files checked, run into a matching CRC-32 and File-Size collision. But I know a few exist, when not purposely forced to exist. (Hacked tricks/exploits)
When doing comparison, you should ALSO be checking "data-sizes". You will rarely have a collision of the same data-size, with a matching CRC, within the correct sizes.
Purposely manipulated data, to fake a match, is usually done by adding extra-data until the CRC matches a target. However, that results in a data-size that no-longer matches. Attempting to brute-force, or cycle through random, or sequential data, of the same exact size, would leave a real narrow collision-rate.
You can also have collisions within the data-size, just by the generic limits of the formulas used, and constraints of using bits/bytes and base-ten systems, which depends on floating-point values, which get truncated and clipped.
The point you would want to think about going larger, is when you start to see many collisions which can not be "confirmed" as "originals". (When they both have the same data-size, and (when tested backwards, they have a matching CRC. Reverse/byte or reverse/bits, or bit-offsets)
In any event, it should NEVER be used as the ONLY form of comparison, just for a quick form of comparison, for indexing.
You can use a CRC-8 to index the whole internet, and divide everything into one of N-catagories. You WANT those collisions. Now, with those pre-sorted, you only have to check one of N-directories, looking for "file-size", or "reverse-CRC", or whatever other comparison you can do to that smaller data-set, fast...
Doing a CRC-32 forwards and backwards on the same blob of data is more reliable than using CRC-64 in just one direction. (Or an MD5, for that matter.)
You can detect a single bit error with a CRC in any size packet. Detecting double bit errors or correction of single bit errors is limited to the number of distinct values the CRC can take, so for 8 bits, that would 256; for 16 bits, 65535; etc. 2^n; In practice, though, CRCs actually take on fewer distinct values for single bit errors. For example what I call the 'Y5' polynomial, the 0x5935 polynomial only takes on up to 256 different values before they repeat going back farther, but on the other hand it is able to correct double bit errors that distance, which is 30 bytes plus 2 bytes for errors in the CRC itself.
The number of bits you can correct with forward error correction is also limited by the Hamming Distance of the polynomial. For example, if the Hamming distance is three, you have to flip three bits to change from a set of bits that represents one valid message with matching CRC to another valid message with its own matching CRC. If that is the case, you can correct one bit with confidence. If the Hamming distance were 5, you could correct two bits. But when correcting multiple bits, you are effectively indexing multiple positions, so you need twice as many bits to represent the indexes of two corrected bits rather than one.
With forward error correction, you calculate the CRC on a packet and CRC together, and get a residual value. A good message with zero errors will always have the expected residual value (zero unless there's a nonzero initial value for the CRC register), and each bit position of error has a unique residual value, so use it to identify the position. If you ever get a CRC result with that residual, you know which bit (or bits) to flip to correct the error.