Guideline of choosing a polynomial in CRC for a given message - crc

I'm trying to code on performing cyclic redundancy checksum for a given input message with a polynomial.
For specific examples, I realize that for some polynomials, even if the message sent to the receiver side is wrong, the CRC will output as if there is no error.
What are general guidelines on choosing a polynomial that will detect errors and what factors determine it (e.g. is it dependent on the message size, does it have to do anything with parity, does longer polynomials help catch more errors)?
For example, I am given the message on the receiver side 1101 with the polynomial 10 where the CRC is generated according to the even parity.
First, I perform binary long division and get the remainder as 0.
Then, I append it to the message and send it as 11010
The problem is on the receiver side, where even if the received message is wrong, the CRC will not detect the error since the last digit 0 is always divisible by 2 regardless of the message. For example, 11110, 10000, ... etc will be undetected.

If by "polynomial 10" you mean the polynomial x, then that is not a valid CRC polynomial. A CRC polynomial must always end with a 1. The one-bit CRC polynomial is x+1, or 11 in your notation. x gives you a zero-bit CRC!
As for guidelines on choosing a polynomial, look at Koopman's research and resulting good performance CRCs for various message lengths.

Related

Primitivity of CRC polynomials

I found the following in the web about the performance of CRCs:
Primitive polynomial. This has optimal length for HD=3, and good HD=2
performance above that length.
I don't get it. Optimal length for HD=3 is understandable; but what does good HD=2 performance mean? AFAIK all CRCs have infinite data length at HD=2.
So what does "good HD=2 performance above that length" for primitive polynomials mean?
... has optimal length for HD=3, and good HD=2 performance above that length.
The statement is poorly worded. I find it at the bottom of this web page under "Notation:"
https://users.ece.cmu.edu/~koopman/crc
In this and other articles I find, the abbreviation "HD" represents the minimum Hamming Distance for a CRC: for HD=k+1, then the CRC can detect any pattern of k bit errors in a message up to some length (as shown in the tables). As you stated, "all CRCs have infinite data length at HD=2".
The usage of the phrase "good HD=2 performance above that length" is confusing. The web site above links to the web site below which includes the statement "HD=2 lengths are always infinite, and so are always left out of this list."
https://users.ece.cmu.edu/~koopman/crc/notes.html
Wiki hamming distance explains the relationship between bit error detection and Hamming distance: "a code C is said to be k error detecting if, and only if, the minimum Hamming distance between any two of its codewords is at least k+1"
As you stated, "all CRCs have infinite data length at HD=2", meaning all CRCs can detect any single bit error regardless of message length.
As for "optimal length for HD=3", which means being able to detect a 2 bit error, consider a linear feedback shift register based on the CRC polynomial, initialized with any non-zero value, if you cycle the register enough times, it will end up back with that initial value. For a n bit CRC based on a n+1 bit primitive polynomial, the register will cycle through all 2^n - 1 non-zero values before repeating. The maximum length of the message (which is the length of data plus the length of CRC) where no failure to detect a 2 bit error can occur is 2^n - 1. For a message of length 2^n or greater, then for any "i", if bit[0+i] and bit[(2^n)-1+i] are in error, the primitive CRC will fail to detect the 2 bit error. If the CRC polynomial is not primitive, the the maximum length for failproof 2 error bit detection will be decreased, and not "optimal".
For a linear feedback shift register based on any CRC polynomial, initialized to any non-zero value, no matter how many times it it cycled, it will never include a value of zero. This is one way to explain why "all CRCs have infinite data length at HD=2" (are able to detect single bit errors).
The Author said: "Generally the primitive polynomials tend to have pretty good (i.e., low) weights at HD=2 compared to many other polynomials. It has been a while since I looked, but I think in all cases right above the HD=2 break point the lowest weight polynomial was always primitive."
For some algorithm implementation low weight may provide faster computation.

Maximum message length for CRC codes? [duplicate]

I've seen 8-bit, 16-bit, and 32-bit CRCs.
At what point do I need to jump to a wider CRC?
My gut reaction is that it is based on the data length:
1-100 bytes: 8-bit CRC
101 - 1000 bytes: 16-bit CRC
1001 - ??? bytes: 32-bit CRC
EDIT:
Looking at the Wikipedia page about CRC and Lott's answer, here' what we have:
<64 bytes: 8-bit CRC
<16K bytes: 16-bit CRC
<512M bytes: 32-bit CRC
It's not a research topic. It's really well understood: http://en.wikipedia.org/wiki/Cyclic_redundancy_check
The math is pretty simple. An 8-bit CRC boils all messages down to one of 256 values. If your message is more than a few bytes long, the possibility of multiple messages having the same hash value goes up higher and higher.
A 16-bit CRC, similarly, gives you one of the 65,536 available hash values. What are the odds of any two messages having one of these values?
A 32-bit CRC gives you about 4 billion available hash values.
From the wikipedia article: "maximal total blocklength is equal to 2**r − 1". That's in bits. You don't need to do much research to see that 2**9 - 1 is 511 bits. Using CRC-8, multiple messages longer than 64 bytes will have the same CRC checksum value.
The effectiveness of a CRC is dependent on multiple factors. You not only need to select the SIZE of the CRC but also the GENERATING POLYNOMIAL to use. There are complicated and non-intuitive trade-offs depending on:
The expected bit error rate of the channel.
Whether the errors tend to occur in bursts or tend to be spread out (burst is common)
The length of the data to be protected - maximum length, minimum length and distribution.
The paper Cyclic Redundancy Code Polynominal Selection For Embedded Networks, by Philip Koopman and Tridib Chakravarty, publised in the proceedings of the 2004 International Conference on Dependable Systems and Networks gives a very good overview and makes several recomendations. It also provides a bibliography for further understanding.
http://www.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf
The choice of CRC length versus file size is mainly relevant in cases where one is more likely to have an input which differs from the "correct" input by three or fewer bits than to have a one which is massively different. Given two inputs which are massively different, the possibility of a false match will be about 1/256 with most forms of 8-bit check value (including CRC), 1/65536 with most forms of 16-bit check value (including CRC), etc. The advantage of CRC comes from its treatment of inputs which are very similar.
With an 8-bit CRC whose polynomial generates two periods of length 128, the fraction of single, double, or triple bit errors in a packet shorter than that which go undetected won't be 1/256--it will be zero. Likewise with a 16-bit CRC of period 32768, using packets of 32768 bits or less.
If packets are longer than the CRC period, however, then a double-bit error will go undetected if the distance between the erroneous bits is a multiple of the CRC period. While that might not seem like a terribly likely scenario, a CRC8 will be somewhat worse at catching double-bit errors in long packets than at catching "packet is totally scrambled" errors. If double-bit errors are the second most common failure mode (after single-bit errors), that would be bad. If anything that corrupts some data is likely to corrupt a lot of it, however, the inferior behavior of CRCs with double-bit errors may be a non-issue.
I think the size of the CRC has more to do with how unique of a CRC you need instead of of the size of the input data. This is related to the particular usage and number of items on which you're calculating a CRC.
The CRC should be chosen specifically for the length of the messages, it is not just a question of the size of the CRC: http://www.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf
Here is a nice "real world" evaluation of CRC-N
http://www.backplane.com/matt/crc64.html
I use CRC-32 and file-size comparison and have NEVER, in the billions of files checked, run into a matching CRC-32 and File-Size collision. But I know a few exist, when not purposely forced to exist. (Hacked tricks/exploits)
When doing comparison, you should ALSO be checking "data-sizes". You will rarely have a collision of the same data-size, with a matching CRC, within the correct sizes.
Purposely manipulated data, to fake a match, is usually done by adding extra-data until the CRC matches a target. However, that results in a data-size that no-longer matches. Attempting to brute-force, or cycle through random, or sequential data, of the same exact size, would leave a real narrow collision-rate.
You can also have collisions within the data-size, just by the generic limits of the formulas used, and constraints of using bits/bytes and base-ten systems, which depends on floating-point values, which get truncated and clipped.
The point you would want to think about going larger, is when you start to see many collisions which can not be "confirmed" as "originals". (When they both have the same data-size, and (when tested backwards, they have a matching CRC. Reverse/byte or reverse/bits, or bit-offsets)
In any event, it should NEVER be used as the ONLY form of comparison, just for a quick form of comparison, for indexing.
You can use a CRC-8 to index the whole internet, and divide everything into one of N-catagories. You WANT those collisions. Now, with those pre-sorted, you only have to check one of N-directories, looking for "file-size", or "reverse-CRC", or whatever other comparison you can do to that smaller data-set, fast...
Doing a CRC-32 forwards and backwards on the same blob of data is more reliable than using CRC-64 in just one direction. (Or an MD5, for that matter.)
You can detect a single bit error with a CRC in any size packet. Detecting double bit errors or correction of single bit errors is limited to the number of distinct values the CRC can take, so for 8 bits, that would 256; for 16 bits, 65535; etc. 2^n; In practice, though, CRCs actually take on fewer distinct values for single bit errors. For example what I call the 'Y5' polynomial, the 0x5935 polynomial only takes on up to 256 different values before they repeat going back farther, but on the other hand it is able to correct double bit errors that distance, which is 30 bytes plus 2 bytes for errors in the CRC itself.
The number of bits you can correct with forward error correction is also limited by the Hamming Distance of the polynomial. For example, if the Hamming distance is three, you have to flip three bits to change from a set of bits that represents one valid message with matching CRC to another valid message with its own matching CRC. If that is the case, you can correct one bit with confidence. If the Hamming distance were 5, you could correct two bits. But when correcting multiple bits, you are effectively indexing multiple positions, so you need twice as many bits to represent the indexes of two corrected bits rather than one.
With forward error correction, you calculate the CRC on a packet and CRC together, and get a residual value. A good message with zero errors will always have the expected residual value (zero unless there's a nonzero initial value for the CRC register), and each bit position of error has a unique residual value, so use it to identify the position. If you ever get a CRC result with that residual, you know which bit (or bits) to flip to correct the error.

CRC Procedure - Checking Efficiently

Let us get an m bit-message where the last n bits are the CRC bits. As far as I know, in order to check if it is received correctly or not, we should XOR all m bits with the polynomial of the specific CRC algorithm. If the result is all-zeros, we can say there are no errors.
Here are my questions:
1) What about calculating the n CRC bits using the first (m-n) bits and then compare it to the last n bits of the received message? This way we can say there are no errors if the received and calculated n bits are equal. Is this approach true?
2) If it is true, which is more efficient?
Your description of how to check a CRC doesn't really parse. But anyway, yes, the way that a CRC check is normally done is to calculate the CRC of the pre-CRC bits, and then to compare that to the CRC sent. It is very marginally more efficient that way. More importantly, it is more easily verifiable to be correct, since that is the way the CRC is generated and appended on the other end.
That method extends to any style of check value, where other check values do not have the mathematical property of getting zero if you run the CRC through the algorithm after the data that precedes it. Also CRCs with pre- and post-conditioning, which is most of them, won't have that property either. You would need to un-post-condition, and then compare the result with the pre-condition value.

Data Length vs CRC Length

I've seen 8-bit, 16-bit, and 32-bit CRCs.
At what point do I need to jump to a wider CRC?
My gut reaction is that it is based on the data length:
1-100 bytes: 8-bit CRC
101 - 1000 bytes: 16-bit CRC
1001 - ??? bytes: 32-bit CRC
EDIT:
Looking at the Wikipedia page about CRC and Lott's answer, here' what we have:
<64 bytes: 8-bit CRC
<16K bytes: 16-bit CRC
<512M bytes: 32-bit CRC
It's not a research topic. It's really well understood: http://en.wikipedia.org/wiki/Cyclic_redundancy_check
The math is pretty simple. An 8-bit CRC boils all messages down to one of 256 values. If your message is more than a few bytes long, the possibility of multiple messages having the same hash value goes up higher and higher.
A 16-bit CRC, similarly, gives you one of the 65,536 available hash values. What are the odds of any two messages having one of these values?
A 32-bit CRC gives you about 4 billion available hash values.
From the wikipedia article: "maximal total blocklength is equal to 2**r − 1". That's in bits. You don't need to do much research to see that 2**9 - 1 is 511 bits. Using CRC-8, multiple messages longer than 64 bytes will have the same CRC checksum value.
The effectiveness of a CRC is dependent on multiple factors. You not only need to select the SIZE of the CRC but also the GENERATING POLYNOMIAL to use. There are complicated and non-intuitive trade-offs depending on:
The expected bit error rate of the channel.
Whether the errors tend to occur in bursts or tend to be spread out (burst is common)
The length of the data to be protected - maximum length, minimum length and distribution.
The paper Cyclic Redundancy Code Polynominal Selection For Embedded Networks, by Philip Koopman and Tridib Chakravarty, publised in the proceedings of the 2004 International Conference on Dependable Systems and Networks gives a very good overview and makes several recomendations. It also provides a bibliography for further understanding.
http://www.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf
The choice of CRC length versus file size is mainly relevant in cases where one is more likely to have an input which differs from the "correct" input by three or fewer bits than to have a one which is massively different. Given two inputs which are massively different, the possibility of a false match will be about 1/256 with most forms of 8-bit check value (including CRC), 1/65536 with most forms of 16-bit check value (including CRC), etc. The advantage of CRC comes from its treatment of inputs which are very similar.
With an 8-bit CRC whose polynomial generates two periods of length 128, the fraction of single, double, or triple bit errors in a packet shorter than that which go undetected won't be 1/256--it will be zero. Likewise with a 16-bit CRC of period 32768, using packets of 32768 bits or less.
If packets are longer than the CRC period, however, then a double-bit error will go undetected if the distance between the erroneous bits is a multiple of the CRC period. While that might not seem like a terribly likely scenario, a CRC8 will be somewhat worse at catching double-bit errors in long packets than at catching "packet is totally scrambled" errors. If double-bit errors are the second most common failure mode (after single-bit errors), that would be bad. If anything that corrupts some data is likely to corrupt a lot of it, however, the inferior behavior of CRCs with double-bit errors may be a non-issue.
I think the size of the CRC has more to do with how unique of a CRC you need instead of of the size of the input data. This is related to the particular usage and number of items on which you're calculating a CRC.
The CRC should be chosen specifically for the length of the messages, it is not just a question of the size of the CRC: http://www.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf
Here is a nice "real world" evaluation of CRC-N
http://www.backplane.com/matt/crc64.html
I use CRC-32 and file-size comparison and have NEVER, in the billions of files checked, run into a matching CRC-32 and File-Size collision. But I know a few exist, when not purposely forced to exist. (Hacked tricks/exploits)
When doing comparison, you should ALSO be checking "data-sizes". You will rarely have a collision of the same data-size, with a matching CRC, within the correct sizes.
Purposely manipulated data, to fake a match, is usually done by adding extra-data until the CRC matches a target. However, that results in a data-size that no-longer matches. Attempting to brute-force, or cycle through random, or sequential data, of the same exact size, would leave a real narrow collision-rate.
You can also have collisions within the data-size, just by the generic limits of the formulas used, and constraints of using bits/bytes and base-ten systems, which depends on floating-point values, which get truncated and clipped.
The point you would want to think about going larger, is when you start to see many collisions which can not be "confirmed" as "originals". (When they both have the same data-size, and (when tested backwards, they have a matching CRC. Reverse/byte or reverse/bits, or bit-offsets)
In any event, it should NEVER be used as the ONLY form of comparison, just for a quick form of comparison, for indexing.
You can use a CRC-8 to index the whole internet, and divide everything into one of N-catagories. You WANT those collisions. Now, with those pre-sorted, you only have to check one of N-directories, looking for "file-size", or "reverse-CRC", or whatever other comparison you can do to that smaller data-set, fast...
Doing a CRC-32 forwards and backwards on the same blob of data is more reliable than using CRC-64 in just one direction. (Or an MD5, for that matter.)
You can detect a single bit error with a CRC in any size packet. Detecting double bit errors or correction of single bit errors is limited to the number of distinct values the CRC can take, so for 8 bits, that would 256; for 16 bits, 65535; etc. 2^n; In practice, though, CRCs actually take on fewer distinct values for single bit errors. For example what I call the 'Y5' polynomial, the 0x5935 polynomial only takes on up to 256 different values before they repeat going back farther, but on the other hand it is able to correct double bit errors that distance, which is 30 bytes plus 2 bytes for errors in the CRC itself.
The number of bits you can correct with forward error correction is also limited by the Hamming Distance of the polynomial. For example, if the Hamming distance is three, you have to flip three bits to change from a set of bits that represents one valid message with matching CRC to another valid message with its own matching CRC. If that is the case, you can correct one bit with confidence. If the Hamming distance were 5, you could correct two bits. But when correcting multiple bits, you are effectively indexing multiple positions, so you need twice as many bits to represent the indexes of two corrected bits rather than one.
With forward error correction, you calculate the CRC on a packet and CRC together, and get a residual value. A good message with zero errors will always have the expected residual value (zero unless there's a nonzero initial value for the CRC register), and each bit position of error has a unique residual value, so use it to identify the position. If you ever get a CRC result with that residual, you know which bit (or bits) to flip to correct the error.

How to correct a message using Hamming Code

So I want to work on this summer project to correct errors in a message transmission using Hamming Code, but I cannot figure out how it really works. I've read many articles online, but I don't really understand the algorithm. Can anybody explain it in simple terms?
Thanks.
It's all about Hamming distance.
The Hamming distance between two base-2 values is the number of bits at which they differ. So if you transmit A, but I receive B, then the number of bits which must have been switched in transmission is the Hamming distance between A and B.
Hamming codes are useful when the bits in each code word are transmitted somehow separately. We don't care whether they're serial or parallel, but they aren't for instance combined into an analogue value representing several bits, or compressed/encrypted after encoding.
Thus, each bit is independently (at random with some fixed probability), either received correctly, or flipped. Assuming the transmission is fairly reliable, most bits are received correctly. So errors in a small number of bits are more likely, and simultaneous errors in large numbers of bits are unlikely.
So, a Hamming code usually aims to correct 1-bit errors, and/or to detect 2-bit errors (see the Wikipedia article for details of the two main types). Codes which correct/detect bigger errors can be constructed, but AFAIK aren't used as much.
The code works by evenly spacing out the code points in "Hamming space", which in mathematical terms is the metric space consisting of all values of the relevant word size, with Hamming distance as the metric. Imagine that each code point is surrounded by a little "buffer zone" of invalid values. If a value is received that isn't a code point, then an error must have occurred, because only valid code points are ever transmitted.
If a value in the buffer zone is received, then on the assumption that a 1-bit error occurred, the value which was transmitted must be distance 1 from the value received. But because the code points are spread out, there is only one code point that close. So it's "corrected" to that code point, on grounds that a 1-bit error is more likely than the greater error that would be needed for any other code point to produce the value received. In probability terms, the conditional probability that you sent the nearby code point is greater than the conditional probability that you send any other code point, given that I received the value I did. So I guess that you sent the nearby one, with a certain confidence based on the reliability of the transmission and the number of bits in each word.
If an invalid value is received which is equidistant from two code points, then I can't say that one is more likely to be the true value than the other. So I detect the error, but I can't correct it.
Obviously 3-bit errors are not corrected by a SECDED Hamming code. The received value is further from the value you actually sent, than it is to some other code point, and I erroneously "correct" it to the wrong value. So you either need transmission reliable enough that you don't care about them, or else you need higher-level error detection as well (for example, a CRC over an entire message).
Specifically from Wikipedia, the algorithm is as follows:
Number the bits starting from 1: bit 1, 2, 3, 4, 5, etc.
Write the bit numbers in binary. 1, 10, 11, 100, 101, etc.
All bit positions that are powers of two (have only one 1 bit in the binary form of their position) are parity bits.
All other bit positions, with two or more 1 bits in the binary form of their position, are data bits.
Each data bit is included in a unique set of 2 or more parity bits, as determined by the binary form of its bit position.
Parity bit 1 covers all bit positions which have the least significant bit set: bit 1 (the parity bit itself), 3, 5, 7, 9, etc.
Parity bit 2 covers all bit positions which have the second least significant bit set: bit 2 (the parity bit itself), 3, 6, 7, 10, 11, etc.
Parity bit 4 covers all bit positions which have the third least significant bit set: bits 4–7, 12–15, 20–23, etc.
Parity bit 8 covers all bit positions which have the fourth least significant bit set: bits 8–15, 24–31, 40–47, etc.
In general each parity bit covers all bits where the binary AND of the parity position and the bit position is non-zero.
The wikipedia article explains it quite nicely.
If you don't understand a specific aspect of the algorithm, then you will need to rephrase (or detail) your question, so that someone can address your specific part of the problem.