Related
The algorithm to calculate a CRC involves dividing (mod 2) the data by a polynomial, and that, by nature starts at the biggest bit using the basic long division algorithm and works down (unless you're taking the shortcuts and using tables).
Now, the stream I'm dealing with has the requirements that the data is added little endian and the CRC remainder goes at the end, whereas if the CRC was applied and appended; the CRC remainder bits would appear at the leftmost point in the least significant bit given the bitstream is little endian.
So here's the question. We have a little endian stream with the CRC remainder at the "unexpected" end (correct me if I'm wrong please), should the CRC remainder be added big endian at the end of the bytestream, and then the CRC run on the whole bytestream (this is what I expect from the requirements) or something else?
How in industry is this normally done?
Major update for clarity thanks to Mark Adler's highly helpful answer.
I've read a few posts, but seen nothing where there seems to be a little endian bytestream with a CRC in the MSB (rightmost).
The image above should describe what I'm doing. All the bytes are big endian bit order, but the issue is that the requirements state that the bytes should be little endian byte ordered, and then the CRC tacked on the end.
For the bytestream as a sequence of bits to be validated by the CRC remainder being placed at the end, the CRC remainder bytes should be added bit endian, therefore allowing the message as a whole to be validated with the polynomial. However this involves adding bytes to the stream in a mix of endiannesses, which sounds highly ugly and wrong.
I will assume that by "biggest" bit, you mean most significant. First off, you can take either the most-significant bit or the least-significant bit of the first byte as the highest power of x for polynomial division. Both are in common use. There is no "by nature" here. And that has nothing to do with whether tables are used or not. Taking the least-significant bit as the highest power of x, the one you would call "not by nature" is in very common use, due to slightly faster and simpler software implementations as compared to using the most-significant bit.
Second, bit streams are neither "little endian", nor "big endian". Those terms are used for how integers are broken up into a sequence of bytes. That has nothing to do with the interpretation of a stream of bits as a polynomial. The terms you seem to be looking for are "reflected" and "not reflected" bit streams in and CRCs out. "reflected" means that the highest power of x is the least significant bit, and "not reflected" means it is the most significant bit.
If you look at Greg Cook's catalogue of CRCs, you will see as part of each definition refin=false refout=false or refin=true refout=true, meaning that the data coming in is reflected or not, and the CRC coming out is reflected or not, referring to where the highest power of x is found. For the CRC, the entire n-bits is reflected or not. In actual implementations, no bits are flipped for the input data or the output CRC. Instead, the constant CRC polynomial is reflected to match the data and CRC reflections. That is done once as the code is written, never during execution. (There is one outlier CRC in Greg's catalogue, CRC-12/UMTS, that has refin=false refout=true. For that one, the implementation would in fact have to reflect the CRC result every time.)
Given all that, I am left attempting to intepret your question. What do you mean by "the data is added little endian"? Does that mean the CRC is being calculated using the least-significant bit as the highest power of x (the opposite of your "by nature")? What does "the CRC remainder bits would appear at the leftmost point in the least significant bit given the bitstream is little endian" mean? That one is really confusing, since there is no leftmost point of a bit, and I can't tell at all what you're trying to say about the arrangement of the remainder bits.
The only thing I think I understand and can try to answer here is: "How in industry is this normally done?"
Well, as you can tell from the list of over a hundred CRCs, there is little normalcy established. What I can say is that CRCs have a special property that leads to a "natural" (now I can use that word) ordering of the CRC bits and bytes at the end of the stream of bits and bytes that the CRC was calculated on. That property is that if you append it properly, the CRC of the entire message, including the CRC at the end, will always be the same constant, if there are no errors in the message. Now little and big endian are useful terms, but only for the CRC itself, not the bit or byte stream. The proper order is little endian for reflected CRCs and big endian for non-reflected CRCs. (This assumes that the input and output have the same reflection, so this won't work for that one outlier CRC.)
Of course, I have seen many cases where a reflected CRC is used, but is appended to the stream big-endian, and vice versa, in which case this calculation of the CRC on the entire message doesn't work. That's ok, since the alternative way to check the CRC is to simply repeat what was done before transmission, which is to calculate the CRC only on the data portion of the message, then properly assemble the CRC from the bytes that follow it, and compare the two values. That is what would be done for any other hash that doesn't have that elegant mathematical property of CRCs.
Can anyone with good knowledge of CRC calculation verify that this code
https://github.com/psvanstrom/esphome-p1reader/blob/main/p1reader.h#L120
is actually calculating crc according to this description?
CRC is a CRC16 value calculated over the preceding characters in the data message (from
“/” to “!” using the polynomial: x16+x15+x2
+1). CRC16 uses no XOR in, no XOR out and is
computed with least significant bit first. The value is represented as 4 hexadecimal characters (MSB first).
There's nothing in the linked code about where it starts and ends, and how the result is eventually represented, but yes, that code implements that specification.
Are there CRC algorithms in which the remainder is checked for zero during the generation, and changed when that happens?
When calculating a CRC, if you start with an initial value (the CRC value, aka the remainder value) of 0, leading zeroes in the data being CRCd will not have an effect. So the CRC for "\0\0\0Hello" will be the same as "Hello".
https://xcore.github.io/doc_tips_and_tricks/crc.html#the-initial-value
This means that if, as the CRC is being calculated, the value becomes zero at some point, any zeroes immediately following that point will have no effect on the CRC. If one or more of the zeroes is lost, the CRC will not be changed.
I want to generate a CRC for some data, so that I can determine when a byte is lost. Is this an esoteric application of the CRC? Because when looking for examples of calculating a CRC, I have not found any examples where the CRC value is checked for 0 during the computation, and some non-zero bits fed in a that point. I would think that this would be the only way to be sure to detect a loss of one or more bits at a point in the data, if the CRC up to that point happened to be 0.
To detect and/or correct dropped bits, something like Marker Code or Watermark Code is needed.
https://link.springer.com/article/10.1007/BF03219806
https://ieeexplore.ieee.org/document/866775
Marker codes and Watermark codes are usually implemented in hardware, such as the read / write logic in a hard drive, and operate at the bit level and the media interface. Additional bits are intermixed with data bits to allow detection of lost synchronization or signal, which could otherwise lead to dropped or inserted bits. Here is a better link that describes this. The turbopaper.pdf file mentions LDPC, low density parity codes, which Wiki has an article for. Note that when receiving or reading data, the Marker | Watermark code is handled first at the hardware interface, and the LDPC (or CRC or Reed Solomon) is done after the hardware has detected no dropped or inserted bits.
http://www.inference.org.uk/ear23/turbopaper.pdf
https://en.wikipedia.org/wiki/Low-density_parity-check_code
If you did that, then it wouldn't be a CRC anymore. So, no. By definition there are no such CRC algorithms.
I've seen 8-bit, 16-bit, and 32-bit CRCs.
At what point do I need to jump to a wider CRC?
My gut reaction is that it is based on the data length:
1-100 bytes: 8-bit CRC
101 - 1000 bytes: 16-bit CRC
1001 - ??? bytes: 32-bit CRC
EDIT:
Looking at the Wikipedia page about CRC and Lott's answer, here' what we have:
<64 bytes: 8-bit CRC
<16K bytes: 16-bit CRC
<512M bytes: 32-bit CRC
It's not a research topic. It's really well understood: http://en.wikipedia.org/wiki/Cyclic_redundancy_check
The math is pretty simple. An 8-bit CRC boils all messages down to one of 256 values. If your message is more than a few bytes long, the possibility of multiple messages having the same hash value goes up higher and higher.
A 16-bit CRC, similarly, gives you one of the 65,536 available hash values. What are the odds of any two messages having one of these values?
A 32-bit CRC gives you about 4 billion available hash values.
From the wikipedia article: "maximal total blocklength is equal to 2**r − 1". That's in bits. You don't need to do much research to see that 2**9 - 1 is 511 bits. Using CRC-8, multiple messages longer than 64 bytes will have the same CRC checksum value.
The effectiveness of a CRC is dependent on multiple factors. You not only need to select the SIZE of the CRC but also the GENERATING POLYNOMIAL to use. There are complicated and non-intuitive trade-offs depending on:
The expected bit error rate of the channel.
Whether the errors tend to occur in bursts or tend to be spread out (burst is common)
The length of the data to be protected - maximum length, minimum length and distribution.
The paper Cyclic Redundancy Code Polynominal Selection For Embedded Networks, by Philip Koopman and Tridib Chakravarty, publised in the proceedings of the 2004 International Conference on Dependable Systems and Networks gives a very good overview and makes several recomendations. It also provides a bibliography for further understanding.
http://www.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf
The choice of CRC length versus file size is mainly relevant in cases where one is more likely to have an input which differs from the "correct" input by three or fewer bits than to have a one which is massively different. Given two inputs which are massively different, the possibility of a false match will be about 1/256 with most forms of 8-bit check value (including CRC), 1/65536 with most forms of 16-bit check value (including CRC), etc. The advantage of CRC comes from its treatment of inputs which are very similar.
With an 8-bit CRC whose polynomial generates two periods of length 128, the fraction of single, double, or triple bit errors in a packet shorter than that which go undetected won't be 1/256--it will be zero. Likewise with a 16-bit CRC of period 32768, using packets of 32768 bits or less.
If packets are longer than the CRC period, however, then a double-bit error will go undetected if the distance between the erroneous bits is a multiple of the CRC period. While that might not seem like a terribly likely scenario, a CRC8 will be somewhat worse at catching double-bit errors in long packets than at catching "packet is totally scrambled" errors. If double-bit errors are the second most common failure mode (after single-bit errors), that would be bad. If anything that corrupts some data is likely to corrupt a lot of it, however, the inferior behavior of CRCs with double-bit errors may be a non-issue.
I think the size of the CRC has more to do with how unique of a CRC you need instead of of the size of the input data. This is related to the particular usage and number of items on which you're calculating a CRC.
The CRC should be chosen specifically for the length of the messages, it is not just a question of the size of the CRC: http://www.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf
Here is a nice "real world" evaluation of CRC-N
http://www.backplane.com/matt/crc64.html
I use CRC-32 and file-size comparison and have NEVER, in the billions of files checked, run into a matching CRC-32 and File-Size collision. But I know a few exist, when not purposely forced to exist. (Hacked tricks/exploits)
When doing comparison, you should ALSO be checking "data-sizes". You will rarely have a collision of the same data-size, with a matching CRC, within the correct sizes.
Purposely manipulated data, to fake a match, is usually done by adding extra-data until the CRC matches a target. However, that results in a data-size that no-longer matches. Attempting to brute-force, or cycle through random, or sequential data, of the same exact size, would leave a real narrow collision-rate.
You can also have collisions within the data-size, just by the generic limits of the formulas used, and constraints of using bits/bytes and base-ten systems, which depends on floating-point values, which get truncated and clipped.
The point you would want to think about going larger, is when you start to see many collisions which can not be "confirmed" as "originals". (When they both have the same data-size, and (when tested backwards, they have a matching CRC. Reverse/byte or reverse/bits, or bit-offsets)
In any event, it should NEVER be used as the ONLY form of comparison, just for a quick form of comparison, for indexing.
You can use a CRC-8 to index the whole internet, and divide everything into one of N-catagories. You WANT those collisions. Now, with those pre-sorted, you only have to check one of N-directories, looking for "file-size", or "reverse-CRC", or whatever other comparison you can do to that smaller data-set, fast...
Doing a CRC-32 forwards and backwards on the same blob of data is more reliable than using CRC-64 in just one direction. (Or an MD5, for that matter.)
You can detect a single bit error with a CRC in any size packet. Detecting double bit errors or correction of single bit errors is limited to the number of distinct values the CRC can take, so for 8 bits, that would 256; for 16 bits, 65535; etc. 2^n; In practice, though, CRCs actually take on fewer distinct values for single bit errors. For example what I call the 'Y5' polynomial, the 0x5935 polynomial only takes on up to 256 different values before they repeat going back farther, but on the other hand it is able to correct double bit errors that distance, which is 30 bytes plus 2 bytes for errors in the CRC itself.
The number of bits you can correct with forward error correction is also limited by the Hamming Distance of the polynomial. For example, if the Hamming distance is three, you have to flip three bits to change from a set of bits that represents one valid message with matching CRC to another valid message with its own matching CRC. If that is the case, you can correct one bit with confidence. If the Hamming distance were 5, you could correct two bits. But when correcting multiple bits, you are effectively indexing multiple positions, so you need twice as many bits to represent the indexes of two corrected bits rather than one.
With forward error correction, you calculate the CRC on a packet and CRC together, and get a residual value. A good message with zero errors will always have the expected residual value (zero unless there's a nonzero initial value for the CRC register), and each bit position of error has a unique residual value, so use it to identify the position. If you ever get a CRC result with that residual, you know which bit (or bits) to flip to correct the error.
I've seen 8-bit, 16-bit, and 32-bit CRCs.
At what point do I need to jump to a wider CRC?
My gut reaction is that it is based on the data length:
1-100 bytes: 8-bit CRC
101 - 1000 bytes: 16-bit CRC
1001 - ??? bytes: 32-bit CRC
EDIT:
Looking at the Wikipedia page about CRC and Lott's answer, here' what we have:
<64 bytes: 8-bit CRC
<16K bytes: 16-bit CRC
<512M bytes: 32-bit CRC
It's not a research topic. It's really well understood: http://en.wikipedia.org/wiki/Cyclic_redundancy_check
The math is pretty simple. An 8-bit CRC boils all messages down to one of 256 values. If your message is more than a few bytes long, the possibility of multiple messages having the same hash value goes up higher and higher.
A 16-bit CRC, similarly, gives you one of the 65,536 available hash values. What are the odds of any two messages having one of these values?
A 32-bit CRC gives you about 4 billion available hash values.
From the wikipedia article: "maximal total blocklength is equal to 2**r − 1". That's in bits. You don't need to do much research to see that 2**9 - 1 is 511 bits. Using CRC-8, multiple messages longer than 64 bytes will have the same CRC checksum value.
The effectiveness of a CRC is dependent on multiple factors. You not only need to select the SIZE of the CRC but also the GENERATING POLYNOMIAL to use. There are complicated and non-intuitive trade-offs depending on:
The expected bit error rate of the channel.
Whether the errors tend to occur in bursts or tend to be spread out (burst is common)
The length of the data to be protected - maximum length, minimum length and distribution.
The paper Cyclic Redundancy Code Polynominal Selection For Embedded Networks, by Philip Koopman and Tridib Chakravarty, publised in the proceedings of the 2004 International Conference on Dependable Systems and Networks gives a very good overview and makes several recomendations. It also provides a bibliography for further understanding.
http://www.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf
The choice of CRC length versus file size is mainly relevant in cases where one is more likely to have an input which differs from the "correct" input by three or fewer bits than to have a one which is massively different. Given two inputs which are massively different, the possibility of a false match will be about 1/256 with most forms of 8-bit check value (including CRC), 1/65536 with most forms of 16-bit check value (including CRC), etc. The advantage of CRC comes from its treatment of inputs which are very similar.
With an 8-bit CRC whose polynomial generates two periods of length 128, the fraction of single, double, or triple bit errors in a packet shorter than that which go undetected won't be 1/256--it will be zero. Likewise with a 16-bit CRC of period 32768, using packets of 32768 bits or less.
If packets are longer than the CRC period, however, then a double-bit error will go undetected if the distance between the erroneous bits is a multiple of the CRC period. While that might not seem like a terribly likely scenario, a CRC8 will be somewhat worse at catching double-bit errors in long packets than at catching "packet is totally scrambled" errors. If double-bit errors are the second most common failure mode (after single-bit errors), that would be bad. If anything that corrupts some data is likely to corrupt a lot of it, however, the inferior behavior of CRCs with double-bit errors may be a non-issue.
I think the size of the CRC has more to do with how unique of a CRC you need instead of of the size of the input data. This is related to the particular usage and number of items on which you're calculating a CRC.
The CRC should be chosen specifically for the length of the messages, it is not just a question of the size of the CRC: http://www.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf
Here is a nice "real world" evaluation of CRC-N
http://www.backplane.com/matt/crc64.html
I use CRC-32 and file-size comparison and have NEVER, in the billions of files checked, run into a matching CRC-32 and File-Size collision. But I know a few exist, when not purposely forced to exist. (Hacked tricks/exploits)
When doing comparison, you should ALSO be checking "data-sizes". You will rarely have a collision of the same data-size, with a matching CRC, within the correct sizes.
Purposely manipulated data, to fake a match, is usually done by adding extra-data until the CRC matches a target. However, that results in a data-size that no-longer matches. Attempting to brute-force, or cycle through random, or sequential data, of the same exact size, would leave a real narrow collision-rate.
You can also have collisions within the data-size, just by the generic limits of the formulas used, and constraints of using bits/bytes and base-ten systems, which depends on floating-point values, which get truncated and clipped.
The point you would want to think about going larger, is when you start to see many collisions which can not be "confirmed" as "originals". (When they both have the same data-size, and (when tested backwards, they have a matching CRC. Reverse/byte or reverse/bits, or bit-offsets)
In any event, it should NEVER be used as the ONLY form of comparison, just for a quick form of comparison, for indexing.
You can use a CRC-8 to index the whole internet, and divide everything into one of N-catagories. You WANT those collisions. Now, with those pre-sorted, you only have to check one of N-directories, looking for "file-size", or "reverse-CRC", or whatever other comparison you can do to that smaller data-set, fast...
Doing a CRC-32 forwards and backwards on the same blob of data is more reliable than using CRC-64 in just one direction. (Or an MD5, for that matter.)
You can detect a single bit error with a CRC in any size packet. Detecting double bit errors or correction of single bit errors is limited to the number of distinct values the CRC can take, so for 8 bits, that would 256; for 16 bits, 65535; etc. 2^n; In practice, though, CRCs actually take on fewer distinct values for single bit errors. For example what I call the 'Y5' polynomial, the 0x5935 polynomial only takes on up to 256 different values before they repeat going back farther, but on the other hand it is able to correct double bit errors that distance, which is 30 bytes plus 2 bytes for errors in the CRC itself.
The number of bits you can correct with forward error correction is also limited by the Hamming Distance of the polynomial. For example, if the Hamming distance is three, you have to flip three bits to change from a set of bits that represents one valid message with matching CRC to another valid message with its own matching CRC. If that is the case, you can correct one bit with confidence. If the Hamming distance were 5, you could correct two bits. But when correcting multiple bits, you are effectively indexing multiple positions, so you need twice as many bits to represent the indexes of two corrected bits rather than one.
With forward error correction, you calculate the CRC on a packet and CRC together, and get a residual value. A good message with zero errors will always have the expected residual value (zero unless there's a nonzero initial value for the CRC register), and each bit position of error has a unique residual value, so use it to identify the position. If you ever get a CRC result with that residual, you know which bit (or bits) to flip to correct the error.