Hamming distance and CRC

Hamming distance and CRC - crc

How to find the Hamming distance of a code generated by a certain CRC?
Assume that I have a generating polynomial of order, say, 4 and 11 bits of data.
How to compute the HD basing only on these information?

You should be able to pad your results with zeros making both values 11bits long. Computing an XOR on the two bit strings and counting the ones should yield the hamming distance for your data set.
Hope this helps...

Related

Primitivity of CRC polynomials

I found the following in the web about the performance of CRCs:
Primitive polynomial. This has optimal length for HD=3, and good HD=2
performance above that length.
I don't get it. Optimal length for HD=3 is understandable; but what does good HD=2 performance mean? AFAIK all CRCs have infinite data length at HD=2.
So what does "good HD=2 performance above that length" for primitive polynomials mean?

... has optimal length for HD=3, and good HD=2 performance above that length.
The statement is poorly worded. I find it at the bottom of this web page under "Notation:"
https://users.ece.cmu.edu/~koopman/crc
In this and other articles I find, the abbreviation "HD" represents the minimum Hamming Distance for a CRC: for HD=k+1, then the CRC can detect any pattern of k bit errors in a message up to some length (as shown in the tables). As you stated, "all CRCs have infinite data length at HD=2".
The usage of the phrase "good HD=2 performance above that length" is confusing. The web site above links to the web site below which includes the statement "HD=2 lengths are always infinite, and so are always left out of this list."
https://users.ece.cmu.edu/~koopman/crc/notes.html
Wiki hamming distance explains the relationship between bit error detection and Hamming distance: "a code C is said to be k error detecting if, and only if, the minimum Hamming distance between any two of its codewords is at least k+1"
As you stated, "all CRCs have infinite data length at HD=2", meaning all CRCs can detect any single bit error regardless of message length.
As for "optimal length for HD=3", which means being able to detect a 2 bit error, consider a linear feedback shift register based on the CRC polynomial, initialized with any non-zero value, if you cycle the register enough times, it will end up back with that initial value. For a n bit CRC based on a n+1 bit primitive polynomial, the register will cycle through all 2^n - 1 non-zero values before repeating. The maximum length of the message (which is the length of data plus the length of CRC) where no failure to detect a 2 bit error can occur is 2^n - 1. For a message of length 2^n or greater, then for any "i", if bit[0+i] and bit[(2^n)-1+i] are in error, the primitive CRC will fail to detect the 2 bit error. If the CRC polynomial is not primitive, the the maximum length for failproof 2 error bit detection will be decreased, and not "optimal".
For a linear feedback shift register based on any CRC polynomial, initialized to any non-zero value, no matter how many times it it cycled, it will never include a value of zero. This is one way to explain why "all CRCs have infinite data length at HD=2" (are able to detect single bit errors).

The Author said: "Generally the primitive polynomials tend to have pretty good (i.e., low) weights at HD=2 compared to many other polynomials. It has been a while since I looked, but I think in all cases right above the HD=2 break point the lowest weight polynomial was always primitive."
For some algorithm implementation low weight may provide faster computation.

32 bit CRC with some inputs set to zero. Is this less accurate than dummy data?

Sorry if I should be able to answer this simple question myself!
I am working on an embedded system with a 32bit CRC done in hardware for speed. A utility exists that I cannot modify that initially takes 3 inputs (words) and returns a CRC.
If a standard 32 bit was implemented, would generating a CRC from a 32 bit word of actual data and 2 32 bit words comprising only of zeros produce a less reliable CRC than if I just made up/set some random values for the last 2 32?
Depending on the CRC/polynomial, my limited understanding of CRC would say the more data you put in the less accurate it is. But don't zero'd data reduce accuracy when performing the shifts?

Using zeros will be no different than some other value you might pick. The input word will be just as well spread among the CRC bits either way.

I agree with Mark Adler that zeros are mathematically no worse than other numbers. However, if the utility you can't change does something bad like set the initial CRC to zero, then choose non-zero pad words. An initial CRC=0 + Data=0 + Pads=0 produces a final CRC=0. This is technically valid, but routinely getting CRC=0 is undesirable for data integrity checking. You could compensate for a problem like this with non-zero pad characters, e.g. pad = -1.

How should I store and compute Hamming distance between binary codes?

How can I efficiently store binary codes? For certain fixed sizes, such as 32 bits, there are primitive types that can be used. But what if I my binary codes are much longer?
What is the fastest way to compute the Hamming distance between two binary codes?

Use std::bitset<N>, defined in the <bitset> header, where N is the number of bits (not bytes).
Compute the Hamming distance between two binary codes a and b using (a ^ b).count().

What is the hamming distance, and how do I determine it for a CRC scheme?

While studying for a class in computer networks, the prof talked about the hamming distance between 2 valid code words in a sample code. I have read about hamming distance, and it makes sense from the perspective of telling the difference distance between 2 strings. For example:
Code Word 1 = 10110
The sender sends code word 1, and there is an error introduced, and the receiver receives 10100. So you see that the 4th bit was corrupted. This would result in the a hamming distance of 1 because:
Valid Code Word: 10110
Error Code Word: 10100
-----
XOR 00010
The XOR of the 2 strings results in one 1, so the hamming distance is 1. I understand it up to that point. But then the prof asks:
What is the hamming distance of the standard CRC-16 bit protocol?
What is the hamming distance of the standard CRC-32 bit protocol?
I'm a bit confused, and was wondering if someone could help. Thanks.

You probably figured it out by now, but what he asked for was most likely the minimum number of bit errors that a CRC code would not detect. The answer depends on the width, the polynomial and the length of the message. For instance, the best known CRC-32 polynomial (0x1EDC6F41) has a Hamming distance of 6 or better for messages of up to 5,275 bits (Castaglioni, Bräuer, Herrmann: Optimization of Cyclic Redundancy-Check Codes with 24 and 32 Parity Bits, IEEE Transactions on Communications, vol 41 no 6, June 1993) which means it is guaranteed to detect up to 5 flipped bits in a single message of 5,275 bits or less.
BTW, the code word includes the checksum, so your example is incorrect.

Better compression algorithm for vector data?

I need to compress some spatially correlated data records. Currently I am getting 1.2x-1.5x compression with zlib, but I figure it should be possible to get more like 2x. The data records have various fields, but for example, zlib seems to have trouble compressing lists of points.
The points represent a road network. They are pairs of fixed-point 4-byte integers of the form XXXXYYYY. Typically, if a single data block has 100 points, there will be only be a few combinations of the top two bytes of X and Y (spatial correlation). But the bottom bytes are always changing and must look like random data to zlib.
Similarly, the records have 4-byte IDs which tend to have constant high bytes and variable low bytes.
Is there another algorithm that would be able to compress this kind of data better? I'm using C++.
Edit: Please no more suggestions to change the data itself. My question is about automatic compression algorithms. If somebody has a link to an overview of all popular compression algorithms I'll just accept that as answer.

You'll likely get much better results if you try to compress the data yourself based on your knowledge of its structure.
General-purpose compression algorithms just treat your data as a bitstream. They look for commonly-used sequences of bits, and replace them with a shorter dictionary indices.
But the duplicate data doesn't go away. The duplicated sequence gets shorter, but it's still duplicated just as often as it was before.
As I understand it, you have a large number of data points of the form
XXxxYYyy, where the upper-case letters are very uniform. So factor them out.
Rewrite the list as something similar to this:
XXYY // a header describing the common first and third byte for all the subsequent entries
xxyy // the remaining bytes, which vary
xxyy
xxyy
xxyy
...
XXYY // next unique combination of 1st and 3rd byte)
xxyy
xxyy
...
Now, each combination of the rarely varying bytes is listed only once, rather than duplicated for every entry they occur in. That adds up to a significant space saving.
Basically, try to remove duplicate data yourself, before running it through zlib. You can do a better job of it because you have additional knowledge about the data.
Another approach might be, instead of storing these coordinates as absolute numbers, write them as deltas, relative deviations from some location chosen to be as close as possible to all the entries. Your deltas will be smaller numbers, which can be stored using fewer bits.

Not specific to your data, but I would recommend checking out 7zip instead of zlib if you can. I've seen ridiculously good compression ratios using this.
http://www.7-zip.org/

Without seeing the data and its exact distribution, I can't say for certain what the best method is, but I would suggest that you start each group of 1-4 records with a byte whose 8 bits indicate the following:
0-1 Number of bytes of ID that should be borrowed from previous record
2-4 Format of position record
6-7 Number of succeeding records that use the same 'mode' byte
Each position record may be stored one of eight ways; all types other than 000 use signed displacements. The number after the bit code is the size of the position record.
000 - 8 - Two full four-byte positions
001 - 3 - Twelve bits for X and Y
010 - 2 - Ten-bit X and six-bit Y
011 - 2 - Six-bit X and ten-bit Y
100 - 4 - Two sixteen-bit signed displacements
101 - 3 - Sixteen-bit X and 8-bit Y signed displacement
110 - 3 - Eight-bit signed displacement for X; 16-bit for Y
111 - 2 - Two eight-bit signed displacements
A mode byte of zero will store all the information applicable to a point without reference to any previous point, using a total of 13 bytes to store 12 bytes of useful information. Other mode bytes will allow records to be compacted based upon similarity to previous records. If four consecutive records differ only in the last bit of the ID, and either have both X and Y within +/- 127 of the previous record, or have X within +/- 31 and Y within +/- 511, or X within +/- 511 and Y within +/- 31, then all four records may be stored in 13 bytes (an average of 3.25 bytes each (a 73% reduction in space).
A "greedy" algorithm may be used for compression: examine a record to see what size ID and XY it will have to use in the output, and then grab up to three more records until one is found that either can't "fit" with the previous records using the chosen sizes, or could be written smaller (note that if e.g. the first record has X and Y displacements both equal to 12, the XY would be written with two bytes, but until one reads following records one wouldn't know which of the three two-byte formats to use).
Before setting your format in stone, I'd suggest running your data through it. It may be that a small adjustment (e.g. using 7+9 or 5+11 bit formats instead of 6+10) would allow many data to pack better. The only real way to know, though, is to see what happens with your real data.

It looks like the Burrows–Wheeler transform might be useful for this problem. It has a peculiar tendency to put runs of repeating bytes together, which might make zlib compress better. This article suggests I should combine other algorithms than zlib with BWT, though.
Intuitively it sounds expensive, but a look at some source code shows that reverse BWT is O(N) with 3 passes over the data and a moderate space overhead, likely making it fast enough on my target platform (WinCE). The forward transform is roughly O(N log N) or slightly over, assuming an ordinary sort algorithm.

Sort the points by some kind of proximity measure such that the average distance between adjacent points is small. Then store the difference between adjacent points.
You might do even better if you manage to sort the points so that most differences are positive in both the x and y axes, but I can't say for sure.
As an alternative to zlib, a family of compression techniques that works well when the probability distribution is skewed towards small numbers is universal codes. They would have to be tweaked for signed numbers (encode abs(x)<<1 + (x < 0 ? 1 : 0)).

You might want to write two lists to the compressed file: a NodeList and a LinkList. Each node would have an ID, x, y. Each link would have a FromNode and a ToNode, along with a list of intermediate xy values. You might be able to have a header record with a false origin and have node xy values relative to that.
This would provide the most benefit if your streets follow an urban grid network, by eliminating duplicate coordinates at intersections.
If the compression is not required to be lossless, you could use truncated deltas for intermediate coordinates. While someone above mentioned deltas, keep in mind that a loss in connectivity would likely cause more problems than a loss in shape, which is what would happen if you use truncated deltas to represent the last coordinate of a road (which is often an intersection).
Again, if your roads aren't on an urban grid, this probably wouldn't buy you much.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js