Primitivity of CRC polynomials - crc

I found the following in the web about the performance of CRCs:
Primitive polynomial. This has optimal length for HD=3, and good HD=2
performance above that length.
I don't get it. Optimal length for HD=3 is understandable; but what does good HD=2 performance mean? AFAIK all CRCs have infinite data length at HD=2.
So what does "good HD=2 performance above that length" for primitive polynomials mean?

... has optimal length for HD=3, and good HD=2 performance above that length.
The statement is poorly worded. I find it at the bottom of this web page under "Notation:"
https://users.ece.cmu.edu/~koopman/crc
In this and other articles I find, the abbreviation "HD" represents the minimum Hamming Distance for a CRC: for HD=k+1, then the CRC can detect any pattern of k bit errors in a message up to some length (as shown in the tables). As you stated, "all CRCs have infinite data length at HD=2".
The usage of the phrase "good HD=2 performance above that length" is confusing. The web site above links to the web site below which includes the statement "HD=2 lengths are always infinite, and so are always left out of this list."
https://users.ece.cmu.edu/~koopman/crc/notes.html
Wiki hamming distance explains the relationship between bit error detection and Hamming distance: "a code C is said to be k error detecting if, and only if, the minimum Hamming distance between any two of its codewords is at least k+1"
As you stated, "all CRCs have infinite data length at HD=2", meaning all CRCs can detect any single bit error regardless of message length.
As for "optimal length for HD=3", which means being able to detect a 2 bit error, consider a linear feedback shift register based on the CRC polynomial, initialized with any non-zero value, if you cycle the register enough times, it will end up back with that initial value. For a n bit CRC based on a n+1 bit primitive polynomial, the register will cycle through all 2^n - 1 non-zero values before repeating. The maximum length of the message (which is the length of data plus the length of CRC) where no failure to detect a 2 bit error can occur is 2^n - 1. For a message of length 2^n or greater, then for any "i", if bit[0+i] and bit[(2^n)-1+i] are in error, the primitive CRC will fail to detect the 2 bit error. If the CRC polynomial is not primitive, the the maximum length for failproof 2 error bit detection will be decreased, and not "optimal".
For a linear feedback shift register based on any CRC polynomial, initialized to any non-zero value, no matter how many times it it cycled, it will never include a value of zero. This is one way to explain why "all CRCs have infinite data length at HD=2" (are able to detect single bit errors).

The Author said: "Generally the primitive polynomials tend to have pretty good (i.e., low) weights at HD=2 compared to many other polynomials. It has been a while since I looked, but I think in all cases right above the HD=2 break point the lowest weight polynomial was always primitive."
For some algorithm implementation low weight may provide faster computation.

Related

Does log2 (n)*(x) set compression limits.

I may get all kinds of flags and penalties thrown at me for this. So please be patient. 2 questions
If the minimal number of bits to represent an arbitrary number of decimals is calculated by log2 (n)*(x)....n is range x is length, then you should be able to calculate max compression by turning the file into decimals by the>>> bin to dec.?
Is this result a law that one can not compress below the theoretical min compression limit, or is it an approximated limit?
Jon Hutton
It's actually a bit (ha) trickier. That formula assumes that the number is drawn from a uniform distribution, which is often not the case, but notably is the case for what is commonly called "random data" (though that is an inaccurate name, since data may be random but drawn from a non-uniform distribution).
The entropy H of X in bits is given by the formula:
H(X) = - sum[i](P(x[i]) log2(P(x[i])))
Where P gives the probability of every value x[i] that X may take. The bounds of i are implied and irrelevant, impossible options have a probability of zero anyway. In the uniform case, P(x[i]) is (by definition) 1/N for any possible x[i], we have H(X) = -N * (1/N log2(1/N)) = -log2(1/N) = log2(N).
The formula should in general not simply be multiplied by the length of the data, that only works if all symbols are independent and identically distributed (so for example on your file with IID uniform-random digits, it does work). Often for meaningful data, the probability distribution for a symbol depends on its context, and indeed a lot of compression techniques are aimed at exploiting this.
There is no law that says you cannot get lucky and thereby compress an individual file to fewer bits than are suggested by its entropy. You can arrange for it to be possible on purpose (but it won't necessarily happen), for example, let's say we expect that any letter is equally probable, but we decide to go against the flow and encode an A with the single bit 0, and any other letter as a 1 followed by 5 bits that indicate which letter it is. This is obviously a bad encoding given the expectation, there are only 26 letters and they're equally probable but we're using more than log2(26) ≈ 4.7 bits on average, the average would be (1 + 25 * 6)/26 ≈ 5.8. However, if by some accident we happen to actually get an A (there is a chance of 1/26th that this happens, the odds are not too bad), we compress it to a single bit, which is much better than expected. Of course one cannot rely on luck, it can only come as a surprise.
For further reference you could read about entropy (information theory) on Wikipedia.

Maximum message length for CRC codes? [duplicate]

I've seen 8-bit, 16-bit, and 32-bit CRCs.
At what point do I need to jump to a wider CRC?
My gut reaction is that it is based on the data length:
1-100 bytes: 8-bit CRC
101 - 1000 bytes: 16-bit CRC
1001 - ??? bytes: 32-bit CRC
EDIT:
Looking at the Wikipedia page about CRC and Lott's answer, here' what we have:
<64 bytes: 8-bit CRC
<16K bytes: 16-bit CRC
<512M bytes: 32-bit CRC
It's not a research topic. It's really well understood: http://en.wikipedia.org/wiki/Cyclic_redundancy_check
The math is pretty simple. An 8-bit CRC boils all messages down to one of 256 values. If your message is more than a few bytes long, the possibility of multiple messages having the same hash value goes up higher and higher.
A 16-bit CRC, similarly, gives you one of the 65,536 available hash values. What are the odds of any two messages having one of these values?
A 32-bit CRC gives you about 4 billion available hash values.
From the wikipedia article: "maximal total blocklength is equal to 2**r − 1". That's in bits. You don't need to do much research to see that 2**9 - 1 is 511 bits. Using CRC-8, multiple messages longer than 64 bytes will have the same CRC checksum value.
The effectiveness of a CRC is dependent on multiple factors. You not only need to select the SIZE of the CRC but also the GENERATING POLYNOMIAL to use. There are complicated and non-intuitive trade-offs depending on:
The expected bit error rate of the channel.
Whether the errors tend to occur in bursts or tend to be spread out (burst is common)
The length of the data to be protected - maximum length, minimum length and distribution.
The paper Cyclic Redundancy Code Polynominal Selection For Embedded Networks, by Philip Koopman and Tridib Chakravarty, publised in the proceedings of the 2004 International Conference on Dependable Systems and Networks gives a very good overview and makes several recomendations. It also provides a bibliography for further understanding.
http://www.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf
The choice of CRC length versus file size is mainly relevant in cases where one is more likely to have an input which differs from the "correct" input by three or fewer bits than to have a one which is massively different. Given two inputs which are massively different, the possibility of a false match will be about 1/256 with most forms of 8-bit check value (including CRC), 1/65536 with most forms of 16-bit check value (including CRC), etc. The advantage of CRC comes from its treatment of inputs which are very similar.
With an 8-bit CRC whose polynomial generates two periods of length 128, the fraction of single, double, or triple bit errors in a packet shorter than that which go undetected won't be 1/256--it will be zero. Likewise with a 16-bit CRC of period 32768, using packets of 32768 bits or less.
If packets are longer than the CRC period, however, then a double-bit error will go undetected if the distance between the erroneous bits is a multiple of the CRC period. While that might not seem like a terribly likely scenario, a CRC8 will be somewhat worse at catching double-bit errors in long packets than at catching "packet is totally scrambled" errors. If double-bit errors are the second most common failure mode (after single-bit errors), that would be bad. If anything that corrupts some data is likely to corrupt a lot of it, however, the inferior behavior of CRCs with double-bit errors may be a non-issue.
I think the size of the CRC has more to do with how unique of a CRC you need instead of of the size of the input data. This is related to the particular usage and number of items on which you're calculating a CRC.
The CRC should be chosen specifically for the length of the messages, it is not just a question of the size of the CRC: http://www.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf
Here is a nice "real world" evaluation of CRC-N
http://www.backplane.com/matt/crc64.html
I use CRC-32 and file-size comparison and have NEVER, in the billions of files checked, run into a matching CRC-32 and File-Size collision. But I know a few exist, when not purposely forced to exist. (Hacked tricks/exploits)
When doing comparison, you should ALSO be checking "data-sizes". You will rarely have a collision of the same data-size, with a matching CRC, within the correct sizes.
Purposely manipulated data, to fake a match, is usually done by adding extra-data until the CRC matches a target. However, that results in a data-size that no-longer matches. Attempting to brute-force, or cycle through random, or sequential data, of the same exact size, would leave a real narrow collision-rate.
You can also have collisions within the data-size, just by the generic limits of the formulas used, and constraints of using bits/bytes and base-ten systems, which depends on floating-point values, which get truncated and clipped.
The point you would want to think about going larger, is when you start to see many collisions which can not be "confirmed" as "originals". (When they both have the same data-size, and (when tested backwards, they have a matching CRC. Reverse/byte or reverse/bits, or bit-offsets)
In any event, it should NEVER be used as the ONLY form of comparison, just for a quick form of comparison, for indexing.
You can use a CRC-8 to index the whole internet, and divide everything into one of N-catagories. You WANT those collisions. Now, with those pre-sorted, you only have to check one of N-directories, looking for "file-size", or "reverse-CRC", or whatever other comparison you can do to that smaller data-set, fast...
Doing a CRC-32 forwards and backwards on the same blob of data is more reliable than using CRC-64 in just one direction. (Or an MD5, for that matter.)
You can detect a single bit error with a CRC in any size packet. Detecting double bit errors or correction of single bit errors is limited to the number of distinct values the CRC can take, so for 8 bits, that would 256; for 16 bits, 65535; etc. 2^n; In practice, though, CRCs actually take on fewer distinct values for single bit errors. For example what I call the 'Y5' polynomial, the 0x5935 polynomial only takes on up to 256 different values before they repeat going back farther, but on the other hand it is able to correct double bit errors that distance, which is 30 bytes plus 2 bytes for errors in the CRC itself.
The number of bits you can correct with forward error correction is also limited by the Hamming Distance of the polynomial. For example, if the Hamming distance is three, you have to flip three bits to change from a set of bits that represents one valid message with matching CRC to another valid message with its own matching CRC. If that is the case, you can correct one bit with confidence. If the Hamming distance were 5, you could correct two bits. But when correcting multiple bits, you are effectively indexing multiple positions, so you need twice as many bits to represent the indexes of two corrected bits rather than one.
With forward error correction, you calculate the CRC on a packet and CRC together, and get a residual value. A good message with zero errors will always have the expected residual value (zero unless there's a nonzero initial value for the CRC register), and each bit position of error has a unique residual value, so use it to identify the position. If you ever get a CRC result with that residual, you know which bit (or bits) to flip to correct the error.

Hash 16-bit integer to a 256-bit space efficiently

It sounds weird to be going bigger, but that's what I'm trying to do. I want to take the entire sequence of 16-bit integers and hash each one in such a way that it maps to 256-bit space uniformly.
The reason for this is that I'm trying to put a subset of the 16-bit number space into a 256-bit bloom filter, for fast membership testing.
I could use some well-known hashing function on each integer, but I'm looking for an extremely efficient implementation (just a few instructions) so that this runs well in a GPU shader program. I feel like the fact that the hash input is known to be only 16-bits can inform the hash function is designed somehow, but I am failing to see the solution.
Any ideas?
EDITS
Based on the responses, my original question is confusing. Sorry about that. I will try to restate it with a more concrete example:
I have a subset S1 of n numbers from the set S, which is in the range (0, 2^16-1). I need to represent this subset S1 with a 256-bit bloom filter constructed with a single hashing function. The reason for the bloom filter is a space consideration. I've chosen a 256-bit bloom filter because it fits my space requirements, and has a low enough probability of false positives. I'm looking to find a very simple hashing function that can take a number from set S and represent it in 256 bits such that each bit has roughly equal probability of being 1 or 0.
The reason for the requirement of simplicity in the hashing function is that this hashing function is going to have to run thousands of times per pixel, so anywhere where I can trim instructions is a win.
If you multiply (using uint32_t) a 16 bit value by prime (or for that matter any odd number) p between 2^31 and 2^32, then you "probably" smear the results fairly evenly across the 32 bit space. Then you might want to add another prime value, to prevent 0 mapping to 0 (you want each bit to have an equal probability of being 0 or 1, only one input value in 2^256 should have output all zeros, and since there are only 2^16 inputs that means you want none of them to have output all zeros).
So that's how to expand 16 bits to 32 with one operation (plus whatever instructions are needed to load the constant). Use four different values p1 ... p4 to get 256 bits, and run some tests with different p values to find good ones (i.e. those that produce not too many more false positives than what you expect for your Bloom filter given the size of the set you're encoding and assuming an ideal hashing function). For example I'm pretty sure -1 is a bad p-value.
No matter how good the values you'll see some correlations, though: for example as I've described it above the lowest bit of all 4 separate values will be equal, which is a pretty serious dependency. So you probably want a couple more "mixing" operations. For example you might say that each byte of the final output shall be the XOR of two of the bytes of what I've described (and not two least-siginficant bytes!), just to get rid of the simple arithmetic relations.
Unless I've misunderstood the question, though, this is not how a Bloom filter usually works. Usually you want your hash to produce an exact fixed number of set bits for each input, and all the arithmetic to compute the false positive rate relies on this. That's why for a Bloom filter 256 bits in size you'd normally have k 8-bit hashes, not one 256-bit hash. k is normally rather less than half the size of the filter in bits (the optimal value is the number of bits per value in the filter, times ln(2) which is about 0.7). So normally you don't want the probability of each bit being 1 to be anything like as high as 0.5.
The reason is that once you've ORed as few as 4 such 256-bit values together, almost all the bits in your filter are set (15 in 16 of them). So you're looking at a lot of false positives already.
But if you've done the math and you're happy with a single hash function producing a variable number of set bits averaging half of them, then fair enough. Or is the double-occurrence of the number 256 just a coincidence, because k happens to be 32 for the set size you have chosen and you're actually using the 256-bit hash as 32 8-bit hashes?
[Edit: your comment clarifies this, but anyway k should not get so high that you need 256 bits of hash in total. Clearly there's no point in this case using a Bloom filter with more than 16 bits per value (i.e fewer than 16 values), since using the same amount of space you could just list the values, and have a false positive rate of 0. A filter with 16 bits per value gives a false positive rate of something like 1 in 2200. Even there, optimal k is only 23, that is you should set 23 bits in the filter for each value in the set. If you expect the sets to be bigger than 16 values then you want to set fewer bits for each element, and you'll get a higher false positive rate.]
I believe there is some confusion in the question as posed. I will first try to clear up any inconsistencies I've noticed above.
OP originally states that he is trying to map a smaller space into a larger one. If this is truly the case, then the use of the bloom filter algorithm is unnecessary. Instead, as has been suggested in the comments above, the identity function is the only "hash" function necessary to set and test each bit. However, I make the assertion that this is not really what the OP is looking for. If so, then the OP must be storing 2^256 bits in memory (based on how the question is stated) in order for the space of 16-bit integers (i.e. 2^16) to be smaller than his set size; this is an unreasonable amount of memory to be using and is highly unlikely to be the case.
Therefore, I make the assumption that the problem constraints are as follows: we have a 256-bit bit vector in which we want to map the space of 16-bit integers. That is, we have 256 bits available to map 2^16 possible different integers. Thus, we are not actually mapping into a larger space, but, instead, a much smaller space. Similarly, it does appear (again, as previously pointed out in the comments above) that the OP is requesting a single hash function. If this is the case, there is clear misunderstanding about how bloom filters work.
Bloom filters typically use a set of hash independent hash functions to reduce false positives. Without going into too much detail, every input to the bloom filter runs through all n hash functions and then the resulting index in the bit vector is tested for each function. If all indices tested are set to 1, then the value may be in the set (with proper collisions in all n hash functions or overlap, false positives will occur). Moreover, if any of the indices is set to 0, then the value is absolutely not in the set. With this in mind, it is important to notice that an entirely saturated bloom filter has no benefit. That is, every query to the bloom filter will return that the item is in the set.
Hash Function Concerns
Now, back to the OP's original question. It is likely going to be best to use known hashing algorithms (since these are mathematically difficult to write and "rolling your own" typically doesn't end well). If you are worried about efficiency down to clock-cycles, implement the algorithm yourself in the appropriate assembly language for your architecture to reduce running time for each hash function. Remember, algorithmically, hash functions should run in O(1) time, so they should not contribute too much overhead if implemented properly. To start you off, I would recommend considering the modified bernstein hash. I have written a version for your specific case below (mostly for example purposes):
unsigned char modified_bernstein(short key)
{
unsigned ret = key & 0xff;
ret = 33 * ret ^ (key >> 8);
return ret % 256; // Try to do some modulo math to keep it in range
}
The bernstein method I have adapted generally runs as a function of the number of bytes of the input. Since a short type is 2 bytes or 16-bits, I have removed any variables and loops from the algorithm and simply performed some bit twiddling to get at each byte. Finally, an unsigned char can return a value in the range of [0,256) which forces the hash function to return a valid index in the bit vector.

Data Length vs CRC Length

I've seen 8-bit, 16-bit, and 32-bit CRCs.
At what point do I need to jump to a wider CRC?
My gut reaction is that it is based on the data length:
1-100 bytes: 8-bit CRC
101 - 1000 bytes: 16-bit CRC
1001 - ??? bytes: 32-bit CRC
EDIT:
Looking at the Wikipedia page about CRC and Lott's answer, here' what we have:
<64 bytes: 8-bit CRC
<16K bytes: 16-bit CRC
<512M bytes: 32-bit CRC
It's not a research topic. It's really well understood: http://en.wikipedia.org/wiki/Cyclic_redundancy_check
The math is pretty simple. An 8-bit CRC boils all messages down to one of 256 values. If your message is more than a few bytes long, the possibility of multiple messages having the same hash value goes up higher and higher.
A 16-bit CRC, similarly, gives you one of the 65,536 available hash values. What are the odds of any two messages having one of these values?
A 32-bit CRC gives you about 4 billion available hash values.
From the wikipedia article: "maximal total blocklength is equal to 2**r − 1". That's in bits. You don't need to do much research to see that 2**9 - 1 is 511 bits. Using CRC-8, multiple messages longer than 64 bytes will have the same CRC checksum value.
The effectiveness of a CRC is dependent on multiple factors. You not only need to select the SIZE of the CRC but also the GENERATING POLYNOMIAL to use. There are complicated and non-intuitive trade-offs depending on:
The expected bit error rate of the channel.
Whether the errors tend to occur in bursts or tend to be spread out (burst is common)
The length of the data to be protected - maximum length, minimum length and distribution.
The paper Cyclic Redundancy Code Polynominal Selection For Embedded Networks, by Philip Koopman and Tridib Chakravarty, publised in the proceedings of the 2004 International Conference on Dependable Systems and Networks gives a very good overview and makes several recomendations. It also provides a bibliography for further understanding.
http://www.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf
The choice of CRC length versus file size is mainly relevant in cases where one is more likely to have an input which differs from the "correct" input by three or fewer bits than to have a one which is massively different. Given two inputs which are massively different, the possibility of a false match will be about 1/256 with most forms of 8-bit check value (including CRC), 1/65536 with most forms of 16-bit check value (including CRC), etc. The advantage of CRC comes from its treatment of inputs which are very similar.
With an 8-bit CRC whose polynomial generates two periods of length 128, the fraction of single, double, or triple bit errors in a packet shorter than that which go undetected won't be 1/256--it will be zero. Likewise with a 16-bit CRC of period 32768, using packets of 32768 bits or less.
If packets are longer than the CRC period, however, then a double-bit error will go undetected if the distance between the erroneous bits is a multiple of the CRC period. While that might not seem like a terribly likely scenario, a CRC8 will be somewhat worse at catching double-bit errors in long packets than at catching "packet is totally scrambled" errors. If double-bit errors are the second most common failure mode (after single-bit errors), that would be bad. If anything that corrupts some data is likely to corrupt a lot of it, however, the inferior behavior of CRCs with double-bit errors may be a non-issue.
I think the size of the CRC has more to do with how unique of a CRC you need instead of of the size of the input data. This is related to the particular usage and number of items on which you're calculating a CRC.
The CRC should be chosen specifically for the length of the messages, it is not just a question of the size of the CRC: http://www.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf
Here is a nice "real world" evaluation of CRC-N
http://www.backplane.com/matt/crc64.html
I use CRC-32 and file-size comparison and have NEVER, in the billions of files checked, run into a matching CRC-32 and File-Size collision. But I know a few exist, when not purposely forced to exist. (Hacked tricks/exploits)
When doing comparison, you should ALSO be checking "data-sizes". You will rarely have a collision of the same data-size, with a matching CRC, within the correct sizes.
Purposely manipulated data, to fake a match, is usually done by adding extra-data until the CRC matches a target. However, that results in a data-size that no-longer matches. Attempting to brute-force, or cycle through random, or sequential data, of the same exact size, would leave a real narrow collision-rate.
You can also have collisions within the data-size, just by the generic limits of the formulas used, and constraints of using bits/bytes and base-ten systems, which depends on floating-point values, which get truncated and clipped.
The point you would want to think about going larger, is when you start to see many collisions which can not be "confirmed" as "originals". (When they both have the same data-size, and (when tested backwards, they have a matching CRC. Reverse/byte or reverse/bits, or bit-offsets)
In any event, it should NEVER be used as the ONLY form of comparison, just for a quick form of comparison, for indexing.
You can use a CRC-8 to index the whole internet, and divide everything into one of N-catagories. You WANT those collisions. Now, with those pre-sorted, you only have to check one of N-directories, looking for "file-size", or "reverse-CRC", or whatever other comparison you can do to that smaller data-set, fast...
Doing a CRC-32 forwards and backwards on the same blob of data is more reliable than using CRC-64 in just one direction. (Or an MD5, for that matter.)
You can detect a single bit error with a CRC in any size packet. Detecting double bit errors or correction of single bit errors is limited to the number of distinct values the CRC can take, so for 8 bits, that would 256; for 16 bits, 65535; etc. 2^n; In practice, though, CRCs actually take on fewer distinct values for single bit errors. For example what I call the 'Y5' polynomial, the 0x5935 polynomial only takes on up to 256 different values before they repeat going back farther, but on the other hand it is able to correct double bit errors that distance, which is 30 bytes plus 2 bytes for errors in the CRC itself.
The number of bits you can correct with forward error correction is also limited by the Hamming Distance of the polynomial. For example, if the Hamming distance is three, you have to flip three bits to change from a set of bits that represents one valid message with matching CRC to another valid message with its own matching CRC. If that is the case, you can correct one bit with confidence. If the Hamming distance were 5, you could correct two bits. But when correcting multiple bits, you are effectively indexing multiple positions, so you need twice as many bits to represent the indexes of two corrected bits rather than one.
With forward error correction, you calculate the CRC on a packet and CRC together, and get a residual value. A good message with zero errors will always have the expected residual value (zero unless there's a nonzero initial value for the CRC register), and each bit position of error has a unique residual value, so use it to identify the position. If you ever get a CRC result with that residual, you know which bit (or bits) to flip to correct the error.

Shannon's entropy formula. Help my confusion

my understanding of the entropy formula is that it's used to compute the minimum number of bits required to represent some data. It's usually worded differently when defined, but the previous understanding is what I relied on until now.
Here's my problem. Suppose I have a sequence of 100 '1' followed by 100 '0' = 200 bits. The alphabet is {0,1}, base of entropy is 2. Probability of symbol "0" is 0.5 and "1" is 0.5. So the entropy is 1 or 1 bit to represent 1 bit.
However you can run-length encode it with something like 100 / 1 / 100 / 0 where it's number of bits to output followed by the bit. It seems like I have a representation smaller than the data. Especially if you increase the 100 to much larger number.
I'm using: http://en.wikipedia.org/wiki/Information_entropy as reference at the moment.
Where did I go wrong? Is it the probability assigned to symbols? I don't think it's wrong. Or did I get the connection between compression and entropy wrong? Anything else?
Thanks.
Edit
Following some of the answers my followup are: would you apply the entropy formula to a particular instance of a message to try to find out its information content? Would it be valid to take the message "aaab" and say the entropy is ~0.811. If yes then what's the entropy of 1...10....0 where 1s and 0s are repeated n times using the entropy formula. Is the answer 1?
Yes I understand that you are creating a random variable of your input symbols and guessing at the probability mass function based on your message. What I'm trying to confirm is the entropy formula does not take into account the position of the symbols in the message.
Or did I get the connection between compression and entropy wrong?
You're pretty close, but this last question is where the mistake was. If you're able to compress something into a form that was smaller than its original representation, it means that the original representation had at least some redundancy. Each bit in the message really wasn't conveying 1 bit of information.
Because redundant data does not contribute to the information content of a message, it also does not increase its entropy. Imagine, for example, a "random bit generator" that only returns the value "0". This conveys no information at all! (Actually, it conveys an undefined amount of information, because any binary message consisting of only one kind of symbol requires a division by zero in the entropy formula.)
By contrast, had you simulated a large number of random coin flips, it would be very hard to reduce the size of this message by much. Each bit would be contributing close to 1 bit of entropy.
When you compress data, you extract that redundancy. In exchange, you pay a one-time entropy price by having to devise a scheme that knows how to compress and decompress this data; that itself takes some information.
However you can run-length encode it with something like 100 / 1 / 100 / 0 where it's number of bits to output followed by the bit. It seems like I have a representation smaller than the data. Especially if you increase the 100 to much larger number.
To summarize, the fact that you could devise a scheme to make the encoding of the data smaller than the original data tells you something important. Namely, it says that your original data contained very little information.
Further reading
For a more thorough treatment of this, including exactly how you'd calculate the entropy for any arbitrary sequence of digits with a few examples, check out this short whitepaper.
Have a look at Kolmogorov complexity
The minimum number of bits into which a string can be compressed without losing information. This is defined with respect to a fixed, but universal decompression scheme, given by a universal Turing machine.
And in your particular case, don't restrict yourself to alphabet {0,1}. For your example use {0...0, 1...1} (hundred of 0's and hundred of 1's)
Your encoding works in this example, but it is possible to conceive an equally valid case: 010101010101... which would be encoded as 1 / 0 / 1 / 1 / ...
Entropy is measured across all possible messages that can be constructed in the given alphabet, and not just pathological examples!
John Feminella got it right, but I think there is more to say.
Shannon entropy is based on probability, and probability is always in the eye of the beholder.
You said that 1 and 0 were equally likely (0.5). If that is so, then the string of 100 1s followed by 100 0s has a probability of 0.5^200, of which -log(base 2) is 200 bits, as you expect. However, the entropy of that string (in Shannon terms) is its information content times its probability, or 200 * 0.5^200, still a really small number.
This is important because if you do run-length coding to compress the string, in the case of this string it will get a small length, but averaged over all 2^200 strings, it will not do well. With luck, it will average out to about 200, but not less.
On the other hand, if you look at your original string and say it is so striking that whoever generated it is likely to generate more like it, then you are really saying its probability is larger than 0.5^200, so you are making a different assumptions about the original probability structure of the generator of the string, namely that it has lower entropy than 200 bits.
Personally, I find this subject really interesting, especially when you look into Kolmogorov (Algorithmic) information. In that case, you define the information content of a string as the length of the smallest program that could generate it. This leads to all sorts of insights into software engineering and language design.
I hope that helps, and thanks for your question.