Regular expressions for bit patterns or binary data ? - regex

A question out of curiosity, is there a way to do pattern matching on a bit level ?
Currently all the regex systems I have seen operate on a byte or character based representation, but I haven't seen, any that will let you match on a bit level.
For example, if I have a bit field like this :
011101100011100110110001
(24 bits!) can I check that bits 7,8 & 9 are the pattern 100 ?
Language agnostic answers are preferable, but as I know of nowhere that does it, I would appreciate any insight.
NOTE: I wish to do this on an arbitrary number of bits so converting to bytes (or padding to a byte size) and applying a convoluted normal regexp is NOT what I want !
Thanks,

Certainly there is no theoretical limit which would make it impossible. In fact, the associated theory can apply to any alphabet, and examples often use quite small alphabets, though not usually the one consisting of the symbols 0 and 1. You might want to read a book about computational theory.

Assuming that you're trying to check actual bits and not a string of 1s and 0s, I don't believe you can do this with regex per se, but you can apply a bit mask to check the status of certain bits. For example, to check the LMB is 1:
11000100
AND
10000000
= 10000000

Related

Is there any regex to find that a value is hashmap or not?

Is there any regex to find that a value is hashmap or not?
I want to build an PL/SQL function to say that a value is hashed or not?
For e.g
1. TIM 2. F6099C0932D0E2B13286218F99C265975B33FD84
My Regex should have intelligence to tell me that expression 1 (Tim) is not hashmaped.
Whereas expression 2 (F6099C0932D0E2B13286218F99C265975B33FD84) is hashmaped.
A hash is just a number of bits of a particular size. Cryptographic hashes generally have a 256 to 512 bit output size to achieve a security of about 128-256 bits to achieve collision resistance.
Other hashes used in a hash map may be smaller, as collision resistance is usually not required; instead the hash just needs to be well distributed, so that hashed values are distributed equally.
Computers generally only address bytes, not bits. So commonly the hashes are a multiple of 8 bits. Even more generally, they are commonly a power of two, or two or three powers of two added together (160 bits for 128 + 32 bits).
Now to view those well distributed bytes we need to have some kind of way to view these bit values using printable characters. One way to do that is base 64. However, for these relatively short values hexadecimals are usually preferred, and that's what you have in the question.
So can you see if it is a hash value or not? Well, yes and no. You can see with a pretty good likelihood that it is a 40 character hexadecimal value, which represents a 20 bytes or 20 * 8 = 160 bit value. We can also "see" that it is pretty well distributed and that it doesn't encode printable ASCII (as there are values above 7E hex).
Testing with a regex that the contents are (upper or lowercase) hexadecimals is easy enough. That it is 40 characters for 160 bits should be easy as well. However, to test that it is indeed a well distributed value is not really possible with regular expressions. It is even not easy for any program code, as "random" values may now and then look surprisingly non-random. Besides that, not only hashes consist of well distributed byte values. Ciphertext and - of course - random byte values have similar properties.
So, yeah, you can verify that the output format is compatible with a hash value, but testing if it is a hash value is not really possible.
The regexp:
[0-9A-Fa-f]{40}
would of course wipe "Tim" out. You can be 100% sure that "Tim" is not a 160 bit hash value encoded in hexadecimals after all.

How does the quality of 128 bit MurmurHash3 change in case of small key length or output truncation?

I have 64bit machine and I want to use 128 bits murmurhash3 due to its speed (MurmurHash3_x64_128 function in https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp).
But the thing is my inputs to this hash function won't be more than 30 bytes long, in which case the for loop in that MurmurHash3_x64_128 function will only iterate once, and then the tail part will be done. In such a scheme, it seems like the mixing wont be that great. Am I right? If not, could you please elaborate why? If yes, what would you suggest the reasonable minimum length of input key to 128 bits murmurhash3, so that hashing is good?
The second thing is about the truncation of the output bits. As far as I understood from the answer https://stackoverflow.com/a/11488383/7056851, although it causes more collision rate due to less output range, slicing the output will still give good hash values if the original hash function is "random" enough. My question is then if the 128 bit murmurhash3 is a good candidate for output truncation. The reason why I am asking this is that I want to use the MurmurHash3_x64_128 for its speed performance but I only need 32-bit hash values so I am planing to separate the 128 bits to 32 bits and get 4 32-bits hash values for a given key. But I am doubtful about how good the resulting hash values are.
One last question is about the endianness. If you look at the comment at line 52 in the link to the source code, it says:
Block read - if your platform needs to do endian-swapping or can only handle aligned reads, do the conversion here
Why does whether the platform is little endian or big endian matter? After all, all the bits are multipled with some constants and rotated and XORed, etc. and what we want from a hash function is basically to map the input keys to the output range, with a uniform distribution. How the endianness change the picture? And even if it changes the picture, what if the input is an array of char? The endianness shouldn't matter at least for such keys as array of chars, should it?
As you can see, I am not very good at analyzing hash functions. Any clear explanation is appreciated.

32 bit CRC with some inputs set to zero. Is this less accurate than dummy data?

Sorry if I should be able to answer this simple question myself!
I am working on an embedded system with a 32bit CRC done in hardware for speed. A utility exists that I cannot modify that initially takes 3 inputs (words) and returns a CRC.
If a standard 32 bit was implemented, would generating a CRC from a 32 bit word of actual data and 2 32 bit words comprising only of zeros produce a less reliable CRC than if I just made up/set some random values for the last 2 32?
Depending on the CRC/polynomial, my limited understanding of CRC would say the more data you put in the less accurate it is. But don't zero'd data reduce accuracy when performing the shifts?
Using zeros will be no different than some other value you might pick. The input word will be just as well spread among the CRC bits either way.
I agree with Mark Adler that zeros are mathematically no worse than other numbers. However, if the utility you can't change does something bad like set the initial CRC to zero, then choose non-zero pad words. An initial CRC=0 + Data=0 + Pads=0 produces a final CRC=0. This is technically valid, but routinely getting CRC=0 is undesirable for data integrity checking. You could compensate for a problem like this with non-zero pad characters, e.g. pad = -1.

Shannon's entropy formula. Help my confusion

my understanding of the entropy formula is that it's used to compute the minimum number of bits required to represent some data. It's usually worded differently when defined, but the previous understanding is what I relied on until now.
Here's my problem. Suppose I have a sequence of 100 '1' followed by 100 '0' = 200 bits. The alphabet is {0,1}, base of entropy is 2. Probability of symbol "0" is 0.5 and "1" is 0.5. So the entropy is 1 or 1 bit to represent 1 bit.
However you can run-length encode it with something like 100 / 1 / 100 / 0 where it's number of bits to output followed by the bit. It seems like I have a representation smaller than the data. Especially if you increase the 100 to much larger number.
I'm using: http://en.wikipedia.org/wiki/Information_entropy as reference at the moment.
Where did I go wrong? Is it the probability assigned to symbols? I don't think it's wrong. Or did I get the connection between compression and entropy wrong? Anything else?
Thanks.
Edit
Following some of the answers my followup are: would you apply the entropy formula to a particular instance of a message to try to find out its information content? Would it be valid to take the message "aaab" and say the entropy is ~0.811. If yes then what's the entropy of 1...10....0 where 1s and 0s are repeated n times using the entropy formula. Is the answer 1?
Yes I understand that you are creating a random variable of your input symbols and guessing at the probability mass function based on your message. What I'm trying to confirm is the entropy formula does not take into account the position of the symbols in the message.
Or did I get the connection between compression and entropy wrong?
You're pretty close, but this last question is where the mistake was. If you're able to compress something into a form that was smaller than its original representation, it means that the original representation had at least some redundancy. Each bit in the message really wasn't conveying 1 bit of information.
Because redundant data does not contribute to the information content of a message, it also does not increase its entropy. Imagine, for example, a "random bit generator" that only returns the value "0". This conveys no information at all! (Actually, it conveys an undefined amount of information, because any binary message consisting of only one kind of symbol requires a division by zero in the entropy formula.)
By contrast, had you simulated a large number of random coin flips, it would be very hard to reduce the size of this message by much. Each bit would be contributing close to 1 bit of entropy.
When you compress data, you extract that redundancy. In exchange, you pay a one-time entropy price by having to devise a scheme that knows how to compress and decompress this data; that itself takes some information.
However you can run-length encode it with something like 100 / 1 / 100 / 0 where it's number of bits to output followed by the bit. It seems like I have a representation smaller than the data. Especially if you increase the 100 to much larger number.
To summarize, the fact that you could devise a scheme to make the encoding of the data smaller than the original data tells you something important. Namely, it says that your original data contained very little information.
Further reading
For a more thorough treatment of this, including exactly how you'd calculate the entropy for any arbitrary sequence of digits with a few examples, check out this short whitepaper.
Have a look at Kolmogorov complexity
The minimum number of bits into which a string can be compressed without losing information. This is defined with respect to a fixed, but universal decompression scheme, given by a universal Turing machine.
And in your particular case, don't restrict yourself to alphabet {0,1}. For your example use {0...0, 1...1} (hundred of 0's and hundred of 1's)
Your encoding works in this example, but it is possible to conceive an equally valid case: 010101010101... which would be encoded as 1 / 0 / 1 / 1 / ...
Entropy is measured across all possible messages that can be constructed in the given alphabet, and not just pathological examples!
John Feminella got it right, but I think there is more to say.
Shannon entropy is based on probability, and probability is always in the eye of the beholder.
You said that 1 and 0 were equally likely (0.5). If that is so, then the string of 100 1s followed by 100 0s has a probability of 0.5^200, of which -log(base 2) is 200 bits, as you expect. However, the entropy of that string (in Shannon terms) is its information content times its probability, or 200 * 0.5^200, still a really small number.
This is important because if you do run-length coding to compress the string, in the case of this string it will get a small length, but averaged over all 2^200 strings, it will not do well. With luck, it will average out to about 200, but not less.
On the other hand, if you look at your original string and say it is so striking that whoever generated it is likely to generate more like it, then you are really saying its probability is larger than 0.5^200, so you are making a different assumptions about the original probability structure of the generator of the string, namely that it has lower entropy than 200 bits.
Personally, I find this subject really interesting, especially when you look into Kolmogorov (Algorithmic) information. In that case, you define the information content of a string as the length of the smallest program that could generate it. This leads to all sorts of insights into software engineering and language design.
I hope that helps, and thanks for your question.

Compression for a unique stream of data

I've got a large number of integer arrays. Each one has a few thousand integers in it, and each integer is generally the same as the one before it or is different by only a single bit or two. I'd like to shrink each array down as small as possible to reduce my disk IO.
Zlib shrinks it to about 25% of its original size. That's nice, but I don't think its algorithm is particularly well suited for the problem. Does anyone know a compression library or simple algorithm that might perform better for this type of information?
Update: zlib after converting it to an array of xor deltas shrinks it to about 20% of the original size.
If most of the integers really are the same as the previous, and the inter-symbol difference can usually be expressed as a single bit flip, this sounds like a job for XOR.
Take an input stream like:
1101
1101
1110
1110
0110
and output:
1101
0000
0010
0000
1000
a bit of pseudo code
compressed[0] = uncompressed[0]
loop
compressed[i] = uncompressed[i-1] ^ uncompressed[i]
We've now reduced most of the output to 0, even when a high bit is changed. The RLE compression in any other tool you use will have a field day with this. It'll work even better on 32-bit integers, and it can still encode a radically different integer popping up in the stream. You're saved the bother of dealing with bit-packing yourself, as everything remains an int-sized quantity.
When you want to decompress:
uncompressed[0] = compressed[0]
loop
uncompressed[i] = uncompressed[i-1] ^ compressed[i]
This also has the advantage of being a simple algorithm that is going to run really, really fast, since it is just XOR.
Have you considered Run-length encoding?
Or try this: Instead of storing the numbers themselves, you store the differences between the numbers. 1 1 2 2 2 3 5 becomes 1 0 1 0 0 1 2. Now most of the numbers you have to encode are very small. To store a small integer, use an 8-bit integer instead of the 32-bit one you'll encode on most platforms. That's a factor of 4 right there. If you do need to be prepared for bigger gaps than that, designate the high-bit of the 8-bit integer to say "this number requires the next 8 bits as well".
You can combine that with run-length encoding for even better compression ratios, depending on your data.
Neither of these options is particularly hard to implement, and they all run very fast and with very little memory (as opposed to, say, bzip).
You want to preprocess your data -- reversibly transform it to some form that is better-suited to your back-end data compression method, first. The details will depend on both the back-end compression method, and (more critically) on the properties you expect from the data you're compressing.
In your case, zlib is a byte-wise compression method, but your data comes in (32-bit?) integers. You don't need to reimplement zlib yourself, but you do need to read up on how it works, so you can figure out how to present it with easily compressible data, or if it's appropriate for your purposes at all.
Zlib implements a form of Lempel-Ziv coding. JPG and many others use Huffman coding for their backend. Run-length encoding is popular for many ad hoc uses. Etc., etc. ...
Perhaps the answer is to pre-filter the arrays in a way analogous to the Filtering used to create small PNG images. Here are some ideas right off the top of my head. I've not tried these approaches, but if you feel like playing, they could be interesting.
Break your ints up each into 4 bytes, so i0, i1, i2, ..., in becomes b0,0, b0,1, b0,2, b0,3, b1,0, b1,1, b1,2, b1,3, ..., bn,0, bn,1, bn,2, bn,3. Then write out all the bi,0s, followed by the bi,1s, bi,2s, and bi,3s. If most of the time your numbers differ only by a bit or two, you should get nice long runs of repeated bytes, which should compress really nicely using something like Run-length Encoding or zlib. This is my favourite of the methods I present.
If the integers in each array are closely-related to the one before, you could maybe store the original integer, followed by diffs against the previous entry - this should give a smaller set of values to draw from, which typically results in a more compressed form.
If you have various bits differing, you still may have largish differences, but if you're more likely to have large numeric differences that correspond to (usually) one or two bits differing, you may be better off with a scheme where you create ahebyte array - use the first 4 bytes to encode the first integer, and then for each subsequent entry, use 0 or more bytes to indicate which bits should be flipped - storing 0, 1, 2, ..., or 31 in the byte, with a sentinel (say 32) to indicate when you're done. This could result the raw number of bytes needed to represent and integer to something close to 2 on average, which most bytes coming from a limited set (0 - 32). Run that stream through zlib, and maybe you'll be pleasantly surprised.
Did you try bzip2 for this?
http://bzip.org/
It's always worked better than zlib for me.
Since your concern is to reduce disk IO, you'll want to compress each integer array independently, without making reference to other integer arrays.
A common technique for your scenario is to store the differences, since a small number of differences can be encoded with short codewords. It sounds like you need to come up with your own coding scheme for differences, since they are multi-bit differences, perhaps using an 8 bit byte something like this as a starting point:
1 bit to indicate that a complete new integer follows, or that this byte encodes a difference from the last integer,
1 bit to indicate that there are more bytes following, recording more single bit differences for the same integer.
6 bits to record the bit number to switch from your previous integer.
If there are more than 4 bits different, then store the integer.
This scheme might not be appropriate if you also have a lot of completely different codes, since they'll take 5 bytes each now instead of 4.
"Zlib shrinks it by a factor of about 4x." means that a file of 100K now takes up negative 300K; that's pretty impressive by any definition :-). I assume you mean it shrinks it by 75%, i.e., to 1/4 its original size.
One possibility for an optimized compression is as follows (it assumes a 32-bit integer and at most 3 bits changing from element to element).
Output the first integer (32 bits).
Output the number of bit changes (n=0-3, 2 bits).
Output n bit specifiers (0-31, 5 bits each).
Worst case for this compression is 3 bit changes in every integer (2+5+5+5 bits) which will tend towards 17/32 of original size (46.875% compression).
I say "tends towards" since the first integer is always 32 bits but, for any decent sized array, that first integer would be negligable.
Best case is a file of identical integers (no bit changes for every integer, just the 2 zero bits) - this will tend towards 2/32 of original size (93.75% compression).
Where you average 2 bits different per consecutive integer (as you say is your common case), you'll get 2+5+5 bits per integer which will tend towards 12/32 or 62.5% compression.
Your break-even point (if zlib gives 75% compression) is 8 bits per integer which would be
single-bit changes (2+5 = 7 bits) : 80% of the transitions.
double-bit changes (2+5+5 = 12 bits) : 20% of the transitions.
This means your average would have to be 1.2 bit changes per integer to make this worthwhile.
One thing I would suggest looking at is 7zip - this has a very liberal licence and you can link it with your code (I think the source is available as well).
I notice (for my stuff anyway) it performs much better than WinZip on a Windows platform so it may also outperform zlib.