Find a repeating symmetric bit pattern in a small stream of 128 bits - c++

How can I quickly scan groups of 128 bits that are exact equal repeating binary patterns, such 010101... Or 0011001100...?
I have a number of 128 bit blocks, and wish to see if they match the patterns where the number of 1s is equal to number of 0s, eg 010101.... Or 00110011... Or 0000111100001111... But NOT 001001001...
The problem is that patterns may not start on their boundary, so the pattern 00110011.. May begin as 0110011..., and will end 1 bit shifted also (note the 128 bits are not circular, so start doesn't join to the end)
The 010101... Case is easy, it is simply 0xAAAA... Or 0x5555.... However as the patterns get longer, the permutations get longer. Currently I use repeating shifting values such as outlined in this question Fastest way to scan for bit pattern in a stream of bits but something quicker would be nice, as I'm spending 70% of all CPU in this routine. Other posters have solutions for general cases but I am hoping the symmetric nature of my pattern might lead to something more optimal.
If it helps, I am only interested in patterns up to 63 bits long, and most interested in the power of 2 patterns (0101... 00110011... 0000111100001111... Etc) while patterns such as 5 ones/5 zeros are present, these non power 2 sequences are less than 0.1%, so can be ignored if it helps the common cases go quicker.
Other constraints for a perfect solution would be small number of assembler instructions, no wildly random memory access (ie, large rainbow tables not ideal).
Edit. More precise pattern details.
I am mostly interested in the patterns of 0011 and 0000,1111 and 0000,0000,1111,1111 and 16zeros/ones and 32 zeros/ones (commas for readabily only) where each pattern repeats continuously within the 128 bits. Patterns that are not 2,4,8,16,32 bits long for the repeating portion are not as interesting and can be ignored. ( eg 000111... )
The complexity for scanning is that the pattern may start at any position, not just on the 01 or 10 transition. So for example, all of the following would match the 4 bit repeating pattern of 00001111... (commas every 4th bit for readability) (ellipses means repeats identically)
0000,1111.... Or 0001,1110... Or 0011,1100... Or 0111,1000... Or 1111,0000... Or 1110,0001... Or 1100,0011... Or 1000,0111
Within the 128bits, the same pattern needs to repeat, two different patterns being present is not of interest. Eg this is NOT a valid pattern. 0000,1111,0011,0011... As we have changed from 4 bits repeating to 2 bits repeating.
I have already verified the number of 1s is 64, which is true for all power 2 patterns, and now need to identify how many bits make up the repeating pattern (2,4,8,16,32) and how much the pattern is shifted. Eg pattern 0000,1111 is a 4 bit pattern, shifted 0. While 0111,1000... Is a 4 bit pattern shifted 3.

Lets start with the case where the patterns do start on their boundary. You can check the first bit and use it to determine your state. Then start looping through your block, check the first bit, increment a count, left shift and repeat until you find that you've gotten the opposite bit. You can now use this initial length as the bitset length. Reset the count to 1 then count the next set of opposite bits. When you switch, check the length against the initial length and error out if they're not equal. Here's a quick function - it seems to work as expected for chars, and it shouldn't be too hard to expand it to deal with blocks of 32 bytes.
unsigned char myblock = 0x33;
unsigned char mask = 0x80, prod = 0x00;
int setlen = 0, count = 0, ones=0;
prod = myblock & mask;
if(prod == 0x80)
ones = 1;
for(int i=0;i<8;i++){
prod = myblock & mask;
myblock = myblock << 1;
if((prod == 0x80 && ones) || (prod == 0x00 && !ones)){
count++;
}else{
if(setlen == 0) setlen = count;
if(count != setlen){
printf("Bad block\n");
return -1;
}
count = 1;
ones = ( ones == 1 ) ? 0 : 1;
}
}
printf("Good block of with % repeating bits\n",setlen);
return setlen;
Now to deal with blocks where there's an offset, I'd suggest counting the number of bits until the first 'flip'. Store this number, then run the above routine until you hit the last segment which should have length unequal to the rest of the sets. Add the initial bits to the last segment's length, and then you should be able to compare it with the size of the rest of the sets correctly.
This code is pretty small, and bit shifting through a buffer shouldn't require too much work on the CPU's part. I'd be interested to see how this solution ends up performing against your current one.

The Generic solution for this kind of problems is to create a good hashing function for the patterns and store each pattern in a hash map. Once you have the hash map created for the patterns then try to lookup in the table using the input stream. I don't have code yet but let me know if you are struck in code.. Please post it and I can work on it..

I've thought about making a state machine, so every next byte (out of 16) would advance its state and after some 16 state transitions you'd have the pattern identified. But that doesn't look very promising. Data structures and logic look more complex.
Instead, why not precompute all those 126 patterns (from 01 to 32 zeroes + 32 ones), sort them and perform binary search? That would give you at most 7 iterations of binary search. And you don't need to store all 16 bytes of every pattern as its halves are identical. That gives you 126*16/2=1008 bytes for the array of patterns. You also need something like 2 bytes per pattern to store the length of zero (one) runs and the shift relative to whatever pattern you consider unshifted. That's a total of 126*(16/2+2)=1260 bytes of data (should be gentle on the data cache) and very simple and tiny binary search algorithm. Basically, its just an improvement over the answer that you mentioned in the question.
You might want to try switching to linear search after 4-5 iterations of binary search. That may give a small boost to the overall algorithm.
Ultimately, the winner is determined by testing/profiling. And that's what you should do, get a few implementations and compare them on the real data in the real system.

The restriction of the pattern repeating it self all over the 128-stream makes the number of combinations limited and also the sequence will have properties making it easy to check:
One needs to iteratively check if high and low parts are same; if they are opposites, check if that particular length contains consecutive ones.
8-bit repeat at offset 3: 00011111 11100000 00011111 11100000
==> high and low 16 bits are the same
00011111 11100000 ==> high and low parts are inverted.
Not same, nor inverted means rejection of pattern.
At that point one needs to check if there's a sequence of ones -- add '1' to the left side and check if it's power of two: n==(n & -n) is the textbook check for that.

Related

How to negate only least significant bits?

Starting from the position of the highest order 1 bit, how can I negate that bit and all lower order bits?
Example (C#)
int inputNumber = 0b_10001;
inputNumber = ~inputNumber; // bitwise complement
int expectedNumber = 0b_000000000000000000000000000_01110;
uint actualNumber = 0b_111111111111111111111111111_01110; // almost, but not quite
Details
0b is just a marker to start writing the number in binary. The _ is just a separator to make it visually easier to follow (but still valid C# code). Example: 0b_1010_1111 is actually 10101111 in binary.
I feel I'm close to the solution - I need to make a mask to get rid of the unwanted bits, but I'm not sure how.
How can I get from 10001 to 01110, basically negating each bit, but without having the leading 1s?

How to calculate the set bit positions in a number?

If n = 100011 in binary, then I want to retrieve the positions of set bits which in this case are 1,5,6 when measured from left to right.
How to calculate such positions without literally checking for bit is zero or not by going to every bit position?
In the most common convention, a binary number is written in the order as a number in other common positional representations (decimal etc): with the least significant digit in the rightmost position. It also makes more sense to label that digit as "digit zero", so that the label of every digit corresponds with the exponent in the associated weight (eg bit 0 has weight 20=1 and so forth). This doesn't really matter, it's easy enough to re-number the digits, but it's usually easier to follow the conventions.
Since you asked
How to calculate such positions without literally checking for bit is zero or not by going to every bit position?
I will address that portion of the question. Checking the bits one by one is not completely disastrous however. Even for BigInts. The number of results could be as high as the number of bits anyway. For numbers known to be sparse, there is still not much that can be done - every bit has to be checked somehow because if any bit is ignored completely, that bit might have been set and we'd miss it. But in the context of a machine word, there are tricks, for example based on find-first-set.
Using the find-first-set function (or count trailing zeroes), the index of the set bit with the lowest index can be found in one step (if you accept this function as being one step, which is a reasonable assumption on most hardware, and in theory you can just define it to be one step), and then that bit can be removed so the next find-first-set will find the index of the next bit. For example:
while bitmask != 0:
yield return find-first-set(bitmask)
bitmask &= bitmask - 1 // remove lowest set bit
This is easy to adapt to BigInts, just do this on every limb of the number and add the appropriate offset.
To do that you use masks.
Each position from right to left is a power of two.
For example 0101 is 1*2ˆ0 + 0*2ˆ1 + 1*2ˆ2 + 0*1ˆ3 = 1+0+4+0 = 5
Then to check if these two bits are on against a bytesToTest variable you AND with 5: byteToTest & 5 == 5
Given that 1 & 0 = 0 and 1 & 1 = 1
If bytesToTest is 1111 then 1111 & 0101 will give 0101
If bytesToTest is 1010 then 1010 & 0101 will give 0000
Following this reasoning for the particular case of 100011
To retrieve 1, 5, and 6 from left to right (the three ones set to 1)
The mask is: 1+2+32 = 35
With this information you should be able to define individual masks for each bit, test one by one, and be able to answer in which position you find bits that are on and in which bits that are off.

Find all partial matches to vector of unsigned

For an AI project of mine, I need to apply to a factored state all rules that apply to its partial components. This needs to be done very frequently so I'm looking for a way to make this as fast as possible.
I'm going to describe my problem with strings, however the true problem works in the same way with vectors of unsigned integers.
I have a bunch of entries (of length N) like this which I need to store in some way:
__a_b
c_e__
___de
abcd_
fffff
__a__
My input is a single entry ciede to which I must find, as fast as possible, all stored entries which match to it. For example in this case the matches would be c_e__ and ___de. Removal and adding of entries should be supported, however I don't care how slow it is. What I would like to be as fast as possible is:
for ( const auto & entry : matchedEntries(input) )
My problem, as I said, is one where each letter is actually an unsigned integer, and the vector is of an unspecified (but known) length. I have no requirements for how entries should be stored, or what type of metadata is going to be associated with them. The naive algorithm of matching all is O(N), is it possible to do better? The number of reasonable entries I need stored is <=100k.
I'm thinking some kind of sorting might help, or some weird looking tree structure, but I can't seem to figure out a good way to approach this problem. It also looks like something word processers already need to do, so someone might be able to help.
The easiest solution is to build a trie containing your entries. When searching the trie, you start in the root and recursively follow an edge, that matches character from your input. There will be at most two of those edges in each node, one for the wildcard _ and one for the actual letter.
In the worst case you have to follow two edges from each node, which would add up to O(2^n) complexity, where n is the length of the input, while the space complexity is linear.
A different approach would be to preprocess the entries, to allow for linear search. This is basically what compiling regular expressions does. For your example, consider following regular expression, which matches your desired input:
(..a.b|c.e..|...de|abcd.|fffff|..a..)
This expression can be implemented as a nondeterministic finite state automaton, with initial state having ε-moves to a deterministic automaton for each of the single entries. This NFSA can then be turned to a deterministic FSA, using the standard powerset construction.
Although this construction can increase the number of states substantially, searching the input word can then be done in linear time, simply simulating the deterministic automaton.
Below is an example for entries ab, a_, ba, _a and __. First start with a nondeterministic automaton, which upon removing ε-moves and joining equivalent states is actually a trie for the set.
Then turn it into a deterministic machine, with states corresponding to subsets of states of the NFSA. Start in the state 0 and for each edge, other than _, create the next state as the union of the states in the original machine, that are reachable from any state in the current set.
For example, when DFSA is in state 16, that means the NFSA could be either in state 1 or 6. Upon transition on a, the NFSA could get to states 3 (from 1), 7 or 8 (from 6) - that will be your next state in the DFSA.
The standard construction would preserve the _-edges, but we can omit them, as long as the input does not contain _.
Now if you have a word ab on the input, you simulate this automaton (i.e. traverse its transition graph) and end up in state 238, from which you can easily recover the original entries.
Store the data in a tree, 1st layer represents 1st element (character or integer), and so on. This means the tree will have a constant depth of 5 (excluding the root) in your example. Don't care about wildcards ("_") at this point. Just store them like the other elements.
When searching for the matches, traverse the tree by doing a breadth first search and dynamically build up your result set. Whenever you encounter a wildcard, add another element to your result set for all other nodes of this layer that do not match. If no subnode matches, remove the entry from your result set.
You should also skip reduntant entries when building up the tree: In your example, __a_b is reduntant, because whenever it matches, __a__ also matches.
I've got an algorithm in mind which I plan to implement and benchmark, but I'll describe the approach already. It needs n_templates * template_length * n_symbols bits of storage (so for 100k templates of length 100 and 256 distinct symbols needs 2.56 Gb = 320 MB of RAM. This does not scale nicely to large number of symbols unless succinct data structure is used.
Query takes O(n_templates * template_length * n_symbols) time but should perform quite well thanks to bit-wise operations.
Let's say we have the given set of templates:
__a_b
c_e__
___de
abcd_
_ied_
bi__e
The set of symbols is abcdei, for each symbol we pre-calculate a bit mask indicating whether the template differs from the symbol at that location or not:
aaaaa bbbbb ccccc ddddd eeeee iiiii
....b ..a.. ..a.b ..a.b ..a.b ..a.b
c.e.. c.e.. ..e.. c.e.. c.... c.e..
...de ...de ...de ....e ...d. ...de
.bcd. a.cd. ab.d. abc.. abcd. abcd.
.ied. .ied. .ied. .ie.. .i.d. ..ed.
bi..e .i..e bi..e bi..e bi... b...e
Same tables expressed in binary:
aaaaa bbbbb ccccc ddddd eeeee iiiii
00001 00100 00101 00101 00101 00101
10100 10100 00100 10100 10000 10100
00011 00011 00011 00001 00010 00011
01110 10110 11010 11100 11110 11110
01110 01110 01110 01100 01010 00110
11001 01001 11001 11001 11000 10001
These are stored in columnar order, 64 templates / unsigned integer. To determine which templates match ciede we check the 1st column of c table, 2st column from i, 3rd from e and so forth:
ciede ciede
__a_b ..a.b 00101
c_e__ ..... 00000
___de ..... 00000
abcd_ abc.. 11100
_ied_ ..... 00000
bi__e b.... 10000
We find matching templates as rows of zeros, which indicates that no differences were found. We can check 64 templates at once, and the algorithm itself is very simple (python-like code):
for i_block in range(n_templates / 64):
mask = 0
for i in range(template_length):
# Accumulate difference-indicating bits
mask |= tables[i_block][word[i]][i]
if mask == 0xFFFFFFFF:
# All templates differ, we can stop early
break
for i in range(64):
if mask & (1 << i) == 0:
print('Match at template ' + (i_block * 64 + i))
As I said I haven't yet actually tried implementing this, so I have no clue how fast it is in practice.

Hamming SEC/DED extra parity bit

I'm having some troubles with the SEC/DED error correction code. It seems I've found some cases in which the decoder thinks a double bit flip occured but only one really occured. I suppose I did somthing wrong, but I was not able to understand what.
Let me show you an example.
Suppose I want to encode the 4 bits 1011 using a (7,4) code plus an extra bit needed to perform the two-error-detection. The coded word should be 00110011, where the most significant bit is the extra parity bit, the following two are p0 and p1 and so on.
Now, let's suppose that during a transmission the less significant bit is flipped; thus the received word will be 00110010. The receiver will extract from this code the four received data bits 1010 and will construct a new code which will result 01011010. Finally the receiver will perform a bitwise xor of the two codes obtaining 0111. The last three bits says that bit 7 has been flipped (which is right), but the first bit is 0 and, as far as i know, the decoder should consider this situation as if more than a bit flip has occured.
What did I do wrong?
I think I've solved the problem.
In the example above I calculate the syndrome and then I compute a new overall parity bit of the resultant codeword. Instead, I should check the overall parity of the received word and set the error_happened boolean to that value; then calculate the syndrome.

Compression algorithms for numbers only

I am to compress location data (latitude,longitude, date,time). All the numbers are in fixed format. 2 of them (latitude,longitude) are with decimal format. Other 2 are integers.
Now these numbers are in fixed format string.
What are the algorithms for compressing numbers in fixed format?
Is number only compressions (if there any) better than string compression?
Should I directly compress string without converting it to numbers and then compress?
Thanks in advance.
This is one of these places where a little theory is helpful. You need to think about several things:
what is the resolution of your measurements: 0.1° or 0.001°? 1 second or one microsecond?
are the measurements associated and in some order, or tossed together randomly?
Let's say, just for example, that the resolution is 0.01°. Them you know that your values range from -180° to +180°, or 35900 different values. Lg(35900) ≈ 16 so you need 16 bits; 14 bits for -90°–+90°. Clearly, if you're storing this kind of value as floating-point, you can compress the data by half immediately.
Similarly with date time, what's the range; how many bits must you have?
Now, if the data is in some order (like, samples taken sequentially aboard a single ship) then all you need is a start value and a delta; that can make a big difference. With a ship traveling at 30 knots, the position can't change any more that about 0.03 degrees an hour or about 0.0000083 degrees a second. Those deltas are going to be very small values, so you can store them in a very few bits.
The point is that there are a number of things you can do, but you have to know more about the data than we do to make a recommendation.
Update: Oh, wait, fixed point strings?!
Okay, this is (relatively) easy. Just to start with, yes, you want to convert your strings into some binary representation. Just making up a data item, you might have
040.00105.0020090518212100Z
which you could convert to
| 4000 | short int, 16 bits |
| 10500 | short int, 16 bits |
| 20090518212100Z | 64 bits |
So that's 96 bits, 12 bytes versus 26 bytes.
Compression typically works on a byte stream. When a stream has a non-uniform distribution of byte values (for instance text, or numbers stored as text), the compression ratio you can achieve will be higher, since fewer bits are used to store the bytes which appear more frequently (in Huffman compression).
Typically, the data you are talking about will simply be stored as binary numbers (not text), and that's usually space and retrieval efficient.
I recommend you have a look at The Data Compression Book
What kind of data are you compressing? How is it distributed? Is it ordered in any way? All of these things can affect how well it compresses, and perhaps allow you to convert the data in to something more easily compressed, or simply smaller right out the gate.
Data compress works poorly on "random" data. If your data is within a smaller range, you may well be able to leverage that.
In truth, you should simply try running any of the common algorithms and see if the data is "compressed enough". If not, and you know more about the data than can be "intuited" by the compression algorithms, you should leverage that information.
An example is say that your data is not just Lat's and Long's, but they're assumed to be "close" to each other. Then you could probably store an "origin" Lat and Long, and the rest can be differential. Perhaps these differences are small enough to fit in to a single, signed byte.
That's just a simple example of things you can do with knowledge of the data vs what some generic algorithm may not be able to figure out.
It depends on what you are going to do with the data, and how much precision you need.
Lat/long is traditionally given in degrees, minutes, and seconds, with 60 seconds to the minute, 60 minutes to the degree,and 1 degree of latitude nominally equal to 60 nautical miles (nmi). 1 minute is then 1 nmi, and 1 second is just over 100 ft.
Latitude goes from -90 to +90 degrees. Representing latitude as integer seconds gives you a range of -324000..+324000, or about 20 bits. Longitude goes -180 to +180, so representing longitude the same way requires 1 more bit.
So you can represent a complete lat/long position, to +/- 50 ft, in 41 bits.
Obviously, if you don't need that much precision, you can back down your bit count.
Observe that a traditional single-precision 32-bit float uses about 24 bits of mantissa, so you are down to about +/- 6 feet if you just convert your lat/long in seconds to float. It is kind of hard to beat two single-precision floats for this kind of thing.
Depending on the available characters, you could make something quite easily.
For example, if the input is only digits (0..9), here's a solution that will encode and decode them, in Kotlin (similar thing on Java) :
fun encodeDigitsOnlyString(stringWithDigitsOnly: String): ByteArray {
//we couple each 2 digits together into a single byte.
//For the last digit, if it has no digit to pair with, it's paired with something that's not a digit
val result = ArrayList<Byte>()
val length = stringWithDigitsOnly.length
var lastDigit: Byte? = null
for (i in 0 until length) {
val char = stringWithDigitsOnly[i]
val digitAsByte = char.toString().toInt().toByte()
if (lastDigit == null) {
if (i == length - 1) {
//last digit
val newByte = (digitAsByte + 0xf0).toByte()
result.add(newByte)
} else {
//more to go
lastDigit = digitAsByte
}
} else {
val newByte = (digitAsByte + lastDigit.toInt().shl(4)).toByte()
result.add(newByte)
lastDigit = null
}
}
return result.toByteArray()
}
fun decodeByteArrayToDigitsOnlyString(encodedDigitsOnlyByteArray: ByteArray): String {
val sb = StringBuilder(encodedDigitsOnlyByteArray.size * 2)
for (byte in encodedDigitsOnlyByteArray) {
val hex = Integer.toHexString(byte.toInt()).takeLast(2).padStart(2, '0')
if (hex[0].isLetter())
sb.append(hex.last())
else
sb.append(hex)
}
return sb.toString()
}
Example usage:
val inputString="12345"
val byteArray=encodeDigitsOnlyString(inputString) //produces a byte array of size 3
val outputString=decodeByteArrayToDigitsOnlyString(byteArray) //should be the same as the input