How to implement regular expression NFA with character ranges? - regex

When you read such posts as Regex: NFA and Thompson's algorithm everything looks rather straightforward until you realize in real life you need not only direct characters like "7" or "b", but also:
[A-Z]
[^_]
.
namely character classes (or ranges). And thus my question -- how to build NFA using character ranges? Using meta-characters like "not A", "anything else" and then computing overlapping ranges? This would lead to using tree-like structure when using final automaton, instead of just a table.
Update: please assume non-trivial in size (>>256) alphabet.
I am asking about NFA, but later I would like to convert NFA to DFA as well.

The simplest approach would be:
Use segments as labels for transitions in both NFA and DFA. For example, range [a-z] would be reperesented as segment [97, 122]; single character 'a' would become [97,97]; and any character '.' would become [minCode, maxCode].
Each negated range [^a-z] would result in two transitions from starting state to next state. In this example two transitions [minCode, 96] and [123, maxCode] should be created.
When range is represented by enumerating all possible characters [abcz], either transition per character should be created, or the code migh first group characters into ranges to optimize the number of transitions. So the [abcz] would become [a-c]|z. Thus two transitions instead of four.
This should be enough for NFA. However the classical power set construction to transform NFA to DFA will not work when there are transitions with intersecting character ranges.
To solve this issue only one additional generalization step is required. Once a set of all input symbols created, in our case it will be a set of segments, it should be transformed into a set of non-intersecting segments. This can be done in time O(n*Log(n)), where n is a number of segments in a set using priority equeue (PQ) in which segments are ordered by the left component. Example:
Procedure DISJOIN:
Input <- [97, 99] [97, 100] [98, 108]
Output -> [97, 97] [98, 99], [100, 100], [101, 108]
Step 2. To create new transitions from a "set state" the algorithm should be modified as following:
for each symbol in DISJOIN(input symbols)
S <- empty set of symbols
T <- empty "set state"
for each state in "set state"
for each transition in state.transitions
I <- intersection(symbol, transition.label)
if (I is not empty)
{
Add I to the set S
Add transition.To to the T
}
for each segement from DISJOIN(S)
Create transition from "set state" to T
To speed up matching when searching for a transition and input symbol C, transitions per state might be sorted by segments and binary search applied.

Related

Formulation of language and regular expressions

I can't figure out what is the formal language and regular expression
of this automaton :
DFA automaton
I know that the instance of 'b' or 'a' have to be even.
At first I thought the language was:
L = {(a^i)(b^j) | i(mod2) = j(mod2) = 0, i,j>=0}
But the automaton can start from 'b', so the language is incorrect.
also, the regular expression i found, isn't match either ((aa)* + (bb)) -
can't get abab for example.
The regex I got by progressively ripping out nodes (order: 3,1,2,0) is:
(aa|bb|(ab|ba)(bb|aa)*(ab|ba))*
As far as I can tell, that's the simplest it goes. (I'd love to know if anyone has a simpler reduction—I'm actually taking a test on this stuff this week!)
Step-by-step process
We start off by adding a new start and accept state. Every old accept state (in this case, there's only one) gets linked to the new accept state with an ε transition:
Next, we rip out state 3. We need to preserve all paths that run through state 3. In this case we've added a path from state 0 back to itself, paths from state 0 to state 2, and state 2 back to itself:
We do the same with state 1:
We can simplify this a bit: we'll concatenate the looping-back transitions with commas. (At the end, this will turn into the union operator (| or ⋃ etc. depending on your notation.)
We'll remove state 2 next, and get everything smooshed onto one big loop:
Loops become stars; we remove the last state so we just have a transition from the start state to the end state connected with one big regular expression:
And that's our regular expression!
Language definition
You're pretty close with the language definition. If you can allow something a little looser, it would be this:
L = { w | w contains an even number of 'a's and 'b's }
The problem with your definition is that you start the string w off with a every time, whereas the only restriction is on the parity of the number of a's and b's.

Find all partial matches to vector of unsigned

For an AI project of mine, I need to apply to a factored state all rules that apply to its partial components. This needs to be done very frequently so I'm looking for a way to make this as fast as possible.
I'm going to describe my problem with strings, however the true problem works in the same way with vectors of unsigned integers.
I have a bunch of entries (of length N) like this which I need to store in some way:
__a_b
c_e__
___de
abcd_
fffff
__a__
My input is a single entry ciede to which I must find, as fast as possible, all stored entries which match to it. For example in this case the matches would be c_e__ and ___de. Removal and adding of entries should be supported, however I don't care how slow it is. What I would like to be as fast as possible is:
for ( const auto & entry : matchedEntries(input) )
My problem, as I said, is one where each letter is actually an unsigned integer, and the vector is of an unspecified (but known) length. I have no requirements for how entries should be stored, or what type of metadata is going to be associated with them. The naive algorithm of matching all is O(N), is it possible to do better? The number of reasonable entries I need stored is <=100k.
I'm thinking some kind of sorting might help, or some weird looking tree structure, but I can't seem to figure out a good way to approach this problem. It also looks like something word processers already need to do, so someone might be able to help.
The easiest solution is to build a trie containing your entries. When searching the trie, you start in the root and recursively follow an edge, that matches character from your input. There will be at most two of those edges in each node, one for the wildcard _ and one for the actual letter.
In the worst case you have to follow two edges from each node, which would add up to O(2^n) complexity, where n is the length of the input, while the space complexity is linear.
A different approach would be to preprocess the entries, to allow for linear search. This is basically what compiling regular expressions does. For your example, consider following regular expression, which matches your desired input:
(..a.b|c.e..|...de|abcd.|fffff|..a..)
This expression can be implemented as a nondeterministic finite state automaton, with initial state having ε-moves to a deterministic automaton for each of the single entries. This NFSA can then be turned to a deterministic FSA, using the standard powerset construction.
Although this construction can increase the number of states substantially, searching the input word can then be done in linear time, simply simulating the deterministic automaton.
Below is an example for entries ab, a_, ba, _a and __. First start with a nondeterministic automaton, which upon removing ε-moves and joining equivalent states is actually a trie for the set.
Then turn it into a deterministic machine, with states corresponding to subsets of states of the NFSA. Start in the state 0 and for each edge, other than _, create the next state as the union of the states in the original machine, that are reachable from any state in the current set.
For example, when DFSA is in state 16, that means the NFSA could be either in state 1 or 6. Upon transition on a, the NFSA could get to states 3 (from 1), 7 or 8 (from 6) - that will be your next state in the DFSA.
The standard construction would preserve the _-edges, but we can omit them, as long as the input does not contain _.
Now if you have a word ab on the input, you simulate this automaton (i.e. traverse its transition graph) and end up in state 238, from which you can easily recover the original entries.
Store the data in a tree, 1st layer represents 1st element (character or integer), and so on. This means the tree will have a constant depth of 5 (excluding the root) in your example. Don't care about wildcards ("_") at this point. Just store them like the other elements.
When searching for the matches, traverse the tree by doing a breadth first search and dynamically build up your result set. Whenever you encounter a wildcard, add another element to your result set for all other nodes of this layer that do not match. If no subnode matches, remove the entry from your result set.
You should also skip reduntant entries when building up the tree: In your example, __a_b is reduntant, because whenever it matches, __a__ also matches.
I've got an algorithm in mind which I plan to implement and benchmark, but I'll describe the approach already. It needs n_templates * template_length * n_symbols bits of storage (so for 100k templates of length 100 and 256 distinct symbols needs 2.56 Gb = 320 MB of RAM. This does not scale nicely to large number of symbols unless succinct data structure is used.
Query takes O(n_templates * template_length * n_symbols) time but should perform quite well thanks to bit-wise operations.
Let's say we have the given set of templates:
__a_b
c_e__
___de
abcd_
_ied_
bi__e
The set of symbols is abcdei, for each symbol we pre-calculate a bit mask indicating whether the template differs from the symbol at that location or not:
aaaaa bbbbb ccccc ddddd eeeee iiiii
....b ..a.. ..a.b ..a.b ..a.b ..a.b
c.e.. c.e.. ..e.. c.e.. c.... c.e..
...de ...de ...de ....e ...d. ...de
.bcd. a.cd. ab.d. abc.. abcd. abcd.
.ied. .ied. .ied. .ie.. .i.d. ..ed.
bi..e .i..e bi..e bi..e bi... b...e
Same tables expressed in binary:
aaaaa bbbbb ccccc ddddd eeeee iiiii
00001 00100 00101 00101 00101 00101
10100 10100 00100 10100 10000 10100
00011 00011 00011 00001 00010 00011
01110 10110 11010 11100 11110 11110
01110 01110 01110 01100 01010 00110
11001 01001 11001 11001 11000 10001
These are stored in columnar order, 64 templates / unsigned integer. To determine which templates match ciede we check the 1st column of c table, 2st column from i, 3rd from e and so forth:
ciede ciede
__a_b ..a.b 00101
c_e__ ..... 00000
___de ..... 00000
abcd_ abc.. 11100
_ied_ ..... 00000
bi__e b.... 10000
We find matching templates as rows of zeros, which indicates that no differences were found. We can check 64 templates at once, and the algorithm itself is very simple (python-like code):
for i_block in range(n_templates / 64):
mask = 0
for i in range(template_length):
# Accumulate difference-indicating bits
mask |= tables[i_block][word[i]][i]
if mask == 0xFFFFFFFF:
# All templates differ, we can stop early
break
for i in range(64):
if mask & (1 << i) == 0:
print('Match at template ' + (i_block * 64 + i))
As I said I haven't yet actually tried implementing this, so I have no clue how fast it is in practice.

Steps to draw a DFA (or NFA) from a simple statement?

I am given a simple statement: Construct a DFA over alphabet {0, 1} that accepts all the strings that end in 101?
My question is that what will be the steps to design it? Or design an NFA, because then I know the clear steps yo convert an NFA to a DFA, so I will then convert the NFA to the DFA.
Note:- It is just a minor course for me, so I have never studied anything like regular expressions, or any algorithms probably used to construct DFA's.
If you want more of an explanation on how I derived this, I'd be happy to explain, but for now I just drew the DFA and explained each state.
Sorry about the screenshot...I didn't know how to convert it straight to an image.
On input 0 at state 0, it loops back to itself. On 1, it prepares
itself to end because it could possibly be '101'.
q1 loops to itself on input 1 because it's still preparing to end on
'101'. Input '0' on q1 means it is preparing for input '10', so it goes to q2.
Input '0' on q2 breaks the whole cycle and goes back to q0. Input '1'
results in moving to q3, the accepting state.
Any input on q3 results in going back to whatever point in the cycle
the input corresponds with.
That is, on '1' it goes back to q1, or the state where the first '1'
was encountered in '101', preparing to end.
On '0', it goes to q2 because in order to get to q3, there must have
been an input of '1' from q2, so no matter what, the last two input
symbols are '10' now.
TikZ DFA examples.
Here,the string should end with 101.So we need to draw nfa for it and later convert it into DFA
Here the total states are A,B,C,D.
I will upload an image here. In that I have drawn NFA and then I have drawn transition table for it.
And then I have drawn transition table for conversion of NFA to DFA.
I also drawn DFA for your sake.
In NFA, when a specific input is given to the current state, the machine goes to multiple states. It can have zero, one or more than one move on a given input symbol. On the other hand, in DFA, when a specific input is given to the current state, the machine goes to only one state. DFA has only one move on a given input State.
THE STEPS FOR CONVERTING NFA TO DFA:
Step 1: Initially Q' = ϕ
Step 2: Add q0 of NFA to Q'. Then find the transitions from this start state.
Step 3: In Q', find the possible set of states for each input symbol. If this set of states is not in Q', then add it to Q'.
Step 4: In DFA, the final state will be all the states which contain F(final states of NFA)
View the image here
Click here

Build a Regular Expression and Finite Automata

I need some help understanding how take the following to make a regular expression that will be used to generate an epsilon NFA.
Alphabet is {0,1}
Language is: The set of all strings beginning with 101 and ending with 01010.
Valid strings would be:
101010
10101010
101110101
1011101010
I am more concerned with understanding how to make the regular expression.
The regular expression you need is pretty simple:
101010|101(0|1)*01010 (theoretical)
or
^101010|101[01]*01010$ (used in most programming languages)
which means either:
Match 1, 0, 1, 0, 1, 0
or
Match 1, 0, and 1.
Keep matching 0 or 1, zero or more times.
Match 0, 1, 0, 1, 0.
The following non-deterministic automata should work:
To get an idea of what you are looking for, it is helpful to use the intersection operator (denoted & below). It does not belong to the core set of rational expressions, yet it preserves rationality --- in other words, you can use it, and always find a means to express the same language without it.
Using Vcsn, I get this in text mode:
In [1]: import vcsn
In [2]: vcsn.B.expression('(101[01]*)&([01]*01010)').derived_term().expression()
Out[2]: 101010+101(0+1)*01010
and this in graphical mode, showing the intermediate automaton computed using derived_term (which includes details about the "meaning" of each state, so strip called afterwards to get something simpler to read):
I'd suggest a pattern that includes both the base-case and general case. You need to cover the base case of 101010, where the two patterns overlap (starts with "101", ends with "01010", and the last two digits of the first pattern are the first two digits of the second pattern. Then you can cover the general case of "101", any 0s or 1s, "01010", as given by Oscar.
So the full pattern would be:
^(101010|(101[01]*01010))$

Checking if a string contains an English sentence

As of right now, I decided to take a dictionary and iterate through the entire thing. Every time I see a newline, I make a string containing from that newline to the next newline, then I do string.find() to see if that English word is somewhere in there. This takes a VERY long time, each word taking about 1/2-1/4 a second to verify.
It is working perfectly, but I need to check thousands of words a second. I can run several windows, which doesn't affect the speed (Multithreading), but it still only checks like 10 a second. (I need thousands)
I'm currently writing code to pre-compile a large array containing every word in the English language, which should speed it up a lot, but still not get the speed I want. There has to be a better way to do this.
The strings I'm checking will look like this:
"hithisisastringthatmustbechecked"
but most of them contained complete garbage, just random letters.
I can't check for impossible compinations of letters, because that string would be thrown out because of the 'tm', in between 'thatmust'.
You can speed up the search by employing the Knuth–Morris–Pratt (KMP) algorithm.
Go through every dictionary word, and build a search table for it. You need to do it only once. Now your search for individual words will proceed at faster pace, because the "false starts" will be eliminated.
There are a lot of strategies for doing this quickly.
Idea 1
Take the string you are searching and make a copy of each possible substring beginning at some column and continuing through the whole string. Then store each one in an array indexed by the letter it begins with. (If a letter is used twice store the longer substring.
So the array looks like this:
a - substr[0] = "astringthatmustbechecked"
b - substr[1] = "bechecked"
c - substr[2] = "checked"
d - substr[3] = "d"
e - substr[4] = "echecked"
f - substr[5] = null // since there is no 'f' in it
... and so forth
Then, for each word in the dictionary, search in the array element indicated by its first letter. This limits the amount of stuff that has to be searched. Plus you can't ever find a word beginning with, say 'r', anywhere before the first 'r' in the string. And some words won't even do a search if the letter isn't in there at all.
Idea 2
Expand upon that idea by noting the longest word in the dictionary and get rid of letters from those strings in the arrays that are longer than that distance away.
So you have this in the array:
a - substr[0] = "astringthatmustbechecked"
But if the longest word in the list is 5 letters, there is no need to keep any more than:
a - substr[0] = "astri"
If the letter is present several times you have to keep more letters. So this one has to keep the whole string because the "e" keeps showing up less than 5 letters apart.
e - substr[4] = "echecked"
You can expand upon this by using the longest words starting with any particular letter when condensing the strings.
Idea 3
This has nothing to do with 1 and 2. Its an idea that you could use instead.
You can turn the dictionary into a sort of regular expression stored in a linked data structure. It is possible to write the regular expression too and then apply it.
Assume these are the words in the dictionary:
arun
bob
bill
billy
body
jose
Build this sort of linked structure. (Its a binary tree, really, represented in such a way that I can explain how to use it.)
a -> r -> u -> n -> *
|
b -> i -> l -> l -> *
| | |
| o -> b -> * y -> *
| |
| d -> y -> *
|
j -> o -> s -> e -> *
The arrows denote a letter that has to follow another letter. So "r" has to be after an "a" or it can't match.
The lines going down denote an option. You have the "a or b or j" possible letters and then the "i or o" possible letters after the "b".
The regular expression looks sort of like: /(arun)|(b(ill(y+))|(o(b|dy)))|(jose)/ (though I might have slipped a paren). This gives the gist of creating it as a regex.
Once you build this structure, you apply it to your string starting at the first column. Try to run the match by checking for the alternatives and if one matches, more forward tentatively and try the letter after the arrow and its alternatives. If you reach the star/asterisk, it matches. If you run out of alternatives, including backtracking, you move to the next column.
This is a lot of work but can, sometimes, be handy.
Side note I built one of these some time back by writing a program that wrote the code that ran the algorithm directly instead of having code looking at the binary tree data structure.
Think of each set of vertical bar options being a switch statement against a particular character column and each arrow turning into a nesting. If there is only one option, you don't need a full switch statement, just an if.
That was some fast character matching and really handy for some reason that eludes me today.
How about a Bloom Filter?
A Bloom filter, conceived by Burton Howard Bloom in 1970 is a
space-efficient probabilistic data structure that is used to test
whether an element is a member of a set. False positive matches are
possible, but false negatives are not; i.e. a query returns either
"inside set (may be wrong)" or "definitely not in set". Elements can
be added to the set, but not removed (though this can be addressed
with a "counting" filter). The more elements that are added to the
set, the larger the probability of false positives.
The approach could work as follows: you create the set of words that you want to check against (this is done only once), and then you can quickly run the "in/not-in" check for every sub-string. If the outcome is "not-in", you are safe to continue (Bloom filters do not give false negatives). If the outcome is "in", you then run your more sophisticated check to confirm (Bloom filters can give false positives).
It is my understanding that some spell-checkers rely on bloom filters to quickly test whether your latest word belongs to the dictionary of known words.
This code was modified from How to split text without spaces into list of words?:
from math import log
words = open("english125k.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
def infer_spaces(s):
"""Uses dynamic programming to infer the location of spaces in a string
without spaces."""
# Find the best match for the i first characters, assuming cost has
# been built for the i-1 first characters.
# Returns a pair (match_cost, match_length).
def best_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
# Build the cost array.
cost = [0]
for i in range(1,len(s)+1):
c,k = best_match(i)
cost.append(c)
# Backtrack to recover the minimal-cost string.
costsum = 0
i = len(s)
while i>0:
c,k = best_match(i)
assert c == cost[i]
costsum += c
i -= k
return costsum
Using the same dictionary of that answer and testing your string outputs
>>> infer_spaces("hithisisastringthatmustbechecked")
294.99768817854056
The trick here is finding out what threshold you can use, keeping in mind that using smaller words makes the cost higher (if the algorithm can't find any usable word, it returns inf, since it would split everything to single-letter words).
In theory, I think you should be able to train a Markov model and use that to decide if a string is probably a sentence or probably garbage. There's another question about doing this to recognize words, not sentences: How do I determine if a random string sounds like English?
The only difference for training on sentences is that your probability tables will be a bit larger. In my experience, though, a modern desktop computer has more than enough RAM to handle Markov matrices unless you are training on the entire Library of Congress (which is unnecessary- even 5 or so books by different authors should be enough for very accurate classification).
Since your sentences are mashed together without clear word boundaries, it's a bit tricky, but the good news is that the Markov model doesn't care about words, just about what follows what. So, you can make it ignore spaces, by first stripping all spaces from your training data. If you were going to use Alice in Wonderland as your training text, the first paragraph would, perhaps, look like so:
alicewasbeginningtogetverytiredofsittingbyhersisteronthebankandofhavingnothingtodoonceortwiceshehadpeepedintothebookhersisterwasreadingbutithadnopicturesorconversationsinitandwhatistheuseofabookthoughtalicewithoutpicturesorconversation
It looks weird, but as far as a Markov model is concerned, it's a trivial difference from the classical implementation.
I see that you are concerned about time: Training may take a few minutes (assuming you have already compiled gold standard "sentences" and "random scrambled strings" texts). You only need to train once, you can easily save the "trained" model to disk and reuse it for subsequent runs by loading from disk, which may take a few seconds. Making a call on a string would take a trivially small number of floating point multiplications to get a probability, so after you finish training it, it should be very fast.