Walkthrough of regex match

Walkthrough of regex match - regex

Could someone help me understand how a regex engine matches the following:
a(bc)*
Against the text: abc.
For example, how many steps does it take? What happens at each step? For example, something like:
The first step is to match the letter "a" from the regex against the "a" in the text "abc". Because this is not optional/repeated there is no backtrack stored at this position.

Ideally, the regular expression (if it is a true regular expression) is first converted to a graph representation of an NFA (non-deterministic finite automaton), perhaps something like this:
a(bc)*:
(0)-- a --> (1) ---b--> (2) -- ε --> ((3))
^ |
`-----c----'
0 is the start state; ((3)) is the acceptance state. ε is an empty transition without consuming input.
An NFA can be executed directly by the NFA simulation algorithm.
It can also be compiled to a DFA (deterministic F. A.) using the "subset construction". The states of the DFA correspond to sets of the original NFA states. We end up with something like this:
DFA state NFA States Input Next State
--------------------------------------------
0 { 0 } a 1
1 { 1 } b 2
2 (accept) { 2, 3 } c 1
State 2 of the DFA corresponds to two states of the NFA: when the DFA is instate 2, the corresponding NFA simulator has to be in states 2 and 3 simultaneously, because 3 is reachable via an epsilon transition (no input symbol consumed). The DFA state 2 is an acceptance state because the NFA set it corresponds to { 2, 3 } contains an acceptance state.
The DFA requires very few steps; basically we just read characters and dispatch to the next state in the table based on the current state and the input character. If we are not able to dispatch, then there is a mismatch; we can stop reading more input. If we process the entire input, and are left in an acceptance state, then there is a match.

Related

Regex to validate a password in order to create a DFA

I need to create a regex to validate a password and then create a DFA with it.
The sets are:
a = {a,...,z}
| A = {A,...,Z}
| d = {0,...,9}
The criteria are:
Must begin with a letter (doesn't matter if upper or lower case).
| Must contain at least 1 upper case.
| Must contain at least 1 lower case.
| Must contain at least 1 number.
So far, I've come with the following Regex:
(aa\*(AA\*a\*dd\*|dd\*a\*AA\*)|AA\*(aa\*A\*dd\*|dd\*A\*aa\*))(a|A|d)\*
Is it correct?

The inner parts of your expression are slightly wrong. For example, with AA*a*dd*, you trying to check the case where it begins with 'a' and then 'A' occurs before 'd', but this does not match "aAaAd". Here is a correct version:
(aa*(d(a|d)*A|A(a|A)*d)|AA*(d(A|d)*a|a(A|a)*d))(a|A|d)*
The DFA you should do as an exercise. It should have 7 states including start and end. Think about establishing one state for each combination of characters that you have left to match.

RE: Number of a's is divisible by 6 and Number of b's is divisible by 8

Find a regular expression which represents strings made of {a, b}, where number of a's is divisible by 6 and number of b's is divisible by 8.
I tried to create a DFA which accepts such strings. My idea was to use all the remainders mod 6 and mod 8 leading to a total of 48 remainders. Thus each state in DFA is a pair (r, s) where r varies from 0 to 6 and s varies from 0 to 7. Start state (as well as accepting state) is (0, 0) and by we can easily give transitions to states by noting that if we input "a" the state (r, s) transitions to (r + 1, s) and on input "b" it transitions to state (r, s + 1).
However it is too difficult to work with a DFA of 48 states and I am not sure if this can be minimized by using the DFA minimization algorithm by hand.
I am really not sure how then we can get to a regular expression representing such strings.

If you are allowed to use lookaheads:
^(?=b*((ab*){6})+$)a*((ba*){8})+$
Debuggex Demo
Example of matched string: bbaabbaabbaabb
Idea is simple: We know how to match string having number of as divisible by 6 - ^((b*ab*){6})+$, also we know how to match string having number of bs divisible by 8 - ^((a*ba*){8})+$. So we just put one regex to lookahead, and another one to matching part.
In case if you also need to match strings consisting only of as or only of bs, then the following regex will do:
^(?=b*((ab*){6})*$)a*((ba*){8})*$
Examples of matched strings: aaaaaa, bbbbbbbb, bbaabbaabbaabb
Debuggex Demo

NFA to an RE Kleene's Theorem

Here is my NFA:
Here is my attempt.
Create new start and final nodes
Next eliminate the 2nd node from the left which gives me ab
Next eliminate the 2nd node from the right which gives me ab*a
Next eliminate the 2nd node from the left which gives me abb*b
Next eliminate the 2nd node from the right which gives me b+ab*a
Which leads to abbb (b+aba)*
Is this the correct answer?

No you are not correct :(
you not need to create start state. the first state with - sign is the start state. Also a,b label means a or b but not ab
there is a theorem called Arden's theoram, will be quit helpful to convert NFA into RE
What is Regular Expression for this NFA?
In you NFA the intial part of DFA:
step-1:
(-) --a,b-->(1)
means (a+b)
step-2: next from stat 1 to 2, note state 2 is accepting state final (having + sign).
(1) --b--->(2+)
So you need (a+b)b to reach to final state.
step-3: One you are at final state 2, any number of b are accepted (any number means one or more). This is because of self loop on state 2 with label b.
So, b* accepted on state-2.
step-4:
Actually there is two loops on state-2.
one is self loop with label b as I described in step-3. Its expression is b*
second loop on state-2 is via state-3.
the expression for second loop on state-2 is aa*b
why expression aa*b ?
because:
a-
|| ====> aa*b
▼|
(2+)--a-->(3) --b-->(2+)
So, In step-3 and step-4 because of loop on state-2 run can be looped back via b labeled or via aa*b ===> (b + aa*b)*
So regular expression for your NFA is:
(a+b) b (b + aa*b)*

simulate a deterministic pushdown automaton (PDA) in c++

I was reading an exercise of UVA, which I need to simulate a deterministic pushdown automaton, to see
if certain strings are accepted or not by PDA on a given entry in the following format:
The first line of input will be an integer C, which indicates the number of test cases. The first line of each test case contains five integers E, T, F, S and C, where E represents the number of states in the automaton, T the number of transitions, F represents the number of final states, S the initial state and C the number of test strings respectively. The next line will contain F integers, which represent the final states of the automaton. Then come T lines, each with 2 integers I and J and 3 strings, L, T and A, where I and J (0 ≤ I, J < E) represent the state of origin and destination of a transition state respectively. L represents the character read from the tape into the transition, T represents the symbol found at the top of the stack and A the action to perform with the top of the stack at the end of this transition (the character used to represent the bottom of the pile is always Z. to represent the end of the string, or unstack the action of not taking into account the top of the stack for the transition character is used <alt+156> £). The alphabet of the stack will be capital letters. For chain A, the symbols are stacked from right to left (in the same way that the program JFlap, ie, the new top of the stack will be the character that is to the left). Then come C lines, each with an input string. The input strings may contain lowercase letters and numbers (not necessarily present in any transition).
The output in the first line of each test case must display the following string "Case G:", where G represents the number of test case (starting at 1). Then C lines on which to print the word "OK" if the automaton accepts the string or "Reject" otherwise.
For example:
Input:
2
3 5 1 0 5
2
0 0 1 Z XZ
0 0 1 X XX
0 1 0 X X
1 1 1 X £
1 2 £ Z Z
111101111
110111
011111
1010101
11011
4 6 1 0 5
3
1 2 b A £
0 0 a Z AZ
0 1 a A AAA
1 0 a A AA
2 3 £ Z Z
2 2 b A £
aabbb
aaaabbbbbb
c1bbb
abbb
aaaaaabbbbbbbbb
this is the output:
Output:
Case 1:
Accepted
Rejected
Rejected
Rejected
Accepted
Case 2:
Accepted
Accepted
Rejected
Rejected
Accepted
I need some help, or any idea how I can simulate this PDA, I am not asking me a code that solves the problem because I want to make my own code (The idea is to learn right??), But I need some help (Some idea or pseudocode) to begin implementation.

You first need a data structure to keep transitions. You can use a vector with a transition struct that contains transition quintuples. But you can use fact that states are integer and create a vector which keeps at index 0, transitions from state 0; at index 1 transitions from state 1 like that. This way you can reduce searching time for finding correct transition.
You can easily use the stack in stl library for the stack. You also need search function it could chnage depending on your implementation if you use first method you can use a function which is like:
int findIndex(vector<quintuple> v)//which finds the index of correct transition otherwise returns -1
then use the return value to get newstate and newstack symbol.
Or you can use a for loop over the vector and bool flag which represents transition is found or not.
On second method you can use a function which takes references to new state and new stack symbol and set them if you find a appropriate transition.
For inputs you can use something like vector or vector depends on personal taste. You can implement your main method with for loops but if you want extra difficulties you can implement a recursive function. May it be easy.

Efficient algorithm for converting a character set into a nfa/dfa

I'm currently working on a scanner generator.
The generator already works fine. But when using character classes the algorithm gets very slow.
The scanner generator produces a scanner for UTF8 encoded files. The full range of characters (0x000000 to 0x10ffff) should be supported.
If I use large character sets, like the any operator '.' or the unicode property {L}, the nfa (and also the dfa) contains a lot of states ( > 10000 ). So the convertation for nfa to dfa and create the minimal dfa takes a long time (even if the output minimal dfa contains only a few states).
Here's my current implementation of creating a character set part of the nfa.
void CreateNfaPart(int startStateIndex, int endStateIndex, Set<int> characters)
{
transitions[startStateIndex] = CreateEmptyTransitionsArray();
foreach (int character in characters) {
// get the utf8 encoded bytes for the character
byte[] encoded = EncodingHelper.EncodeCharacter(character);
int tStartStateIndex = startStateIndex;
for (int i = 0; i < encoded.Length - 1; i++) {
int tEndStateIndex = transitions[tStartStateIndex][encoded[i]];
if (tEndStateIndex == -1) {
tEndStateIndex = CreateState();
transitions[tEndStateIndex] = CreateEmptyTransitionsArray();
}
transitions[tStartStateIndex][encoded[i]] = tEndStateIndex;
tStartStateIndex = tEndStateIndex;
}
transitions[tStartStateIndex][encoded[encoded.Length - 1]] = endStateIndex;
}
Does anyone know how to implement the function much more efficiently to create only the necessary states?
EDIT:
To be more specific I need a function like:
List<Set<byte>[]> Convert(Set<int> characters)
{
???????
}
A helper function to convert a character (int) to a UTF8 encoding byte[] is defined as:
byte[] EncodeCharacter(int character)
{ ... }

There are a number of ways to handle it. They all boil down to treating sets of characters at a time in the data structures, instead of enumerating the entire alphabet ever at all. It's also how you make scanners for Unicode in a reasonable amount of memory.
You've many choices about how to represent and process sets of characters. I'm presently working with a solution that keeps an ordered list of boundary conditions and corresponding target states. You can process operations on these lists much faster than you could if you had to scan the entire alphabet at each juncture. In fact, it's fast enough that it runs in Python with acceptable speed.

I'll clarify what I think you're asking for: to union a set of Unicode codepoints such that you produce a state-minimal DFA where transitions represent UTF8-encoded sequences for those codepoints.
When you say "more efficiently", that could apply to runtime, memory usage, or to compactness of the end result. The usual meaning for "minimal" in finite automata refers to using the fewest states to describe any given language, which is what you're getting at by "create only the necessary states".
Every finite automata has exactly one equivalent state minimal DFA (see the Myhill-Nerode theorem [1], or Hopcroft & Ullman [2]). For your purposes, we can construct this minimal DFA directly using the Aho-Corasick algorithm [3].
To do this, we need a mapping from Unicode codepoints to their corresponding UTF8 encodings. There's no need to store all of these UTF8 byte sequences in advance; they can be encoded on the fly. The UTF8 encoding algorithm is well documented and I won't repeat it here.
Aho-Corasick works by first constructing a trie. In your case this would be a trie of each UTF8 sequence added in turn. Then that trie is annotated with transitions turning it into a DAG per the rest of the algorithm. There's a nice overview of the algorithm here, but I suggest reading the paper itself.
Pseudocode for this approach:
trie = empty
foreach codepoint in input_set:
bytes[] = utf8_encode(codepoint)
trie_add_key(bytes)
dfa = add_failure_edges(trie) # per the rest of AC
This approach (forming a trie of UTF8-encoded sequences, then Aho-Corasick, then rendering out DFA) is the approach taken in the implementation for my regexp and finite state machine libraries, where I do exactly this for constructing Unicode character classes. Here you can see code for:
UTF8-encoding a Unicode codepoint: examples/utf8dfa/main.c
Construction of the trie: libre/ac.c
Rendering out of minimal DFA for each character class: libre/class/
Other approaches (as mentioned in other answers to this question) include working on codepoints and expressing ranges of codepoints, rather than spelling out every byte sequence.
[1] Myhill-Nerode: Nerode, Anil (1958), Linear Automaton Transformations, Proceedings of the AMS, 9, JSTOR 2033204
[2] Hopcroft & Ullman (1979), Section 3.4, Theorem 3.10, p.67
[3] Aho, Alfred V.; Corasick, Margaret J. (June 1975). Efficient string matching: An aid to bibliographic search. Communications of the ACM. 18 (6): 333–340.

Look at what regular expression libraries like Google RE2 and TRE are doing.

I had the same problem with my scanner generator, so I've come up with the idea of replacing intervals by their ids which is determined using interval tree. For instance a..z range in dfa can be represented as: 97, 98, 99, ..., 122, instead I represent ranges as [97, 122], then build interval tree structure out of them, so at the end they are represented as ids that is referring to the interval tree. Given the following RE: a..z+, we end up with such DFA:
0 -> a -> 1
0 -> b -> 1
0 -> c -> 1
0 -> ... -> 1
0 -> z -> 1
1 -> a -> 1
1 -> b -> 1
1 -> c -> 1
1 -> ... -> 1
1 -> z -> 1
1 -> E -> ACCEPT
Now compress intervals:
0 -> a..z -> 1
1 -> a..z -> 1
1 -> E -> ACCEPT
Extract all intervals from your DFA and build interval tree out of them:
{
"left": null,
"middle": {
id: 0,
interval: [a, z],
},
"right": null
}
Replace actual intervals to their ids:
0 -> 0 -> 1
1 -> 0 -> 1
1 -> E -> ACCEPT

In this library (http://mtimmerm.github.io/dfalex/) I do it by putting a range of consecutive characters on each transition, instead of single characters. This is carried through all the steps of NFA constuction, NFA->DFA conversion, DFA minimization, and optimization.
It's quite compact, but it adds code complexity to every step.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js