NFA to DFA implementation in C++ - c++

The title says it all. I need some ideas.
The nfa input looks like this and has no epsilon moves.
1 2
0 a 2
1 b 3
etc, meaning that 1 and 2 ar final states and that with 'a' I can get from state 0 to state 2.
I use
struct nod
{
int ns;
char c;
};
vector<nod> v[100];
to save the data, where v[i] is the vector containing all the ways that can go from state i.
I need an idea on how to name the new states when there is a multiset of states like
0 a 1
0 a 2
0 a 3
because i can't create state 123 or something like that.
And how can i check if a multiset has already been transformed in a state?

In the worst case, an NFA with n states will require a DFA with 2^n states: one state for every subset of states in the NFA. Another way to think of this is that all states of the DFA can be labeled with a binary string of length n where the kth bit is set to 1 iff the kth state of the NFA is included in the subset the DFA state corresponds to.
OK, that is possibly a dense explanation to parse. An example will help. Suppose the NFA has states 0, 1 and 2. There are 2^3 subsets of states so our DFA will have up to 8 states. The bit strings are as follows:
000: the empty set, containing no states
001: the set containing only state 0
010: the set containing only state 1
100: the set containing only state 2
011: the set containing states 0 and 1
101: the set containing states 0 and 2
110: the set containing states 1 and 2
111: the set containing all states
We can interpret these bit strings as integers themselves, and this gives us a natural way to label states of the DFA:
0: the empty set, containing no states
1: the set containing only state 0
2: the set containing only state 1
3: the set containing only state 2
4: the set containing states 0 and 1
5: the set containing states 0 and 2
6: the set containing states 1 and 2
7: the set containing all states
The only issue with this ordering is that if you don't end up needing all the states (and frequently you won't), you may have (sometimes large) gaps between states that are used. Your DFA may only really need states 1, 2 and 3 (if your NFA happens to already be a minimal DFA, for instance). Maybe you need states 1 and 7 only. Other possibilities exist.
You can always separate the name and the index for the states you are creating, if you are creating them on demand. If you have created k states and you create a new state as you find you need it, you can assign it index k+1 and name it according to the bit string. That way, your indices are packed densely together and you have traceability back to the subset of NFA states via the name.
I need an idea on how to name the new states when there is a multiset of states like
Name them according to the bit string for the corresponding subset.
And how can i check if a multiset has already been transformed in a state?
Check existing states to see if the bit string for the subset you are encountering has already been used.

Related

Regex: Binary string contains at least 3 of a certain integer

I am working with regular expressions for a class and we need to have it set up to contain at least 3 0 throughout the string. BUT they don't need to be next to each other, so 101010 would pass but 101011 would fail because it lacks a 0.
^[0-1]*(?:0){3,}[0-1]*$
This is what I currently have but that requires them to be adjacent.
How about:
([01]*0[01]*){3}
The 0 in the center without a qualifier ensures that it exists at least 3 times in your string. The [01]* on either side requires a 0 or 1 zero to unlimited times, giving it some wiggle room so as not to require that the 3 zeros occur consecutively.
(Demo) (regexr.com)

How to parse an input string for matches given a DFA

I'm implementing a regular expression parser from scratch by generating a NFA from a regular expression and then a DFA from the NFA. The problem is DFA only can say when a computation is accepting. If the regex is "n*" and the string to match is "cannot" the DFA will go to a failed state after it sees the c, so then I drop the first character from the front, "annot" then "nnot". At this point it sees the n and goes to a final state and would just return the single n so I told it to keep trying until the next character would take it off a final state. however when it finishes it removes the first character again so it would be "not" and it would match the "n" but I don't want the subsequent matches, I only want "nn". I don't know how this would be possible.
Here's a simple but possibly not optimal algorithm. We try an anchored match at each successive point in the string by running the DFA starting at that point. When the DFA is being run, we record the last point in the string where the DFA was in an accepting state. When we eventually either reach the end of the string or a point at which the DFA can no longer advance, we can return a match if we passed through an accepting state; in other words, if we saved an accepting position, which will be the end of the match. Otherwise, we backtrack to the next starting position and continue.
(Note: in both pseudocode algorithms below, it's assumed that a variable which holds a string index can have an Undefined value. In a practical implementation, this value could be, for example, -1.)
In pseudocode:
Set <start> to 0
Repeat A:
Initialise the DFA state the starting state.
Set <scan> to <start>
Set <accepted> to Undefined
Repeat B:
If there is a character at <scan> and
the DFA has a transition on that character:
Advance the DFA to the indicated next state
Increment <scan>
If the DFA is now in an accepting state, set <accepted> to <scan>
Continue Loop B
Otherwise, the DFA cannot advance:
If <accepted> is still Undefined:
Increment <start> and continue Loop A
Otherwise, <accepted> has a value:
Return a match from <scan> to <accepted> (semi-inclusive)
The problem with the above algorithm is that Loop B could execute an arbitrary number of times before failing and backtracking to the next starting position. So in the worst case, the search time would be quadratic in the string length. That would happen, for example, with pattern a*b and a string consisting of a large number of as.
An alternative is to run several DFAs in parallel. Each DFA corresponds to a different starting point in the pattern. We scan the string linearly; at each position, we can spawn a new DFA corresponding to that position, whose state is the initial state.
It's important to note that not every starting point has a DFA, because it is not necessary to keep two DFAs with the same state. Since the search is for the first match in the string, if two DFAs share the same state, only the one with the earlier starting point is a plausible match. Furthermore, once some DFA reaches an accepting state, it is no longer necessary to retain any DFA whose starting point comes later, which means that as soon as an accepting state is reached by any DFA, we no longer add new DFAs in the scan.
Since the number of active DFAs is at most the number of states in the DFA, this algorithm runs in O(NM) where N is the length of the string and M is the number of states in the DFA. In practice, the number of active DFAs is usually much less than the number of states (unless there are very few states).
Nonetheless, pathological worst cases still exist because the NFA⇒DFA transformation can exponentially increase the number of states. The exponential blowup can be avoided by using a collection of NFAs instead of DFAs. It's convenient to simplify the NFA transitions by using ε-free NFAs, either by doing an ε-closure on the Thompson automaton or by building a Glushkov automaton. Using the Glushkov automaton guarantees that the number of states is no greater than the length of the pattern.
In pseudocode:
Initialise a vector <v> of <index, state> pairs. Initially the vector
is empty, and its maximum size is the number of states. This vector is
always kept in increasing order by index.
Initialise another vector <active> of string indices, one for each state.
Initially all the elements in <active> are Undefined. This vector records
the most recent index at which some Automaton was in each state.
Initialise <match> to a pair of index positions, both undefined. This
will record the best match found so far.
For each position <scan> in the string:
If <match> has not yet been set:
Append {<scan>, <start_state>} to <v>.
If <v> is empty:
Return <match> if it has been set, or otherwise
return a failure indication.
Copy <v> to <v'> and empty <v>. (It's possible to recycle <v>,
but it's easier to write the pseudocode with a copy.)
For each pair {<i>, <q>} in <v'>:
If <i> is greater than the starting index in <match>:
Terminate this loop and continue with the outer loop.
If there is no transition from state <q> on the symbol at <scan>:
Continue with the next pair.
Otherwise, there is a transition to <q'> (if using NFAs, do this for each transition):
If the index in <active> corresponding to <q'> has already
been set to <scan>:
Continue with the next pair.
Otherwise, <q'> is not yet in <v>:
Append the pair {<i>, <q'>} at the end of <v>.
Set the the index in <active> at <q'> to <scan>.
If <q'> is an accepting state:
Set <match> to {<i>, <scan>}.

How to derive the RegEx from the state diagram?

I found a state diagram of a DFA (deterministic finite automaton) with its RegEx in a script, but this diagram is just a sample without any explanations. So I tried to derive the RegEx from the DFA state diagram by myself and got the expression: ab+a+b(a*b)*. I dont understand how I get the original RegEx (ab+a*)+ab+ mentioned in the script. Here my derivation:
I am grateful for any help, links, references and hints!
You've derived the regex correctly here. The expression you have ab+a+b(a*b)* is equivalent to (ab+a*)+ab+ - once you've finished the DFA state elimination (you have a single transition from the start state to an accepting state), there aren't any more derivations to do. You may however get different final regexes depending on the order you eliminate the states in, and they should all be valid assuming you did the eliminations correctly. The state elimination method is also not guaranteed to be able to produce all equivalent regular expressions for a particular DFA, so it's okay that you didn't arrive at exactly the original regex. You can also check the equivalence of two regular expressions here.
For your particular example though to show that this DFA is equivalent to this original regex (ab+a*)+ab+, take a look at the DFA at this state of the elimination (somewhere in between the second and third steps you've shown above):
Let's expand our expression (ab+a*)+ab+ to (ab+a*)(ab+a*)*ab+. So in the DFA, the first (ab+a*) gets us from state 0 to partway between states 2 and 3 (the a* in a*a).
Then the next part (ab+a*)* means we're allowed to have 0 or more copies of (ab+a*). If there are 0 copies, we'll just finish with an ab+, reading an a from the second half of the a*a transition from 2 to 3 and a b from the 3 to 4 transition, landing us in state 4 which is accepting and where we can take the self loop and read as many b's as we want.
Otherwise we have 1 or more copies of (ab+a*), again reading an a from the second half of the a*a transition from 2 to 3 and a b from the 3 to 4 transition. The a* comes from the first half of the a*ab self loop on state 4 and the second half ab is either the final ab+ of the regex or the start of another copy of (ab+a*). I'm not sure if there's a state elimination that arrives at exactly the expression (ab+a*)+ab+ but for what it's worth, I think the regex you derived more clearly captures the structure of this DFA.

Convert a regulation expression to DFA

I have been trying different ways to solve this problem for over an hour and am getting very frustrated.
The problem is: Give regular expressions and DFAs for each of the following languages over Sigma = {0,1}.
a). {w ∈ Σ* | w contains an even number of 0s or an odd number of 1s}
If anyone could provide hints or get me started on figuring this one out, it would be very appreciated!
I know it is something along the lines of this DFA but this one is for
{w ∈ Σ* | w contains an even number of 0s or exactly two 1's}
so it's a bit different but I can't figure it out.
You can see it as follows: you always have to remember two things:
whether the number of 0s is even or odd; and
whether the number of 1s is even or odd.
Now if we denote even with e and odd with o, we consider four states: ee (both even), eo (even number of 0s and odd number of 1s), oe and oo.
Now when we read a zero (0), we simply swap the first state token, so it means we introduce transitions from:
ee - 0 -> oe;
eo - 0 -> oo;
oe - 0 -> ee; and
oo - 0 -> eo.
The same for ones (1):
ee - 1 -> eo;
eo - 1 -> ee;
oe - 1 -> oo; and
oo - 1 -> oe.
Now we only need to determine the initial state and the accepting state(s). The intial state is ee, since at that moment we have considered no zeros and no ones.
Furthermore the accepting state can by determined by the condition:
w contains an even number of 0s or an odd number of 1s
So that means the accepting states are ee, eo and oo. A drawing of this DFA is shown below:
There exists an algorithmic way to convert a DFA into an equivalent regular expression as is stated here.
You can construct a regular expression by splitting the problem into two easier problems:
a regex that checks if the number of 0s is even; and
a regex that checks if the number of 1s is odd.
For the first, you can use the regex:
(1*01*0)*1*
Indeed: you first have a group (1*01*0). This group ensures that there are two zeros, and 1s can appear everywhere in between. We allow an arbitrary number of repetitions, since the number always remains even. The regex ends with 1* since it is still possible that there are additional ones in the string.
The second problem can be solved with the regex:
0*1(0*10*1)*0*
The solution is more or less the same. The expression between the brackets: (0*10*1) ensures that the ones occur evenly. By adding a 1 in front, we ensure the number of 1s is odd.
A regular expression that then solves the problem is:
(1*01*0)*1*|0*1(0*10*1)*0*
Since the "pipe" (|) means "or".
Think about what possible states you can ever be in.
A number contains either an even number of 0's or an odd number of 0's. (2 possible states)
A number contains either an even number of 1's or an odd number of 1's. (2 possible states)
Now let's look at what combinations are accepted by your language:
even 0's, even 1's: accept
even 0's, odd 1's: accept
odd 0's, even 1's: reject
odd 0's, odd 1's: accept
As a result, your DFA will need 4 states, of which 3 are accept states and 1 is a reject state. Every state will have 2 transitions leading to a different state. Since the empty string has an even number of 0's and an even number of 1's, the first state will be the initial state.
For making this into a regular expression: think about how you'd match an even number of 0's, then how you'd match an odd number of 1's. The language is just the union of these two.
Alternatively, as suggested by Willem, you can use an algorithm to convert any NFA to a regular expression. It has the advantage of being very general, but it's also more technical. Either way, it should lead to an equivalent regular expression.
What does a number with an even number of 0's look like? It might start with any number of 1's, but when we do find a 0 we better find another one! There can be any number of 1's in between, but we only care about the 0's. Thus, we come up with the following regular expression:
1*(01*01*)*
You should be able to apply a similar logic to match an odd number of 1's. Finally, OR the two expressions to get the requested regular expression.

How to force jumps with std::next_permutation

Is it possible to manipulate the current permutation sequence in order to skip (in my case) "useless" sequences?
If that isn't possible, would a custom implementation of a permutation iteration be as fast as std::next_permutation?
An example:
1 2 3 4 5 6 7 ...
1 3 2 4 5 6 7 ...
Detecting that "2" at the 2nd position isn't valid, leads to skipping every permutation which begins with "1, 2".
You would have to write some custom rules for that. A smart way to do this would be to write a code, in which whenever you have a set of permutations which are not valid, you jump to the next permutation you can get which will be valid.
For eg, in the above case, knowing that 2 at 2nd position is invalid, you could write the code to swap 2 and 3, and ensure that the permutation then achieved is the smallest one possible with 3 in that location, and so on.
Also, if you were writing your own implementation of next_permuation, ensure that the internal functioning is as close to that of next_permutation. You can read about it here: std::next_permutation Implementation Explanation
Yes, this is possible, and it's not hard once you understand how next_permutation works.
In short, there are N! permutations of N items. However, only one of them is sorted. 1 2 3 4 5 6 7 for instance is the first of 5040 permutations. Viewed as a string (lexicographically) the next permutation is the one that sorts directly after that: 1 2 3 4 5 7 6. It particular, the first 5 elements are unaltered. The next permutation alters the 5th element because the last 2 elements are in reverse order.
So, in your case, once you've found an illegal permutation, you'd have to explicitly calculate the next legal permutation by sort order. If 1 2 is the only illegal prefix then 1 3 2 4 5 6 7 is obviously the next valid permutation.
In general, look at your illegal pattern, determine which positions you have to increase because they violate constraints, and come up with the next valid values for those positions (if your pattern is small enough, you can brute-force this). Then fill in the remaining numbers in sorted order.