Ambiguity in transition: How to process string in NFA? - regex

I have made DFA from a given regular expression to match the test string. There are some cases in which .* occurs. ( for example .*ab ) . Let say now the machine is in state 1. In the DFA, .* refers to the transition for all the characters to itself and another transition for a from the state 1 for 'a'. If test string contains 'a' then what could be the transition because from state 1, machine can go to two states that is not possible in DFA.

I start with fundamental with your example so that one can find it helpful
Any class of automata can have two forms:
Deterministic
Non-Deterministic.
In Deterministic model: we only have single choice (or say no choice) to move from one congratulation to next configuration.
In Deterministic model of Finite Automate (DFA): for every possible combination of state (Q) and language symbol (Σ), we always have unique next state.
Definition of transition function for DFA: δ:Q×Σ → Q
δ(q0, a) → q1
^ only single choice
So, In DFA every possible move is definite from one state to next state.
Whereas,
In Non-Deterministic model: we can have more than one choice for next configuration.
And in Non-deterministaic model of Finite Automata (NFA): output is set of states for some combination of state (Q) and language symbol (Σ).
Definition of transition function for NFA: δ:Q×Σ → 2Q = ⊆ Q
δ(q0, a) → {q1, q2, q3}
^ is set, Not single state (more than one choice)
In NFA, we can have more then one choice for next state. That is you calls ambiguity in transition NFA.
(your example)
Suppose language symbols are Σ = {a, b} and the language/regular expression is (a + b)*ab. The finite automata for this language you down might be probably like below:
Your question is: Which state to move when we have more than one choices for next state?
I make it more general question.
How to process string in NFA?
I am considering automata model as an acceptor that accept a string if it belong to the language of automata.(Notice: we can have an automaton as a transducer), below is my answer with an above example
In above NFA, we find 5 tapular objects:
1. Σ : {a, b}
2. Q : {q1, ,q2, q3}
3. q1: initial state
4. F : {q3} <---F is set of final state
5. δ : Transition rules in above diagram:
δ(q1, a) → { q1, q2 }
δ(q1, b) → { q1 }
δ(q2, b) → { q3 }
The exampled finite automata is an actually an NFA because in production rule δ(q1, a) → { q1, q2 }, if we get a symbol while present state is q1 then next states can be either q1 or q2 (more than one choices). So when we process a string in NFA, we get extra path to travel wherever their is a symbol a to be process while current state is q1.
A string is accepted by an NFA, if there is some sequence of possible moves that will put the machine in a final state at the end of string processing. And the set of all string those have some path to reach to any final state in set F from initial state is called language of NFA:
We can also write, "what is language defined by a NFA?" as:
L(nfa) = { w ⊆ Σ* | δ*(q1, w) ∩ F ≠ ∅}
when I was new, this was too complex to understand to me but its really not
L(nfa) says: all strings consists of language symbols = (w ⊆ Σ* ) are in language; if (|) the set of states get after processing of w form initial state (=δ*(q1, w) ) contains some states in the set of Final states (hence intersection with final states is not empty = δ*(q1, w) ∩ F ≠ ∅). So while processing a string in Σ*, we need to keep track of all the possible paths.
Example-1: to process string abab though above NFS:
--►(q1)---a---►(q1)---b---►(q1)---a---►(q1)---b---►(q1)
\ \
a a
\ \
▼ ▼
(q2) (q2)---b---►((q3))
|
b
|
▼
(q3)
|
a
|
halt
Above diagram show: How to process a string abab in NFA?
A halt: means string could not process completely so it can't be consider a accepted string in this path
String abab could process completely in two directions so δ*(q1, w) = { q1, q3}.
and intersection of δ*(q1, w) with set of final states is {q3}:
{q1, q3} ∩ F
==> {q1, q3} ∩ {q3}
==> {q3} ≠ ∅
In this way, string ababa is in language L(nfa).
Example-2: String from Σ* is abba and following is how to process:
--►(q1)---a---►(q1)---b---►(q1)---b---►(q1)---a---►(q1)
\ \
a a
\ \
▼ ▼
(q2) (q2)
|
b
|
▼
(q3)
|
b
|
halt
For string abba set of reachable states is δ*(q1, w) = { q1, q2} and no state is final state in this set this implies => its intersection with F is ∅ a empty set, hence string abba is not an accepted string (and not in language).
This is the way we process a string in Non-deterministic Finite Automata.
Some additional important notes:
In case of finite automata's both Deterministic and Non-Deterministic models are equally capable. Non-Deterministic model doesn't have extra capability to define a language.
hence scope of NFA and DFA are same that is Regular Language. (this is not case for all class of automate for example scope of PDA !=NPDA)
Non-deterministic models are more useful for theoretical purpose, comparatively essay to draw. Whereas for implementation purpose we always desire deterministic model (minimized for efficiency). And fortunately in class of finite autometa every Non-deterministic model can be converted into an equivalent Deterministic one. We have algorithmic method to convert an NFA into DFA.
An information represented by a single state in DFA, can be represented by a combination of NFA states, hence number of states in NFA are less than their equivalent DFA. (proof are available numberOfStates(DFA)<= 2 power numberOfStates(NFA) as all set combinations are powerset)
The DFA for above regular language is as below:
Using this DFA you will always find a unique path from initial state to final state for any string in Σ* and instead of set you will gets to a single reachable final state and if that state belongs to set of final that input string is said to be accepted string (in language) otherwise not/
(your expression .*ab and (a + b)*ab are same usually in theoretical science we don't use . dot operator other then concatenation)

Matches with such regular expressions happen via backtracking. When there is an ambiguity about the next state, the evaluation takes the first choice and remembers it made the choice. If taking the first choice results in a failure to match, the evaluation backtracks to the last choice it made and tries the next available choice from that state.
I'm not sure such a mechanism maps to a strict DFA.

Related

How do you convert regular grammar into finite automata: S->aaB|aB|epsolon, B->bb|bS|aBB

How to deal with aaB and aB.on getting aa input i make three state including start state.Can i again add one more transition from start state on geting a lead to state B? or i have to do something else?
For this Question first we need know about the Regular Grammar
Regula Grammar also known as Type-3 Grammar.
Regular grammar generates regular language. They have a single non-terminal on the left-hand side and a right-hand side consisting of a single terminal or single terminal followed by a non-terminal.
The productions must be in the form:
A ⇢ xB
A ⇢ x
A ⇢ Bx
where A, B ∈ Variable(V) and x ∈ T* i.e. string of terminals.
Types of regular grammar:
Left linear grammar(LLG):
In LLG, the productions are in the form if all the productions are of the form
A ⇢ Bx
A ⇢ x
where A,B ∈ V and x ∈ T*
Right linear grammar(RLG):
In RLG, the productions are in the form if all the productions are of the form
A ⇢ xB
A ⇢ x
where A,B ∈ V and x ∈ T*
The language generated by type-3 grammar is a regular language, for which a FA can be designed. The FA can also be converted into type-3 grammar
The given grammar is Right linear grammar.
image of Finite Automata from Regular Grammar
As there are two variables we need to create two states.
and make S as final state.

RegEx that matches "variable" strings/sequences? + backtracking?

I want to use regex like language to match against variable-string (in my case sequence of character|words|numbers stored in a graph DB).
I found a way to implement RegEx engine :
https://deniskyashif.com/2019/02/17/implementing-a-regular-expression-engine/
the problem is that it matches against static string. My case is sort of what I call variable-string/sequence.
F.e. let say I have stored the following sequences :
who; why; when; where;
Keep in mind I dont have the sequences available (so that I can loop over them), they are deconstructed to a graph. (you can think of interface to the sequence like a function which given prefix predicts/returns the next character)
if I match against regex : w* it should match/return all of the strings one after another /like in backtracking/
if i use : whe* => when, where
etc..
Is there a way to modify NFA, DFA in such a way that it will accommodate variable-string ?
I just started exploring implementing NFA and think the change has to be here :
function search(nfa, word) { .... }
it has to be search that passes the next expected regex-symbol/state i.e. given the previous string-symbol does the next-predicted-string-symbol match the expected regex-symbol ?
The regex should 'drive' the match, rather than the string ! It should be doable because the regex is deconstructed to finite states with the transitions..
what do you think ?
they are stored as a tree in graph db...f.e.can be represented as :
lvl5: (where:.)
lvl4: (wher:e), (when:.), (whom:.),
lvl3: (whe:r), (whe:n), (who:m), (who:.), (why:.)
lvl2: (wh:y) , (wh:o), (wh:e)
lvl1: (w:h)
lvl0: w h y o .
I don’t understand your question, but this regex could be the answer:
<prefix>.*?\b
Where <prefix> is w or whe etc.
This will match all words in the input that start with the prefix.
In whatever language you’re using, there should be a way to loop over all matches found for a given input.

A regex for maximal periodic substrings

This is a follow up to A regex to detect periodic strings .
A period p of a string w is any positive integer p such that w[i]=w[i+p]
whenever both sides of this equation are defined. Let per(w) denote
the size of the smallest period of w . We say that a string w is
periodic iff per(w) <= |w|/2.
So informally a periodic string is just a string that is made up from a another string repeated at least once. The only complication is that at the end of the string we don't require a full copy of the repeated string as long as it is repeated in its entirety at least once.
For, example consider the string x = abcab. per(abcab) = 3 as x[1] = x[1+3] = a, x[2]=x[2+3] = b and there is no smaller period. The string abcab is therefore not periodic. However, the string ababa is periodic as per(ababa) = 2.
As more examples, abcabca, ababababa and abcabcabc are also periodic.
#horcruz, amongst others, gave a very nice regex to recognize a periodic string. It is
\b(\w*)(\w+\1)\2+\b
I would like to find all maximal periodic substrings in a longer string. These are sometimes called runs in the literature.
Formally a substring w is a maximal periodic substring if it is periodic and neither w[i-1] = w[i-1+p] nor w[j+1] = w[j+1-p]. Informally, the "run" cannot be contained in a larger "run"
with the same period.
The four maximal periodic substrings (runs) of string T = atattatt are T[4,5] = tt, T[7,8] = tt, T[1,4] = atat, T[2,8] = tattatt.
The string T = aabaabaaaacaacac contains the following 7 maximal periodic substrings (runs):
T[1,2] = aa, T[4,5] = aa, T[7,10] = aaaa, T[12,13] = aa, T[13,16] = acac, T[1,8] = aabaabaa, T[9,15] = aacaaca.
The string T = atatbatatb contains the following three runs. They are:
T[1, 4] = atat, T[6, 9] = atat and T[1, 10] = atatbatatb.
Is there a regex (with backreferences) that will capture all maximal
periodic substrings?
I don't really mind which flavor of regex but if it makes a difference, anything that the Python module re supports. However I would even be happy with PCRE if that makes the problem solvable.
(This question is partly copied from https://codegolf.stackexchange.com/questions/84592/compute-the-maximum-number-of-runs-possible-for-as-large-a-string-as-possible . )
Let's extend the regex version to the very powerful https://pypi.python.org/pypi/regex . This supports variable length lookbehinds for example.
This should do it, using Python's re module:
(?<=(.))(?=((\w*)(\w*(?!\1)\w\3)\4+))
Fiddle: https://regex101.com/r/aA9uJ0/2
Notes:
You must precede the string being scanned by a dummy character; the # in the fiddle. If that is a problem, it should be possible to work around it in the regex.
Get captured group 2 from each match to get the collection of maximal periodic substrings.
Haven't tried it with longer strings; performance may be an issue.
Explanation:
(?<=(.)) - look-behind to the character preceding the maximal periodic substring; captured as group 1
(?=...) - look-ahead, to ensure overlapping patterns are matched; see How to find overlapping matches with a regexp?
(...) - captures the maximal periodic substring (group 2)
(\w*)(\w*...\w\3)\4+ - #horcruz's regex, as proposed by OP
(?!\1) - negative look-ahead to group 1 to ensure the periodic substring is maximal
As pointed out by #ClasG, the result of my regex may be incomplete. This happens when two runs start at the same offset. Examples:
aabaab has 3 runs: aabaab, aa and aa. The first two runs start at the same offset. My regex will fail to return the shortest one.
atatbatatb has 3 runs: atatbatatb, atat, atat. Same problem here; my regex will only return the first and third run.
This may well be impossible to solve within the regex. As far as I know, there is no regex engine that is capable of returning two different matches that start at the same offset.
I see two possible solutions:
Ignore the missing runs. If I am not mistaken, then they are always duplicates; an identical run will follow within the same encapsulating run.
Do some postprocessing on the result. For every run found (let's call this X), scan earlier runs trying to find one that starts with the same character sequence (let's call this Y). When found (and not already 'used'), add an entry with the same character sequence as X, but the offset of Y.
I think it is not possible. Regular expressions cannot do complex nondeterministic jobs, even with backreferences. You need an algorithm for this.
This kind of depends on your input criteria... There is no infinite string of characters.. using back references you will be able to create a suitable representation of the last amount of occurrences of the pattern you wish to match.
\
Personally I would define buckets of length of input and then fill them.
I would then use automata to find patterns in the buckets and then finally coalesce them into larger patterns.
It's not how fast the RegEx is going to be in this case it's how fast you are going to be able to recognize a pattern and eliminate the invalid criterion.

Regular Grammar to my Regex/DFA

I have following regular expression: ((abc)+d)|(ef*g?)
I have created a DFA (I hope it is correct) which you can see here
http://www.informatikerboard.de/board/attachment.php?attachmentid=495&sid=f4a1d32722d755bdacf04614424330d2
The task is to create a regular grammar (Chomsky hierarchy Type 3) and I don't get it. But I created a regular grammar, which looks like this:
S → aT
T → b
T → c
T → dS
S → eT
S → eS
T → ε
T → f
T → fS
T → gS
Best Regards
Patrick
Type 3 Chomsky are the class of regular grammars constricted to the use of following rules:
X -> aY
X -> a,
in which X is an arbitrary non-terminal and a an arbitrary terminal. The rule A -> eps is only allowed if A is not present in any of the right hand sides.
Construction
We notice the regular expression consists of two possibilities, either (abc)+d or ef*g?, our first rules will therefor be S -> aT and S -> eP. These rules allow us to start creating one of the two possibilities. Note that the non-terminals are necessarily different, these are completely different disjunct paths in the corresponding automaton. Next we continue with both regexes separately:
(abc)+
We have at least one sequence abc followed by 0 or more occurrences, it's not hard to see we can model this like this:
S -> aT
T -> bU
U -> cV
V -> aT # repeat pattern
V -> d # finish word
ef*g? Here we have an e followed by zero or more f characters and an optional g, since we already have the first character (one of the first two rules gave us that), we continue like this:
S -> eP
S -> e # from the starting state we can simply add an 'e' and be done with it,
# this is an accepted word!
P -> fP # keep adding f chars to the word
P -> f # add f and stop, if optional g doesn't occur
P -> g # stop and add a 'g'
Conclusion
Put these together and they will form a grammar for the language. I tried to write down the train of thought so you could understand it.
As an exercise, try this regex: (a+b*)?bc(a|b|c)*

Check if `LIKE` patterns intersects in Postgres

There ara two strings in some request that are patterns that used within LIKE expressions (with _ and % placeholders). I want to find if this patterns intersects (have some string that matches them both). Is there any way to do that?...
“Like pattern” corresponds to finit or infinit set of strings. Each string in this set matches to given pattern. I want to check if intersection of string sets for two given patterns is not empty. Thus it is better to say patterns conjunction. In a math language:
S — set of strings
P — set of patterns (where each pattern has one or more string representation)
Sᵢ — subset of strings (Sᵢ ⊂ S) that match pᵢ pattern (where instead of i could be any index).
In equation form: “Sᵢ = {s | s ∈ S, s matches pᵢ, pᵢ ∈ P}” — that meas: “Sᵢ is a set of elements that are strings and match pᵢ pattern”.
Or another notation: “Sᵢ ⊂ S, ∀pᵢ ∈ P ∀s ∈ S (s matches pᵢ ≡ s ∈ Sᵢ)” — that meas: “Sᵢ is subset of strings and any string is element of Sᵢ if it matches pᵢ pattern”.
Let's define conjunction of patterns: “p₁ ∧ p₂ = p₃ ≡ S₁ ∩ S₂ = S₃” — that means: “Set of strings that match conjunction of patterns p₁ and p₂ is intersection of sets of strings that match p₁ pattern and that match p₂ pattern”.
For example:
ab_d and %cd — intersects
k%n and kl___ — intersects
I want to find if this patterns intersects (have some string that matches them both). Is there any way to do that?... (...) I want to check if intersection of string sets for two given patterns is not empty.
So, if I get this right, given two like patterns, p1 and p2, you're interested in whether there exists a (yet to be determined) string that matches p1 as well as p2.
E.g.:
select check_pattern('a%', 'b_'); -- false
select check_pattern('a%', '_b'); -- true ('ab')
Are you even sure there's a general solution to that problem in the first place?
Assuming there is, plain SQL isn't the right tool to find the solution imho, because you cannot readily express this in terms of "here's my (finite) set of data, join/filter them and yield a set based on it". To find the solution in SQL terms, you'd need to generate the set that stems from your data, and that's obviously not an option when the set in question is infinite.
Methinks you'd want to break up the problem into smaller parts and use a procedure language such as C, Perl, Lisp, whatever you fancy.
One potential solution might be this:
If both p1 and p2 are open on both ends or different ends, the answer is trivially yes: strings matching %foo% will intersect with those matching %bar%, just as strings matching foo% will intersect strings matching %bar.
If p1 yields a finite set (i.e. it contains no %), you could imagine iterating the entire set of potential matches for p1 using generate_series() or a for/while/whatever loop, and trying p2 on each string. It's ugly and inefficient, but it'll eventually work.
If p1 and p2 are both anchored (e.g. abc% and def% or %abc and %def), or reasonably anchored (e.g. _abc% and abcd%) the solution is trivial enough as well by considering the anchored part and proceeding as in the prior case.
I'll leave it to you to enumerate and solve the remaining cases if any...
The key, I think, will be to nail down the anchored parts of your patterns that yield a finite set of strings, and to stick to checking whether the (finite) set of strings they will match will intersect.