How to detect deviations in a sequence from a pattern? - regex

i want to detect the passages where a sequence of action deviates from a given pattern and can't figure out a clever solution to do this, although the problem sound really simple.
The objective of the pattern is to describe somehow a normal sequence. More specific: "What actions should or shouldn't be contained in an action sequence and in which order?" Then i want to match the action sequence against the pattern and detect deviations and their locations.
My first approach was to do this with regular expressions. Here is an example:
Example 1:
Pattern: A.*BC
Sequence: AGDBC (matches)
Sequence: AGEDC (does not match)
Example 2:
Pattern: ABCD
Sequence: ABD (does not match)
Sequence: ABED (does not match)
Sequence: ABCED (does not match)
Example 3:
Pattern: ABCDEF
Sequence: ABXDXF (does not match)
With regular expressions it is simple to detect a error but not where it occurs. My approach was to successively remove the last regex block until I can find the pattern in the sequence. Then i will know the last correct actions and have at least found the first deviation. But this doesn't seem the best solution to me. Further i can't all deviations.
Other soultions in my mind are working with state machines, order tools like ANTLR. But I don't know if they can solve my problem.
I want to detect errors of omission and commission and give an user the possibility to create his own pattern. Do you know a good way to do this?

While matching your input, a regular expression engine has information on the location of a mismatch -- it may however not provide it in a way where you can easily access it.
Consider e.g. a DFA implementing the expression. It fetches characters sequentially, matching them with expectation, and you are interested at the point in the sequence where there is no valid match.
Other implementations may go back and forth, and you would be interested in the maximum address of any character that was fetched.
In Java, this could possibly be done by supplying a CharSequence implementation to
java.util.regex.Pattern.matches(String regex, CharSequence input)
where the accessor methods keep track of the maximum index.
I have not tried that, though. And it does not help you categorizing an error, either.

have you looked at markov chains? http://en.wikipedia.org/wiki/Markov_chain - it sounds like you want unexpected transitions. maybe also hidden markov models http://en.wikipedia.org/wiki/Hidden_Markov_Models

Find an open-source implementation of regexs, and add in a hook to return/set/print/save the index at which it fails, if a given comparison does not match. Alternately, write your own RE engine (not for the faint of heart) so it'll do exactly what you want!

Related

Regex issue with car submodels

I'm pulling car submodels from the DB and I'm building my regular expression on the fly.
Here is an example of a search string:
EX-L Sedan 4-Door
Here is my regular expression:
preg_match("/LX|EX|EX-L|LX-P|LX-S/Ui", $input_line, $output_array);
For some reason the output is EX and not EX-L as it supposed to be. Can someone explain why?
Your pattern is unanchored and thus the first alternative that matches a substring makes the regex engine stop processing the whole group. This is a common behavior with NFA regexes.
Also, there are no quantifiers in your pattern, thus the /U modifier is redundant.
So, you can use
/EX-L|LX-P|LX-S|LX|EX/i
It is a readable form. However, best practice with regexes is to make sure no alternative branch can match at the same location as another. That means you can use
/EX(-L)?|LX(-[PS])?/i
As others have pointed out, the reason for this undesired outcome is because the regex engine is happy to have the first alternative and run for the door since your pattern has no anchors (like: ^, $, and some other lesser known ones). This is the same short-circuiting behavior you'd see in php's if($x || $y) conditions; if $x is true there is no need to evaluate further. But enough about that...
I would like to offer some additional logic that I think is relevant to your case/question.
You say your regex is built on the fly, so I am assuming your method goes something like this:
A user identifies which substrings/keywords they want to search for.
$strings=array('LX','EX','EX-L','LX-P','LX-S');
// array of substrings in any order
As mentioned earlier, you need longer strings to precede shorter ones with identical starting characters.
rsort($strings);
// sort DESC, longer strings precede shorter strings when leading characters match
Pipe all strings into a single regex pattern with implode().
$piped_regex='/\b(?:'.implode('|',$array).')\b/i';
// word boundaries ensure the string is not part of a larger word; remove if not desired
// pattern: /\b(?:LX-S|LX-P|LX|EX-L|EX)\b/i
While programmatically condensing your similar strings into a concise pattern as Wiktor recommended is possible, it's probably not worth the effort with your on-the-fly patterns.
Finally run preg_match() as normal.
$input_line='EX-L Sedan 4-Door';
if(preg_match($piped_regex,$input_line,$output_array)){
var_export($output_array);
}
// output: array(0=>'EX-L')
I hope stepping out this method is helpful to you and future SO readers.

Rules of regex engines. Greediness, eagerness and laziness of regexes

As we all know, regex engine use two rules when it goes about its work:
Rule 1: The Match That Begins Earliest Wins or regular expressions
are eager.
Rule 2: Regular expressions are greedy.
These lines appear in tutorial:
The two of these rules go hand in hand.
It's eager to give you a result, so what it does is it tries to just
keep letting that first one do all the work.
While we're already in the middle of it, let's keep going, get to the
end of the string and then when it doesn't work out, then it will
backtrack and try another one.
It doesn't backtrack back to the beginning; it doesn't try all sorts
of other combinations.
It's still eager to get you a result, so it says, what if I just gave
back one?
Would that allow me to give a result back?
If it does, great, it's done. It's able to just finish there.
It doesn't have to keep backtracking further in the string, looking
for some kind of a better match or match that's further along.
I don't quite understand these lines (especially 2nd ("While we're...") and last ("It doesn't have to keep backtracking") sentences).
And these lines about lazy mode.
It still defers to the overall match just like the greedy one does
clearly.
I don't understand the following analogy:
It's not necessarily any faster or slower to choose a lazy strategy or
a greedy strategy, but it will probably match different things.
Now as far as is faster or slower, it's a little bit like saying, if
you've lost your car keys and your sunglasses inside your house, is it
better to start looking in the kitchen or to start looking in the
living room?
You don't know which one's going to yield the best result, and you
don't know which one's going to find the sunglasses first or the keys
first; it's just about different strategies of starting the search.
So you will likely get different results depending on where you start,
but it's not necessarily faster to start in one place or the other.
What 'faster or slower' means?
I'm going to draw scheme how it work (in both case). So I will contemplate this questions until I find out what's going on around here!)
I need understand it exactly and unambiguously.
Thanks.
Let's try by the exemple
for an input of this is input for test input on regex and a regex like /this.*input/
The match will be this is input for test input
What will be done is
starting to examine the string and it will get a match with this is input
But now its at the middle of the string, it will continue to see if it could match more on it (this is the While we're already in the middle of it, let's keep going )
It will match till this is input for test input and continue till the end of the string
at the end, there's things wich are not part of the match, so the interpreter "backtrack" to the last time it matches.
For the last part its more about the ored regexes
Consider input string as cdacdgabcdef and the regex (ab|a).*
A common mistake is thinking it will return the more precise one (in this case 'abcdef') but it will return 'acdgabcdef' because the a match is the first one to match.
what happens here is: There's something matching this part, let's continue to the next part of the pattern and forget about the other options in this part.
For the lazy and greedy questions, the link of #AvinashRaj is clear enough, I won't repeat it here.

Extracting static strings from a regular expression

I'm trying to efficiently extract static strings (strings that MUST be matched for a given regular expression to match). I've been able to do it in the simplest cases but I'm trying to discover a more robust solution.
Given a regex such as the one below
"fox jump(ed|ing|s)"
would give us
"fox,jumped,jumping,jumps"
Another example is
"fox jump(ed|ing|s)?"
which would give us
"fox,jump"
because of the optional operator
The algorithm I have is overly simple for now. It will start from the end of the regex and removes groups or a single character followed by these operators "* ?" as well as "explode" grouped OR operators "(|)". This has worked quite well but doesn't take into consideration the full syntax of a regex. You can think of it as kind of a minimal set generating process for a regex (the minimal set of strings that the regex can "generate/must match").
WHY?
I'm trying to match a bunch of text against a large set of regexes. If I can get a list of "keywords" for these regexes that is "required" I can do a quick text search for that keyword to filter the regexes I care about (ignore the ones I am guaranteed to not match or even skip that text entirely effectively not running any regexes on the text because we are guarenteed to not have a match within our set of regexes). I can organize this set of keywords in an efficient data structure (Binary Search/Trie/Aho-Corasick) to filter the set of regexes before I even try to run the text through the Finite Automata. There are extremely fast string matching algorithms that I can run as a filtering stage before I attempt to run a regular expression. I've been able to increase throughput many folds doing this simple process.
See the library Xeger which given a regular expression will give you all the possible strings that match.
You seem to only want to keep the common prefix of these strings (the part where you said to ignore optional operators) but if you do that you might capture stings that have that common prefix yet do not have the ending you want (such as "jumpy" in your example). If this is not a problem then just find the shortest string given by Xeger, assuming that optional operators occur only at the end of the regex.
If I understand your problem correctly, you are looking for a set of words such that all these words are (disjoint) substrings of any word accepted by the (given) regular expression.
My guess is that such a set will very often be empty, but nevertheless it can be found.
To find such a set, I propose the following algorithm:
Find the FA corresponding to your input regex.
Identify bridges ( https://en.wikipedia.org/wiki/Bridge_(graph_theory) ) between the starting state S and the accepting state F. This can for example be done by removing an edge E and asking whether a path exists from S to E in the FA with E removed - repeat this for all edges.
All edges that are bridges must be hit during an accepting run, and each edge corresponds to a letter of input.
You may now generate the required words by connecting subsequent bridge edges end-to-end.
I think it follows from the algorithm construction that an FA (and not a DFA) suffices for this to work. Again, a proof would be nice but I think it should work:)

Can one find out which input characters matched which part of a regex?

I'm trying to build a tool that uses something like regexes to find patterns in a string (not a text string, but that is not important right now). I'm familiar with automata theory, i.e. I know how to implement basic regex matching, and output true or false if the string matches my regex, by simulating an automaton in the textbook way.
Say I'm interested in all as that comes before bs, with no more as before the bs, so, this regex: a[^a]*b. But I don't just want to find out if my string contains such a part, I want to get as output the a, so that I can inspect it (remember, I'm not actually dealing with text).
In summary: Let's say I mark the a with parentheses, like so: (a)[^a]*b and run it on the input string bcadacb then I want the second a as output.
Or, more generally, can one find out which characters in the input string matches which part of the regex? How is it done in text editors? They at least know where the match started, because they can highlight the matches. Do I have to use a backtracking approach, or is there a smarter, less computationally expensive, way?
EDIT: Proper back references, i.e. capturing with parens and referencing with \1, etc. may not be necessary. I do know that back references do introduce the need for backtracking (or something similar) and make the problem (IIRC) NP-hard. My question, in essence, is: Is the capturing part, without the back referencing, less computationally expensive than proper back references?
Most text editors do this by using a backtracking algorithm, in which case recording the match locations is trivial to add.
It is possible to do with a direct NFA simulation too, by augmenting the state lists with parenthesis location information. This can be done in a way that preserves the linear time guarantee. See http://swtch.com/~rsc/regexp/regexp2.html#submatch.
Timos's answer is on the right track, but you cannot tag DFA states, because a DFA state corresponds to a collection of possible NFA states, and so one DFA state might represent the possibility of having passed a paren (but maybe something else too) and if that turns out not to be the case, it would be incorrect to record it as fact. You really need to work on the NFA simulation instead.
After you constructed your DFA for the matching, mark all states which correspond to the first state after an opening parenthesis in the regex. When you visit such a state, save the index of the current input character, when you visit a state which corresponds to a closing parenthesis, also save the index.
When you reach an accepting state, output the two indices. I am not sure if this is the algorithm used in text editors, but that's how I would do it.

Create regex from samples algorithm

AFAIK no one have implemented an algorithm that takes a set of strings and substrings and gives back one or more regular expressions that would match the given substrings inside the strings. So, for instance, if I'd give my algorithm this two samples:
string1 = "fwef 1234 asdfd"
substring1 = "1234"
string2 = "asdf456fsdf"
substring2 = "456"
The algorithm would give me the regular expression "[0-9]*" back. I know it could give more than one regex or even no possible regex back and you might find 1000 reasons why such algorithm would be close to impossible to implement to perfection. But what's the closest thing?
I don't really care about regex itself also. Basically what I want is an algorithm that takes samples as the ones above and then finds a pattern in them that can be used to easily find the "kind" of text I want to find in a string without having to write any regex or code manually.
I don't have proof but I suspect no such discrete algorithm with a finite output could exist since you are asking for the creation of a regular language which could be "large" in respect to the input size.
With that, I suggest you peek at txt2re which can break down sample texts one-by-one and help you build regexes.
FlashFill a new feature of MS Excel 2013 would do exactly the task you want, but it does not give you the regular expression. It's a NP-complete problem and an open question for practical purposes. If you're interested in how to synthesise string manipulation from multiple examples, Go Flash Fill official website and read a few papers. They have pseudo-code and demo. movies as well.
There are many such algorithm in fact. This is a research area called "Grammatical inference".
I know RPNI, for example. (you could also look on the probabilistic branch, alergia, MDI, DEES). These algorithms generate DSA (Deterministic State Automata). In fact you absolutely don't need to enter the strings in your example. Only substrings.
There are also some algorithms to generate directly Non deterministic automata.
Of course, get the regular expression from an Non Deterministic Automata is easy.
The main ideas are simple:
Generate a PTSA (Prefix Tree State Automata) from your sample.
Then, you have to try to "merge" some states. From these merge, will emerge loops (i.e. * in the regular expression). All the difficulty being to choose the right rule to merge.
Here you go:
re = '|'.join(substrings)
If you want anything more general, your algorithm is going to have to make educated guesses about what type of strings are acceptable as matches, and it's trivial to demonstrate that no procedure can account for all possible sets of possible inputs without simply enumerating them all. For instance, consider some of these scenarios:
Match all prime numbers
Match hexadecimal strings, but no strings containing 'f' are in the sample set
Match the same string repeated twice
Match any even-length string
The root problem is that your question is incompletely specified. If you have a more specific requirement, that might be solvable, depending on what it is.