Regular expressions: search (instead of match) with DFA - regex

I have been wondering about theory behind search mode of regex matcher. So, say I have a regex that matches aab, what if instead of match at the beginning of the string I wanna be able to perform this match starting from any position of the string. What I mean - in match mode I can only verify that string aab is consistent with a regex, on the other hand with search this should work with say aaab producing corresponding span result.
So super specifically - is there any way to build DFA searcher, or this is fundamentally impossible since it would require additional memory which you cant have in DFSM. It is obvious though that you can in fact build searcher from matcher by reapplying matcher to input string in a for loop, but complexity of such approach is something like O(len_of_pattern * len_of_input).

A regex searcher is basically the same thing as a regex matcher with .* tacked onto the front of the expression.

I think you basically answered your own question. Search, like many other features in modern regex implementations, leverages the wonder of memory to do things unfeasible with a Finite Automata. DFAs traditionally cannot loop over strings or backtrack along input as this would require memory. Search requires the ability to find a match and then understand how that match fits in the string.

Related

Regex issue with car submodels

I'm pulling car submodels from the DB and I'm building my regular expression on the fly.
Here is an example of a search string:
EX-L Sedan 4-Door
Here is my regular expression:
preg_match("/LX|EX|EX-L|LX-P|LX-S/Ui", $input_line, $output_array);
For some reason the output is EX and not EX-L as it supposed to be. Can someone explain why?
Your pattern is unanchored and thus the first alternative that matches a substring makes the regex engine stop processing the whole group. This is a common behavior with NFA regexes.
Also, there are no quantifiers in your pattern, thus the /U modifier is redundant.
So, you can use
/EX-L|LX-P|LX-S|LX|EX/i
It is a readable form. However, best practice with regexes is to make sure no alternative branch can match at the same location as another. That means you can use
/EX(-L)?|LX(-[PS])?/i
As others have pointed out, the reason for this undesired outcome is because the regex engine is happy to have the first alternative and run for the door since your pattern has no anchors (like: ^, $, and some other lesser known ones). This is the same short-circuiting behavior you'd see in php's if($x || $y) conditions; if $x is true there is no need to evaluate further. But enough about that...
I would like to offer some additional logic that I think is relevant to your case/question.
You say your regex is built on the fly, so I am assuming your method goes something like this:
A user identifies which substrings/keywords they want to search for.
$strings=array('LX','EX','EX-L','LX-P','LX-S');
// array of substrings in any order
As mentioned earlier, you need longer strings to precede shorter ones with identical starting characters.
rsort($strings);
// sort DESC, longer strings precede shorter strings when leading characters match
Pipe all strings into a single regex pattern with implode().
$piped_regex='/\b(?:'.implode('|',$array).')\b/i';
// word boundaries ensure the string is not part of a larger word; remove if not desired
// pattern: /\b(?:LX-S|LX-P|LX|EX-L|EX)\b/i
While programmatically condensing your similar strings into a concise pattern as Wiktor recommended is possible, it's probably not worth the effort with your on-the-fly patterns.
Finally run preg_match() as normal.
$input_line='EX-L Sedan 4-Door';
if(preg_match($piped_regex,$input_line,$output_array)){
var_export($output_array);
}
// output: array(0=>'EX-L')
I hope stepping out this method is helpful to you and future SO readers.

How to invert an arbitrary Regex expression

This question sounds like a duplicate, but I've looked at a LOT of similar questions, and none fit the bill either because they restrict their question to a very specific example, or to a specific usercase (e.g: single chars only) or because you need substitution for a successful approach, or because you'd need to use a programming language (e.g: C#'s split, or Match().Value).
I want to be able to get the reverse of any arbitrary Regex expression, so that everything is matched EXCEPT the found match.
For example, let's say I want to find the reverse of the Regex "over" in "The cow jumps over the moon", it would match The cow jumps and also match the moon.
That's only a simple example of course. The Regex could be something more messy such as "o.*?m", in which case the matches would be: The c, ps, and oon.
Here is one possible solution I found after ages of hunting. Unfortunately, it requires the use of substitution in the replace field which I was hoping to keep clear. Also, everything else is matched, but only a character by character basis instead of big chunks.
Just to stress again, the answer should be general-purpose for any arbitrary Regex, and not specific to any particular example.
From post: I want to be able to get the reverse of any arbitrary Regex expression, so that everything is matched EXCEPT the found match.
The answer -
A match is Not Discontinuous, it is continuous !!
Each match is a continuous, unbroken substring. So, within each match there
is no skipping anything within that substring. Whatever matched the
regular expression is included in a particular match result.
So within a single Match, there is no inverting (i.e. match not this only) that can extend past
a negative thing.
This is a Tennant of Regular Expressions.
Further, in this case, since you only want all things NOT something, you have
to consume that something in the process.
This is easily done by just capturing what you want.
So, even with multiple matches, its not good enough to say (?:(?!\bover\b).)+
because even though it will match up to (but not) over, on the next match
it will match ver ....
There are ways to avoid this that are tedious, requiring variable length lookbehinds.
But, the easiest way is to match up to over, then over, then the rest.
Several constructs can help. One is \K.
Unfortunately, there is no magical recipe to negate a pattern.
As you mentioned it in your question when you have an efficient pattern you use with a match method, to obtain the complementary, the more easy (and efficient) way is to use a split method with the same pattern.
To do it with the pattern itself, workarounds are:
1. consuming the characters that match the pattern
"other content" is the content until the next pattern or the end of the string.
alternation + capture group:
(pattern)|other content
Then you must check if the capture group exists to know which part of the alternation succeeds.
"other content" can be for example described in this way: .*?(?=pattern|$)
With PCRE and Perl, you can use backtracking control verbs to avoid the capture group, but the idea is the same:
pattern(*SKIP)(*FAIL)|other content
With this variant, you don't need to check anything after, since the first branch is forced to fail.
or without alternation:
((?:pattern)*)(other content)
variant in PCRE, Perl, or Ruby with the \K feature:
(?:pattern)*\Kother content
Where \K removes all on the left from the match result.
2. checking characters of the string one by one
(?:(?!pattern).)*
if this way is very simple to write (if the lookahead is available), it has the inconvenient to be slow since each positions of the string are tested with the lookahead.
The amount of lookahead tests can be reduced if you can use the first character of the pattern (lets say "a"):
[^a]*(?:(?!pattern)a[^a]*)*
3. list all that is not the pattern.
using character classes
Lets say your pattern is /hello/:
([^h]|h(([^eh]|$)|e(([^lh]|$)|l(([^lh]|$)|l([^oh]|$))))*
This way becomes quickly fastidious when the number of characters is important, but it can be useful for regex flavors that haven't many features like POSIX regex.

Finding a string *and* its substrings in a haystack

Suppose you have a string (e.g. needle). Its 19 continuous substrings are:
needle
needl eedle
need eedl edle
nee eed edl dle
ne ee ed dl le
n e d l
If I were to build a regex to match, in a haystack, any of the substrings I could simply do:
/(needle|needl|eedle|need|eedl|edle|nee|eed|edl|dle|ne|ee|ed|dl|le|n|e|d|l)/
but it doesn't look really elegant. Is there a better way to create a regex that will greedly match any one of the substrings of a given string?
Additionally, what if I posed another constraint, wanted to match only substrings longer than a threshold, e.g. for substrings of at least 3 characters:
/(needle|needl|eedle|need|eedl|edle|nee|eed|edl|dle)/
note: I deliberately did not mention any particular regex dialect. Please state which one you're using in your answer.
As Qtax suggested, the expression
n(e(e(d(l(e)?)?)?)?)?|e(e(d(l(e)?)?)?)?|e(d(l(e)?)?)?|d(l(e)?)?|l(e)?|e
would be the way to go if you wanted to write an explicit regular expression (egrep syntax, optionally replace (...) by (?:...)). The reason why this is better than the initial solution is that the condensed version requires only O(n^2) space compared to O(n^3) space in the original version, where n is the length of the input. Try this with extraordinarily as input to see the difference. I guess the condensed version is also faster with many regexp engines out there.
The expression
nee(d(l(e)?)?)?|eed(l(e)?)?|edl(e)?|dle
will look for substrings of length 3 or longer.
As pointed out by vhallac, the generated regular expressions are a bit redundant and can be optimized. Apart from the proposed Emacs tool, there is a Perl package Regexp::Optimizer that I hoped would help here, but a quick check failed for the first regular expression.
Note that many regexp engines perform non-overlapping search by default. Check this with the requirements of your problem.
I have found elegant almostsolution, depending how badly you need only one regexp. For example here is the regexp, which finds common substring (perl) of length 7:
"$needle\0$heystack" =~ /(.{7}).*?\0.*\1/s
Matching string is in \1. Strings should not contain null character which is used as separator.
You should make a cycle which starters with length of the needle and goes downto treshold and tries to match the regexp.
Is there a better way to create a regex that will match any one of the
substrings of a given string?
No. But you can generate such expression easily.
Perhaps you're just looking for
.*(.{1,6}).*

The Greedy Option of Regex is really needed?

The Greedy Option of Regex is really needed?
Lets say I have following texts, I like to extract texts inside [Optionx] and [/Optionx] blocks
[Option1]
Start=1
End=10
[/Option1]
[Option2]
Start=11
End=20
[/Option2]
But with Regex Greedy Option, its give me
Start=1
End=10
[/Option1]
[Option2]
Start=11
End=20
Anybody need like that? If yes, could you let me know?
If I understand correctly, the question is “why (when) do you need greedy matching?”
The answer is – almost always. Consider a regular expression that matches a sequence of arbitrary – but equal – characters, of length at least two. The regular expression would look like this:
(.)\1+
(\1 is a back-reference that matches the same text as the first parenthesized expression).
Now let’s search for repeats in the following string: abbbbbc. What do we find? Well, if we didn’t have greedy matching, we would find bb. Probably not what we want. In fact, in most application s we would be interested in finding the whole substring of bs, bbbbb.
By the way, this is a real-world example: the RLE compression works like that and can be easily implemented using regex.
In fact, if you examine regular expressions all around you will see that a lot of them use quantifiers and expect them to behave greedily. The opposite case is probably a minority. Often, it makes no difference because the searched expression is inside guard clauses (e.g. a quoted string is inside the quote marks) but like in the example above, that’s not always the case.
Regular expressions can potentially match multiple portion of a text.
For example consider the expression (ab)*c+ and the string "abccababccc". There are many portions of the string that can match the regular expressions:
(abc)cababccc
(abcc)ababccc
abcc(ababccc)
abccab(abccc)
ab(c)cababccc
ab(cc)ababccc
abcabab(c)ccc
....
some regular expressions implementation are actually able to return the entire set of matches but it is most common to return a single match.
There are many possible ways to determine the "winning match". The most common one is to take the "longest leftmost match" which results in the greedy behaviour you observed.
This is tipical of search and replace (a la grep) when with a+ you probably mean to match the entire aaaa rather than just a single a.
Choosing the "shortest non-empty leftmost" match is the usual non-greedy behaviour. It is the most useful when you have delimiters like your case.
It all depends on what you need, sometimes greedy is ok, some other times, like the case you showed, a non-greedy behaviour would be more meaningful. It's good that modern implementations of regular expressions allow us to do both.
If you're looking for text between the optionx blocks, instead of searching for .+, search for anything that's not "[\".
This is really rough, but works:
\[[^\]]+]([^(\[/)]+)
The first bit searches for anything in square brackets, then the second bit searches for anything that isn't "[\". That way you don't have to care about greediness, just tell it what you don't want to see.
One other consideration: In many cases, greedy and non-greedy quantifiers result in the same match, but differ in performance:
With a non-greedy quantifier, the regex engine needs to backtrack after every single character that was matched until it finally has matched as much as it needs to. With a greedy quantifier, on the other hand, it will match as much as possible "in one go" and only then backtrack as much as necessary to match any following tokens.
Let's say you apply a.*c to
abbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbc. This finds a match in 5 steps of the regex engine. Now apply a.*?c to the same string. The match is identical, but the regex engine needs 101 steps to arrive at this conclusion.
On the other hand, if you apply a.*c to abcbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb, it takes 101 steps whereas a.*?c only takes 5.
So if you know your data, you can tailor your regex to match it as efficiently as possible.
just use this algorithm which you can use in your fav language. No need regex.
flag=0
open file for reading
for each line in file :
if check "[/Option" in line:
flag=0
if check "[Option" in line:
flag=1
continue
if flag:
print line.strip()
# you can store the values of each option in this part

Regular expression listing all possibilities

Given a regular expression, how can I list all possible matches?
For example: AB[CD]1234, I want it to return a list like:
ABC1234
ABD1234
I searched the web, but couldn't find anything.
Exrex can do this:
$ python exrex.py 'AB[CD]1234'
ABC1234
ABD1234
The reason you haven't found anything is probably because this is a problem of serious complexity given the amount of combinations certain expressions would allow. Some regular expressions could even allow infite matches:
Consider following expressions:
AB[A-Z0-9]{1,10}1234
AB.*1234
I think your best bet would be to create an algorithm yourself based on a small subset of allowed patterns. In your specific case, I would suggest to use a more naive approach than a regular expression.
For some simple regular expressions like the one you provided (AB[CD]1234), there is a limited set of matches. But for other expressions (AB[CD]*1234) the number of possible matches are not limited.
One method for locating all the posibilities, is to detect where in the regular expression there are choices. For each possible choice generate a new regular expression based on the original regular expression and the current choice. This new regular expression is now a bit simpler than the original one.
For an expression like "A[BC][DE]F", the method will proceed as follows
getAllMatches("A[BC][DE]F")
= getAllMatches("AB[DE]F") + getAllMatches("AC[DE]F")
= getAllMatches("ABDF") + getAllMatches("ABEF")
+ getAllMatches("ACDF")+ getAllMatches("ACEF")
= "ABDF" + "ABEF" + "ACDF" + "ACEF"
It's possible to write an algorithm to do this but it will only work for regular expressions that have a finite set of possible matches. Your regexes would be limited to using:
Optional: ?
Characters: . \d \D
Sets: like [1a-c]
Negated sets: [^2-9d-z]
Alternations: |
Positive lookarounds
So your regexes could NOT use:
Repeaters: * +
Word patterns: \w \W
Negative lookarounds
Some zero-width assertions: ^ $
And there are some others (word boundaries, lazy & greedy quantifiers) I'm not sure about yet.
As for the algorithm itself, another user posted a link to this answer which describes how to create it.
Well you could convert the regular expression into an equivalent finite state machine (is relatively simple and can be done algorithmly) and then recursively folow every possible path through that fsm, outputting the followed paths through the machine. It's neither very hard nor computer intensive per output (you will normally get a HUGE amount of output however). You should however take care to disallow potentielly infinite passes (like .*). This can be done by having a maximum allowed path length, after which the tracing is aborted
A regular expression is intended to do nothing more than match to a pattern, that being said, the regular expression will never 'list' anything, only match. If you want to get a list of all matches I believe you will need to do it on your own.
Impossible.
Really.
Consider look ahead assertions. And what about .*, how will you generate all possible strings that match that regex?
It may be possible to find some code to list all possible matches for something as simple as you are doing. But most regular expressions you would not even want to attempt listing all possible matches.
For example AB.*1234 would be AB followed by absolutely anything and then 1234.
I'm not entirely sure this is even possible, but if it were, it would be so cpu/time intensive for many situations that it would not be useful.
For instance, try to make a list of all matches for A.*Z
There are sites that help with building a good regular expression though:
http://www.fileformat.info/tool/regex.htm
http://www.regular-expressions.info/javascriptexample.html
http://www.regextester.com/