Finding a string *and* its substrings in a haystack - regex

Suppose you have a string (e.g. needle). Its 19 continuous substrings are:
needle
needl eedle
need eedl edle
nee eed edl dle
ne ee ed dl le
n e d l
If I were to build a regex to match, in a haystack, any of the substrings I could simply do:
/(needle|needl|eedle|need|eedl|edle|nee|eed|edl|dle|ne|ee|ed|dl|le|n|e|d|l)/
but it doesn't look really elegant. Is there a better way to create a regex that will greedly match any one of the substrings of a given string?
Additionally, what if I posed another constraint, wanted to match only substrings longer than a threshold, e.g. for substrings of at least 3 characters:
/(needle|needl|eedle|need|eedl|edle|nee|eed|edl|dle)/
note: I deliberately did not mention any particular regex dialect. Please state which one you're using in your answer.

As Qtax suggested, the expression
n(e(e(d(l(e)?)?)?)?)?|e(e(d(l(e)?)?)?)?|e(d(l(e)?)?)?|d(l(e)?)?|l(e)?|e
would be the way to go if you wanted to write an explicit regular expression (egrep syntax, optionally replace (...) by (?:...)). The reason why this is better than the initial solution is that the condensed version requires only O(n^2) space compared to O(n^3) space in the original version, where n is the length of the input. Try this with extraordinarily as input to see the difference. I guess the condensed version is also faster with many regexp engines out there.
The expression
nee(d(l(e)?)?)?|eed(l(e)?)?|edl(e)?|dle
will look for substrings of length 3 or longer.
As pointed out by vhallac, the generated regular expressions are a bit redundant and can be optimized. Apart from the proposed Emacs tool, there is a Perl package Regexp::Optimizer that I hoped would help here, but a quick check failed for the first regular expression.
Note that many regexp engines perform non-overlapping search by default. Check this with the requirements of your problem.

I have found elegant almostsolution, depending how badly you need only one regexp. For example here is the regexp, which finds common substring (perl) of length 7:
"$needle\0$heystack" =~ /(.{7}).*?\0.*\1/s
Matching string is in \1. Strings should not contain null character which is used as separator.
You should make a cycle which starters with length of the needle and goes downto treshold and tries to match the regexp.

Is there a better way to create a regex that will match any one of the
substrings of a given string?
No. But you can generate such expression easily.

Perhaps you're just looking for
.*(.{1,6}).*

Related

Conditional regular expression with one section dependent on the result of another section of the regex

Is it possible to design a regular expression in a way that a part of it is dependent on another section of the same regular expression?
Consider the following example:
(ABCHEHG)[HGE]{5,1230}(EEJOPK)[DM]{5}
I want to continue this regex, and at some point I will have a section where the result of that section should depend on the result of [DM]{5}.
For example, D will be complemented by C, and M will be complemented by N.
(ABCHEHG)[HGHE]{5,1230}(EEJOPK)[DM]{5}[ACF]{1,1000}(BBBA)[CU]{2,5}[D'M']{5}
By D' I mean C, and by M' I mean N.
So a resulting string that matches the above regex, if it has DDDMM matching to the section [DM]{5}, it should necessarily have CCCNN matching to [D'M']{5}. Therefore, the result of [D'M']{5} always depends on [DM]{5}, or in other words, what matches to [DM]{5} always dictates what will match to [D'M']{5}.
Is it possible to do such a thing with regex?
Please note that, in this example I have extremely over-simplified the problem. The regex pattern I currently have is really much more complex and longer and my actual pattern includes about 5-6 of such dependent sections.
I cannot think of a way you can do this in pure regex. I would run 2 regex expressions. The first regex to extract the [DM]{5} string, such as
(ABCHEHG)[HGHE]{5,1230}(EEJOPK)[DM]{5}
And take the last 5 characters. Now replace the characters, for example in C# it would be result = result.Substring(result.Length - 5, 5).Replace('D', 'C').Replace('M', 'N'), and then concatenate like
(ABCHEHG)[HGHE]{5,1230}(EEJOPK)[DM]{5}[ACF]{1,1000}(BBBA)[CU]{2,5} + result
This is pretty easy to do in Perl:
m{
ABCHEHG
[HGHE]{5,1230}
EEJOPK
( [DM]{5} )
[ACF]{1,1000}
BBBA
[CU]{2,5}
(??{ $1 =~ tr/DM/CN/r })
}x
I've added the x modifier and whitespace for better readability. I've also removed the capturing groups around the fixed strings (they're fixed strings; you already know what they're going to capture).
The crucial part is that we capture the string that was actually matched by [DM]{5} (in $1), which we then use at the end to dynamically generate a subpattern by replacing all D by C and M by N in $1.
This sounds like bioinformatics in python. Do 2-stage filtering, at regex level and at app level.
Wildcard the DM portions, so the regex is permissive in what it accepts. Bury the regex in a token generator that yields several matching sections. Have your app iterate through the generator's results, discarding any result rejected by your business logic, such as finding that one token is not the complement of another token.
Alternatively, you might push some of that work down into a complex generated regex, which likely will perform worse and will be harder to debug. Your DDDMM example might be summarized as D+M+, or [DM]+, not sure if sequence matters. The complement might be C+N+, or [CN]+. Apparently there's two cases here. So start assembling a regex: stuff1 [DM]+ stuff2 [CN]+ stuff3. Then tack on '|' for alternation, and tack on the other case: stuff1 [CN]+ stuff2 [DM]+ stuff3 (or factor out suffix and prefix so alternation starts after stuff1). I can't imagine you'll be happy with such an approach, as the combinatorics get ugly, and the regex engine is forced to do lots of scanning and backtracking. And recompiling additional regexes on the fly doesn't come for free. Instead you should use the regex engine for the simple things that it's good at, and delegate complex business logic decisions to your app.

Regular expressions: search (instead of match) with DFA

I have been wondering about theory behind search mode of regex matcher. So, say I have a regex that matches aab, what if instead of match at the beginning of the string I wanna be able to perform this match starting from any position of the string. What I mean - in match mode I can only verify that string aab is consistent with a regex, on the other hand with search this should work with say aaab producing corresponding span result.
So super specifically - is there any way to build DFA searcher, or this is fundamentally impossible since it would require additional memory which you cant have in DFSM. It is obvious though that you can in fact build searcher from matcher by reapplying matcher to input string in a for loop, but complexity of such approach is something like O(len_of_pattern * len_of_input).
A regex searcher is basically the same thing as a regex matcher with .* tacked onto the front of the expression.
I think you basically answered your own question. Search, like many other features in modern regex implementations, leverages the wonder of memory to do things unfeasible with a Finite Automata. DFAs traditionally cannot loop over strings or backtrack along input as this would require memory. Search requires the ability to find a match and then understand how that match fits in the string.

Regex issue with car submodels

I'm pulling car submodels from the DB and I'm building my regular expression on the fly.
Here is an example of a search string:
EX-L Sedan 4-Door
Here is my regular expression:
preg_match("/LX|EX|EX-L|LX-P|LX-S/Ui", $input_line, $output_array);
For some reason the output is EX and not EX-L as it supposed to be. Can someone explain why?
Your pattern is unanchored and thus the first alternative that matches a substring makes the regex engine stop processing the whole group. This is a common behavior with NFA regexes.
Also, there are no quantifiers in your pattern, thus the /U modifier is redundant.
So, you can use
/EX-L|LX-P|LX-S|LX|EX/i
It is a readable form. However, best practice with regexes is to make sure no alternative branch can match at the same location as another. That means you can use
/EX(-L)?|LX(-[PS])?/i
As others have pointed out, the reason for this undesired outcome is because the regex engine is happy to have the first alternative and run for the door since your pattern has no anchors (like: ^, $, and some other lesser known ones). This is the same short-circuiting behavior you'd see in php's if($x || $y) conditions; if $x is true there is no need to evaluate further. But enough about that...
I would like to offer some additional logic that I think is relevant to your case/question.
You say your regex is built on the fly, so I am assuming your method goes something like this:
A user identifies which substrings/keywords they want to search for.
$strings=array('LX','EX','EX-L','LX-P','LX-S');
// array of substrings in any order
As mentioned earlier, you need longer strings to precede shorter ones with identical starting characters.
rsort($strings);
// sort DESC, longer strings precede shorter strings when leading characters match
Pipe all strings into a single regex pattern with implode().
$piped_regex='/\b(?:'.implode('|',$array).')\b/i';
// word boundaries ensure the string is not part of a larger word; remove if not desired
// pattern: /\b(?:LX-S|LX-P|LX|EX-L|EX)\b/i
While programmatically condensing your similar strings into a concise pattern as Wiktor recommended is possible, it's probably not worth the effort with your on-the-fly patterns.
Finally run preg_match() as normal.
$input_line='EX-L Sedan 4-Door';
if(preg_match($piped_regex,$input_line,$output_array)){
var_export($output_array);
}
// output: array(0=>'EX-L')
I hope stepping out this method is helpful to you and future SO readers.

The Greedy Option of Regex is really needed?

The Greedy Option of Regex is really needed?
Lets say I have following texts, I like to extract texts inside [Optionx] and [/Optionx] blocks
[Option1]
Start=1
End=10
[/Option1]
[Option2]
Start=11
End=20
[/Option2]
But with Regex Greedy Option, its give me
Start=1
End=10
[/Option1]
[Option2]
Start=11
End=20
Anybody need like that? If yes, could you let me know?
If I understand correctly, the question is “why (when) do you need greedy matching?”
The answer is – almost always. Consider a regular expression that matches a sequence of arbitrary – but equal – characters, of length at least two. The regular expression would look like this:
(.)\1+
(\1 is a back-reference that matches the same text as the first parenthesized expression).
Now let’s search for repeats in the following string: abbbbbc. What do we find? Well, if we didn’t have greedy matching, we would find bb. Probably not what we want. In fact, in most application s we would be interested in finding the whole substring of bs, bbbbb.
By the way, this is a real-world example: the RLE compression works like that and can be easily implemented using regex.
In fact, if you examine regular expressions all around you will see that a lot of them use quantifiers and expect them to behave greedily. The opposite case is probably a minority. Often, it makes no difference because the searched expression is inside guard clauses (e.g. a quoted string is inside the quote marks) but like in the example above, that’s not always the case.
Regular expressions can potentially match multiple portion of a text.
For example consider the expression (ab)*c+ and the string "abccababccc". There are many portions of the string that can match the regular expressions:
(abc)cababccc
(abcc)ababccc
abcc(ababccc)
abccab(abccc)
ab(c)cababccc
ab(cc)ababccc
abcabab(c)ccc
....
some regular expressions implementation are actually able to return the entire set of matches but it is most common to return a single match.
There are many possible ways to determine the "winning match". The most common one is to take the "longest leftmost match" which results in the greedy behaviour you observed.
This is tipical of search and replace (a la grep) when with a+ you probably mean to match the entire aaaa rather than just a single a.
Choosing the "shortest non-empty leftmost" match is the usual non-greedy behaviour. It is the most useful when you have delimiters like your case.
It all depends on what you need, sometimes greedy is ok, some other times, like the case you showed, a non-greedy behaviour would be more meaningful. It's good that modern implementations of regular expressions allow us to do both.
If you're looking for text between the optionx blocks, instead of searching for .+, search for anything that's not "[\".
This is really rough, but works:
\[[^\]]+]([^(\[/)]+)
The first bit searches for anything in square brackets, then the second bit searches for anything that isn't "[\". That way you don't have to care about greediness, just tell it what you don't want to see.
One other consideration: In many cases, greedy and non-greedy quantifiers result in the same match, but differ in performance:
With a non-greedy quantifier, the regex engine needs to backtrack after every single character that was matched until it finally has matched as much as it needs to. With a greedy quantifier, on the other hand, it will match as much as possible "in one go" and only then backtrack as much as necessary to match any following tokens.
Let's say you apply a.*c to
abbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbc. This finds a match in 5 steps of the regex engine. Now apply a.*?c to the same string. The match is identical, but the regex engine needs 101 steps to arrive at this conclusion.
On the other hand, if you apply a.*c to abcbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb, it takes 101 steps whereas a.*?c only takes 5.
So if you know your data, you can tailor your regex to match it as efficiently as possible.
just use this algorithm which you can use in your fav language. No need regex.
flag=0
open file for reading
for each line in file :
if check "[/Option" in line:
flag=0
if check "[Option" in line:
flag=1
continue
if flag:
print line.strip()
# you can store the values of each option in this part

Regular expression listing all possibilities

Given a regular expression, how can I list all possible matches?
For example: AB[CD]1234, I want it to return a list like:
ABC1234
ABD1234
I searched the web, but couldn't find anything.
Exrex can do this:
$ python exrex.py 'AB[CD]1234'
ABC1234
ABD1234
The reason you haven't found anything is probably because this is a problem of serious complexity given the amount of combinations certain expressions would allow. Some regular expressions could even allow infite matches:
Consider following expressions:
AB[A-Z0-9]{1,10}1234
AB.*1234
I think your best bet would be to create an algorithm yourself based on a small subset of allowed patterns. In your specific case, I would suggest to use a more naive approach than a regular expression.
For some simple regular expressions like the one you provided (AB[CD]1234), there is a limited set of matches. But for other expressions (AB[CD]*1234) the number of possible matches are not limited.
One method for locating all the posibilities, is to detect where in the regular expression there are choices. For each possible choice generate a new regular expression based on the original regular expression and the current choice. This new regular expression is now a bit simpler than the original one.
For an expression like "A[BC][DE]F", the method will proceed as follows
getAllMatches("A[BC][DE]F")
= getAllMatches("AB[DE]F") + getAllMatches("AC[DE]F")
= getAllMatches("ABDF") + getAllMatches("ABEF")
+ getAllMatches("ACDF")+ getAllMatches("ACEF")
= "ABDF" + "ABEF" + "ACDF" + "ACEF"
It's possible to write an algorithm to do this but it will only work for regular expressions that have a finite set of possible matches. Your regexes would be limited to using:
Optional: ?
Characters: . \d \D
Sets: like [1a-c]
Negated sets: [^2-9d-z]
Alternations: |
Positive lookarounds
So your regexes could NOT use:
Repeaters: * +
Word patterns: \w \W
Negative lookarounds
Some zero-width assertions: ^ $
And there are some others (word boundaries, lazy & greedy quantifiers) I'm not sure about yet.
As for the algorithm itself, another user posted a link to this answer which describes how to create it.
Well you could convert the regular expression into an equivalent finite state machine (is relatively simple and can be done algorithmly) and then recursively folow every possible path through that fsm, outputting the followed paths through the machine. It's neither very hard nor computer intensive per output (you will normally get a HUGE amount of output however). You should however take care to disallow potentielly infinite passes (like .*). This can be done by having a maximum allowed path length, after which the tracing is aborted
A regular expression is intended to do nothing more than match to a pattern, that being said, the regular expression will never 'list' anything, only match. If you want to get a list of all matches I believe you will need to do it on your own.
Impossible.
Really.
Consider look ahead assertions. And what about .*, how will you generate all possible strings that match that regex?
It may be possible to find some code to list all possible matches for something as simple as you are doing. But most regular expressions you would not even want to attempt listing all possible matches.
For example AB.*1234 would be AB followed by absolutely anything and then 1234.
I'm not entirely sure this is even possible, but if it were, it would be so cpu/time intensive for many situations that it would not be useful.
For instance, try to make a list of all matches for A.*Z
There are sites that help with building a good regular expression though:
http://www.fileformat.info/tool/regex.htm
http://www.regular-expressions.info/javascriptexample.html
http://www.regextester.com/