Optimizing the regex "unrolling the loop" pattern [duplicate] - regex

Requirement : Two expressions, exp1 and exp2, we need to match one or more of both. So I came up with,
(exp1 | exp2)*
However in some places, I see the below being used,
(exp1 * (exp2 exp1*)*)
What is the difference between the two? When would you use one over the other?
Hopefully a fiddle will make this more clear,
var regex1 = /^"([\x00-!#-[\]-\x7f]|\\")*"$/;
var regex2 = /^"([\x00-!#-[\]-\x7f]*(\\"[\x00-!#-[\]-\x7f]*)*)"$/;
var str = '"foo \\"bar\\" baz"';
var r1 = regex1.exec(str);
var r2 = regex2.exec(str);
EDIT: It looks like there is a difference in behavior between the two apporaches when we capture the groups. The second approach captures the entire string while the first approach captures only the last matching group. See updated fiddle.

The difference between the two patterns is potential efficiency.
The (exp1 | exp2)* pattern contains an alternation that automatically disables some internal regex matching optimization. Also, this regex tries to match the pattern at each location in the string.
The (exp1 * (exp2 exp1*)*) expression is written acc. to the unroll-the-loop principle:
This optimisation thechnique is used to optimize repeated alternation of the form (expr1|expr2|...)*. These expression are not uncommon, and the use of another repetition inside an alternation may also leads to super-linear match. Super-linear match arise from the underterministic expression (a*)*.
The unrolling the loop technique is based on the hypothesis that in most case, you kown in a repeteated alternation, which case should be the most usual and which one is exceptional. We will called the first one, the normal case and the second one, the special case. The general syntax of the unrolling the loop technique could then be written as:
normal* ( special normal* )*
So, the exp1 in your example is normal part that is most common, and exp2 is expected to be less frequent. In that case, the efficiency of the unrolled pattern can be really, much higher than that of the other regex since the normal* part will grab the whole chunks of input without any need to stop and check each location.
Let's see a simple "([^"\\]|\\.)*" regex test against "some text here": there are 35 steps involved:
Unrolling it as "[^"\\]*(\\.[^"\\]*)*" gives a boost to 6 steps as there is much less backtracking.
NOTE that the number of steps at regex101.com does not directly mean one regex is more efficient than another, however, the debug table shows where backtracking occurs, and backtracking is resource consuming.
Let's then test the pattern efficiency with JS benchmark.js:
var suite = new Benchmark.Suite();
Benchmark = window.Benchmark;
suite
.add('Regular RegExp test', function() {
'"some text here"'.match(/"([^"\\]|\\.)*"/);
})
.add('Unrolled RegExp test', function() {
'"some text here"'.match(/"[^"\\]*(\\.[^"\\]*)*"/);
})
.on('cycle', function(event) {
console.log(String(event.target));
})
.on('complete', function() {
console.log('Fastest is ' + this.filter('fastest').map('name'));
})
.run({ 'async': true });
<script src="https://cdnjs.cloudflare.com/ajax/libs/lodash.js/4.13.1/lodash.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/platform/1.3.1/platform.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/benchmark/2.1.0/benchmark.js"></script>
Results:
Regular RegExp test x 9,295,393 ops/sec ±0.69% (64 runs sampled)
Unrolled RegExp test x 12,176,227 ops/sec ±1.17% (64 runs sampled)
Fastest is Unrolled RegExp test
Also, since unroll the loop concept is not language specific, here is an online PHP test (regular pattern yielding ~0.45, and unrolled one yielding ~0.22 results).
Also see Unroll Loop, when to use.

What is the difference between the two?
The difference between them is how they exactly match a particular given input. If you think of these as two functions in terms of input and output they are the equivalent, but how the function works to produce the output (match) is different. Both of these regular expressions (exp1 | exp2)* and (exp1 * (exp2 exp1*)*) will match the exact same input. In other-words you can say they are semantically equivalent in terms of the given input and a match (output).
When would you use one over the other?
Edit
The second regular expression (exp1 * (exp2 exp1*)*) is more optimal due to the loop unrolling technique. See #Wiktor Stribiżew's answer.
Proof
One way to prove if two regular expressions are equivalent is to see if they have the same DFA. Using this converter, here are the following DFAs of the regular expressions.
(Note: a = exp1 and b = exp2)
(a*(ba*)*)
(a|b)*
Notice that the first DFA is the same as the second one? The only difference is that the first one isn't minimized. Here is a crud fix to show the minimization of the first DFA:

Related

value of binding operator expression in perl

I have some doubt about the outcome of a binding operator expression in perl. I mean expression like
string =~ /pattern/
I have done some simple test
$ss="a1b2c3";
say $ss=~/a/; # 1
say $ss=~/[a-z]/g; # abc
#aa=$ss=~/[a-z]/g;say #aa; # abc
$aa=#aa;say $aa; # 3
$aa=$ss=~/[a-z]/g;say $aa; # 1
note the comment part above is the running result.
So here comes the question, what on earth is returned by $ss=~/[a-z]/g, it seems that it returned an array according to code line 3,4,5. But what about the last line, why it gives 1 instead of 3 which is the length of array?
The return of the match operator depends on the context: in list context it returns all captured matches, in scalar context the true/false. The say imposes list context, but in the first example nothing is captured in the regex so you only get "success."
Next, the behavior of /g modifier also differs across contexts. In list context, with it the string keeps being scanned with the given pattern until all matches are found, and a list with them is returned. These are your second and third examples.
But in scalar context its behavior is a bit specific: with it the search will continue from the position of the last match, the next time round. One typical use is in the loop condition
while (/(\w+)/g) { ... }
This is a bit of a tokenizer: after the body of the loop runs the next word is found, etc.
Then the last example doesn't really make sense; you are getting the "normal" scalar-context matching success/fail, and /g doesn't do anything -- until you match on $ss the next time
perl -wE'
$s=shift||q(abc);
for (1..2) { $m = $s=~/(.)/g; say "$m: $1"; }
'
prints lines 1:a and then 1:b.
Outside of iterative structures (like while condition) the /g in scalar context is usually an error, pointless at best or a quiet bug.
See "Global matching" under "Using regular expressions" in perlretut for /g.
See regex operators in perlop in general, and about /g as well. A useful tool to explore /g workings is pos.

+ is supposed to be greedy, so why am I getting a lazy result?

Why does the following regex return 101 instead of 1001?
console.log(new RegExp(/1(0+)1/).exec('101001')[0]);
I thought that + was greedy, so the longer of the two matches should be returned.
IMO this is different from Using javascript regexp to find the first AND longest match because I don't care about the first, just the longest. Can someone correct my definition of greedy? For example, what is the difference between the above snippet and the classic "oops, too greedy" example of new RegExp(/<(.+)>/).exec('<b>a</b>')[0] giving b>a</b?
(Note: This seems to be language-agnostic (it also happens in Perl), but just for ease of running it in-browser I've used JavaScript here.)
Regex always reads from left to right! It will not look for something longer. In the case of multiple matches, you have to re-execute the regex to get them, and compare their lengths yourself.
Greedy means up to the rightmost occurrence, it never means the longest in the input string.
Regex itself is not the correct tool to extract the longest match. You might get all the substrings that match your pattern, and get the longest one using the language specific means.
Since the string is parsed from left to right, 101 will get matched in 101001 first, and the rest (001) will not match (as the 101 and 1001 matches are overlapping). You might use /(?=(10+1))./g and then check the length of each Group 1 value to get the longest one.
var regex = /(?=(10+1))./g;
var str = "101001";
var m, res=[];
while ((m = regex.exec(str)) !== null) {
res.push(m[1]);
}
console.log(res); // => ["101", "1001"]
if (res.length>0) {
console.log("The longest match:", res.sort(function (a, b) { return b.length - a.length; })[0]);
} // => 1001

A regex for maximal periodic substrings

This is a follow up to A regex to detect periodic strings .
A period p of a string w is any positive integer p such that w[i]=w[i+p]
whenever both sides of this equation are defined. Let per(w) denote
the size of the smallest period of w . We say that a string w is
periodic iff per(w) <= |w|/2.
So informally a periodic string is just a string that is made up from a another string repeated at least once. The only complication is that at the end of the string we don't require a full copy of the repeated string as long as it is repeated in its entirety at least once.
For, example consider the string x = abcab. per(abcab) = 3 as x[1] = x[1+3] = a, x[2]=x[2+3] = b and there is no smaller period. The string abcab is therefore not periodic. However, the string ababa is periodic as per(ababa) = 2.
As more examples, abcabca, ababababa and abcabcabc are also periodic.
#horcruz, amongst others, gave a very nice regex to recognize a periodic string. It is
\b(\w*)(\w+\1)\2+\b
I would like to find all maximal periodic substrings in a longer string. These are sometimes called runs in the literature.
Formally a substring w is a maximal periodic substring if it is periodic and neither w[i-1] = w[i-1+p] nor w[j+1] = w[j+1-p]. Informally, the "run" cannot be contained in a larger "run"
with the same period.
The four maximal periodic substrings (runs) of string T = atattatt are T[4,5] = tt, T[7,8] = tt, T[1,4] = atat, T[2,8] = tattatt.
The string T = aabaabaaaacaacac contains the following 7 maximal periodic substrings (runs):
T[1,2] = aa, T[4,5] = aa, T[7,10] = aaaa, T[12,13] = aa, T[13,16] = acac, T[1,8] = aabaabaa, T[9,15] = aacaaca.
The string T = atatbatatb contains the following three runs. They are:
T[1, 4] = atat, T[6, 9] = atat and T[1, 10] = atatbatatb.
Is there a regex (with backreferences) that will capture all maximal
periodic substrings?
I don't really mind which flavor of regex but if it makes a difference, anything that the Python module re supports. However I would even be happy with PCRE if that makes the problem solvable.
(This question is partly copied from https://codegolf.stackexchange.com/questions/84592/compute-the-maximum-number-of-runs-possible-for-as-large-a-string-as-possible . )
Let's extend the regex version to the very powerful https://pypi.python.org/pypi/regex . This supports variable length lookbehinds for example.
This should do it, using Python's re module:
(?<=(.))(?=((\w*)(\w*(?!\1)\w\3)\4+))
Fiddle: https://regex101.com/r/aA9uJ0/2
Notes:
You must precede the string being scanned by a dummy character; the # in the fiddle. If that is a problem, it should be possible to work around it in the regex.
Get captured group 2 from each match to get the collection of maximal periodic substrings.
Haven't tried it with longer strings; performance may be an issue.
Explanation:
(?<=(.)) - look-behind to the character preceding the maximal periodic substring; captured as group 1
(?=...) - look-ahead, to ensure overlapping patterns are matched; see How to find overlapping matches with a regexp?
(...) - captures the maximal periodic substring (group 2)
(\w*)(\w*...\w\3)\4+ - #horcruz's regex, as proposed by OP
(?!\1) - negative look-ahead to group 1 to ensure the periodic substring is maximal
As pointed out by #ClasG, the result of my regex may be incomplete. This happens when two runs start at the same offset. Examples:
aabaab has 3 runs: aabaab, aa and aa. The first two runs start at the same offset. My regex will fail to return the shortest one.
atatbatatb has 3 runs: atatbatatb, atat, atat. Same problem here; my regex will only return the first and third run.
This may well be impossible to solve within the regex. As far as I know, there is no regex engine that is capable of returning two different matches that start at the same offset.
I see two possible solutions:
Ignore the missing runs. If I am not mistaken, then they are always duplicates; an identical run will follow within the same encapsulating run.
Do some postprocessing on the result. For every run found (let's call this X), scan earlier runs trying to find one that starts with the same character sequence (let's call this Y). When found (and not already 'used'), add an entry with the same character sequence as X, but the offset of Y.
I think it is not possible. Regular expressions cannot do complex nondeterministic jobs, even with backreferences. You need an algorithm for this.
This kind of depends on your input criteria... There is no infinite string of characters.. using back references you will be able to create a suitable representation of the last amount of occurrences of the pattern you wish to match.
\
Personally I would define buckets of length of input and then fill them.
I would then use automata to find patterns in the buckets and then finally coalesce them into larger patterns.
It's not how fast the RegEx is going to be in this case it's how fast you are going to be able to recognize a pattern and eliminate the invalid criterion.

Java Regular Expression Two Question marks (??)

I know that /? means the / is optional. so "toys?" will match both toy and toys. My understanding is that if I make it lazy and use "toys??" I will match both toy and toys and always return toy. So, a quick test:
private final static Pattern TEST_PATTERN = Pattern.compile("toys??", Pattern.CASE_INSENSITIVE);
public static void main(String[] args) {
for(String arg : args) {
Matcher m = TEST_PATTERN.matcher(arg);
System.out.print("Arg: " + arg);
boolean b = false;
while (m.find()) {
System.out.print(" {");
for (int i=0; i<=m.groupCount(); ++i) {
System.out.print("[" + m.group(i) + "]");
}
System.out.print("}");
}
System.out.println();
}
}
Yep, it looks like it works as expected
java -cp .. regextest.RegExTest toy toys
Arg: toy {[toy]}
Arg: toys {[toy]}
Now, change the regular expression to "toys??2" and it still matches toys2 and toy2. In both cases, it returns the entire string without the s removed. Is there any functional difference between searching for "toys?2" and "toys??2".
The reason I am asking is because I found an example like the following:
private final static Pattern TEST_PATTERN = Pattern.compile("</??tag(\\s+?.*?)??>", Pattern.CASE_INSENSITIVE);
and although I see no apparent reason for using ?? rather than ?, I thought that perhaps the original author (who is not known to me) might know something that I don't, I expect the later.
?? is lazy while ? is greedy.
Given (pattern)??, it will first test for empty string, then if the rest of the pattern can't match, it will test for pattern.
In contrast, (pattern)? will test for pattern first, then it will test for empty string on backtrack.
Now, change the regular expression to "toys??2" and it still matches toys2 and toy2. In both cases, it returns the entire string without the s removed. Is there any functional difference between searching for "toys?2" and "toys??2".
The difference is in the order of searching:
"toys?2" searches for toys2, then toy2
"toys??2" searches for toy2, then toys2
But for the case of these 2 patterns, the result will be the same regardless of the input string, since the sequel 2 (after s? or s??) must be matched.
As for the pattern you found:
Pattern.compile("</??tag(\\s+?.*?)??>", Pattern.CASE_INSENSITIVE)
Both ?? can be changed to ? without affecting the result:
/ and t (in tag) are mutually exclusive. You either match one or the other.
> and \s are also mutually exclusive. The at least 1 in \s+? is important to this conclusion: the result might be different otherwise.
This is probably micro-optimization from the author. He probably thinks that the open tag must be there, while the closing tag might be forgotten, and that open/close tags without attributes/random spaces appears more often than those with some.
By the way, the engine might run into some expensive backtracking attempt due to \\s+?.*? when the input has <tag followed by lots of spaces without > anywhere near.

Finding if a string matches a pattern

At one point in my app, I need to match some strings against a pattern. Let's say that some of the sample strings look as follows:
Hi there, John.
What a lovely day today!
Lovely sunset today, John, isn't it?
Will you be meeting Linda today, John?
Most (not all) of these strings are from pre-defined patterns as follows:
"Hi there, %s."
"What a lovely day today!"
"Lovely sunset today, %s, isn't it?"
"Will you be meeting %s today, %s?"
This library of patterns is ever-expanding (currently at around 1,500), but is manually maintained. The input strings though (the first group) is largely unpredictable. Though most of them will match one of the patterns, some of them will not.
So, here's my question: Given a string (from the first group) as input, I need to know which of the patterns (known second group) it matched. If nothing matched, it needs to tell me that.
I'm guessing the solution involves building a regex out of the patterns, and iteratively checking which one matched. However, I'm unsure what the code to build those regexes looks like.
Note: The strings I've given here are for illustration purposes. In reality, the strings aren't human generated, but are computer-generated human-friendly strings as shown above from systems I don't control. Since they aren't manually typed in, we don't need to worry about things like typos and other human errors. Just need to find which pattern it matches.
Note 2: I could modify the patterns library to be some other format, if that makes it easier to construct the regexes. The current structure, with the printf style %s, isn't set in stone.
I am looking at this as a parsing problem. The idea is that the parser function takes a string and determines if it is valid or not.
The string is valid if you can find it among the given patterns. That means you need an index of all the patterns. The index must be a full text index. Also it must match according to the word position. eg. it should short circuit if the first word of the input is not found among the first word of the patterns. It should take care of the any match ie %s in the pattern.
One solution is to put the patterns in an in memory database (eg. redis) and do a full text index on it. (this will not match according to word position) but you should be able to narrow down to the correct pattern by splitting the input into words and searching. The searches will be very fast because you have a small in memory database. Also note that you are looking for the closest match. One or more words will not match. The highest number of matches is the pattern you want.
An even better solution is to generate your own index in a dictionary format. Here is an example index for the four patterns you gave as a JavaScript object.
{
"Hi": { "there": {"%s": null}},
"What: {"a": {"lovely": {"day": {"today": null}}}},
"Lovely": {"sunset": {"today": {"%s": {"isnt": {"it": null}}}}},
"Will": {"you": {"be": {"meeting": {"%s": {"today": {"%s": null}}}}}}
}
This index is recursive descending according to the word postion. So search for the first word, if found search for the next within the object returned by the first and so on. Same words at a given level will have only one key. You should also match the any case. This should be blinding fast in memory.
My first thought would be to have the regexp engine take all the trouble of handling this. They're usually optimised to handle large amounts of text so it shouldn't be that much of a performance hassle. It's brute force but the performance seems to be okay. And you could split the input into pieces and have multiple processes handle them. Here's my moderately tested solution (in Python).
import random
import string
import re
def create_random_sentence():
nwords = random.randint(4, 10)
sentence = []
for i in range(nwords):
sentence.append("".join(random.choice(string.lowercase) for x in range(random.randint(3,10))))
ret = " ".join(sentence)
print ret
return ret
patterns = [ r"Hi there, [a-zA-Z]+.",
r"What a lovely day today!",
r"Lovely sunset today, [a-zA-Z]+, isn't it?",
r"Will you be meeting [a-zA-Z]+ today, [a-zA-Z]+\?"]
for i in range(95):
patterns.append(create_random_sentence())
monster_pattern = "|".join("(%s)"%x for x in patterns)
print monster_pattern
print "--------------"
monster_regexp = re.compile(monster_pattern)
inputs = ["Hi there, John.",
"What a lovely day today!",
"Lovely sunset today, John, isn't it?",
"Will you be meeting Linda today, John?",
"Goobledigoock"]*2000
for i in inputs:
ret = monster_regexp.search(i)
if ret:
print ".",
else:
print "x",
I've created a hundred patterns. This is the maximum limit of the python regexp library. 4 of them are your actual examples and the rest are random sentences just to stress performance a little.
Then I combined them into a single regexp with 100 groups. (group1)|(group2)|(group3)|.... I'm guessing you'll have to sanitise the inputs for things that can have meanings in regular expressions (like ? etc.). That's the monster_regexp.
Testing one string against this tests it against 100 patterns in a single shot. There are methods that fetch out the exact group which was matched. I test 10000 strings 80% of which should match and 10% which will not. It short cirtcuits so if there's a success, it will be comparatively quick. Failures will have to run through the whole regexp so it will be slower. You can order things based on the frequency of input to get some more performance out of it.
I ran this on my machine and this is my timing.
python /tmp/scratch.py 0.13s user 0.00s system 97% cpu 0.136 total
which is not too bad.
However, to run a pattern against such a large regexp and fail will take longer so I changed the inputs to have lots of randomly generated strings that won't match and then tried. 10000 strings none of which match the monster_regexp and I got this.
python /tmp/scratch.py 3.76s user 0.01s system 99% cpu 3.779 total
Similar to Noufal's solution, but returns the matched pattern or None.
import re
patterns = [
"Hi there, %s.",
"What a lovely day today!",
"Lovely sunset today, %s, isn't it",
"Will you be meeting %s today, %s?"
]
def make_re_pattern(pattern):
# characters like . ? etc. have special meaning in regular expressions.
# Escape the string to avoid interpretting them as differently.
# The re.escape function escapes even %, so replacing that with XXX to avoid that.
p = re.escape(pattern.replace("%s", "XXX"))
return p.replace("XXX", "\w+")
# Join all the pattens into a single regular expression.
# Each pattern is enclosed in () to remember the match.
# This will help us to find the matched pattern.
rx = re.compile("|".join("(" + make_re_pattern(p) + ")" for p in patterns))
def match(s):
"""Given an input strings, returns the matched pattern or None."""
m = rx.match(s)
if m:
# Find the index of the matching group.
index = (i for i, group in enumerate(m.groups()) if group is not None).next()
return patterns[index]
# Testing with couple of patterns
print match("Hi there, John.")
print match("Will you be meeting Linda today, John?")
Python solution. JS should be similar.
>>> re2.compile('^ABC(.*)E$').search('ABCDE') == None
False
>>> re2.compile('^ABC(.*)E$').search('ABCDDDDDDE') == None
False
>>> re2.compile('^ABC(.*)E$').search('ABX') == None
True
>>>
The trick is to use ^ and $ to bound your pattern and making it a "template". Use (.*) or (.+) or whatever it is that you want to "search" for.
The main bottleneck for you, imho, will be iterating through a list of these patterns. Regex searches are computationally expensive.
If you want a "does any pattern match" result, build a massive OR based regex and let your regex engine handle the 'OR'ing for you.
Also, if you have only prefix patterns, check out the TRIE data structure.
This could be a job for sscanf, there is an implementation in js: http://phpjs.org/functions/sscanf/; the function being copied is this: http://php.net/manual/en/function.sscanf.php.
You should be able to use it without changing the prepared strings much, but I have doubts about the performances.
the problem isn't clear to me. Do you want to take the patterns and build regexes out of it?
Most regex engines have a "quoted string" option. (\Q \E). So you could take the string and make it
^\QHi there,\E(?:.*)\Q.\E$
these will be regexes that match exactly the string you want outside your variables.
if you want to use a single regex to match just a single pattern, you can put them in grouped patterns to find out which one matched, but that will not give you EVERY match, just the first one.
if you use a proper parser (I've used PEG.js), it might be more maintainable though. So that's another option if you think you might get stuck in regex hell