Good time of day!
I am reading a book about perl: "Programming Perl" By Larry Wall, Tom Christiansen, Jon Orwant. In this book I found several examples that were not clarified by the authors (or simply I dont get then).
The first
This prints hi only ONCE.
"adfsfloglig"=~ /.*(?{print "hi"})f/;
But this prints "hi" TWICE?? how can it be explained?
"adfsfloglig"=~ /.*(?{print "hi"})log/;
And continuing to experement even make things worse:
"adfsfloglig"=~ /.*(?{print "hi"})sflog/;
The above string of code again prints only ONCE this terrifying "hi" !
After about a week I understood only one thing completely - I NEED HELP :)
SO I am asking you to help me, please.
The second (this is a bomb!)
$_ = "lothiernbfj";
m/ (?{$i = 0; print "setting i to 0\n"})
(.(?{ local $i = $i + 1; print "\ti is $i"; print "\tWas founded $&\n" }))*
(?{print "\nchecking rollback\n"})
er
(?{ $result = $i; print "\nsetting result\n"})
/x;
print "final $result\n";
Here the $result finally printing on the screen is equal to number of chars that were matched by .*, but I don't get it again.
When turning on debug printing(shown above), i see, that $i is being incremented every time the new char is included in $& (matched part of a string).
In the end $i is equal 11 (amount of chars in a string), then there are 7 rollbacks, when .* returns from its match char at a time (7 times) so the match of an all pattern occurs.
But, damn magic, the result is setting to value of $i! And we were not decrementing this value anywhere! So $result should be equal 11! But it is not. And authors were right. I know.
Please, can you explain this strange perl code, i was happy to met?
Thank you for any answer!
From the documentation at http://perldoc.perl.org/perlre.html :
"WARNING: This extended regular expression feature is considered experimental, and may be changed without notice. Code executed that has side effects may not perform identically from version to version due to the effect of future optimisations in the regex engine. The implementation of this feature was radically overhauled for the 5.18.0 release, and its behaviour in earlier versions of perl was much buggier, especially in relation to parsing, lexical vars, scoping, recursion and reentrancy."
Even on a failed match, if the regex engine gets to the point where it has to run the code, it will run the code. If the code involves only assigning to (local?) variables and whatever operations are allowed, backtracking will cause it to undo the operations, so the failed matches will have no effect. But print operations can't be undone, with the result that you can get strings printed from a failed match. This is why the documentation warns against embedding code with "side effects".
I did some experimenting and is making the answer a community wiki, hoping that people will populate it. I tried to crack the simplest regexps and didn't dare to deal with "the bomb".
1. "adfsfloglig"=~ /.*(?{print "hi"})f/;
Here is the debug info for the regexp:
Final program:
1: STAR (3)
2: REG_ANY (0)
3: EVAL (5)
5: EXACT <f> (7)
7: END (0)
And the trace of execution with my comments:
#matches the whole string with .*
0 <> <adfsflogli> | 1:STAR(3)
REG_ANY can match 11 times out of 2147483647...
#splits the string to <adfs> and <floglig> and prints "hi".
#Why does it split? Not sure, probably, knows about the f after "hi" code
4 <adfs> <floglig> | 3: EVAL(5)
#tries to find f in 'floglig' - success
4 <adfs> <floglig> | 5: EXACT <f>(7)
#end
5 <adfsf> <loglig> | 7: END(0)
2. "adfsfloglig" =~ /.*(?{print "hi"})log/;
1: STAR (3)
2: REG_ANY (0)
3: EVAL (5)
5: EXACT <log> (7)
7: END (0)
Trace:
#matches the whole string with .*
0 <> <adfsflogli> | 1:STAR(3)
REG_ANY can match 11 times out of 2147483647...
#splits the string to <adfsflog> and <lig> and prints "hi".
#Probably, it found 'l' symbol after the code block
#and, being greedy, tries to capture up to the last 'l'
8 <adfsflog> <lig> | 3: EVAL(5)
#compares the 'lig' with 'log' - failed
8 <adfsflog> <lig> | 5: EXACT <log>(7)
failed...
#moves backwards, taking the previous 'l'
#prints 2-nd 'hi'
5 <adfsf> <loglig> | 3: EVAL(5)
#compares 'loglig' with 'log' - success
5 <adfsf> <loglig> | 5: EXACT <log>(7)
#end
8 <adfsflog> <lig> | 7: END(0)
3. "adfsfloglig"=~ /.*(?{print "hi"})sflog/;
1: STAR (3)
2: REG_ANY (0)
3: EVAL (5)
5: EXACT <sflog> (8)
8: END (0)
Trace:
#matches the whole string with .*
0 <> <adfsflogli> | 1:STAR(3)
REG_ANY can match 11 times out of 2147483647...
#splits the string to <adf> and <sfloglig> and prints "hi".
3 <adf> <sfloglig> | 3: EVAL(5)
#compares 'sfloglig' with 'sflog' - success
3 <adf> <sfloglig> | 5: EXACT <sflog>(8)
#end
8 <adfsflog> <lig> | 8: END(0)
Related
Some DS code systems don't readily support categories. Is this expression the most efficient way to programmatically combine the category with code name?
perl -ne '$data = $_ ; $cat = $1 if $data =~ /CAT (.*)/ ; $cde = $1 if $data =~ /CODE \d (.*)/ ; print "$cat, $cde\n" if /CODE \d /' 'Mario Kart DS (USA).mch'
Example 1 - melonDS, Mario Kart DS (USA).mch
CAT Mission 1 Codes
CODE 0 3 Star Rank - Mission 1-1
223D00C4 0000000F
CODE 0 3 Star Rank - Mission 1-2
223D00C5 0000000F
CAT Mission 2 Codes
CODE 0 3 Star Rank - Mission 2-1
223D00CD 0000000F
CAT Mission 3 Codes
CODE 0 3 Star Rank - Mission 3-1
223D00D6 0000000F
Output:
Mission 1 Codes, 3 Star Rank - Mission 1-1
Mission 1 Codes, 3 Star Rank - Mission 1-2
Mission 2 Codes, 3 Star Rank - Mission 2-1
Mission 3 Codes, 3 Star Rank - Mission 3-1
Regex can't capture the CAT and prepend it to CODE. This was the best expression I could come up with:
perl -0777 -pe 's/CAT (.*)(?s).+?(?-s)(?:CODE \d (.*)(?s).+?(?-s))+(?=CAT|CODE|\z)/\1, \2\n/gi' 'Mario Kart DS (USA).mch'
In order to search and replace, I have to capture each group of CODE preceded by CAT. perl -0777 and (?s)(?-s) allows me to slurp the input file and anchor CODE matches to the initial CAT match while stepping across the end of line. I can repeat the CODE match, as capture group 2, but it will only ever get the last one.
The expression above reads like so:
For a line starting with 'CAT ' capture to end of line, step across lines in the least greedy way until we reach CODE. For every group that starts with 'CODE [number] ' capture to the end of line, then step across lines until reaching either CAT, CODE, or the end of file. Repeat the code group as many times as possible.
With example above, this is the output:
Mission 1 Codes, 3 Star Rank - Mission 1-2
Mission 2 Codes, 3 Star Rank - Mission 2-1
Mission 3 Codes, 3 Star Rank - Mission 3-1
Debating what is most efficient or not is perhaps not too interesting in this case. If you have a solution that works, that should perhaps suffice.
Here is another solution, based on paragraph mode.
-00: sets input record separator to empty string $/ = '', which enables paragraph mode. Line endings are considered \n\n.
-l automatic chomp
-E enable say (since there is an interaction with print and -l)
Then just store the header if /^CAT/, else clean up and print.
$ perl -00 -nlwE'if (s/^CAT //) { $k = $_ } else { s/^CODE \d+ //; s/\n.*//; say "$k, $_"; }' mission.txt
Mission 1 Codes, 3 Star Rank - Mission 1-1
Mission 1 Codes, 3 Star Rank - Mission 1-2
Mission 2 Codes, 3 Star Rank - Mission 2-1
Mission 3 Codes, 3 Star Rank - Mission 3-1
As a file:
use strict;
use warnings;
use feature 'say';
$/ = '';
my $key;
while (<DATA>) {
chomp;
if (s/^CAT //) {
$key = $_;
} else {
s/CODE \d+ //;
s/\n.*//;
say "$key, $_";
}
}
To elaborate on the initial question, it's important to note that I know some regex and no Perl, so I don't know what an efficient Perl expression looks like. From my experience, regular expressions are great at capturing 'one this or one that' but we need 'one this and many that'.
If I were talking about the title of a book chapter and each subsequent paragraph, the goal would be to merge the title as the first sentence of each paragraph for each chapter.
A regular expression could capture the title and indent of each paragraph but must limit itself to one chapter at a time. The title becomes capture group 1 while the paragraphs are capture group 2. We can't have 'one and many'; 'one or the other' would return all chapters and paragraphs (as capture group 1 or 2) but wouldn't allow them to be merged together.
Perl language allows this simply by storing the title in a variable to be added as part of the substitution for each paragraph. Since the title occurs first, and only once, per chapter, it can easily be merged in a 'one this many that' situation.
The initial example was flawed in that it was extracting information when it should have removed the categories and merged them with the code names. With that goal, an expression like this would suffice:
perl -pe '$cat = $1 if s/(?:^CAT ([^\v]+).*\n)// ; s/(^CODE \d )/$1$cat, /'
For the non-capture group (?:...) that starts with 'CAT ' store every character that doesn't match the end of line [vertical whitespace] ([^\v]+) up to the end of line .*\n (which captures all modern line endings for Win, MacOS X+, and Linux since each ends in \n or linefeed) and remove the entire match including the final linefeed //. This expression captures the category while removing the line.
The next expression (separated by semicolon) captures the phrase 'CODE # ' (^CODE \d ), for each line that matches, then repeats the phrase /$1$cat, / while adding the result of the category variable. This is the result for Example 1:
CODE 0 Mission 1 Codes, 3 Star Rank - Mission 1-1
223D00C4 0000000F
CODE 0 Mission 1 Codes, 3 Star Rank - Mission 1-2
223D00C5 0000000F
CODE 0 Mission 2 Codes, 3 Star Rank - Mission 2-1
223D00CD 0000000F
CODE 0 Mission 3 Codes, 3 Star Rank - Mission 3-1
223D00D6 0000000F
Unfortunately, the melonDS code format insists there be at least one category for the file to be read properly so we'd have to add something generic back in on the first line e.g., CAT Cheats.
A better use case would be a RetroArch formatted cheat file since it doesn't directly support categories. The cheat files that ship with the program use a trick to simulate this in the form of a numbered cheat description that lacks a subsequent code.
Example 2: RetroArch, Mario Kart DS (USA).cht
cheats = 514
cheat0_desc = "Misc Codes"
cheat1_desc = "Freeze Time"
cheat1_code = "621755FC+00000000+B21755FC+00000000+10000000+00000000+D2000000+00000000"
cheat1_enable = false
cheat2_desc = "Start for Final Lap"
cheat2_code = "94000130+FFF70000+023CDD3F+00000001+D2000000+00000000"
cheat2_enable = false
With this expression:
perl -0777 -pe 's|(cheat(\d+)_)desc(?=.*\n(?!cheat\2_code))|\1cat|gi' 'Mario Kart DS (USA).cht' | perl -pe '$cat = $1 if s/(?:^cheat\d+_cat = \"(.*)\".*\n)// ; s/(^cheat\d+_desc = \")/$1$cat, /'
The result is:
cheats = 514
cheat1_desc = "Misc Codes, Freeze Time"
cheat1_code = "621755FC+00000000+B21755FC+00000000+10000000+00000000+D2000000+00000000"
cheat1_enable = false
cheat2_desc = "Misc Codes, Start for Final Lap"
cheat2_code = "94000130+FFF70000+023CDD3F+00000001+D2000000+00000000"
cheat2_enable = false
The expression, from a high level, slurps the input file and for each numbered cheat description cheat0_desc that is not immediately followed by a cheat code name cheat0_code we rename it from cheat0_desc to cheat0_cat then send the changes to the next expression (basically a repeat of the one shown above) that replaces on 'cheat#_desc = "' with itself and the category.
I feel the question was valuable but poorly asked due to lack of knowledge and the continuing learning process.
I have quite a straightforward question. Where I work I see a lot of regular expressions come by. They are used in Perl to get replace and/or get rid of some strings in text, e.g.:
$string=~s/^.+\///;
$string=~s/\.shtml//;
$string=~s/^ph//;
I understand that you cannot concatenate the first and last replacement, because you may only want to replace ph at the beginning of the string after you did the first replacement. However, I would put the first and second regex together with alternation: $string=~s/(^.+\/|\.shtml)//; Because we're processing thousands of files (+500,000) I was wondering which method is the most efficient.
Your expressions are not equivalent
This:
$string=~s/^.+\///;
$string=~s/\.shtml//;
replaces the text .shtml and everything up to and including the last slash.
This:
$string=~s/(^.+\/|\.shtml)//;
replaces either the text .shtml or everything up to and including the last slash.
This is one problem with combining regexes: a single complex regex is harder to write, harder to understand, and harder to debug than several simple ones.
It probably doesn't matter which is faster
Even if your expressions were equivalent, using one or the other probably wouldn't have a significant impact on your program's speed. In-memory operations like s/// are significantly faster than file I/O, and you've indicated that you're doing a lot of file I/O.
You should profile your application with something like Devel::NYTProf to see if these particular substitutions are actually a bottleneck (I doubt they are). Don't waste your time optimizing things that are already fast.
Alternations hinder the optimizer
Keep in mind that you're comparing apples and oranges, but if you're still curious about performance, you can see how perl evaluates a particular regex using the re pragma:
$ perl -Mre=debug -e'$_ = "foobar"; s/^.+\///; s/\.shtml//;'
...
Guessing start of match in sv for REx "^.+/" against "foobar"
Did not find floating substr "/"...
Match rejected by optimizer
Guessing start of match in sv for REx "\.shtml" against "foobar"
Did not find anchored substr ".shtml"...
Match rejected by optimizer
Freeing REx: "^.+/"
Freeing REx: "\.shtml"
The regex engine has an optimizer. The optimizer searches for substrings that must appear in the target string; if these substrings can't be found, the match fails immediately, without checking the other parts of the regex.
With /^.+\//, the optimizer knows that $string must contain at least one slash in order to match; when it finds no slashes, it rejects the match immediately without invoking the full regex engine. A similar optimization occurs with /\.shtml/.
Here's what perl does with the combined regex:
$ perl -Mre=debug -e'$_ = "foobar"; s/(?:^.+\/|\.shtml)//;'
...
Matching REx "(?:^.+/|\.shtml)" against "foobar"
0 <> <foobar> | 1:BRANCH(7)
0 <> <foobar> | 2: BOL(3)
0 <> <foobar> | 3: PLUS(5)
REG_ANY can match 6 times out of 2147483647...
failed...
0 <> <foobar> | 7:BRANCH(11)
0 <> <foobar> | 8: EXACT <.shtml>(12)
failed...
BRANCH failed...
1 <f> <oobar> | 1:BRANCH(7)
1 <f> <oobar> | 2: BOL(3)
failed...
1 <f> <oobar> | 7:BRANCH(11)
1 <f> <oobar> | 8: EXACT <.shtml>(12)
failed...
BRANCH failed...
2 <fo> <obar> | 1:BRANCH(7)
2 <fo> <obar> | 2: BOL(3)
failed...
2 <fo> <obar> | 7:BRANCH(11)
2 <fo> <obar> | 8: EXACT <.shtml>(12)
failed...
BRANCH failed...
3 <foo> <bar> | 1:BRANCH(7)
3 <foo> <bar> | 2: BOL(3)
failed...
3 <foo> <bar> | 7:BRANCH(11)
3 <foo> <bar> | 8: EXACT <.shtml>(12)
failed...
BRANCH failed...
4 <foob> <ar> | 1:BRANCH(7)
4 <foob> <ar> | 2: BOL(3)
failed...
4 <foob> <ar> | 7:BRANCH(11)
4 <foob> <ar> | 8: EXACT <.shtml>(12)
failed...
BRANCH failed...
5 <fooba> <r> | 1:BRANCH(7)
5 <fooba> <r> | 2: BOL(3)
failed...
5 <fooba> <r> | 7:BRANCH(11)
5 <fooba> <r> | 8: EXACT <.shtml>(12)
failed...
BRANCH failed...
Match failed
Freeing REx: "(?:^.+/|\.shtml)"
Notice how much longer the output is. Because of the alternation, the optimizer doesn't kick in and the full regex engine is executed. In the worst case (no matches), each part of the alternation is tested against each character in the string. This is not very efficient.
So, alternations are slower, right? No, because...
It depends on your data
Again, we're comparing apples and oranges, but with:
$string = 'a/really_long_string';
the combined regex may actually be faster because with s/\.shtml//, the optimizer has to scan most of the string before rejecting the match, while the combined regex matches quickly.
You can benchmark this for fun, but it's essentially meaningless since you're comparing different things.
How regex alternation is implemented in Perl is fairly well explained in perldoc perlre
Matching this or that
We can match different character strings with the alternation
metacharacter '|' . To match dog or cat , we form the regex dog|cat .
As before, Perl will try to match the regex at the earliest possible
point in the string. At each character position, Perl will first try
to match the first alternative, dog . If dog doesn't match, Perl will
then try the next alternative, cat . If cat doesn't match either, then
the match fails and Perl moves to the next position in the string.
Some examples:
"cats and dogs" =~ /cat|dog|bird/; # matches "cat"
"cats and dogs" =~ /dog|cat|bird/; # matches "cat"
Even though dog is the first alternative in the second regex, cat is able to match
earlier in the string.
"cats" =~ /c|ca|cat|cats/; # matches "c"
"cats" =~ /cats|cat|ca|c/; # matches "cats"
Here, all the alternatives match at the first string position, so the
first alternative is the one that matches. If some of the alternatives
are truncations of the others, put the longest ones first to give them
a chance to match.
"cab" =~ /a|b|c/ # matches "c"
# /a|b|c/ == /[abc]/
The last example points out
that character classes are like alternations of characters. At a given
character position, the first alternative that allows the regexp match
to succeed will be the one that matches.
So this should explain the price you pay when using alternations in regex.
When putting simple regex together, you don't pay such a price. It's well explained in another related question in SO. When directly searching for a constant string, or a set of characters as in the question, optimizations can be done and no backtracking is needed which means potentially faster code.
When defining the regex alternations, just choosing a good order (putting the most common findings first) can influence the performance. It is not the same either to choose between two options, or twenty. As always, premature optimization is the root of all evil and you should instrumentiate you code (Devel::NYTProf) if there are problems or you want improvements. But as a general rule alternations should be kept to a minimum and avoided if possible since:
They easily make the regex too big an complex. We like simple, easy to understand / debug / maintain regex.
Variability and input dependant. They could be an unexpected source of problems since they backtrack and can lead to unexpected lack of performance depending on your input. As I understand, there's no case when they will be faster.
Conceptually you are trying to match two different things, so we could argue that two different statements are more correct and clear than just one.
Hope this answer gets closer to what you were expecting.
First, measure the various options on your real data, because no amount of theory will beat an experiment (if one can be done). There are many timing modules on CPAN that will help you.
Second, if you decide to optimize the regexes, do not squish them into one giant monster by hand, try to assemble the “master” regex with code. Otherwise no-one will be able to decipher the code.
Combination is not your best option
If you have three regexes that work perfectly well, there's no benefit to combine them. Not only does rewriting them open the door for bugs, it makes it harder for both programmer and engine to read the regex.
This page suggests that instead alteration like:
while (<FILE>) {
next if (/^(?:select|update|drop|insert|alter)\b/);
...
}
You should use:
while (<FILE>) {
next if (/^select/);
next if (/^update/);
...
}
You can further optimize your regexes
You can use regex objects, which will ensure that your regex is not being recompiled in a loop:
my $query = qr/foo$bar/; #compiles here
#matches = ( );
...
while (<FILE>) {
push #matches, $_ if /$query/i;
}
You may also be able to optimize the .+. It will eat up the entire file, and then has to backtrack character by character until it finds a / so it can match. If there is only one / per file, try a negated character class like: [^/] (escaped: [^\/]). Where do you expect to find the / in your file? Knowing that will allow your regex to become faster.
The speed of replacement depends on other factors
If you have performance issues (currently, with the 3 regexes), it could be a different part of your program. While the processing speed of computers has grown exponentially, read and write speed has experienced little growth.
There may be faster engines for search and replacing text in a file
Perl uses NFA, which is slower yet more powerful than the DFA engine sed has. NFA backtracks (especially with alterations) and has a worst-case exponential run time. DFA has linear execution time. Your patterns do not need an NFA engine, so you can use your regexes in a DFA engine, like sed, very easily.
According to here sed can do search and replace at a speed of 82.1 million characters processed per second (note that this test was writing to /dev/null, so hard disk write speed wasn't really a factor).
A bit off topic maybe, but if the actual replacements are rare, relative to number of comapares (10%-20%?), you may gain some speed with using an index match first
$string=~s/\.shtml//
if index($string, ".shtml");
Second method is best in which you put first and second regex together with alternation.
Because in that method the perl do traverse once, and check both the expressions.
If you use first method in which perl have to traverse separate for both expressions.
Hence Number of loops decreased in Second method.
I've come across following materials:
Mastering Perl by brian d foy, chapter: Debugging Regular Expressions.
Debugging regular expressions which mentions re::debug module for perl
I've also try to use various another techniques:
Module re=debugcolor which highlights it's output.
Used following construction ?{print "$1 $2\n"}.
but still did not get the point how to read their output. I've also found another modules used for debugging regular expressions here but I did not tried them yet, can you please explain how to read output of use re 'debug' or another command used for debugging regular expressions in perl?
EDIT in reply to Borodin:
1st example:
perl -Mre=debug -e' "foobar"=~/(.)\1/'
Compiling REx "(.)\1"
Final program:
1: OPEN1 (3)
3: REG_ANY (4)
4: CLOSE1 (6)
6: REF1 (8)
8: END (0)
minlen 1
Matching REx "(.)\1" against "foobar"
0 <> <foobar> | 1:OPEN1(3)
0 <> <foobar> | 3:REG_ANY(4)
1 <f> <oobar> | 4:CLOSE1(6)
1 <f> <oobar> | 6:REF1(8)
failed...
1 <f> <oobar> | 1:OPEN1(3)
1 <f> <oobar> | 3:REG_ANY(4)
2 <fo> <obar> | 4:CLOSE1(6)
2 <fo> <obar> | 6:REF1(8)
3 <foo> <bar> | 8:END(0)
Match successful!
Freeing REx: "(.)\1"
What does OPEN1, REG_ANY, CLOSE1 ... mean ?
What numbers like 1 3 4 6 8 mean?
What does number in braces OPEN1(3) mean?
Which output should I look at, Compiling REx or Matching REx?
2nd example:
perl -Mre=debugcolor -e' "foobar"=~/(.*)\1/'
Compiling REx "(.*)\1"
Final program:
1: OPEN1 (3)
3: STAR (5)
4: REG_ANY (0)
5: CLOSE1 (7)
7: REF1 (9)
9: END (0)
minlen 0
Matching REx "(.*)\1" against "foobar"
0 <foobar>| 1:OPEN1(3)
0 <foobar>| 3:STAR(5)
REG_ANY can match 6 times out of 2147483647...
6 <foobar>| 5: CLOSE1(7)
6 <foobar>| 7: REF1(9)
failed...
5 <foobar>| 5: CLOSE1(7)
5 <foobar>| 7: REF1(9)
failed...
4 <foobar>| 5: CLOSE1(7)
4 <foobar>| 7: REF1(9)
failed...
3 <foobar>| 5: CLOSE1(7)
3 <foobar>| 7: REF1(9)
failed...
2 <foobar>| 5: CLOSE1(7)
2 <foobar>| 7: REF1(9)
failed...
1 <foobar>| 5: CLOSE1(7)
1 <foobar>| 7: REF1(9)
failed...
0 <foobar>| 5: CLOSE1(7)
0 <foobar>| 7: REF1(9)
0 <foobar>| 9: END(0)
Match successful!
Freeing REx: "(.*)\1"
Why are numbers descending 6 5 4 3 ... in this example?
What does failed keyword mean?
Regular expressions define finite state machines1. The debugger is more or less showing you how the state machine is progressing as the string is consumed character by character.
"Compiling REx" is the listing of instructions for that regular expression. The number in parenthesis after each instruction is where to go once the step succeeds. In /(.*)\1/:
1: OPEN1 (3)
3: STAR (5)
4: REG_ANY (0)
5: CLOSE1 (7)
STAR (5) means compute STAR and once you succeed, go to instruction 5 CLOSE1.
"Matching REx" is the step-by-step execution of those instructions. The number on the left is the total number of characters that have been consumed so far. This number can go down if the matcher has to go backwards because something it tried didn't work.
To understand these instructions, it's important to understand how regular expressions "work." Finite state machines are usually visualized as a kind of flow chart. I have produced a crude one below for /(.)\1/. Because of the back reference to a capture group, I don't believe this regex is a strict finite state machine. The chart is useful none the less.
Match
+-------+ Anything +----------+
| Start +------------------+ State 1 |
+---^---+ +--+---+---+
| | |
| | |Matched same
+-------------------------+ | character
matched different |
character +----+------+
| Success |
+-----------+
We start on Start. It's easy to advance to the first state, we just consume any one character (REG_ANY). The only other thing that could happen is end of input. I haven't drawn that here. The REG_ANY instruction is wrapped in the capture group instructions. OPEN1 starts recording all matched characters into the first capture group. CLOSE1 stops recording characters to the first capture group.
Once we consume a character, we sit on State 1 and consume the next char. If it matches the previous char we move to success! REF1 is the instruction that attempts to match capture group #1. Otherwise, we failed and need to move back to the Start to try again. Whenever the matcher says "failed..." it's telling you that something didn't work, so it's returning to an earlier state (that may or may not include 'unconsuming' characters).
The example with * is more complicated. * (which corresponds to STAR) tries to match the given pattern zero or more times, and it is greedy. That means it tries to match as many characters as it possibly can. Starting at the beginning of the string, it says "I can match up to 6 characters!" So, it matches all 6 characters ("foobar"), closes the capture group, and tries to match "foobar" again. That doesn't work! It tries again with 5, that doesn't work. And so on, until it tries to matching zero characters. That means the capture group is empty, matching the empty string always succeeds. So the match succeeds with \1 = "".
I realize I've spent more time explaining regular expressions than I have Perl's regex debugger. But I think its output will become much more clear once you understand how regexes operate.
Here is a finite state machine simulator. You can enter a regex and see it executed. Unfortunately, it doesn't support back references.
1: I believe some of Perl's regular expression features push it beyond this definition but it's still useful to think about them this way.
The debug Iinformation contains description of the bytecode. Numbers denote the node indices in the op tree. Numbers in round brackets tell the engine to jump to a specific node upon match. The EXACT operator tells the regex engine to look for a literal string. REG_ANY means the . symbol. PLUS means the +. Code 0 is for the 'end' node. OPEN1 is a '(' symbol. CLOSE1 means ')'. STAR is a '*'. When the matcher reaches the end node, it returns a success code back to Perl, indicating that the entire regex has matched.
See more details at http://perldoc.perl.org/perldebguts.html#Debugging-Regular-Expressions and a more conceptual http://perl.plover.com/Rx/paper/
I have a huge file aab.txt whose contents are aaa...aab.
To my great surprise
perl -ne '/a*bb/' < aab.txt
runs (match failure) faster than
perl -ne '/a*b/' < aab.txt
(match success). Why???? Both should first gobble up all the a's, then the second one immediately succeeds, while the first will then have to backtrack over and over again, to fail.
Perl regexes are optimized to rather fail as early as possible, than to succeed as fast as possible. This makes a lot of sense when grepping through a large log file.
There is an optimization that first looks for a constant part of the string, in this case, a “floating” b or bb. This can be checked rather efficiently without having to keep track of backtracking state. No bb is found, and the match aborts right there.
Not so with b. That floating substring is found, and the match constructed from there. Here is the debug output of the regex match (program is "aaab" =~ /a*b/):
Compiling REx "a*b"
synthetic stclass "ANYOF_SYNTHETIC[ab][]".
Final program:
1: STAR (4)
2: EXACT <a> (0)
4: EXACT <b> (6)
6: END (0)
floating "b" at 0..2147483647 (checking floating) stclass ANYOF_SYNTHETIC[ab][] minlen 1
Guessing start of match in sv for REx "a*b" against "aaab"
Found floating substr "b" at offset 3...
start_shift: 0 check_at: 3 s: 0 endpos: 4 checked_upto: 0
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "a*b" against "aaab"
Matching stclass ANYOF_SYNTHETIC[ab][] against "aaab" (4 bytes)
0 <> <aaab> | 1:STAR(4)
EXACT <a> can match 3 times out of 2147483647...
3 <aaa> <b> | 4: EXACT <b>(6)
4 <aaab> <> | 6: END(0)
Match successful!
Freeing REx: "a*b"
You can get such output with the debug option for the re pragma.
Finding the b or bb is unnecessary, strictly speaking, but it allows the match to fail much earlier.
/a*bb/
is basically
/^(?s:.*?)a*bb/
Note the two *. Optimizations aside, it's quadratic. In the worst case scenario, (a string of all a), for a string of length N, it will check if the current character is an a N*(N-1)/2 times. We call this O(N2).
It's worth doing a scan of the string (O(N)) to see if it can possibly match before starting the match. It will take a little longer to match, but it will fail to match much faster. This is what Perl does.
When you run the following
perl -Mre=debug -e"'aaaaab' =~ /a*bb/"
You get information about the compilation of the pattern:
Compiling REx "a*bb"
synthetic stclass "ANYOF{i}[ab][{non-utf8-latin1-all}]".
Final program:
1: STAR (4)
2: EXACT <a> (0)
4: EXACT <bb> (6)
6: END (0)
floating "bb" at 0..2147483647 (checking floating) stclass ANYOF{i}[ab][{non-utf8-latin1-all}] minlen 2
The last line indicates it will search for bb in the input before starting to match.
You get information about the evaluation of the pattern:
Guessing start of match in sv for REx "a*bb" against "aaaaab"
Did not find floating substr "bb"...
Match rejected by optimizer
Here you see that check in action.
A beginner's question. In the code:
$a = 'aaagggaaa';
(#b) = ($a =~ /(a.+)(g.+)/);
print "$b[0]\n";
Why is $b[0] equal to aaagg and not aaa? In other words - why second group - (g.+) - matches only from last g ?
Because the first .+ is "greedy", which means that it will try to match as many characters as possible.
If you want to turn out this "greedy" behaviour, you may replace .+ by .+?, so /(a.+?)(g.+)/ will return ( 'aaa', 'gggaaa').
Maybe, you've wanted to write /(a+)(g+)/ (only 'a's in first group, and 'g's in second one).
The regular expression you wrote:
($a =~ /(a.+)(g.+)/);
catchs the "a" and any word as it can, finishing in one "g" followed by more characters. So the first (a.+) just matches "aaagg" until the match of the second part of your regular expression: (g.+) => "gaaa"
The #b array receives the two matches "aaagg" and "gaaa". So, $b[0] just prints "aaagg".
The problem is that the first .+ is causing the g to be matched as far to the right as possible.
To show you what is really happening I modified your code to output more illustrative debug information.
$ perl -Mre=debug -e'q[aaagggaaa] =~ /a.+[g ]/'
Compiling REx "a.+[g ]"
Final program:
1: EXACT <a> (3)
3: PLUS (5)
4: REG_ANY (0)
5: ANYOF[ g][] (16)
16: END (0)
anchored "a" at 0 (checking anchored) minlen 3
Guessing start of match in sv for REx "a.+[g ]" against "aaagggaaa"
Found anchored substr "a" at offset 0...
Guessed: match at offset 0
Matching REx "a.+[g ]" against "aaagggaaa"
0 <> <aaagggaaa> | 1:EXACT <a>(3)
1 <a> <aagggaaa> | 3:PLUS(5)
REG_ANY can match 8 times out of 2147483647...
9 <aaagggaaa> <> | 5: ANYOF[ g][](16)
failed...
8 <aaagggaa> <a> | 5: ANYOF[ g][](16)
failed...
7 <aaaggga> <aa> | 5: ANYOF[ g][](16)
failed...
6 <aaaggg> <aaa> | 5: ANYOF[ g][](16)
failed...
5 <aaagg> <gaaa> | 5: ANYOF[ g][](16)
6 <aaaggg> <aaa> | 16: END(0)
Match successful!
Freeing REx: "a.+[g ]"
Notice that the first .+ is capturing everything it can to start out with.
Then it has to backtrack until the g can be matched.
What you probably want is one of:
/( a+ )( g+ )/x;
/( a.+? )( g.+ )/x;
/( a+ )( g.+ )/x;
/( a[^g]+ )( g.+ )/x;
/( a[^g]+ )( g+ )/x;
# etc.
Without more information from you, it is impossible to know what regex you want is.
Really regular expressions are a language in their own right, that is more complicated than the rest of Perl.
Perl regular expressions normally match the longest string possible.
In your code it matches with the last g and returns the output aaagg. If you want to get the output as aaa, then you need to use the non-greedy behavior. Use this code:
$a = 'aaagggaaa';
(#b) = ($a =~ /(a.+?)(g.+)/);
print "$b[0]\n";
It will output:
aaa
Clearly, the use of the question mark makes the match ungreedy.
Usually a regex expression is greedy. You can turn it off using ? character:
$a = 'aaagggaaa';
my #b = ($a =~ /(a.+)(g.+)/);
my #c = ($a =~ /(a.+?)(g.+)/);
print "#b\n";
print "#c\n";
Output:
aaagg gaaa
aaa gggaaa
But I'm not sure this is what You want! What about abagggbb? You need aba?