Regex explanation needed

Regex explanation needed - regex

My student guideline contains this example.
I need find all -tation in transportation but for example not deportation.
I use regex with lookbehind assertion
/(?<=transpor)tation/
Also there is an equivalent
/tation(?<=transportation)/
It says first regex
is more efficiency because it hasn't backtracking - engine doesn't
check tation second time.
Ok. It is good.
And there is obscure phrase for me
Usually it would either match the text the first time or use a
lookahead to check that it match the second pattern instead of
backtracking through it a second time with a lookbehind.
I can't understand both hints (in particular about lookahead assertion).(before or and after or)
I believe my riddle (puzzle) will be clear to native English speaker as a minimum.
Thanks.

I wouldn't give much thought to it; the statements seem to be based on assumptions about the internal logic of regular expression engines which aren't universally valid.
To shed some light on this, I looked at the debugging info of Perl's regular expression compiler.
Code:
use re 'debug';
/(?<=transpor)tation/;
/tation(?<=transportation)/;
Output:
Compiling REx `(?<=transpor)tation'
size 11 Got 92 bytes for offset annotations.
first at 1
1: IFMATCH[-8](8)
3: EXACT <transpor>(6)
6: SUCCEED(0)
7: TAIL(8)
8: EXACT <tation>(11)
11: END(0)
anchored `tation' at 0 (checking anchored) minlen 6
Offsets: [11]
12[8] 0[0] 5[8] 0[0] 0[0] 12[0] 12[0] 14[6] 0[0] 0[0] 20[0]
Compiling REx `tation(?<=transportation)'
size 13 Got 108 bytes for offset annotations.
first at 1
1: EXACT <tation>(4)
4: IFMATCH[-14](13)
6: EXACT <transportation>(11)
11: SUCCEED(0)
12: TAIL(13)
13: END(0)
anchored `tation' at 0 (checking anchored) minlen 6
Offsets: [13]
1[6] 0[0] 0[0] 24[14] 0[0] 11[14] 0[0] 0[0] 0[0] 0[0] 24[0] 24[0] 26[0]
Freeing REx: `"(?<=transpor)tation"'
Freeing REx: `"tation(?<=transportation)"'
Here, the compiled expressions don't seem to be substantially different.

Related

How to read perl regular expression debugger

I've come across following materials:
Mastering Perl by brian d foy, chapter: Debugging Regular Expressions.
Debugging regular expressions which mentions re::debug module for perl
I've also try to use various another techniques:
Module re=debugcolor which highlights it's output.
Used following construction ?{print "$1 $2\n"}.
but still did not get the point how to read their output. I've also found another modules used for debugging regular expressions here but I did not tried them yet, can you please explain how to read output of use re 'debug' or another command used for debugging regular expressions in perl?
EDIT in reply to Borodin:
1st example:
perl -Mre=debug -e' "foobar"=~/(.)\1/'
Compiling REx "(.)\1"
Final program:
1: OPEN1 (3)
3: REG_ANY (4)
4: CLOSE1 (6)
6: REF1 (8)
8: END (0)
minlen 1
Matching REx "(.)\1" against "foobar"
0 <> <foobar> | 1:OPEN1(3)
0 <> <foobar> | 3:REG_ANY(4)
1 <f> <oobar> | 4:CLOSE1(6)
1 <f> <oobar> | 6:REF1(8)
failed...
1 <f> <oobar> | 1:OPEN1(3)
1 <f> <oobar> | 3:REG_ANY(4)
2 <fo> <obar> | 4:CLOSE1(6)
2 <fo> <obar> | 6:REF1(8)
3 <foo> <bar> | 8:END(0)
Match successful!
Freeing REx: "(.)\1"
What does OPEN1, REG_ANY, CLOSE1 ... mean ?
What numbers like 1 3 4 6 8 mean?
What does number in braces OPEN1(3) mean?
Which output should I look at, Compiling REx or Matching REx?
2nd example:
perl -Mre=debugcolor -e' "foobar"=~/(.*)\1/'
Compiling REx "(.*)\1"
Final program:
1: OPEN1 (3)
3: STAR (5)
4: REG_ANY (0)
5: CLOSE1 (7)
7: REF1 (9)
9: END (0)
minlen 0
Matching REx "(.*)\1" against "foobar"
0 <foobar>| 1:OPEN1(3)
0 <foobar>| 3:STAR(5)
REG_ANY can match 6 times out of 2147483647...
6 <foobar>| 5: CLOSE1(7)
6 <foobar>| 7: REF1(9)
failed...
5 <foobar>| 5: CLOSE1(7)
5 <foobar>| 7: REF1(9)
failed...
4 <foobar>| 5: CLOSE1(7)
4 <foobar>| 7: REF1(9)
failed...
3 <foobar>| 5: CLOSE1(7)
3 <foobar>| 7: REF1(9)
failed...
2 <foobar>| 5: CLOSE1(7)
2 <foobar>| 7: REF1(9)
failed...
1 <foobar>| 5: CLOSE1(7)
1 <foobar>| 7: REF1(9)
failed...
0 <foobar>| 5: CLOSE1(7)
0 <foobar>| 7: REF1(9)
0 <foobar>| 9: END(0)
Match successful!
Freeing REx: "(.*)\1"
Why are numbers descending 6 5 4 3 ... in this example?
What does failed keyword mean?

Regular expressions define finite state machines1. The debugger is more or less showing you how the state machine is progressing as the string is consumed character by character.
"Compiling REx" is the listing of instructions for that regular expression. The number in parenthesis after each instruction is where to go once the step succeeds. In /(.*)\1/:
1: OPEN1 (3)
3: STAR (5)
4: REG_ANY (0)
5: CLOSE1 (7)
STAR (5) means compute STAR and once you succeed, go to instruction 5 CLOSE1.
"Matching REx" is the step-by-step execution of those instructions. The number on the left is the total number of characters that have been consumed so far. This number can go down if the matcher has to go backwards because something it tried didn't work.
To understand these instructions, it's important to understand how regular expressions "work." Finite state machines are usually visualized as a kind of flow chart. I have produced a crude one below for /(.)\1/. Because of the back reference to a capture group, I don't believe this regex is a strict finite state machine. The chart is useful none the less.
Match
+-------+ Anything +----------+
| Start +------------------+ State 1 |
+---^---+ +--+---+---+
| | |
| | |Matched same
+-------------------------+ | character
matched different |
character +----+------+
| Success |
+-----------+
We start on Start. It's easy to advance to the first state, we just consume any one character (REG_ANY). The only other thing that could happen is end of input. I haven't drawn that here. The REG_ANY instruction is wrapped in the capture group instructions. OPEN1 starts recording all matched characters into the first capture group. CLOSE1 stops recording characters to the first capture group.
Once we consume a character, we sit on State 1 and consume the next char. If it matches the previous char we move to success! REF1 is the instruction that attempts to match capture group #1. Otherwise, we failed and need to move back to the Start to try again. Whenever the matcher says "failed..." it's telling you that something didn't work, so it's returning to an earlier state (that may or may not include 'unconsuming' characters).
The example with * is more complicated. * (which corresponds to STAR) tries to match the given pattern zero or more times, and it is greedy. That means it tries to match as many characters as it possibly can. Starting at the beginning of the string, it says "I can match up to 6 characters!" So, it matches all 6 characters ("foobar"), closes the capture group, and tries to match "foobar" again. That doesn't work! It tries again with 5, that doesn't work. And so on, until it tries to matching zero characters. That means the capture group is empty, matching the empty string always succeeds. So the match succeeds with \1 = "".
I realize I've spent more time explaining regular expressions than I have Perl's regex debugger. But I think its output will become much more clear once you understand how regexes operate.
Here is a finite state machine simulator. You can enter a regex and see it executed. Unfortunately, it doesn't support back references.
1: I believe some of Perl's regular expression features push it beyond this definition but it's still useful to think about them this way.

The debug Iinformation contains description of the bytecode. Numbers denote the node indices in the op tree. Numbers in round brackets tell the engine to jump to a specific node upon match. The EXACT operator tells the regex engine to look for a literal string. REG_ANY means the . symbol. PLUS means the +. Code 0 is for the 'end' node. OPEN1 is a '(' symbol. CLOSE1 means ')'. STAR is a '*'. When the matcher reaches the end node, it returns a success code back to Perl, indicating that the entire regex has matched.
See more details at http://perldoc.perl.org/perldebguts.html#Debugging-Regular-Expressions and a more conceptual http://perl.plover.com/Rx/paper/

Why does Perl backtracking match failure seem to take less time than match success?

I have a huge file aab.txt whose contents are aaa...aab.
To my great surprise
perl -ne '/a*bb/' < aab.txt
runs (match failure) faster than
perl -ne '/a*b/' < aab.txt
(match success). Why???? Both should first gobble up all the a's, then the second one immediately succeeds, while the first will then have to backtrack over and over again, to fail.

Perl regexes are optimized to rather fail as early as possible, than to succeed as fast as possible. This makes a lot of sense when grepping through a large log file.
There is an optimization that first looks for a constant part of the string, in this case, a “floating” b or bb. This can be checked rather efficiently without having to keep track of backtracking state. No bb is found, and the match aborts right there.
Not so with b. That floating substring is found, and the match constructed from there. Here is the debug output of the regex match (program is "aaab" =~ /a*b/):
Compiling REx "a*b"
synthetic stclass "ANYOF_SYNTHETIC[ab][]".
Final program:
1: STAR (4)
2: EXACT <a> (0)
4: EXACT <b> (6)
6: END (0)
floating "b" at 0..2147483647 (checking floating) stclass ANYOF_SYNTHETIC[ab][] minlen 1
Guessing start of match in sv for REx "a*b" against "aaab"
Found floating substr "b" at offset 3...
start_shift: 0 check_at: 3 s: 0 endpos: 4 checked_upto: 0
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "a*b" against "aaab"
Matching stclass ANYOF_SYNTHETIC[ab][] against "aaab" (4 bytes)
0 <> <aaab> | 1:STAR(4)
EXACT <a> can match 3 times out of 2147483647...
3 <aaa> <b> | 4: EXACT <b>(6)
4 <aaab> <> | 6: END(0)
Match successful!
Freeing REx: "a*b"
You can get such output with the debug option for the re pragma.
Finding the b or bb is unnecessary, strictly speaking, but it allows the match to fail much earlier.

/a*bb/
is basically
/^(?s:.*?)a*bb/
Note the two *. Optimizations aside, it's quadratic. In the worst case scenario, (a string of all a), for a string of length N, it will check if the current character is an a N*(N-1)/2 times. We call this O(N2).
It's worth doing a scan of the string (O(N)) to see if it can possibly match before starting the match. It will take a little longer to match, but it will fail to match much faster. This is what Perl does.
When you run the following
perl -Mre=debug -e"'aaaaab' =~ /a*bb/"
You get information about the compilation of the pattern:
Compiling REx "a*bb"
synthetic stclass "ANYOF{i}[ab][{non-utf8-latin1-all}]".
Final program:
1: STAR (4)
2: EXACT <a> (0)
4: EXACT <bb> (6)
6: END (0)
floating "bb" at 0..2147483647 (checking floating) stclass ANYOF{i}[ab][{non-utf8-latin1-all}] minlen 2
The last line indicates it will search for bb in the input before starting to match.
You get information about the evaluation of the pattern:
Guessing start of match in sv for REx "a*bb" against "aaaaab"
Did not find floating substr "bb"...
Match rejected by optimizer
Here you see that check in action.

Misunderstanding perl regexp evaluation

Good time of day!
I am reading a book about perl: "Programming Perl" By Larry Wall, Tom Christiansen, Jon Orwant. In this book I found several examples that were not clarified by the authors (or simply I dont get then).
The first
This prints hi only ONCE.
"adfsfloglig"=~ /.*(?{print "hi"})f/;
But this prints "hi" TWICE?? how can it be explained?
"adfsfloglig"=~ /.*(?{print "hi"})log/;
And continuing to experement even make things worse:
"adfsfloglig"=~ /.*(?{print "hi"})sflog/;
The above string of code again prints only ONCE this terrifying "hi" !
After about a week I understood only one thing completely - I NEED HELP :)
SO I am asking you to help me, please.
The second (this is a bomb!)
$_ = "lothiernbfj";
m/ (?{$i = 0; print "setting i to 0\n"})
(.(?{ local $i = $i + 1; print "\ti is $i"; print "\tWas founded $&\n" }))*
(?{print "\nchecking rollback\n"})
er
(?{ $result = $i; print "\nsetting result\n"})
/x;
print "final $result\n";
Here the $result finally printing on the screen is equal to number of chars that were matched by .*, but I don't get it again.
When turning on debug printing(shown above), i see, that $i is being incremented every time the new char is included in $& (matched part of a string).
In the end $i is equal 11 (amount of chars in a string), then there are 7 rollbacks, when .* returns from its match char at a time (7 times) so the match of an all pattern occurs.
But, damn magic, the result is setting to value of $i! And we were not decrementing this value anywhere! So $result should be equal 11! But it is not. And authors were right. I know.
Please, can you explain this strange perl code, i was happy to met?
Thank you for any answer!

From the documentation at http://perldoc.perl.org/perlre.html :
"WARNING: This extended regular expression feature is considered experimental, and may be changed without notice. Code executed that has side effects may not perform identically from version to version due to the effect of future optimisations in the regex engine. The implementation of this feature was radically overhauled for the 5.18.0 release, and its behaviour in earlier versions of perl was much buggier, especially in relation to parsing, lexical vars, scoping, recursion and reentrancy."
Even on a failed match, if the regex engine gets to the point where it has to run the code, it will run the code. If the code involves only assigning to (local?) variables and whatever operations are allowed, backtracking will cause it to undo the operations, so the failed matches will have no effect. But print operations can't be undone, with the result that you can get strings printed from a failed match. This is why the documentation warns against embedding code with "side effects".

I did some experimenting and is making the answer a community wiki, hoping that people will populate it. I tried to crack the simplest regexps and didn't dare to deal with "the bomb".
1. "adfsfloglig"=~ /.*(?{print "hi"})f/;
Here is the debug info for the regexp:
Final program:
1: STAR (3)
2: REG_ANY (0)
3: EVAL (5)
5: EXACT <f> (7)
7: END (0)
And the trace of execution with my comments:
#matches the whole string with .*
0 <> <adfsflogli> | 1:STAR(3)
REG_ANY can match 11 times out of 2147483647...
#splits the string to <adfs> and <floglig> and prints "hi".
#Why does it split? Not sure, probably, knows about the f after "hi" code
4 <adfs> <floglig> | 3: EVAL(5)
#tries to find f in 'floglig' - success
4 <adfs> <floglig> | 5: EXACT <f>(7)
#end
5 <adfsf> <loglig> | 7: END(0)
2. "adfsfloglig" =~ /.*(?{print "hi"})log/;
1: STAR (3)
2: REG_ANY (0)
3: EVAL (5)
5: EXACT <log> (7)
7: END (0)
Trace:
#matches the whole string with .*
0 <> <adfsflogli> | 1:STAR(3)
REG_ANY can match 11 times out of 2147483647...
#splits the string to <adfsflog> and <lig> and prints "hi".
#Probably, it found 'l' symbol after the code block
#and, being greedy, tries to capture up to the last 'l'
8 <adfsflog> <lig> | 3: EVAL(5)
#compares the 'lig' with 'log' - failed
8 <adfsflog> <lig> | 5: EXACT <log>(7)
failed...
#moves backwards, taking the previous 'l'
#prints 2-nd 'hi'
5 <adfsf> <loglig> | 3: EVAL(5)
#compares 'loglig' with 'log' - success
5 <adfsf> <loglig> | 5: EXACT <log>(7)
#end
8 <adfsflog> <lig> | 7: END(0)
3. "adfsfloglig"=~ /.*(?{print "hi"})sflog/;
1: STAR (3)
2: REG_ANY (0)
3: EVAL (5)
5: EXACT <sflog> (8)
8: END (0)
Trace:
#matches the whole string with .*
0 <> <adfsflogli> | 1:STAR(3)
REG_ANY can match 11 times out of 2147483647...
#splits the string to <adf> and <sfloglig> and prints "hi".
3 <adf> <sfloglig> | 3: EVAL(5)
#compares 'sfloglig' with 'sflog' - success
3 <adf> <sfloglig> | 5: EXACT <sflog>(8)
#end
8 <adfsflog> <lig> | 8: END(0)

How to match lines not ending (-1)\r\n

I'm trying to match the first 5 lines, and the last line, in this sample:
-- 2012-09-20 rep +6 = 184
1 12532070 (2)
2 12531806 (5)
2 12531806 (5)
-- 2012-09-21 rep +12 = 196
3 125xxxxx (-1)
3 125xxxxx (-1)
16 12557052 (2)
Leaving the following unmatched:
3 125xxxxx (-1)
3 125xxxxx (-1)
I've tried the following regular expressions:
^.*[^(-1)\r\n].*
^.*[^(-1)].*\r\n
^.*[^\(-1\)\r\n].*
^.*[^\(\-1\)\r\n].*
^.*[?!\(-1)\r\n].*
^(!?.*-1.*\r\n)
But none of them do what I want (mostly matching all lines).
My RegEx skills are not brilliant - can anybody point me in the right direction?

You can use negative lookahead
^(?!.*\(-1\)$).*$\r\n

Rather than trying to create a regular expression for this, I would just use the surrounding language to negate the sense of the match, and use a regex that only matches lines that end in '(-1)\r\n'. For instance:
Shell: grep -v '(-1)^M$'
Perl: !/\(-1\)\r\n/
Ed/Vi: v/(-1)^M$
etc.

overlapping pattern matching in Perl

A beginner's question. In the code:
$a = 'aaagggaaa';
(#b) = ($a =~ /(a.+)(g.+)/);
print "$b[0]\n";
Why is $b[0] equal to aaagg and not aaa? In other words - why second group - (g.+) - matches only from last g ?

Because the first .+ is "greedy", which means that it will try to match as many characters as possible.
If you want to turn out this "greedy" behaviour, you may replace .+ by .+?, so /(a.+?)(g.+)/ will return ( 'aaa', 'gggaaa').
Maybe, you've wanted to write /(a+)(g+)/ (only 'a's in first group, and 'g's in second one).

The regular expression you wrote:
($a =~ /(a.+)(g.+)/);
catchs the "a" and any word as it can, finishing in one "g" followed by more characters. So the first (a.+) just matches "aaagg" until the match of the second part of your regular expression: (g.+) => "gaaa"
The #b array receives the two matches "aaagg" and "gaaa". So, $b[0] just prints "aaagg".

The problem is that the first .+ is causing the g to be matched as far to the right as possible.
To show you what is really happening I modified your code to output more illustrative debug information.
$ perl -Mre=debug -e'q[aaagggaaa] =~ /a.+[g ]/'
Compiling REx "a.+[g ]"
Final program:
1: EXACT <a> (3)
3: PLUS (5)
4: REG_ANY (0)
5: ANYOF[ g][] (16)
16: END (0)
anchored "a" at 0 (checking anchored) minlen 3
Guessing start of match in sv for REx "a.+[g ]" against "aaagggaaa"
Found anchored substr "a" at offset 0...
Guessed: match at offset 0
Matching REx "a.+[g ]" against "aaagggaaa"
0 <> <aaagggaaa> | 1:EXACT <a>(3)
1 <a> <aagggaaa> | 3:PLUS(5)
REG_ANY can match 8 times out of 2147483647...
9 <aaagggaaa> <> | 5: ANYOF[ g][](16)
failed...
8 <aaagggaa> <a> | 5: ANYOF[ g][](16)
failed...
7 <aaaggga> <aa> | 5: ANYOF[ g][](16)
failed...
6 <aaaggg> <aaa> | 5: ANYOF[ g][](16)
failed...
5 <aaagg> <gaaa> | 5: ANYOF[ g][](16)
6 <aaaggg> <aaa> | 16: END(0)
Match successful!
Freeing REx: "a.+[g ]"
Notice that the first .+ is capturing everything it can to start out with.
Then it has to backtrack until the g can be matched.
What you probably want is one of:
/( a+ )( g+ )/x;
/( a.+? )( g.+ )/x;
/( a+ )( g.+ )/x;
/( a[^g]+ )( g.+ )/x;
/( a[^g]+ )( g+ )/x;
# etc.
Without more information from you, it is impossible to know what regex you want is.
Really regular expressions are a language in their own right, that is more complicated than the rest of Perl.

Perl regular expressions normally match the longest string possible.
In your code it matches with the last g and returns the output aaagg. If you want to get the output as aaa, then you need to use the non-greedy behavior. Use this code:
$a = 'aaagggaaa';
(#b) = ($a =~ /(a.+?)(g.+)/);
print "$b[0]\n";
It will output:
aaa
Clearly, the use of the question mark makes the match ungreedy.

Usually a regex expression is greedy. You can turn it off using ? character:
$a = 'aaagggaaa';
my #b = ($a =~ /(a.+)(g.+)/);
my #c = ($a =~ /(a.+?)(g.+)/);
print "#b\n";
print "#c\n";
Output:
aaagg gaaa
aaa gggaaa
But I'm not sure this is what You want! What about abagggbb? You need aba?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js