A beginner's question. In the code:
$a = 'aaagggaaa';
(#b) = ($a =~ /(a.+)(g.+)/);
print "$b[0]\n";
Why is $b[0] equal to aaagg and not aaa? In other words - why second group - (g.+) - matches only from last g ?
Because the first .+ is "greedy", which means that it will try to match as many characters as possible.
If you want to turn out this "greedy" behaviour, you may replace .+ by .+?, so /(a.+?)(g.+)/ will return ( 'aaa', 'gggaaa').
Maybe, you've wanted to write /(a+)(g+)/ (only 'a's in first group, and 'g's in second one).
The regular expression you wrote:
($a =~ /(a.+)(g.+)/);
catchs the "a" and any word as it can, finishing in one "g" followed by more characters. So the first (a.+) just matches "aaagg" until the match of the second part of your regular expression: (g.+) => "gaaa"
The #b array receives the two matches "aaagg" and "gaaa". So, $b[0] just prints "aaagg".
The problem is that the first .+ is causing the g to be matched as far to the right as possible.
To show you what is really happening I modified your code to output more illustrative debug information.
$ perl -Mre=debug -e'q[aaagggaaa] =~ /a.+[g ]/'
Compiling REx "a.+[g ]"
Final program:
1: EXACT <a> (3)
3: PLUS (5)
4: REG_ANY (0)
5: ANYOF[ g][] (16)
16: END (0)
anchored "a" at 0 (checking anchored) minlen 3
Guessing start of match in sv for REx "a.+[g ]" against "aaagggaaa"
Found anchored substr "a" at offset 0...
Guessed: match at offset 0
Matching REx "a.+[g ]" against "aaagggaaa"
0 <> <aaagggaaa> | 1:EXACT <a>(3)
1 <a> <aagggaaa> | 3:PLUS(5)
REG_ANY can match 8 times out of 2147483647...
9 <aaagggaaa> <> | 5: ANYOF[ g][](16)
failed...
8 <aaagggaa> <a> | 5: ANYOF[ g][](16)
failed...
7 <aaaggga> <aa> | 5: ANYOF[ g][](16)
failed...
6 <aaaggg> <aaa> | 5: ANYOF[ g][](16)
failed...
5 <aaagg> <gaaa> | 5: ANYOF[ g][](16)
6 <aaaggg> <aaa> | 16: END(0)
Match successful!
Freeing REx: "a.+[g ]"
Notice that the first .+ is capturing everything it can to start out with.
Then it has to backtrack until the g can be matched.
What you probably want is one of:
/( a+ )( g+ )/x;
/( a.+? )( g.+ )/x;
/( a+ )( g.+ )/x;
/( a[^g]+ )( g.+ )/x;
/( a[^g]+ )( g+ )/x;
# etc.
Without more information from you, it is impossible to know what regex you want is.
Really regular expressions are a language in their own right, that is more complicated than the rest of Perl.
Perl regular expressions normally match the longest string possible.
In your code it matches with the last g and returns the output aaagg. If you want to get the output as aaa, then you need to use the non-greedy behavior. Use this code:
$a = 'aaagggaaa';
(#b) = ($a =~ /(a.+?)(g.+)/);
print "$b[0]\n";
It will output:
aaa
Clearly, the use of the question mark makes the match ungreedy.
Usually a regex expression is greedy. You can turn it off using ? character:
$a = 'aaagggaaa';
my #b = ($a =~ /(a.+)(g.+)/);
my #c = ($a =~ /(a.+?)(g.+)/);
print "#b\n";
print "#c\n";
Output:
aaagg gaaa
aaa gggaaa
But I'm not sure this is what You want! What about abagggbb? You need aba?
Related
I've come across following materials:
Mastering Perl by brian d foy, chapter: Debugging Regular Expressions.
Debugging regular expressions which mentions re::debug module for perl
I've also try to use various another techniques:
Module re=debugcolor which highlights it's output.
Used following construction ?{print "$1 $2\n"}.
but still did not get the point how to read their output. I've also found another modules used for debugging regular expressions here but I did not tried them yet, can you please explain how to read output of use re 'debug' or another command used for debugging regular expressions in perl?
EDIT in reply to Borodin:
1st example:
perl -Mre=debug -e' "foobar"=~/(.)\1/'
Compiling REx "(.)\1"
Final program:
1: OPEN1 (3)
3: REG_ANY (4)
4: CLOSE1 (6)
6: REF1 (8)
8: END (0)
minlen 1
Matching REx "(.)\1" against "foobar"
0 <> <foobar> | 1:OPEN1(3)
0 <> <foobar> | 3:REG_ANY(4)
1 <f> <oobar> | 4:CLOSE1(6)
1 <f> <oobar> | 6:REF1(8)
failed...
1 <f> <oobar> | 1:OPEN1(3)
1 <f> <oobar> | 3:REG_ANY(4)
2 <fo> <obar> | 4:CLOSE1(6)
2 <fo> <obar> | 6:REF1(8)
3 <foo> <bar> | 8:END(0)
Match successful!
Freeing REx: "(.)\1"
What does OPEN1, REG_ANY, CLOSE1 ... mean ?
What numbers like 1 3 4 6 8 mean?
What does number in braces OPEN1(3) mean?
Which output should I look at, Compiling REx or Matching REx?
2nd example:
perl -Mre=debugcolor -e' "foobar"=~/(.*)\1/'
Compiling REx "(.*)\1"
Final program:
1: OPEN1 (3)
3: STAR (5)
4: REG_ANY (0)
5: CLOSE1 (7)
7: REF1 (9)
9: END (0)
minlen 0
Matching REx "(.*)\1" against "foobar"
0 <foobar>| 1:OPEN1(3)
0 <foobar>| 3:STAR(5)
REG_ANY can match 6 times out of 2147483647...
6 <foobar>| 5: CLOSE1(7)
6 <foobar>| 7: REF1(9)
failed...
5 <foobar>| 5: CLOSE1(7)
5 <foobar>| 7: REF1(9)
failed...
4 <foobar>| 5: CLOSE1(7)
4 <foobar>| 7: REF1(9)
failed...
3 <foobar>| 5: CLOSE1(7)
3 <foobar>| 7: REF1(9)
failed...
2 <foobar>| 5: CLOSE1(7)
2 <foobar>| 7: REF1(9)
failed...
1 <foobar>| 5: CLOSE1(7)
1 <foobar>| 7: REF1(9)
failed...
0 <foobar>| 5: CLOSE1(7)
0 <foobar>| 7: REF1(9)
0 <foobar>| 9: END(0)
Match successful!
Freeing REx: "(.*)\1"
Why are numbers descending 6 5 4 3 ... in this example?
What does failed keyword mean?
Regular expressions define finite state machines1. The debugger is more or less showing you how the state machine is progressing as the string is consumed character by character.
"Compiling REx" is the listing of instructions for that regular expression. The number in parenthesis after each instruction is where to go once the step succeeds. In /(.*)\1/:
1: OPEN1 (3)
3: STAR (5)
4: REG_ANY (0)
5: CLOSE1 (7)
STAR (5) means compute STAR and once you succeed, go to instruction 5 CLOSE1.
"Matching REx" is the step-by-step execution of those instructions. The number on the left is the total number of characters that have been consumed so far. This number can go down if the matcher has to go backwards because something it tried didn't work.
To understand these instructions, it's important to understand how regular expressions "work." Finite state machines are usually visualized as a kind of flow chart. I have produced a crude one below for /(.)\1/. Because of the back reference to a capture group, I don't believe this regex is a strict finite state machine. The chart is useful none the less.
Match
+-------+ Anything +----------+
| Start +------------------+ State 1 |
+---^---+ +--+---+---+
| | |
| | |Matched same
+-------------------------+ | character
matched different |
character +----+------+
| Success |
+-----------+
We start on Start. It's easy to advance to the first state, we just consume any one character (REG_ANY). The only other thing that could happen is end of input. I haven't drawn that here. The REG_ANY instruction is wrapped in the capture group instructions. OPEN1 starts recording all matched characters into the first capture group. CLOSE1 stops recording characters to the first capture group.
Once we consume a character, we sit on State 1 and consume the next char. If it matches the previous char we move to success! REF1 is the instruction that attempts to match capture group #1. Otherwise, we failed and need to move back to the Start to try again. Whenever the matcher says "failed..." it's telling you that something didn't work, so it's returning to an earlier state (that may or may not include 'unconsuming' characters).
The example with * is more complicated. * (which corresponds to STAR) tries to match the given pattern zero or more times, and it is greedy. That means it tries to match as many characters as it possibly can. Starting at the beginning of the string, it says "I can match up to 6 characters!" So, it matches all 6 characters ("foobar"), closes the capture group, and tries to match "foobar" again. That doesn't work! It tries again with 5, that doesn't work. And so on, until it tries to matching zero characters. That means the capture group is empty, matching the empty string always succeeds. So the match succeeds with \1 = "".
I realize I've spent more time explaining regular expressions than I have Perl's regex debugger. But I think its output will become much more clear once you understand how regexes operate.
Here is a finite state machine simulator. You can enter a regex and see it executed. Unfortunately, it doesn't support back references.
1: I believe some of Perl's regular expression features push it beyond this definition but it's still useful to think about them this way.
The debug Iinformation contains description of the bytecode. Numbers denote the node indices in the op tree. Numbers in round brackets tell the engine to jump to a specific node upon match. The EXACT operator tells the regex engine to look for a literal string. REG_ANY means the . symbol. PLUS means the +. Code 0 is for the 'end' node. OPEN1 is a '(' symbol. CLOSE1 means ')'. STAR is a '*'. When the matcher reaches the end node, it returns a success code back to Perl, indicating that the entire regex has matched.
See more details at http://perldoc.perl.org/perldebguts.html#Debugging-Regular-Expressions and a more conceptual http://perl.plover.com/Rx/paper/
I have a file like this:
3107 0.9 0.0 0.0 chr1 29312346 29312694 (219937927) C L1HS LINE/L1 (4) 6151 5803 54360
8095 0.5 0.0 0.0 chr1 31040661 31041597 (218209024) + L1HS LINE/L1 5203 6139 (16) 57249
...
When the 9th column is C, I need to subtract column 14 from 13, and when the 9th column is +, I need to subtract column 12 from 13.
I understand I can create arrays, but how can I use a regex, such as ($line =~/(\w+)\s+(\w+)/), to solve this instead?
You can split at white spaces into #F array(first value being $F[0]), subtract columns, and output values separated by space.
perl -lane'
$F[12] -= $F[13] if $F[8] eq "C";
$F[12] -= $F[11] if $F[8] eq "+";
print "#F";
' file
Since you wanted to use a regex, here is another solution. It is perhaps a bit unsharp, because you did not define your lines cleanly but with only two example lines, and for those, it works. I commented the regex so that you can see, which part of the expression is matching a certain group and which of them are captured.
#!/usr/bin/perl
use strict;
use warnings;
use v5.10;
while( <DATA> )
{
if( $_ =~ /[0-9]+ # 1
\s+
[0-9.]+ # 2
\s+
[0-9.]+ # 3
\s+
[0-9.]+ # 4
\s+
[a-z0-9]+ # 5
\s+
[0-9]+ # 6
\s+
[0-9]+ # 7
\s+
\([a-z0-9]+\) # 8
\s+
([c+]) # 9 -> capture group 1
\s+
[a-z0-9]+ # 10
\s+
[a-z0-9\/]+ # 11
\s+
\(?([0-9]+)\)? # 12 -> capture group 2
\s+
([0-9]+) # 13 -> capture group 3
\s+
\(?([0-9]+)\)? # 14 -> capture group 4
\s+
[0-9]+? # 15
/ix )
{
say "Matched: $_";
say "Operation: $1";
if( $1 eq "+" )
{
say "$2 - $3 = ".( $2 - $3 );
}
elsif( $1 eq "C" )
{
say "$4 - $3 = ".( $4 - $3 );
}
else
{
say "Nothing do to here...";
}
}
}
exit;
#1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
__DATA__
3107 0.9 0.0 0.0 chr1 29312346 29312694 (219937927) C L1HS LINE/L1 (4) 6151 5803 54360
8095 0.5 0.0 0.0 chr1 31040661 31041597 (218209024) + L1HS LINE/L1 5203 6139 (16) 57249
Update:
As you can see in the perl documentation, I used the x flag to have comments in my regex. The i flag makes it case insensitive.
Furthermore, I didn't just try to devide all the single columns by whitespaces but also by their types, which is an advantage of using a regular expression. While \s+ expressions are seperators for columns here, allowing arbitary amounts of whitespace all the single groups are kind of specified. That allows to find non-conforming lines. For example, by defining caputre group $1 as ([c+]) I was able to reduce the possible characters, that trigger an operation to C and + ( and c because of case-inesensitivity).
Binding a group to a variable (capturing it) is done by using parenthises.
This way, I was able to only pick the columns I really need (see the comments).
Do not use a regex for a problem like this.
If you're just working with columns separated by whitespace, the proper tool is split.
my #cols = split ' ', $line;
I cannot get this regex to work:
"4. 182 ex" (number, period, 2 blank spaces, 3 numbers, blank space, 2 characters"
The regex syntax should return "4182" and remove period, blank spaces, and characters.
Can you help me please?
EDIT!!!
Thanks everyone but I missed the key question:
a) the regex shall only find the value (4182) when the same line contains a specific text for example "magic", so for example:
"Magic 4. 182 ex"
b) the regex shall "only" find the value (4182) when the table contains a specific text for example "Magic":
"Magic 4. 182 ex
Lisefeo 2. 123 fg
Nioos 3. 124 df"
specific text = exact match or contains those charachters
My regex that I've tried so far but does it work for a whole table (not just a line) ?
(Magic.*?(\d).\s\s(\d{3})\s\w\w)
Just remove all characters that are not digit:
Perl:
$string =~ s/\D+//g;
or
php:
$string = preg_replace('/\D+/', '', $string);
According to your updated question, you could do:
$string =~ s/^Magic(\d+)\. (\d{3})\b.*$/$1$2/
or, with php:
$string = preg_replace('/^Magic(\d+)\. (\d{3})\b.*$/', '$1$2', $string);
For it to match exactly what you said, use:
(\d)\.\s\s(\d{3})\s\w\w
You'll get it in two groups, first digit and second digit group.
RegEx101 exmple
Regards.
^([\d]+)\.[\s]+([\d]+)[\s]..
Tested with perl:
> echo "4. 182 ex" | perl -lne 'print $1,$2 if(/^([\d]+)\.[\s]+([\d]+)[\s]../)'
4182
I have a huge file aab.txt whose contents are aaa...aab.
To my great surprise
perl -ne '/a*bb/' < aab.txt
runs (match failure) faster than
perl -ne '/a*b/' < aab.txt
(match success). Why???? Both should first gobble up all the a's, then the second one immediately succeeds, while the first will then have to backtrack over and over again, to fail.
Perl regexes are optimized to rather fail as early as possible, than to succeed as fast as possible. This makes a lot of sense when grepping through a large log file.
There is an optimization that first looks for a constant part of the string, in this case, a “floating” b or bb. This can be checked rather efficiently without having to keep track of backtracking state. No bb is found, and the match aborts right there.
Not so with b. That floating substring is found, and the match constructed from there. Here is the debug output of the regex match (program is "aaab" =~ /a*b/):
Compiling REx "a*b"
synthetic stclass "ANYOF_SYNTHETIC[ab][]".
Final program:
1: STAR (4)
2: EXACT <a> (0)
4: EXACT <b> (6)
6: END (0)
floating "b" at 0..2147483647 (checking floating) stclass ANYOF_SYNTHETIC[ab][] minlen 1
Guessing start of match in sv for REx "a*b" against "aaab"
Found floating substr "b" at offset 3...
start_shift: 0 check_at: 3 s: 0 endpos: 4 checked_upto: 0
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "a*b" against "aaab"
Matching stclass ANYOF_SYNTHETIC[ab][] against "aaab" (4 bytes)
0 <> <aaab> | 1:STAR(4)
EXACT <a> can match 3 times out of 2147483647...
3 <aaa> <b> | 4: EXACT <b>(6)
4 <aaab> <> | 6: END(0)
Match successful!
Freeing REx: "a*b"
You can get such output with the debug option for the re pragma.
Finding the b or bb is unnecessary, strictly speaking, but it allows the match to fail much earlier.
/a*bb/
is basically
/^(?s:.*?)a*bb/
Note the two *. Optimizations aside, it's quadratic. In the worst case scenario, (a string of all a), for a string of length N, it will check if the current character is an a N*(N-1)/2 times. We call this O(N2).
It's worth doing a scan of the string (O(N)) to see if it can possibly match before starting the match. It will take a little longer to match, but it will fail to match much faster. This is what Perl does.
When you run the following
perl -Mre=debug -e"'aaaaab' =~ /a*bb/"
You get information about the compilation of the pattern:
Compiling REx "a*bb"
synthetic stclass "ANYOF{i}[ab][{non-utf8-latin1-all}]".
Final program:
1: STAR (4)
2: EXACT <a> (0)
4: EXACT <bb> (6)
6: END (0)
floating "bb" at 0..2147483647 (checking floating) stclass ANYOF{i}[ab][{non-utf8-latin1-all}] minlen 2
The last line indicates it will search for bb in the input before starting to match.
You get information about the evaluation of the pattern:
Guessing start of match in sv for REx "a*bb" against "aaaaab"
Did not find floating substr "bb"...
Match rejected by optimizer
Here you see that check in action.
Good time of day!
I am reading a book about perl: "Programming Perl" By Larry Wall, Tom Christiansen, Jon Orwant. In this book I found several examples that were not clarified by the authors (or simply I dont get then).
The first
This prints hi only ONCE.
"adfsfloglig"=~ /.*(?{print "hi"})f/;
But this prints "hi" TWICE?? how can it be explained?
"adfsfloglig"=~ /.*(?{print "hi"})log/;
And continuing to experement even make things worse:
"adfsfloglig"=~ /.*(?{print "hi"})sflog/;
The above string of code again prints only ONCE this terrifying "hi" !
After about a week I understood only one thing completely - I NEED HELP :)
SO I am asking you to help me, please.
The second (this is a bomb!)
$_ = "lothiernbfj";
m/ (?{$i = 0; print "setting i to 0\n"})
(.(?{ local $i = $i + 1; print "\ti is $i"; print "\tWas founded $&\n" }))*
(?{print "\nchecking rollback\n"})
er
(?{ $result = $i; print "\nsetting result\n"})
/x;
print "final $result\n";
Here the $result finally printing on the screen is equal to number of chars that were matched by .*, but I don't get it again.
When turning on debug printing(shown above), i see, that $i is being incremented every time the new char is included in $& (matched part of a string).
In the end $i is equal 11 (amount of chars in a string), then there are 7 rollbacks, when .* returns from its match char at a time (7 times) so the match of an all pattern occurs.
But, damn magic, the result is setting to value of $i! And we were not decrementing this value anywhere! So $result should be equal 11! But it is not. And authors were right. I know.
Please, can you explain this strange perl code, i was happy to met?
Thank you for any answer!
From the documentation at http://perldoc.perl.org/perlre.html :
"WARNING: This extended regular expression feature is considered experimental, and may be changed without notice. Code executed that has side effects may not perform identically from version to version due to the effect of future optimisations in the regex engine. The implementation of this feature was radically overhauled for the 5.18.0 release, and its behaviour in earlier versions of perl was much buggier, especially in relation to parsing, lexical vars, scoping, recursion and reentrancy."
Even on a failed match, if the regex engine gets to the point where it has to run the code, it will run the code. If the code involves only assigning to (local?) variables and whatever operations are allowed, backtracking will cause it to undo the operations, so the failed matches will have no effect. But print operations can't be undone, with the result that you can get strings printed from a failed match. This is why the documentation warns against embedding code with "side effects".
I did some experimenting and is making the answer a community wiki, hoping that people will populate it. I tried to crack the simplest regexps and didn't dare to deal with "the bomb".
1. "adfsfloglig"=~ /.*(?{print "hi"})f/;
Here is the debug info for the regexp:
Final program:
1: STAR (3)
2: REG_ANY (0)
3: EVAL (5)
5: EXACT <f> (7)
7: END (0)
And the trace of execution with my comments:
#matches the whole string with .*
0 <> <adfsflogli> | 1:STAR(3)
REG_ANY can match 11 times out of 2147483647...
#splits the string to <adfs> and <floglig> and prints "hi".
#Why does it split? Not sure, probably, knows about the f after "hi" code
4 <adfs> <floglig> | 3: EVAL(5)
#tries to find f in 'floglig' - success
4 <adfs> <floglig> | 5: EXACT <f>(7)
#end
5 <adfsf> <loglig> | 7: END(0)
2. "adfsfloglig" =~ /.*(?{print "hi"})log/;
1: STAR (3)
2: REG_ANY (0)
3: EVAL (5)
5: EXACT <log> (7)
7: END (0)
Trace:
#matches the whole string with .*
0 <> <adfsflogli> | 1:STAR(3)
REG_ANY can match 11 times out of 2147483647...
#splits the string to <adfsflog> and <lig> and prints "hi".
#Probably, it found 'l' symbol after the code block
#and, being greedy, tries to capture up to the last 'l'
8 <adfsflog> <lig> | 3: EVAL(5)
#compares the 'lig' with 'log' - failed
8 <adfsflog> <lig> | 5: EXACT <log>(7)
failed...
#moves backwards, taking the previous 'l'
#prints 2-nd 'hi'
5 <adfsf> <loglig> | 3: EVAL(5)
#compares 'loglig' with 'log' - success
5 <adfsf> <loglig> | 5: EXACT <log>(7)
#end
8 <adfsflog> <lig> | 7: END(0)
3. "adfsfloglig"=~ /.*(?{print "hi"})sflog/;
1: STAR (3)
2: REG_ANY (0)
3: EVAL (5)
5: EXACT <sflog> (8)
8: END (0)
Trace:
#matches the whole string with .*
0 <> <adfsflogli> | 1:STAR(3)
REG_ANY can match 11 times out of 2147483647...
#splits the string to <adf> and <sfloglig> and prints "hi".
3 <adf> <sfloglig> | 3: EVAL(5)
#compares 'sfloglig' with 'sflog' - success
3 <adf> <sfloglig> | 5: EXACT <sflog>(8)
#end
8 <adfsflog> <lig> | 8: END(0)