when does a sub stop in perl regex - regex

I'm trying to translate some perl code into python and I ran into some problem with a certain regex I just can't figure out what it does or why does it stop
this is the regex
$url =~ s/^.*\///;
now I've tried to pass some urls and see what comes out
so this is what comes out
# string input
"http://perltest.my-mobile.org/c/test.cgi?u=USER&p=PASS"
# string output
"test.cgi?u=USER&p=PASS"
I really don't know why it is stopping at test as far as I understand it,
it replaces any character in the beginning of the string with nothing
so why does it stop at test?
And if you can help me write a regex in python that does the same thing
that would be cool
Thanks in advance!

I really don't know why it is stopping at 'test' as far as I understand it, it replaces any character in the beginning of the string with nothing so why does it stop at test?
Because of the \/ being part of the pattern.
# V here
$url =~ s/^.*\///;
It would be clearer if the code was using a different quoting delimiter, which is possible in Perl. That way, there would not be the leaning toothpick syndrome here.
$url =~ s{^.*/}{};
Note that it's greedy by default, so it will gobble up all the slashes until the last one.
You can use the re pragma in debug mode to learn more about what the regex engine does under the hood.
use re 'debug';
my $url = "http://perltest.my-mobile.org/c/test.cgi?u=USER&p=PASS";
$url =~ s{^.*/}{};
This will output to STDERR.
Compiling REx "^.*/"
Final program:
1: SBOL /^/ (2)
2: STAR (4)
3: REG_ANY (0)
4: EXACT </> (6)
6: END (0)
floating "/" at 0..9223372036854775807 (checking floating) anchored(SBOL) minlen 1
Matching REx "^.*/" against "http://perltest.my-mobile.org/c/test.cgi?u=USER&p=PASS"
Intuit: trying to determine minimum start position...
doing 'check' fbm scan, [0..54] gave 5
Found floating substr "/" at offset 5 (rx_origin now 0)...
(multiline anchor test skipped)
Intuit: Successfully guessed: match at offset 0
0 <> <http://per> | 0| 1:SBOL /^/(2)
0 <> <http://per> | 0| 2:STAR(4)
| 0| REG_ANY can match 54 times out of 2147483647...
31 <org/c> </test.cgi?> | 1| 4:EXACT </>(6)
32 <rg/c/> <test.cgi?u> | 1| 6:END(0)
Match successful!
Freeing REx: "^.*/"

Related

Perl: Fastest match of anything?

I want to do this without using /s:
(?:.|\n)+
And I want it to be fast.
It is part of a bigger regexp, which is the reason I cannot use /s. I have tested:
perl -pe 's/(?:.|\n)+//' # 30 MBytes/s
perl -pe 's/[^\777]+//' # 124 MBytes/s
They are not as fast as /s:
perl -pe 's/.+//s' # 164 MBytes/s
Can I somehow get the same speed as /s?
Edit:
perl -pe 's/(?s).+(?-s)//' # 162 MBytes/s
perl -pe 's/[\d\D]+//' # 162 MBytes/s
perl -pe 's/[\s\S]+//' # 161 MBytes/s
These are all good options. Thanks.
If I create a 98MB file:
$ dd bs=1024 count=100000 </dev/urandom >file
100000+0 records in
100000+0 records out
102400000 bytes transferred in 0.558640 secs (183302305 bytes/sec)
Now compare some methods:
time perl -0777 -lne 's/(?:.|\n)+//' file
real 0m1.957s
user 0m1.902s
sys 0m0.050s
time perl -0777 -lne 's/[^\777]+//' file
real 0m0.130s
user 0m0.082s
sys 0m0.046s
time perl -0777 -lne 's/[\s\S]+//' file
real 0m0.065s
user 0m0.022s
sys 0m0.041s
time perl -0777 -lne 's/.+//s' file
real 0m0.064s
user 0m0.022s
sys 0m0.038s
As you can see, the character class of [\s\S] is optimized to /./s since 5.24.
$ perl -Mre=debug -e'qr/./s'
Compiling REx "."
Final program:
1: SANY (2)
2: END (0)
minlen 1
Freeing REx: "."
$ perl -Mre=debug -e'qr/[\s\S]/'
Compiling REx "[\s\S]"
Final program:
1: SANY (2)
2: END (0)
minlen 1
Freeing REx: "[\s\S]"
You do not need to use anyworkarounds.
You may use inline modifier flags, (?s) to enable dot to match across lines and (?-s) to disable this effect.
For example:
(?s).*PATTERN(?-s).*
where the first .* matches any text and the last .* only matches till the end of a line.
You can also use modifier groups:
(?s:.*)PATTERN(?-s:.*)
See more in Extended Patterns.
You can use (?s:PATTERN) to enable /s for just the subpattern in the parens.
(?s:.)+
or
(?s:.+)
For example,
$ perl -M5.010 -e'
say /^.(?s:.).\z/ ? "match" : "no match"
for "\n\n\n", "A\n\n", "\n\nA", "A\nA";
'
no match
no match
no match
match
There is no difference between . with /s and (?s:.), so you'll get exactly the same performance.
$ perl -Mre=debug -e'qr/(?s:.)/'
Compiling REx "(?s:.)"
Final program:
1: SANY (2)
2: END (0)
minlen 1
Freeing REx: "(?s:.)"
$ perl -Mre=debug -e'qr/./s'
Compiling REx "."
Final program:
1: SANY (2)
2: END (0)
minlen 1
Freeing REx: "."
Advantages over the other solutions:
(?s:.) restores the s flag state, while (?s).(?-s) turns it off.
(?s:.) is a fair bit shorter than (?s).(?-s).
(?s:.) is not longer than [\s\S] and [\d\D].
(?s:.) is not a workaround like [\s\S] and [\d\D].
That said, (?s:.), (?s).(?-s), [\s\S] and [\d\D] all result in exactly the same regex program (since 5.24), so they'll all perform identically.

perl: add to a path using substitution

My script takes in a filepath, and I want to append a directory to the end of the path. The issue is I want to be agnostic of whether the argument has a trailing slash or not. So for example:
$ perl myscript.pl /path/to/dir
/path/to/dir/new
$ perl myscript.pl /path/to/dir/
/path/to/dir/new
I tried $path =~ s/\/?$/\/new/g, but that results in a double /new if a slash is present:
$ perl myscript.pl /path/to/dir
/path/to/dir/new/new
$ perl myscript.pl /path/to/dir
/path/to/dir/new
What's wrong?
Because /g is 'global' and will match multiple times:
#!/usr/bin/env perl
use strict;
use warnings;
#turn on debugging
use re 'debug';
my $path = '/path/to/dir/';
$path =~ s/\/?$/\/new/g;
print $path;
After the first replacement, the regex engine has 'left' the "end of line" marker, and doesn't need to match the optional /. So matches a second time.
E.g.:
Compiling REx "/?$"
Final program:
1: CURLY {0,1} (5)
3: EXACT </> (0)
5: SEOL (6)
6: END (0)
floating ""$ at 0..1 (checking floating) minlen 0
Matching REx "/?$" against "/path/to/dir/"
Intuit: trying to determine minimum start position...
doing 'check' fbm scan, [0..13] gave 13
Found floating substr ""$ at offset 13 (rx_origin now 12)...
(multiline anchor test skipped)
try at offset...
Intuit: Successfully guessed: match at offset 12
12 <path/to/dir> </> | 1:CURLY {0,1}(5)
EXACT </> can match 1 times out of 1...
13 <path/to/dir/> <> | 5: SEOL(6)
13 <path/to/dir/> <> | 6: END(0)
Match successful!
Matching REx "/?$" against ""
Intuit: trying to determine minimum start position...
doing 'check' fbm scan, [13..13] gave 13
Found floating substr ""$ at offset 13 (rx_origin now 13)...
(multiline anchor test skipped)
Intuit: Successfully guessed: match at offset 13
13 <path/to/dir/> <> | 1:CURLY {0,1}(5)
EXACT </> can match 0 times out of 1...
13 <path/to/dir/> <> | 5: SEOL(6)
13 <path/to/dir/> <> | 6: END(0)
Match successful!
Matching REx "/?$" against ""
Intuit: trying to determine minimum start position...
doing 'check' fbm scan, [13..13] gave 13
Found floating substr ""$ at offset 13 (rx_origin now 13)...
(multiline anchor test skipped)
Intuit: Successfully guessed: match at offset 13
13 <path/to/dir/> <> | 1:CURLY {0,1}(5)
EXACT </> can match 0 times out of 1...
13 <path/to/dir/> <> | 5: SEOL(6)
13 <path/to/dir/> <> | 6: END(0)
This is because $ is a zero width position anchor. And so is \/? if there's no matches. Once the pattern has been consumed all the way up to the trailing / and replaced.. then the regex engine continues (because you told it to with /g) and find just $ left, because that's still the end of line. And that's still a valid match to replace.
But why not instead use File::Spec:
#!/usr/bin/env perl
use strict;
use warnings;
use File::Spec;
use Data::Dumper;
my $path = '/path/to/dir/';
my #dirs = File::Spec->splitdir($path);
print Dumper \#dirs;
$path = File::Spec->catdir(#dirs, "new" );
print $path;
This provides you with a platform independent way to split and join path elements, and doesn't rely on regex matching - which there's various ways it could break (such as the one you found).
Drop the /g modifier:
$path =~ s/\/?$/\/new/
works fine.
You only want to modify add one "new" at the end, so having a /g modifier makes no sense.
Also, note that you can use different delimiters for your regex:
$path =~ s{ /? $}{/new}x;
is a little bit clearer.

Replace regex with captured group ONLY

I'm trying to understand why the following does not give me what I think (or want :)) should be returned:
sed -r 's/^(.*?)(Some text)?(.*)$/\2/' list_of_values
or Perl:
perl -lpe 's/^(.*?)(Some text)?(.*)$/$2/' list_of_values
So I want my result to be just the Some text, otherwise (meaning if there was nothing captured in $2) then it should just be EMPTY.
I did notice that with perl it does work if Some text is at the start of the line/string (which baffles me...). (Also noticed that removing ^ and $ has no effect)
Basically, I'm trying to get what grep would return with the --only-matching option as discussed here. Only I want/need to use sub/replace in the regex.
EDITED (added sample data)
Sample input:
$ cat -n list_of_values
1 Black
2 Blue
3 Brown
4 Dial Color
5 Fabric
6 Leather and Some text after that ....
7 Pearl Color
8 Stainless Steel
9 White
10 White Mother-of-Pearl Some text stuff
Desired output:
$ perl -ple '$_ = /(Some text)/ ? $1 : ""' list_of_values | cat -n
1
2
3
4
5
6 Some text
7
8
9
10 Some text
First of all, this shows how to duplicate grep -o using Perl.
You're asking why
foo Some text bar
012345678901234567
results in just a empty string instead of
Some text
Well,
At position 0, ^ matches 0 characters.
At position 0, (.*?) matches 0 characters.
At position 0, (Some text)? matches 0 characters.
At position 0, (.*) matches 17 characters.
At position 17, $ matches 0 characters.
Match succeeds.
You could use
s{^ .*? (?: (Some[ ]text) .* | $ )}{ $1 // "" }exs;
or
s{^ .*? (?: (Some[ ]text) .* | $ )}{$1}xs; # Warns if warnings are on.
Far simpler:
$_ = /(Some text)/ ? $1 : "";
I question your use of -p. Are you sure you want a line of output for each line of input? It seems to me you'd rather have
perl -nle'print $1 if /(Some text)/'

Negative lookbehind assertion regex has unexpected result with grep -P

I'm testing the following negated lookbehind assertion and I want to understand the result:
echo "foo foofoo" | grep -Po '(?<!foo)foo'
it prints out
foo
foo
foo
I was expecting that only the two first foo would be printed, 'echo foo foofoo' but not the third one, because my assertion is supposed to mean find 'foo' that is not preceded by a 'foo'.
What am I missing? why is the third foo being matched?
Note: grep -P means interpret the regex as perl compatible regex. grep -o means print out only the matched string. My grep is version 2.5.1.
After a big discussion on this issue (that has been moved to the chat) I came to the conclusion that my understanding about the lookbehind negative assertion was correct:
echo "foo foofoo" | grep -Po '(?<!foo)foo'
Should return foo two times.
My version of grep, or the PCRE lib that it was compiled with, is buggy.
Some people tested this command on their machines with different versions of grep and they had different results. Some have seen two foo and others had three foo, like me.
I tested that regex with Perl and I had the expected result, foo two times.
grep man page states that -P option is experimental.
My lesson was: if you want PCRE that really works, use Perl.
I can't reproduce this - running the exact command, I only get two matches.
I'm using GNU grep 2.6.3
However, I find a useful trick for troubleshooting a regex is this - perl allows you to run regex debug:
#!/usr/bin/env perl
use strict;
use warnings;
#dump results
use Data::Dumper;
#set regex indo debug mode
use re 'debug';
#iterate __DATA__ below
while ( <DATA> ) {
#apply regex to current line
my #matches = m/(?<!foo)(foo)/g;
print Dumper \#matches;
}
__DATA__
foo foofoo
This gives us output of:
Compiling REx "(?<!foo)(foo)"
Final program:
1: UNLESSM[-3] (7)
3: EXACT <foo> (5)
5: SUCCEED (0)
6: TAIL (7)
7: OPEN1 (9)
9: EXACT <foo> (11)
11: CLOSE1 (13)
13: END (0)
anchored "foo" at 0 (checking anchored) minlen 3
Matching REx "(?<!foo)(foo)" against "foo foofoo"
Intuit: trying to determine minimum start position...
doing 'check' fbm scan, [0..10] gave 0
Found anchored substr "foo" at offset 0 (rx_origin now 0)...
(multiline anchor test skipped)
Intuit: Successfully guessed: match at offset 0
0 <> <foo foofoo> | 1:UNLESSM[-3](7)
0 <> <foo foofoo> | 7:OPEN1(9)
0 <> <foo foofoo> | 9:EXACT <foo>(11)
3 <foo> < foofoo> | 11:CLOSE1(13)
3 <foo> < foofoo> | 13:END(0)
Match successful!
Matching REx "(?<!foo)(foo)" against " foofoo"
Intuit: trying to determine minimum start position...
doing 'check' fbm scan, [3..10] gave 4
Found anchored substr "foo" at offset 4 (rx_origin now 4)...
(multiline anchor test skipped)
try at offset...
Intuit: Successfully guessed: match at offset 4
4 <foo > <foofoo> | 1:UNLESSM[-3](7)
1 <f> <oo foofoo> | 3: EXACT <foo>(5)
failed...
4 <foo > <foofoo> | 7:OPEN1(9)
4 <foo > <foofoo> | 9:EXACT <foo>(11)
7 <foo foo> <foo> | 11:CLOSE1(13)
7 <foo foo> <foo> | 13:END(0)
Match successful!
Matching REx "(?<!foo)(foo)" against "foo"
Intuit: trying to determine minimum start position...
doing 'check' fbm scan, [7..10] gave 7
Found anchored substr "foo" at offset 7 (rx_origin now 7)...
(multiline anchor test skipped)
Intuit: Successfully guessed: match at offset 7
7 <foo foo> <foo> | 1:UNLESSM[-3](7)
4 <foo > <foofoo> | 3: EXACT <foo>(5)
7 <foo foo> <foo> | 5: SUCCEED(0)
subpattern success...
failed...
Match failed

POSIX Regular Expressions Limit Repetitions

I am trying to grep for a maximum number of repetitions allowed on my input string and can't seem to get it working.
The input file has three lines with 3,5 and 7 repetitions of "pq" respectively. The >=3, >=5 expressions are working fine, but "between 3 and 5" expression {3,5} shows the line with seven repetitions as well.
DEV /> cat input.txt
pq -- One occurance of pq
pqpqpqpqpq -five occurances of pq
pqpqpqpqpqpqpq -- seven occurances of pq
DEV /> grep "\(pq\)\{3,\}" input.txt
pqpqpqpqpq -five occurances of pq
pqpqpqpqpqpqpq -- seven occurances of pq
DEV /> grep "\(pq\)\{5\}" input.txt
pqpqpqpqpq -five occurances of pq
pqpqpqpqpqpqpq -- seven occurances of pq
DEV /> grep "\(pq\)\{3,5\}" input.txt
pqpqpqpqpq -five occurances of pq
pqpqpqpqpqpqpq -- seven occurances of pq
Am I doing something wrong or is this the expected behavior?
If this is the expected behavior ( as the string with 7 PQs has between 3-5 PQs),
1) in what cases is the maximum repetitions applicable? What would be the difference between {3,5} and {3,} (greater than 3)?
2) I can anchor my regular expressions with "^", but what if my string does not end with "pq" and has more text?
If a line has seven repetitions of anything, it also therefore contains between 3–5 repetitions of that thing, and at several points, no less.
Use match anchors if you expect matches to be anchored. Otherwise, of course, they are not.
The practical difference between /X{3,}/ and /X{3,5}/ is the how long of a string it matches — the extent (or span) of the match. If all you are looking for is a boolean yes/no responses and there is nothing further in your pattern, it does not make much of a difference; in fact, a moderately clever regex engine will return early if it knows it is safe to do so.
One way to see the difference is with GNU grep’s ‑o or ‑‑only‐matching option. Watch:
$ echo 123456789 | egrep -o '[0-9]{3}'
123
456
789
$ echo 123456789 | egrep -o '[0-9]{3,}'
123456789
$ echo 123456789 | egrep -o '[0-9]{3,5}'
12345
6789
$ echo 123456789 | egrep -o '[0-9]{3,5}[2468]'
123456
$ echo 123456790 | egrep -o '[0-9]{3,5}[13579]'
12345
6789
To understand how those last two work, it is useful to get a trace of the regex engine’s attempts, including backtracking steps. You can do this using Perl in this way:
$ perl -Mre=debug -le 'print $& while 1234567890 =~ /\d{3,5}[13579]/g'
Compiling REx "\d{3,5}[13579]"
Final program:
1: CURLY {3,5} (4)
3: DIGIT (0)
4: ANYOF[13579][] (15)
15: END (0)
stclass DIGIT minlen 4
Matching REx "\d{3,5}[13579]" against "1234567890"
Matching stclass DIGIT against "1234567" (7 chars)
0 <> <1234567890> | 1:CURLY {3,5}(4)
DIGIT can match 5 times out of 5...
5 <12345> <67890> | 4: ANYOF[13579][](15)
failed...
4 <1234> <567890> | 4: ANYOF[13579][](15)
5 <12345> <67890> | 15: END(0)
Match successful!
12345
Matching REx "\d{3,5}[13579]" against "67890"
Matching stclass DIGIT against "67" (2 chars)
5 <12345> <67890> | 1:CURLY {3,5}(4)
DIGIT can match 5 times out of 5...
10 <1234567890> <> | 4: ANYOF[13579][](15)
failed...
9 <123456789> <0> | 4: ANYOF[13579][](15)
failed...
8 <12345678> <90> | 4: ANYOF[13579][](15)
9 <123456789> <0> | 15: END(0)
Match successful!
6789
Freeing REx: "\d{3,5}[13579]"
When you have additional constraints about what comes after the match, then which type of repetition you choose can make a big difference. Here I’ll impose a constraint on where each match is allowed to finish, by saying it needs to end before an odd digit:
$ perl -le 'print $& while 1234567890 =~ /\d{3}(?=[13579])/g'
234
678
$ perl -le 'print $& while 1234567890 =~ /\d{3,5}(?=[13579])/g'
1234
5678
% perl -le 'print $& while 1234567890 =~ /\d{3,}(?=[13579])/g'
12345678
So when you have things that have to come afterwards, it can make a great deal of difference. When you are just deciding whether the entire line matches something, it may not be as important.
This is expected behavior. The string "pqpqpqpqpqpqpq" does in fact have between three and five repetitions of "pq", and then a few more for good measure. You may want to try anchoring your regular expression, something like ^\(pq\)\{3,5\}$.
Edit to match edited question:
The maximum is applicable in all situations. What is happening is that grep is matching 5 of the 7 repetitions of "pq" (most likely the first five), and since it found a match it prints out the line.
You'll have to figure out a way to change your regex to match what you want and not match what you don't. For example, to match a line starting with 3–5 repetitions of "pq", you might do something like this: ^\(pq\){3,5}\($|[^p]|p$|p[^q]\). That matches 3–5 "pq"s followed immediately by end-of-line or any-character-other-than-"p" or "p"-followed-by-end-of-line or "p"-followed-by-any-character-other-than-"q".