perl regex for extracting multiline blocks - regex

I have text like this:
00:00 stuff
00:01 more stuff
multi line
and going
00:02 still
have
So, I don't have a block end, just a new block start.
I want to recursively get all blocks:
1 = 00:00 stuff
2 = 00:01 more stuff
multi line
and going
etc
The bellow code only gives me this:
$VAR1 = '00:00';
$VAR2 = '';
$VAR3 = '00:01';
$VAR4 = '';
$VAR5 = '00:02';
$VAR6 = '';
What am I doing wrong?
my $text = '00:00 stuff
00:01 more stuff
multi line
and going
00:02 still
have
';
my #array = $text =~ m/^([0-9]{2}:[0-9]{2})(.*?)/gms;
print Dumper(#array);

Version 5.10.0 introduced named capture groups that are useful for matching nontrivial patterns.
(?'NAME'pattern)
(?<NAME>pattern)
A named capture group. Identical in every respect to normal capturing parentheses () but for the additional fact that the group can be referred to by name in various regular expression constructs (such as \g{NAME}) and can be accessed by name after a successful match via %+ or %-. See perlvar for more details on the %+ and %- hashes.
If multiple distinct capture groups have the same name then the $+{NAME} will refer to the leftmost defined group in the match.
The forms (?'NAME'pattern) and (?<NAME>pattern) are equivalent.
Named capture groups allow us to name subpatterns within the regex as in the following.
use 5.10.0; # named capture buffers
my $block_pattern = qr/
(?<time>(?&_time)) (?&_sp) (?<desc>(?&_desc))
(?(DEFINE)
# timestamp at logical beginning-of-line
(?<_time> (?m:^) [0-9][0-9]:[0-9][0-9])
# runs of spaces or tabs
(?<_sp> [ \t]+)
# description is everything through the end of the record
(?<_desc>
# s switch makes . match newline too
(?s: .+?)
# terminate before optional whitespace (which we remove) followed
# by either end-of-string or the start of another block
(?= (?&_sp)? (?: $ | (?&_time)))
)
)
/x;
Use it as in
my $text = '00:00 stuff
00:01 more stuff
multi line
and going
00:02 still
have
';
while ($text =~ /$block_pattern/g) {
print "time=[$+{time}]\n",
"desc=[[[\n",
$+{desc},
"]]]\n\n";
}
Output:
$ ./blocks-demo
time=[00:00]
desc=[[[
stuff
]]]
time=[00:01]
desc=[[[
more stuff
multi line
and going
]]]
time=[00:02]
desc=[[[
still
have
]]]

This should do the trick. Beginning of next \d\d:\d\d is treated as block end.
use strict;
my $Str = '00:00 stuff
00:01 more stuff
multi line
and going
00:02 still
have
00:03 still
have' ;
my #Blocks = ($Str =~ m#(\d\d:\d\d.+?(?:(?=\d\d:\d\d)|$))#gs);
print join "--\n", #Blocks;

Your problem is that .*? is non-greedy in the same way that .* is greedy. When it is not forced, it matches as little as possible, which in this case is the empty string.
So, you'll need something after the non-greedy match to anchor up your capture. I came up with this regex:
my #array = $text =~ m/\n?([0-9]{2}:[0-9]{2}.*?)(?=\n[0-9]{2}:|$)/gs;
As you see, I removed the /m option to accurately be able to match end of string in the look-ahead assertion.
You might also consider this solution:
my #array = split /(?=[0-9]{2}:[0-9]{2})/, $text;

Related

Get the second string of the URI with Perl regex

I need to get the second part of the URI, the possible URI are:
/api/application/v1/method
/web/application/v1/method
I can get "application" using:
([^\/api]\w*)
and
([^\/web]\w*)
But I know is not the best approach, what would be the good way?
Thanks!
Edit: thank you all for the input, the goal was to set the second parte of the uri into a header in apache with rewrite rules
A general regex (Perl or PCRE syntax) solution would be:
^/[^/]+/([^/]+)
Each section is delimited with /, so just capture as many non-/ characters as there are.
This is preferable to non-greedy regexes because it does not need to backtrack, and allows for whatever else the sections may contain, which can easily contain non-word characters such as - that won't be matched by \w.
There are so many options that we can do so, not sure which one would be best, but it could be as simple as:
\/(.+?)\/(.+?)\/.*
which our desired output is in the second capturing group $2.
Demo 1
Example
#!/usr/bin/perl -w
use strict;
use warnings;
use feature qw( say );
main();
sub main{
my $string = '/api/application/v1/method
/web/application/v1/method';
my $pattern = '\/(.+?)\/(.+?)\/.*';
my $match = replace($pattern, '$2', $string);
say $match , " is a match πŸ’šπŸ’šπŸ’š ";
}
sub replace {
my ($pattern, $replacement, $string) = #_;
$string =~s/$pattern/$replacement/gee;
return $string;
}
Output
application
application is a match πŸ’šπŸ’šπŸ’š
Advice
zdim advises that:
A legitimate approach, notes:
(1) there is no need for the trailing .*
(2) Need /|$ (not just /), in case the path finishes without / (to
terminate the non-greedy pattern at the end of string, if there is no
/)
(3) note though that /ee can be vulnerable (even just to errors),
since the second evaluation (e) will run code if the first evaluation
results in code. And it may be difficult to ensure that that is always
done under full control. More to the point, for this purpose there is
no reason to run a substitution --- just match and capture is enough.
With all the regex, explicitly asked for, I'd like to bring up other approaches.
These also parse only a (URI style) path, like the regex ones, and return the second directory.
The most basic and efficient one, just split the string on /
my $dir = ( split /\//, $path )[2];
The split returns '' first (before the first /) thus we need the third element. (Note that we can use an alternate delimiter for the separator pattern, it being regex: split m{/}, $path.)
Use appropriate modules, for example URI
use URI;
my $dir = ( URI->new($path)->path_segments )[2];
or Mojo::Path
use Mojo::Path;
my $dir = Mojo::Path->new($path)->parts->[1];
What to use depends on details of what you do -- if you've got any other work with URLs and web then you clearly want modules for that; otherwise they may (or may not) be an overkill.
I've benchmarked these for a sanity check of what one is paying with modules.
The split either beats regex by up to 10-15% (the regex using negated character class and the one based on non-greedy .+? come around the same), or is about the same with them. They are faster than Mojo by about 30%, and only URI lags seriously, by a factor of 5 behind Mojo.
That's for paths typical for real-life URLs, with a handful of short components. With only two very long strings (10k chars), Mojo::Path (surprisingly for me) is a factor of six ahead of split (!), which is ahead of character-class regex by more than an order of magnitude.
The negated-character-class regex for such long strings beats the non-greedy (.+?) one by a factor of 3, good to know in its own right.
In all this the URI and Mojo objects were created once, ahead of time.
Benchmark code. I'd like to note that the details of these timings are far less important than the structure and quality of code.
use warnings;
use strict;
use feature 'say';
use URI;
use Mojo::Path;
use Benchmark qw(cmpthese);
my $runfor = shift // 3; #/
#my $path = '/' . 'a' x 10_000 . '/' . 'X' x 10_000;
my $path = q(/api/app/v1/method);
my $uri = URI->new($path);
my $mojo = Mojo::Path->new($path);
sub neg_cc {
my ($dir) = $path =~ m{ [^/]+ / ([^/]+) }x; return $dir; #/
}
sub non_greedy {
my ($dir) = $path =~ m{ .+? / (.+?) (?:/|$) }x; return $dir; #/
}
sub URI_path {
my $dir = ( $uri->path_segments )[2]; return $dir;
}
sub Mojo_path {
my $dir = $mojo->parts->[1]; return $dir;
}
sub just_split {
my $dir = ( split /\//, $path )[2]; return $dir;
}
cmpthese( -$runfor, {
neg_cc => sub { neg_cc($path) },
non_greedy => sub { non_greedy($path) },
just_split => sub { just_split($path) },
URI_path => sub { URI_path($path) },
Mojo_path => sub { Mojo_path($path) },
});
With a (10-second) run this prints, on a laptop with v5.16
Rate URI_path Mojo_path non_greedy neg_cc just_split
URI_path 146731/s -- -82% -87% -87% -89%
Mojo_path 834297/s 469% -- -24% -28% -36%
non_greedy 1098243/s 648% 32% -- -5% -16%
neg_cc 1158137/s 689% 39% 5% -- -11%
just_split 1308227/s 792% 57% 19% 13% --
One should keep in mind that the overhead of the function-call is very large for such a simple job, and in spite of Benchmark's work these numbers are probably best taken as a cursory guide.
Your pattern ([^\/api]\w*) consists of a capturing group and a negated character class that will first match 1 time not a /, a, p or i. See demo.
After that 0+ times a word char will be matched. The pattern could for example only match a single char which is not listed in the character class.
What you might do is use a capturing group and match \w+
^/(?:api|web)/(\w+)/v1/method
Explanation
^ Start of string
(?:api|web) Non capturing group with alternation. Match either api or web
(\w+) Capturing group 1, match 1+ word chars
/v1/method Match literally as in your example data.
Regex demo

Perl Regular Expression Needed

I am looking for a Perl Regex to extract all images from the following code:
slideshowimages("images/image1.jpg","images/image2.png","images/image3.jpg")
slideshowimages('images/image4.jpg','images/image5.png','images/image6.jpg')
The output would be an array with:
images/image1.jpg, images/image2.png, images/image3.jpg, images/image4.jpg,
images/image5.png, images/image6.jpg
The things to pay attention to are:
Image extension could be .jpg / .png
Image could be inside singe or double quotes
What i've tried so far:
while ($html =~ /["|'|=](.*\.jpg|gif|png|bmp|swf).*"/g) {
my $item = $1;
$item =~ s/\"|\'|=//g;
push (#images, $item);
}
But this doesn't really work well.
Thank you for your help.
my #images = $html =~ /["'] (.+?) ["']/gx; # /g for multiple hits on a line
A simple regex could be:
["'][^'"]*[.](jpg|png)["']
It would fail though on paths including ' or " like slideshowimages("images/image'1'.jpg"). The expression than can handle this case would be:
('.*?\.(jpg|png)')|(".*?\.(jpg|png)")
/(["']).*?\.(jpg|gif|png|bmp|swf)\1/
You need to capture what you're using, a single or a double quote, and refer back to that.
Also, you need to group your choices for the extension. Otherwise, you're saying 'anything ending with jpg OR just plain gif (, png, and so on)'.
Adjusting #mpapec's answer:
my #images = $html =~ /(["'])[^"']*?\.(jpg|gif|png|bmp|swf)\1/g;
(Disallowing " and ' within the filename is probably a good idea too).
my $html = <<EOF;
slideshowimages("images/image'1'.jpg","images/image2.png","images/image3.jpg")
slideshowimages('images/image4.jpg','images/image5.png','images/image6jpg')
EOF
my #images = ();
while ($html =~ s/(["'])(.+?\/.+?\.(?:jpg|png|gif))\1//) {
push #images, $2;
}
foreach my $image (#images) {
print "$image, ";
}
This is a bit more restrictive so it does not get "images/image6png"
Does not fail on "images/image'1'.jpg"
(["']) - either a single or double quote (any character between the brackets) and capture
( - begin capture
.+?/ - one or more characters up to first slash
.+?. - one or more characters up to first period
(?: - begin grouping without capture
jpg|png|gif - any of the strings
) - end grouping without capture
) - end capture
\1 - captured single or double quote

Get date in different format in Perl?

I need to get the date which could be in 3 possible format.
11/20/2012
11.20.2012
11-20-2012
How could I achieve this in Perl. I'm trying RegEx to get what I want. Here's my code.
my #dates = ("Mon 11/20/2012","2012.11.20","20-11-2012"); #array values may vary in every run
foreach my $date (#dates){
$date =~ /[-.\/\d+]/g;
print "Date: $date \n";
}
I want the output to be. (code above doesn't print anything)
Date: 11/20/2012
Date: 2012.11.20
Date: 20-11-2012
Where am I wrong? Please Help. Thanks
Note: I want to achieve this without using any CPAN module as much as possible. I know there are a lot of CPAN modules that could provide what I want.
Your code almost produces what you want. I assume your input is a bit more complicated, or you have posted code that you are not actually running.
Either way, the problem is this
$date =~ /[-.\/\d+]/g;
First off, your plus multiplier is inside the character class: It should be after it. Second, it is just a pattern match, you need to use it in list context, and store its return value:
my ($match) = $date =~ /[-.\/\d]+/g;
print "Date: $match\n";
Then it will return the first of the strings found that contains one or more of dash, period, slash or a number. Be aware that it will match other things as well, as it is a rather unstrict regex.
Why does it work? Because a pattern match in list context returns a list of the matches when the global /g modifier is used.
I highly recommend the use of DateTime::Format::Strptime module, which has a rich set of funcionality. Think not only in parsing strings, but also in checking the date is valid.
Why not search for the formats one at a time?
=~ m!(\d{2}/\d{2}/\d{2}|\d{4}\.\d{2}\.\d{2}|\d{2}-\d{2}-\d{4})!
should do the trick. Other than that, there's a module dealing with dates called DateTime.
Try matching the formats in turn. The regex below matches any of your permitted separators (/, ., or -) and then requires the same separator via backreference (\2 or \3). Otherwise, you have three possible separators times two possible positions for the year to make six alternatives in your pattern.
#! /usr/bin/env perl
use strict;
use warnings;
#array values may vary in every run
my #dates = ("Mon 11/20/2012","2012.11.20","20-11-2012");
my $date_pattern = qr<
\b # begin on word boundary
(
(?: [0-9][0-9] ([-/.]) [0-9][0-9] \2 [0-9][0-9][0-9][0-9])
| (?: [0-9][0-9][0-9][0-9] ([-/.]) [0-9][0-9] \3 [0-9][0-9])
)
\b # end on word boundary
>x;
foreach my $date (#dates) {
if (my($match) = $date =~ /$date_pattern/) {
print "Date: $match\n";
}
}
Output:
Date: 11/20/2012
Date: 2012.11.20
Date: 20-11-2012
On my first try at the code above, I had \2 in the YYYY-MM-DD alternative where I should have had \3, which failed to match. To spare us counting parentheses, version 5.10.0 added named capture buffers.
Named Capture Buffers
It is now possible to name capturing parenthesis in a pattern and refer to the captured contents by name. The naming syntax is (?<NAME>....). It's possible to backreference to a named buffer with the \k<NAME> syntax. In code, the new magical hashes %+ and %- can be used to access the contents of the capture buffers.
Using this handy feature, the code above becomes
#! /usr/bin/env perl
use 5.10.0; # named capture buffers
use strict;
use warnings;
#array values may vary in every run
my #dates = ("Mon 11/20/2012","2012.11.20","20-11-2012");
my $date_pattern = qr!
\b # begin on word boundary
(?<date>
(?: [0-9][0-9] (?<sep>[-/.]) [0-9][0-9] \k{sep} [0-9][0-9][0-9][0-9])
| (?: [0-9][0-9][0-9][0-9] (?<sep>[-/.]) [0-9][0-9] \k{sep} [0-9][0-9])
)
\b # end on word boundary
!x;
foreach my $date (#dates) {
if ($date =~ /$date_pattern/) {
print "Date: $+{date}\n";
}
}
and produces the same output.
The code above still contains a lot of repetition. Using the (DEFINE) special case combined with named captures, we can make the pattern much nicer.
#! /usr/bin/env perl
use 5.10.0;
use strict;
use warnings;
#array values may vary in every run
my #dates = ("Mon 11/20/2012","2012.11.20","20-11-2012");
my $date_pattern = qr!
\b (?<date> (?&YMD) | (?&DMY)) \b
(?(DEFINE)
(?<SEP> [-/.])
(?<YYYY> [0-9][0-9][0-9][0-9])
(?<MM> [0-9][0-9])
(?<DD> [0-9][0-9])
(?<YMD> (?&YYYY) (?<sep>(?&SEP)) (?&MM) \k<sep> (?&DD))
(?<DMY> (?&DD) (?<sep>(?&SEP)) (?&MM) \k<sep> (?&YYYY))
)
!x;
foreach my $date (#dates) {
if ($date =~ /$date_pattern/) {
print "Date: $+{date}\n";
}
}
Yes, the subpattern named DMY also matches dates int MDY form. For now it suffices, and you ain’t gonna need it.

Perl regular expression isn't greedy enough

I'm writing a regular expression in perl to match perl code that starts the definition of a perl subroutine. Here's my regular expression:
my $regex = '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*\n)*\s*\{';
$regex matches code that starts a subroutine. I'm also trying to capture the name of the subroutine in $1 and any white space and comments between the subroutine name and the initial open brace in $2. It's $2 that is giving me a problem.
Consider the following perl code:
my $x = 1;
sub zz
# This is comment 1.
# This is comment 2.
# This is comment 3.
{
$x = 2;
return;
}
When I put this perl code into a string and match it against $regex, $2 is "# This is comment 3.\n", not the three lines of comments that I want. I thought the regular expression would greedily put all three lines of comments into $2, but that seems not to be the case.
I would like to understand why $regex isn't working and to design a simple replacement. As the program below shows, I have a more complex replacement ($re3) that works. But I think it's important for me to understand why $regex doesn't work.
use strict;
use English;
my $code_string = <<END_CODE;
my \$x = 1;
sub zz
# This is comment 1.
# This is comment 2.
# This is comment 3.
{
\$x = 2;
return;
}
END_CODE
my $re1 = '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*\n)*\s*\{';
my $re2 = '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*\n){0,}\s*\{';
my $re3 = '\s*sub\s+([a-zA-Z_]\w*)((\s*#.*\n)+)?\s*\{';
print "\$code_string is '$code_string'\n";
if ($code_string =~ /$re1/) {print "For '$re1', \$2 is '$2'\n";}
if ($code_string =~ /$re2/) {print "For '$re2', \$2 is '$2'\n";}
if ($code_string =~ /$re3/) {print "For '$re3', \$2 is '$2'\n";}
exit 0;
__END__
The output of the perl script above is the following:
$code_string is 'my $x = 1;
sub zz
# This is comment 1.
# This is comment 2.
# This is comment 3.
{
$x = 2;
return;
} # sub zz
'
For '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*\n)*\s*\{', $2 is '# This is comment 3.
'
For '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*\n){0,}\s*\{', $2 is '# This is comment 3.
'
For '\s*sub\s+([a-zA-Z_]\w*)((\s*#.*\n)+)?\s*\{', $2 is '
# This is comment 1.
# This is comment 2.
# This is comment 3.
'
Look at only the part of your regex that captures $2. It is (\s*#.*\n). By itself, this can only capture a single comment line. You have an asterisk after it in order to capture multiple comment lines, and this works just fine. It captures multiple comment lines and puts each of them into $2, one by one, each time replacing the previous value of $2. So the final value of $2 when the regex is done matching is the last thing that the capturing group matched, which is the final comment line. Only. To fix it, you need to put the asterisk inside the capturing group. But then you need to put another set of parentheses (non-capturing, this time) to make sure the asterisk applies to the whole thing. So instead of (\s*#.*\n)*, you need ((?:\s*#.*\n)*).
Your third regex works because you unwittingly surrounded the whole expression in parentheses so that you could put a question mark after it. This caused $2 to capture all the comments at once, and $3 to capture only the final comment.
When you are debugging your regex, make sure you print out the values of all the match variables you are using: $1, $2, $3, etc. You would have seen that $1 was just the name of the subroutine and $2 was only the third comment. This might have led you to wonder how on earth your regex skipped over the first two comments when there is nothing between the first and second capturing groups, which would eventually lead you in the direction of discovering what happens when a capturing group matches multiple times.
By the way, it looks like you are also capturing any whitespace after the subroutine name into $1. Is this intentional? (Oops, I messed up my mnemonics and thought \w was "w for whitespace".)
If you add repetition to a capturing group, it will only capture the final match of that group. This is why $regex only matches the final comment line.
Here is how I would rewrite you regex:
my $regex = '\s*sub\s+([a-zA-Z_]\w*)((?:\s*#.*\n)*)\s*\{';
This is very similar to your $re3, except for the following changes:
The white space and comment matching portion is now in a non-capturing group
I changed that portion of the regex from ((...)+)? to ((...)*) which is equivalent.
The problem is that by default the \n isn't part of the string. The regex stops matching at \n.
You need to use the s modifier for multi-line matches:
if ($code_string =~ /$re1/s) {print "For '$re1', \$2 is '$2'\n";}
Note the s after the regex.

Why can't I match a substring which may appear 0 or 1 time using /(subpattern)?/

The original string is like this:
checksession ok:6178 avg:479 avgnet:480 MaxTime:18081 fail1:19
The last part "fail1:19" may appear 0 or 1 time. And I tried to match the number after "fail1:", which is 19, using this:
($reg_suc, $reg_fail) = ($1, $2) if $line =~ /^checksession\s+ok:(\d+).*(fail1:(\d+))?/;
It doesn't work. The $2 variable is empty even if the "fail1:19" does exist. If I delete the "?", it can match only if the "fail1:19" part exists. The $2 variable will be "fail1:19". But if the "fail1:19" part doesn't exist, $1 and $2 neither match. This is incorrect.
How can I rewrite this pattern to capture the 2 number correctly? That means when the "fail1:19" part exist, two numbers will be recorded, and when it doesn't exit, only the number after "ok:" will be recorded.
First, the number in fail field would end in $3, as those variables are filled according to opening parentheses. Second, as codaddict shows, the .* construct in RE is hungry, so it will eat even the fail... part. Third, you can avoid numbered variables like this:
my $line = "checksession ok:6178 avg:479 avgnet:480 MaxTime:18081 fail1:19";
if(my ($reg_suc, $reg_fail, $addend)
= $line =~ /^checksession\s+ok:(\d+).*?(fail1:(\d+))?$/
) {
warn "$reg_suc\n$reg_fail\n$addend\n";
}
Try the regex:
^checksession\s+ok:(\d+).*?(fail1:(\d+))?$
Ideone Link
Changes made:
.* in the middle has been made
non-greedy and
$ (end anchor) has been added.
As a result of above changes .*? will try to consume as little as possible and the end anchor forces the regex to match till the end of the string, matching fail1:number if present.
I think this is one of the few cases where a split is actually more robust than a regex:
$bar[0]="checksession ok:6178 avg:479 avgnet:480 MaxTime:18081 fail1:19";
$bar[1]="checksession ok:6178 avg:479 avgnet:480 MaxTime:18081";
for $line (#bar){
(#fields) = split/ /,$line;
$reg_suc = $fields[1];
$reg_fail = $fields[5];
print "$reg_suc $reg_fail\n";
}
I try to avoid the non-greedy modifier. It often bites back. Kudos for suggesting split, but I'd go a step further:
my %rec = split /\s+|:/, ( $line =~ /^checksession (.*)/ )[0];
print "$rec{ok} $rec{fail1}\n";