Perl Regular Expression Needed - regex

I am looking for a Perl Regex to extract all images from the following code:
slideshowimages("images/image1.jpg","images/image2.png","images/image3.jpg")
slideshowimages('images/image4.jpg','images/image5.png','images/image6.jpg')
The output would be an array with:
images/image1.jpg, images/image2.png, images/image3.jpg, images/image4.jpg,
images/image5.png, images/image6.jpg
The things to pay attention to are:
Image extension could be .jpg / .png
Image could be inside singe or double quotes
What i've tried so far:
while ($html =~ /["|'|=](.*\.jpg|gif|png|bmp|swf).*"/g) {
my $item = $1;
$item =~ s/\"|\'|=//g;
push (#images, $item);
}
But this doesn't really work well.
Thank you for your help.

my #images = $html =~ /["'] (.+?) ["']/gx; # /g for multiple hits on a line

A simple regex could be:
["'][^'"]*[.](jpg|png)["']
It would fail though on paths including ' or " like slideshowimages("images/image'1'.jpg"). The expression than can handle this case would be:
('.*?\.(jpg|png)')|(".*?\.(jpg|png)")

/(["']).*?\.(jpg|gif|png|bmp|swf)\1/
You need to capture what you're using, a single or a double quote, and refer back to that.
Also, you need to group your choices for the extension. Otherwise, you're saying 'anything ending with jpg OR just plain gif (, png, and so on)'.
Adjusting #mpapec's answer:
my #images = $html =~ /(["'])[^"']*?\.(jpg|gif|png|bmp|swf)\1/g;
(Disallowing " and ' within the filename is probably a good idea too).

my $html = <<EOF;
slideshowimages("images/image'1'.jpg","images/image2.png","images/image3.jpg")
slideshowimages('images/image4.jpg','images/image5.png','images/image6jpg')
EOF
my #images = ();
while ($html =~ s/(["'])(.+?\/.+?\.(?:jpg|png|gif))\1//) {
push #images, $2;
}
foreach my $image (#images) {
print "$image, ";
}
This is a bit more restrictive so it does not get "images/image6png"
Does not fail on "images/image'1'.jpg"
(["']) - either a single or double quote (any character between the brackets) and capture
( - begin capture
.+?/ - one or more characters up to first slash
.+?. - one or more characters up to first period
(?: - begin grouping without capture
jpg|png|gif - any of the strings
) - end grouping without capture
) - end capture
\1 - captured single or double quote

Related

perl Regex replace for specific string length

I am using Perl to do some prototyping.
I need an expression to replace e by [ee] if the string is exactly 2 chars and finishes by "e".
le -> l [ee]
me -> m [ee]
elle -> elle : no change
I cannot test the length of the string, I need one expression to do the whole job.
I tried:
`s/(?=^.{0,2}\z).*e\z%/[ee]/g` but this is replacing the whole string
`s/^[c|d|j|l|m|n|s|t]e$/[ee]/g` same result (I listed the possible letters that could precede my "e")
`^(?<=[c|d|j|l|m|n|s|t])e$/[ee]/g` but I have no match, not sure I can use ^ on a positive look behind
EDIT
Guys you're amazing, hours of search on the web and here I get answers minutes after I posted.
I tried all your solutions and they are working perfectly directly in my script, i.e. this one:
my $test2="le";
$test2=~ s/^(\S)e$/\1\[ee\]/g;
print "test2:".$test2."\n";
-> test2:l[ee]
But I am loading these regex from a text file (using Perl for proto, the idea is to reuse it with any language implementing regex):
In the text file I store for example (I used % to split the line between match and replace):
^(\S)e$% \1\[ee\]
and then I parse and apply all regex like that:
my $test="le";
while (my $row = <$fh>) {
chomp $row;
if( $row =~ /%/){
my #reg = split /%/, $row;
#if no replacement, put empty string
if($#reg == 0){
push(#reg,"");
}
print "reg found, reg:".$reg[0].", replace:".$reg[1]."\n";
push #regs, [ #reg ];
}
}
print "orgine:".$test."\n";
for my $i (0 .. $#regs){
my $p=$regs[$i][0];
my $r=$regs[$i][1];
$test=~ s/$p/$r/g;
}
print "final:".$test."\n";
This technique is working well with my other regex, but not yet when I have a $1 or \1 in the replace... here is what I am obtaining:
final:\1\ee\
PS: you answered to initial question, should I open another post ?
Something like s/(?i)^([a-z])e$/$1[ee]/
Why aren't you using a capture group to do the replacement?
`s/^([c|d|j|l|m|n|s|t])e$/\1 [ee]/g`
If those are the characters you need and if it is indeed one word to a line with no whitespace before it or after it, then this will work.
Here's another option depending on what you are looking for. It will match a two character string consisting of one a-z character followed by one 'e' on its own line with possible whitespace before or after. It will replace this will the single a-z character followed by ' [ee]'
`s/^\s*([a-z])e\s*$/\1 [ee]/`
^(\S)e$
Try this.Replace by $1 [ee].See demo.
https://regex101.com/r/hR7tH4/28
I'd do something like this
$word =~ s/^(\w{1})(e)$/$1$2e/;
You can use following regex which match 2 character and then you can replace it with $1\[$2$2\]:
^([a-zA-Z])([a-zA-Z])$
Demo :
$my_string =~ s/^([a-zA-Z])([a-zA-Z])$/$1[$2$2]/;
See demo https://regex101.com/r/iD9oN4/1

Sub-pattern in regex can't be dereferenced?

I have following Perl script to extract numbers from a log. It seems that the non-capturing group with ?: isn't working when I define the sub-pattern in a variable. It's only working when I leave out the grouping in either the regex-pattern or the sub-pattern in $number.
#!/usr/bin/perl
use strict;
use warnings;
my $number = '(:?-?(?:(?:\d+\.?\d*)|(?:\.\d+))(?:[Ee][+-]?\d+)?)';
#my $number = '-?(?:(?:\d+\.?\d*)|(?:\.\d+))(?:[Ee][+-]?\d+)?';
open(FILE,"file.dat") or die "Exiting with: $!\n";
while (my $line = <FILE>) {
if ($line =~ m{x = ($number). y = ($number)}){
print "\$1= $1\n";
print "\$2= $2\n";
print "\$3= $3\n";
print "\$4= $4\n";
};
}
close(FILE);
The output for this code looks like:
$1= 12.15
$2= 12.15
$3= 3e-5
$4= 3e-5
for an input of:
asdf x = 12.15. y = 3e-5 yadda
Those doubled outputs aren't desired.
Is this because of the m{} style in contrast to the regular m// patterns for regex? I only know the former style to get variables (sub-strings) in my regex expressions. I just noticed this for the backreferencing so possibly there are other differences for metacharacters?
The delimiters you use for the regular expression aren't causing any problems but the following is:
(:?-?(?:(?:\d+\.?\d*)|(?:\.\d+))(?:[Ee][+-]?\d+)?)
^^
Notice this isn't a capturing group, it is an optional colon :
Probably a typo mistake but it is causing the trouble.
Edit: It looks that it is not a typo mistake, i substituted the variables in the regex and I got this:
x = ((:?-?(?:(?:\d+\.?\d*)|(?:\.\d+))(?:[Ee][+-]?\d+)?)). y = ((:?-?(?:(?:\d+\.?\d*)|(?:\.\d+))(?:[Ee][+-]?\d+)?))
^^ first and second group ^^ ^^ third and fourth grouop ^^
As you can see the first and second capturing group are capturing exactly the same thing, the same is happening for the third and fourth capturing group.
You're going to kick yourself...
Your regexp reads out as:
capture {
maybe-colon
maybe-minus
cluster { (?:(?:\d+\.?\d*)|(?:\.\d+))
cluster { (?:\d+\.?\d*)
1+ digits
maybe-dot
0+ digits
}
-or-
cluster { (?:\.\d+)
dot
1+digits
}
}
maybe cluster {
E or e
maybe + or -
1+ digets
} (?:[Ee][+-]?\d+)?
}
... which is what you're looking for.
However, when you then do your actual regexp, you do:
$line =~ m{x = $number. y = $number})
(the curly braces are a distraction.... you may use any \W if the m or s has been specified)
What this is asking is to capture whatever the regexp defined in $number is.... which is, itself, a capture.... hence $1 and $2 being the same thing.
Simply remove the capture braces from either $number or the regexp line.

I want to match anything not including '#' but yes including '\#'

I'm looking for a perl code line that may contain regexps and comments
i need to capture everything until a comment. so i want all characters until # but I AM INTERESTED in capturing #
for example, if the line was:
if ($line=/\#/) { #captures lines with '#'
I want to capture:
if ($line=/\#/) {
Give this a try:
use PPI;
my $ppi = PPI::Document->new('source.pl');
my $source = '';
for my $token ( #{ $ppi->find("PPI::Token") } ) {
last if $token->isa("PPI::Token::Comment");
$source .= $token;
}
print $source;
This should handle pretty much everything except here-docs. If you need to deal with those, start by copying PPI::Document::serialize and modify it to stop on the first comment.
Try this
^(?:[^#]|(?<=\\)#)+
See it here on Regexr
This will match anything from the start of the string (^), that is not a # ([^#]) OR
a # that is preceeded by a backslash ((?<=\\)#)

perl regex for extracting multiline blocks

I have text like this:
00:00 stuff
00:01 more stuff
multi line
and going
00:02 still
have
So, I don't have a block end, just a new block start.
I want to recursively get all blocks:
1 = 00:00 stuff
2 = 00:01 more stuff
multi line
and going
etc
The bellow code only gives me this:
$VAR1 = '00:00';
$VAR2 = '';
$VAR3 = '00:01';
$VAR4 = '';
$VAR5 = '00:02';
$VAR6 = '';
What am I doing wrong?
my $text = '00:00 stuff
00:01 more stuff
multi line
and going
00:02 still
have
';
my #array = $text =~ m/^([0-9]{2}:[0-9]{2})(.*?)/gms;
print Dumper(#array);
Version 5.10.0 introduced named capture groups that are useful for matching nontrivial patterns.
(?'NAME'pattern)
(?<NAME>pattern)
A named capture group. Identical in every respect to normal capturing parentheses () but for the additional fact that the group can be referred to by name in various regular expression constructs (such as \g{NAME}) and can be accessed by name after a successful match via %+ or %-. See perlvar for more details on the %+ and %- hashes.
If multiple distinct capture groups have the same name then the $+{NAME} will refer to the leftmost defined group in the match.
The forms (?'NAME'pattern) and (?<NAME>pattern) are equivalent.
Named capture groups allow us to name subpatterns within the regex as in the following.
use 5.10.0; # named capture buffers
my $block_pattern = qr/
(?<time>(?&_time)) (?&_sp) (?<desc>(?&_desc))
(?(DEFINE)
# timestamp at logical beginning-of-line
(?<_time> (?m:^) [0-9][0-9]:[0-9][0-9])
# runs of spaces or tabs
(?<_sp> [ \t]+)
# description is everything through the end of the record
(?<_desc>
# s switch makes . match newline too
(?s: .+?)
# terminate before optional whitespace (which we remove) followed
# by either end-of-string or the start of another block
(?= (?&_sp)? (?: $ | (?&_time)))
)
)
/x;
Use it as in
my $text = '00:00 stuff
00:01 more stuff
multi line
and going
00:02 still
have
';
while ($text =~ /$block_pattern/g) {
print "time=[$+{time}]\n",
"desc=[[[\n",
$+{desc},
"]]]\n\n";
}
Output:
$ ./blocks-demo
time=[00:00]
desc=[[[
stuff
]]]
time=[00:01]
desc=[[[
more stuff
multi line
and going
]]]
time=[00:02]
desc=[[[
still
have
]]]
This should do the trick. Beginning of next \d\d:\d\d is treated as block end.
use strict;
my $Str = '00:00 stuff
00:01 more stuff
multi line
and going
00:02 still
have
00:03 still
have' ;
my #Blocks = ($Str =~ m#(\d\d:\d\d.+?(?:(?=\d\d:\d\d)|$))#gs);
print join "--\n", #Blocks;
Your problem is that .*? is non-greedy in the same way that .* is greedy. When it is not forced, it matches as little as possible, which in this case is the empty string.
So, you'll need something after the non-greedy match to anchor up your capture. I came up with this regex:
my #array = $text =~ m/\n?([0-9]{2}:[0-9]{2}.*?)(?=\n[0-9]{2}:|$)/gs;
As you see, I removed the /m option to accurately be able to match end of string in the look-ahead assertion.
You might also consider this solution:
my #array = split /(?=[0-9]{2}:[0-9]{2})/, $text;

Perl regular expression isn't greedy enough

I'm writing a regular expression in perl to match perl code that starts the definition of a perl subroutine. Here's my regular expression:
my $regex = '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*\n)*\s*\{';
$regex matches code that starts a subroutine. I'm also trying to capture the name of the subroutine in $1 and any white space and comments between the subroutine name and the initial open brace in $2. It's $2 that is giving me a problem.
Consider the following perl code:
my $x = 1;
sub zz
# This is comment 1.
# This is comment 2.
# This is comment 3.
{
$x = 2;
return;
}
When I put this perl code into a string and match it against $regex, $2 is "# This is comment 3.\n", not the three lines of comments that I want. I thought the regular expression would greedily put all three lines of comments into $2, but that seems not to be the case.
I would like to understand why $regex isn't working and to design a simple replacement. As the program below shows, I have a more complex replacement ($re3) that works. But I think it's important for me to understand why $regex doesn't work.
use strict;
use English;
my $code_string = <<END_CODE;
my \$x = 1;
sub zz
# This is comment 1.
# This is comment 2.
# This is comment 3.
{
\$x = 2;
return;
}
END_CODE
my $re1 = '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*\n)*\s*\{';
my $re2 = '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*\n){0,}\s*\{';
my $re3 = '\s*sub\s+([a-zA-Z_]\w*)((\s*#.*\n)+)?\s*\{';
print "\$code_string is '$code_string'\n";
if ($code_string =~ /$re1/) {print "For '$re1', \$2 is '$2'\n";}
if ($code_string =~ /$re2/) {print "For '$re2', \$2 is '$2'\n";}
if ($code_string =~ /$re3/) {print "For '$re3', \$2 is '$2'\n";}
exit 0;
__END__
The output of the perl script above is the following:
$code_string is 'my $x = 1;
sub zz
# This is comment 1.
# This is comment 2.
# This is comment 3.
{
$x = 2;
return;
} # sub zz
'
For '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*\n)*\s*\{', $2 is '# This is comment 3.
'
For '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*\n){0,}\s*\{', $2 is '# This is comment 3.
'
For '\s*sub\s+([a-zA-Z_]\w*)((\s*#.*\n)+)?\s*\{', $2 is '
# This is comment 1.
# This is comment 2.
# This is comment 3.
'
Look at only the part of your regex that captures $2. It is (\s*#.*\n). By itself, this can only capture a single comment line. You have an asterisk after it in order to capture multiple comment lines, and this works just fine. It captures multiple comment lines and puts each of them into $2, one by one, each time replacing the previous value of $2. So the final value of $2 when the regex is done matching is the last thing that the capturing group matched, which is the final comment line. Only. To fix it, you need to put the asterisk inside the capturing group. But then you need to put another set of parentheses (non-capturing, this time) to make sure the asterisk applies to the whole thing. So instead of (\s*#.*\n)*, you need ((?:\s*#.*\n)*).
Your third regex works because you unwittingly surrounded the whole expression in parentheses so that you could put a question mark after it. This caused $2 to capture all the comments at once, and $3 to capture only the final comment.
When you are debugging your regex, make sure you print out the values of all the match variables you are using: $1, $2, $3, etc. You would have seen that $1 was just the name of the subroutine and $2 was only the third comment. This might have led you to wonder how on earth your regex skipped over the first two comments when there is nothing between the first and second capturing groups, which would eventually lead you in the direction of discovering what happens when a capturing group matches multiple times.
By the way, it looks like you are also capturing any whitespace after the subroutine name into $1. Is this intentional? (Oops, I messed up my mnemonics and thought \w was "w for whitespace".)
If you add repetition to a capturing group, it will only capture the final match of that group. This is why $regex only matches the final comment line.
Here is how I would rewrite you regex:
my $regex = '\s*sub\s+([a-zA-Z_]\w*)((?:\s*#.*\n)*)\s*\{';
This is very similar to your $re3, except for the following changes:
The white space and comment matching portion is now in a non-capturing group
I changed that portion of the regex from ((...)+)? to ((...)*) which is equivalent.
The problem is that by default the \n isn't part of the string. The regex stops matching at \n.
You need to use the s modifier for multi-line matches:
if ($code_string =~ /$re1/s) {print "For '$re1', \$2 is '$2'\n";}
Note the s after the regex.