I was doing some regex substitution operation with the html snippet using Perl.
This is how I match the wanted part: (class="p_hw"><a href=")(http://[^<>"]*?xxxx\.com\/[^<>"]*[=/])([^<>"]*)(">(?:<b>)?)(.*?)(?=<)
I need to replace the http:// with entry:// followed by certain parameter value of the http url($3 for that matter) if that value exists in a hash(%hw_f), or else the first word(or phrase) from $5 will be used when it exists in %hw_f. If all conditions are not matched, the snippet will stay unchanged.
I have tried the following:
s#(class="p_hw"><a href=")(http://[^<>"]*?xxxx\.com\/[^<>"]*[=/])([^<>"]*)(">(?:<b>)?)(.*?)(?=<)#
my #n = split(/\,|;/, $5);
my #m = map {s,^\s+|\s+$,,mgr} #n;
my $new = $3 =~ s/^\s+|\s+$//mgr;
my $new2 = $new =~ s/\+/ /mgr;
exists $hw_f{$new2} ? "$1entry://$new2$4$5" : (exists $hw_f{$m[0]} ? "$1entry://$m[0]$4$5" : "$1$2$3$4$5") #eg;
%hw_f is where all conditions will be matched against.
It gives the following error:
Use of uninitialized value $1 in concatenation (.) or string
I need to obtain a new value based on $3 within the substitution, continue with that new value. How could I do that?
I'm not going to try to really fix the logic of what you're trying to accomplish because it's rather ill advised. What I will do is offer some semantic and coding advice.
1: Use Regexp::Common and URI to deal with URLs. It is almost never worth it to write your own regexes. Parsing HTML with regex requires that you seriously know what you're doing. https://metacpan.org/search?q=regexp%3A%3Acommon
2: Always only use {} and // to wrap regex. (A 99% rule)
3: Always immediately copy the numbered variables into meaningfully named my() variables unless the expression is trivial.
4: Modify arrays inplace with postfix foreach.
5: Spread out the code formatting to make it visually appealing.
6: Use sprintf for complicated variable recombinations. It makes it a lot easier to see what variable is used where and for what.
HTH
# 1 2 3 4 5
s{(class="p_hw"><a href=\")(http://[^<>"]*?xxxx\.com/[^<>"]*[=/])([^<>\"]*)(\">(?:<b>)?)(.*?)(?=<)}{
my ($m1, $m2, $m3, $m4, $m5) = ($1, $2, $3, $4, $5);
my #n = split /[,|;]/, $m5;
s/^\s+|\s+$//mg foreach #n;
(my $new = $m3) =~ s/^\s+|\s+$//mg;
(my $new2 = $new) =~ s/\+/ /g;
exists $hw_f{$new2} ?
sprintf "%sentry://%s%s%s", $m1, $new2, $m4, $m5 :
exists $hw_f{$n[0]} ?
sprintf "%sentry://%s%s%s", $m1, $n[0], $m4, $m5 :
"$m1$m2$m3$m4$m5";
}ige;
Update:
while (<DICT>) {
s#(class="p_hw"><a href=")(http://[^<>"]*?wordinfo\.info\/[^<>"]*[=/])([^<>"]*)(">(?:<b>)?)(.*?)(?=<)#
my $one = $1;
my $two = $2;
my $three = $3;
my $four = $4;
my $five = $5;
my #n = split(/\,|;/, $5);
my #m = map {s,^\s+|\s+$,,mgr} #n;
my $new = $3 =~ s/^\s+|\s+$//mgr;
my $new2 = $new =~ s/\+/ /mgr;
exists $hw_f{$new2} ? $one."entry://$new2$four$five" : (exists $hw_f{$m[0]} ? $one."entry://$m[0]$four$five" : "$one$two$three$four$five") #eg;
print $FH $_;
}
Assigning all the capture variables before all the regex engine invocation as #DavidO in the comment mentioned, it finally works. Thanks.
from your post it is not obvious what you try to achieve. If you would describe the problem in following format it would be easier to understand
--- Example -----------------------
I extract from web page a snippet with <a href="http:\\....... which I would like to convert/transform into following format <a href="http:\\........
At least in this way we know what is INPUT and what OUTPUT expected.
--- End of the example ------------
When you apply regex with memory it is easier to store remembered values in an array or better hash
use strict;
use warnings;
use Data::Dumper;
my %href;
$data = shift;
if( $data =~ /<a href="(\w+):\\\\([\w\d\.]+)\\([\w\d\.]+)\\(.+)">([^<]+)</ ) {
#href{qw(protocol dns dir rest desc)} = ($1,$2,$3,$4,$5);
print Dumper(\%href);
} else {
print "No match found\n";
}
Related
How do you create a $scalar from the result of a regex match?
Is there any way that once the script has matched the regex that it can be assigned to a variable so it can be used later on, outside of the block.
IE. If $regex_result = blah blah then do something.
I understand that I should make the regex as non-greedy as possible.
#!/usr/bin/perl
use strict;
use warnings;
# use diagnostics;
use Win32::OLE;
use Win32::OLE::Const 'Microsoft Outlook';
my #Qmail;
my $regex = "^\\s\*owner \#";
my $sentence = $regex =~ "/^\\s\*owner \#/";
my $outlook = Win32::OLE->new('Outlook.Application')
or warn "Failed Opening Outlook.";
my $namespace = $outlook->GetNamespace("MAPI");
my $folder = $namespace->Folders("test")->Folders("Inbox");
my $items = $folder->Items;
foreach my $msg ( $items->in ) {
if ( $msg->{Subject} =~ m/^(.*test alert) / ) {
my $name = $1;
print " processing Email for $name \n";
push #Qmail, $msg->{Body};
}
}
for(#Qmail) {
next unless /$regex|^\s*description/i;
print; # prints what i want ie lines that start with owner and description
}
print $sentence; # prints ^\\s\*offense \ # not lines that start with owner.
One way is to verify a match occurred.
use strict;
use warnings;
my $str = "hello what world";
my $match = 'no match found';
my $what = 'no what found';
if ( $str =~ /hello (what) world/ )
{
$match = $&;
$what = $1;
}
print '$match = ', $match, "\n";
print '$what = ', $what, "\n";
Use Below Perl variables to meet your requirements -
$` = The string preceding whatever was matched by the last pattern match, not counting patterns matched in nested blocks that have been exited already.
$& = Contains the string matched by the last pattern match
$' = The string following whatever was matched by the last pattern match, not counting patterns matched in nested blockes that have been exited already. For example:
$_ = 'abcdefghi';
/def/;
print "$`:$&:$'\n"; # prints abc:def:ghi
The match of a regex is stored in special variables (as well as some more readable variables if you specify the regex to do so and use the /p flag).
For the whole last match you're looking at the $MATCH (or $& for short) variable. This is covered in the manual page perlvar.
So say you wanted to store your last for loop's matches in an array called #matches, you could write the loop (and for some reason I think you meant it to be a foreach loop) as:
my #matches = ();
foreach (#Qmail) {
next unless /$regex|^\s*description/i;
push #matches_in_qmail $MATCH
print;
}
I think you have a problem in your code. I'm not sure of the original intention but looking at these lines:
my $regex = "^\\s\*owner \#";
my $sentence = $regex =~ "/^\s*owner #/";
I'll step through that as:
Assign $regexto the string ^\s*owner #.
Assign $sentence to value of running a match within $regex with the regular expression /^s*owner $/ (which won't match, if it did $sentence will be 1 but since it didn't it's false).
I think. I'm actually not exactly certain what that line will do or was meant to do.
I'm not quite sure what part of the match you want: the captures, or something else. I've written Regexp::Result which you can use to grab all the captures etc. on a successful match, and Regexp::Flow to grab multiple results (including success statuses). If you just want numbered captures, you can also use Data::Munge
You can do the following:
my $str ="hello world";
my ($hello, $world) = $str =~ /(hello)|(what)/;
say "[$_]" for($hello,$world);
As you see $hello contains "hello".
If you have older perl on your system like me, perl 5.18 or earlier, and you use $ $& $' like codequestor's answer above, it will slow down your program.
Instead, you can use your regex pattern with the modifier /p, and then check these 3 variables: ${^PREMATCH}, ${^MATCH}, and ${^POSTMATCH} for your matching results.
I need to write a perl regex to convert
site.company.com => dc=site,dc=company,dc=com
Unfortunately I am not able to remove the trailing "," using the regex I came with below. I could of course remove the trailing "," in the next statement but would prefer that to be handled as a part of the regex.
$data="site.company.com";
$data =~ s/([^.]+)\.?/dc=$1,/g;
print $data;
This above code prints:
dc=site,dc=company,dc=com,
Thanks in advance.
When handling urls it may be a good idea to use a module such as URI. However, I do not think it applies in this case.
This task is most easily solved with a split and join, I think:
my $url = "site.company.com";
my $string = join ",", # join the parts with comma
map "dc=$_", # add the dc= to each part
split /\./, $url; # split into parts
$data =~s/\./,dc=/g&&s/^/dc=/g;
tested below:
> echo "site.company.com" | perl -pe 's/\./,dc=/g&&s/^/dc=/g'
dc=site,dc=company,dc=com
Try doing this :
my $x = "site.company.com";
my #a = split /\./, $x;
map { s/^/dc=/; } #a;
print join",", #a;
just put like this,
$data="site.company.com";
$data =~ s/,dc=$1/dc=$1/g; #(or) $data =~ s/,dc/dc/g;
print $data;
I'm going to try the /ge route:
$data =~ s{^|(\.)}{
( $1 && ',' ) . 'dc='
}ge;
e = evaluate replacement as Perl code.
So, it says given the start of the string, or a dot, make the following replacement. If it captured a period, then emit a ','. Regardless of this result, insert 'dc='.
Note, that I like to use a brace style of delimiter on all my evaluated replacements.
Assuming that I must do this substitution using a single substitution, what is the preferred method to avoid this error:
Use of uninitialized value $2 in concatenation (.) or string at -e line 1.
With this Perl code:
perl -e 'use strict;use warnings;my $str="a";$str=~s/(a)|(b)/$1foo$2/gsmo;'
The goal here is to either print "afoo" or "foob" depending on what $str contains.
I can use no warnings; but then I am worried I will miss other "real" warnings. I also know that using one pattern makes this convoluted but my actual pattern is much more complicated.
If you care the actual replacements are closer to:
#!perl
my $search = q~(document\.domain.*?</script>)|(</head>)~;
my $search_re = qr/$search/smo;
my $replace = q("$1
<script src=\"/library.js\"></script>
$2");
while (<*.tmpl>) {
my $str = fead_file($_);
$str =~ s/$search_re/$replace/gee;
}
But even more complicated, basically the above code just reads from a DB to get the search & replace and then does them to the template. Having to run this script twice with every commit would introduce too much overhead, apparently... so says them...
You could:
my $replace = q("#{[$1||'']}
<script src=\"/library.js\"></script>
#{[$2||'']}");
(using // instead of || on 5.10+)
Still works with /g:
s/(a)|(b)/ ($1 // '') . 'foo' . ($2 // '') /ge
Well, you can't find both "a" and "b" when you specifically say OR (|). Also, you cannot concatenate the strings by placing the variable name next to the text, e.g. $1foo.
I'm not quite sure what you are saying about overhead, but you do need to check the match in order to do a correct replacement.
s/(a)/$1 . "foo"/ge || s/(b)/"foo" . $1/ge;
This might work. If the first one works, the second won't be executed (short circuit OR).
Similar to ikegami's solution, if you want to hold the replacement in a variable you can call a code reference in s///e passing it the captures.
#!perl
my $search = q~(document\.domain.*?</script>)|(</head>)~;
my $search_re = qr/$search/smo;
my $replace = sub {
my $one = shift || '';
my $two = shift || '';
return qq($one\n<script src="/library.js"></script>\n$two);
}
while (<*.tmpl>) {
my $str = fead_file($_);
$str =~ s/$search_re/$replace->($1, $2)/ge;
}
I want to be able to do a regex match on a variable and assign the results to the variable itself. What is the best way to do it?
I want to essentially combine lines 2 and 3 in a single line of code:
$variable = "some string";
$variable =~ /(find something).*/;
$variable = $1;
Is there a shorter/simpler way to do this? Am I missing something?
my($variable) = "some string" =~ /(e\s*str)/;
This works because
If the /g option is not used, m// in list context returns a list consisting of the subexpressions matched by the parentheses in the pattern, i.e., ($1, $2, $3 …).
and because my($variable) = ... (note the parentheses around the scalar) supplies list context to the match.
If the pattern fails to match, $variable gets the undefined value.
Why do you want it to be shorter? Does is really matter?
$variable = $1 if $variable =~ /(find something).*/;
If you are worried about the variable name or doing this repeatedly, wrap the thing in a subroutine and forget about it:
some_sub( $variable, qr/pattern/ );
sub some_sub { $_[0] = $1 if eval { $_[0] =~ m/$_[1]/ }; $1 };
However you implement it, the point of the subroutine is to make it reuseable so you give a particular set of lines a short name that stands in their place.
Several other answers mention a destructive substitution:
( my $new = $variable ) =~ s/pattern/replacement/;
I tend to keep the original data around, and Perl v5.14 has an /r flag that leaves the original alone and returns a new string with the replacement (instead of the count of replacements):
my $match = $variable =~ s/pattern/replacement/r;
Well, you could say
my $variable;
($variable) = ($variable = "find something soon") =~ /(find something).*/;
or
(my $variable = "find something soon") =~ s/^.*?(find something).*/$1/;
You can do substitution as:
$a = 'stackoverflow';
$a =~ s/(\w+)overflow/$1/;
$a is now "stack"
From Perl Cookbook 2nd ed
6.1 Copying and Substituting Simultaneously
$dst = $src;
$dst =~ s/this/that/;
becomes
($dst = $src) =~ s/this/that/;
I just assumed everyone did it this way, amazed that no one gave this answer.
Almost ....
You can combine the match and retrieve the matched value with a substitution.
$variable =~ s/.*(find something).*/$1/;
AFAIK, You will always have to copy the value though, unless you do not care to clobber the original.
$variable2 = "stackoverflow";
(my $variable1) = ($variable2 =~ /stack(\w+)/);
$variable1 now equals "overflow".
I do this:
#!/usr/bin/perl
$target = "n: 123";
my ($target) = $target =~ /n:\s*(\d+)/g;
print $target; # the var $target now is "123"
Also, to amplify the accepted answer using the ternary operator to allow you to specify a default if there is no match:
my $match = $variable =~ /(*pattern*).*/ ? $1 : *defaultValue*;
I thought this would have done it...
$rowfetch = $DBS->{Row}->GetCharValue("meetdays");
$rowfetch = /[-]/gi;
printline($rowfetch);
But it seems that I'm missing a small yet critical piece of the regex syntax.
$rowfetch is always something along the lines of:
------S
-M-W---
--T-TF-
etc... to represent the days of the week a meeting happens
$rowfetch =~ s/-//gi
That's what you need for your second line there. You're just finding stuff, not actually changing it without the "s" prefix.
You also need to use the regex operator "=~" for this.
Here is what your code presently does:
# Assign 'rowfetch' to the value fetched from:
# The function 'GetCharValue' which is a method of:
# An Value in A Hash Identified by the key "Row" in:
# Either a Hash-Ref or a Blessed Hash-Ref
# Where 'GetCharValue' is given the parameter "meetdays"
$rowfetch = $DBS->{Row}->GetCharValue("meetdays");
# Assign $rowfetch to the number of times
# the default variable ( $_ ) matched the expression /[-]/
$rowfetch = /[-]/gi;
# Print the number of times.
printline($rowfetch);
Which is equivalent to having written the following code:
$rowfetch = ( $_ =~ /[-]/ )
printline( $rowfetch );
The magic you are looking for is the
=~
Token instead of
=
The former is a Regex operator, and the latter is an assignment operator.
There are many different regex operators too:
if( $subject =~ m/expression/ ){
}
Will make the given codeblock execute only if $subject matches the given expression, and
$subject =~ s/foo/bar/gi
Replaces ( s/) all instances of "foo" with "bar", case-insentitively (/i), and repeating the replacement more than once(/g), on the variable $subject.
Using the tr operator is faster than using a s/// regex substitution.
$rowfetch =~ tr/-//d;
Benchmark:
use Benchmark qw(cmpthese);
my $s = 'foo-bar-baz-blee-goo-glab-blech';
cmpthese(-5, {
trd => sub { (my $a = $s) =~ tr/-//d },
sub => sub { (my $a = $s) =~ s/-//g },
});
Results on my system:
Rate sub trd
sub 300754/s -- -79%
trd 1429005/s 375% --
Off-topic, but without the hyphens, how will you know whether a "T" is Tuesday or Thursday?