Regular expression tag matching - regex

I have a very simple Perl function that returns the content of a tag in custom XML code I need to parse. However, if there are line returns inside of the tags, then it returns an empty value and I'm not sure how to fix it:
sub in_tag
{
my ($text, $tag) = #_;
my ($content) = $text =~ m/<$tag.*>(.*)<\/$tag>/;
$content = $content . "";
return $content;
}
# works
print in_tag("<item><creation type=\"date\">2014-01-03</creation><name type=\"word\">John Doe</name><id type=\"number\">67</id></item>", "name");
# doesnt work
print in_tag("<item><creation type=\"date\">2014-01-03</creation><name type=\"word\">John\nDoe</name><id type=\"number\">67</id></item>", "name");

To make the . regex metacharacter match a newline, you need to use the /s flag:
m/..../s;
You also want to use non-greedy quantifiers in your regular expression. Put a ? after the * to still match zero or more, but with the provision that it doesn't go beyond text that would match the next part of the pattern:
m/<$tag.*?>(.*?)<\/$tag>/
I don't mind this simple sort of extraction for quick programs or small, uncomplicated inputs, but beyond that I like XML::Twig. It takes a bit to get used to, but once you get the hang of it you'll be able to do all sorts of fancy things with almost no effort.

Related

Make a regular expression in perl to grep value work on a string with different endings

I have this code in perl where I want to extract the value of 'EUR_AF', in this case '0.39'.
Sometimes 'EUR_AF' ends with ';', sometimes it doesn't.
Alternatively, 'EUR_AF' may end with '=0' instead of '=0.39;' or '=0.39'.
How do I make the code handle that? Can't seem to find it online...I could of course wrap everything in an almost endless if-elsif-else statement, but that seems overkill.
Example text:
AVGPOST=0.9092;AN=2184;RSQ=0.5988;ERATE=0.0081;AC=144;VT=SNP;THETA=0.0045;AA=A;SNPSOURCE=LOWCOV;LDAF=0.0959;AF=0.07;ASN_AF=0.05;AMR_AF=0.10;AFR_AF=0.11;EUR_AF=0.039
Code: $INFO =~ m/\;EUR\_AF\=(.*?)(;)/
I did find that: $INFO =~ m/\;EUR\_AF\=(.*?0)/ handles the cases of EUR_AF=0, but how to handle alternative scenarios efficiently?
Extract one value:
my ($eur_af) = $s =~ /(?:^|;)EUR_AF=([^;]*)/;
my ($eur_af) = ";$s" =~ /;EUR_AF=([^;]*)/;
Extract all values:
my %rec = split(/[=;]/, $s);
my $eur_af = $rec{EUR_AF};
This regex should work for you: (?<=EUR_AF=)\d+(\.\d+)?
It means
(?<=EUR_AF=) - look for a string preceeded by EUR_AF=
\d+(\.\d+)? - consist of a digit, optionally a decimal digit
EDIT: I originally wanted the whole regex to return the correct result, not only the capture group. If you want the correct capture group edit it to (?<=EUR_AF=)(\d+(?:\.\d+)?)
I have found the answer. The code:
$INFO =~ m/(?:^|;)EUR_AF=([^;]*)/
seems to handle the cases where EUR_AF=0 and EUR_AF=0.39, ending with or without ;. The resulting $INFO will be 0 or 0.39.

Perl Regex negation for multiple words

I need to exclude some URLs for a jMeter test:
dont exclude:
http://foo/bar/is/valid/with/this
http://foo/bar/is/also/valid/with/that
exclude:
http://foo/bar/is/not/valid/with/?=action
http://foo/bar/is/not/valid/with/?=action
http://foo/bar/is/not/valid/with/specialword
Please help me?
My following Regex isnt working:
foo/(\?=|\?action|\?form_action|specialword).*
First problem: / is the general delimiter so escape it with \/ or alter the delimiter.
Second Problem: It will match only foo/action and so on, you need to include a wildcard before the brackets: foo\/.*(\?=|\?action|\?form_action|specialword).*
So:
/foo\/.*(\?=|\?action|\?form_action|specialword).*/
Next problem is that this will match the opposite: Your excludes. You can either finetune your regex to do the inverse OR you can handle this in your language (i.e. if there is no match, do this and that).
Always pay attention to special characters in regex. See here also.
There are countless ways to shoot yourself in the foot with regular expressions. You could write some kind of "parser" using /g and /c in a loop, but why bother? It seems like you are already having trouble with the current regular expression.
Break the problem down into smaller parts and everything will be less complicated. You could write yourself some kind of filter for grep like:
sub filter {
my $u = shift;
my $uri = URI->new($u);
return undef if $uri->query;
return undef if grep { $_ eq 'specialword' } $uri->path_segments;
return $u;
}
say for grep {filter $_} #urls;
I wouldn't cling that hard to a regular expression, especially if others have to read the code too...
Change the regex delimiter to something other than '/' so you don't have to escape it in your matches. You might do:
m{//foo/.+(?:\?=action|\?form_action|specialword)$};
The ?: denotes grouping-only.
Using this, you could say:
print unless m{//foo/.+(?:\?=action|\?form_action|specialword)$};
Your alternation is wrong. foo/(\?=|\?action|\?form_action|specialword) matches any of
foo/?=
foo/?action
foo/?form_action
foo/?specialword
so you need instead
m{foo/.*(?:\?=action|\?=form_action|specialword)}
The .* is necessary to account for the possible bar/is/valid/with/this after /foo/.
Note that I have changed your ( .. ) to the non-capturing (?: .. ) and I have used braces for the regex delimiter to avoid having to escape the slashes in the expression.
Finally, you need to write either
unless ($url =~ m{/foo/.*(?:\?=action|\?=form_action|specialword)}) { ... }
or
if ($url !~ m{/foo/.*(?:\?=action|\?=form_action|specialword)}) { ... }
since the regex matches URLs that are to be discarded.

How to have a variable as regex in Perl

I think this question is repeated, but searching wasn't helpful for me.
my $pattern = "javascript:window.open\('([^']+)'\);";
$mech->content =~ m/($pattern)/;
print $1;
I want to have an external $pattern in the regular expression. How can I do this? The current one returns:
Use of uninitialized value $1 in print at main.pm line 20.
$1 was empty, so the match did not succeed. I'll make up a constant string in my example of which I know that it will match the pattern.
Declare your regular expression with qr, not as a simple string. Also, you're capturing twice, once in $pattern for the open call's parentheses, once in the m operator for the whole thing, therefore you get two results. Instead of $1, $2 etc. I prefer to assign the results to an array.
my $pattern = qr"javascript:window.open\('([^']+)'\);";
my $content = "javascript:window.open('something');";
my #results = $content =~ m/($pattern)/;
# expression return array
# (
# q{javascript:window.open('something');'},
# 'something'
# )
When I compile that string into a regex, like so:
my $pattern = "javascript:window.open\('([^']+)'\);";
my $regex = qr/$pattern/;
I get just what I think I should get, following regex:
(?-xism:javascript:window.open('([^']+)');)/
Notice that it it is looking for a capture group and not an open paren at the end of 'open'. And in that capture group, the first thing it expects is a single quote. So it will match
javascript:window.open'fum';
but not
javascript:window.open('fum');
One thing you have to learn, is that in Perl, "\(" is the same thing as "(" you're just telling Perl that you want a literal '(' in the string. In order to get lasting escapes, you need to double them.
my $pattern = "javascript:window.open\\('([^']+)'\\);";
my $regex = qr/$pattern/;
Actually preserves the literal ( and yields:
(?-xism:javascript:window.open\('([^']+)'\);)
Which is what I think you want.
As for your question, you should always test the results of a match before using it.
if ( $mech->content =~ m/($pattern)/ ) {
print $1;
}
makes much more sense. And if you want to see it regardless, then it's already implicit in that idea that it might not have a value. i.e., you might not have matched anything. In that case it's best to put alternatives
$mech->content =~ m/($pattern)/;
print $1 || 'UNDEF!';
However, I prefer to grab my captures in the same statement, like so:
my ( $open_arg ) = $mech->content =~ m/($pattern)/;
print $open_arg || 'UNDEF!';
The parens around $open_arg puts the match into a "list context" and returns the captures in a list. Here I'm only expecting one value, so that's all I'm providing for.
Finally, one of the root causes of your problems is that you do not need to specify your expression in a string in order for your regex to be "portable". You can get perl to pre-compile your expression. That way, you only care what instructions the characters are to a regex and not whether or not you'll save your escapes until it is compiled into an expression.
A compiled regex will interpolate itself into other regexes properly. Thus, you get a portable expression that interpolates just as well as a string--and specifically correctly handles instructions that could be lost in a string.
my $pattern = qr/javascript:window.open\('([^']+)'\);/;
Is all that you need. Then you can use it, just as you did. Although, putting parens around the whole thing, would return the whole matched expression (and not just what's between the quotes).
You do not need the parentheses in the match pattern. It will match the whole pattern and return that as $1, which I am guess is not matching, but I am only guessing.
$mech->content =~ m/$pattern/;
or
$mech->content =~ m/(?:$pattern)/;
These are the clustering, non-capturing parentheses.
The way you are doing it is correct.
The solutions have been already given, I'd like to point out that the window.open call might have multiple parameters included in "" and grouped by comma like:
javascript:window.open("http://www.javascript-coder.com","mywindow","status=1,toolbar=1");
There might be spaces between the function name and parentheses, so I'd use a slighty different regex for that:
my $pattern = qr{
javascript:window.open\s*
\(
([^)]+)
\)
}x;
print $1 if $text =~ /$pattern/;
Now you have all parameters in $1 and can process them afterwards with split /,/, $stuff and so on.
It reports an uninitialized value because $1 is undefined. $1 is undefined because you have created a nested matching group by wrapping a second set of parentheses around the pattern. It will also be undefined if nothing matches your pattern.

How can I extract a varying number of groups of digits from a Perl string?

I am attempting to parse a string in Perl with the format:
Messages pushed to the Order Book queues 123691 121574 146343 103046 161253
I want to access the numbers at the end of the string so intend to do a match like
/(\d+)/s
My issue is that the number of values at the end contain a variable number of strings.
What is the best way to format the regexp to be able to access each of those numbers individually? I'm a C++ developer and am just learning Perl, so am trying to find the cleanest Perl way to accomplish this.
Thanks for your help.
Just use the /g flag to make the match operator perform a global match. In list context, the match operator returns all of the results as a list:
#result = $string =~ /(\d+)/g;
This works if there are no other numbers than the trailing ones.
You can use the match operator in a list context with the global flag to get a list of all your parenthetical captures. Example:
#list = ($string =~ /(\d+)/g);
Your list should now have the all the digit groups in your string.
See the documentation on the match operator for more info.
"In scalar context, each execution of m//g finds the next match, returning true if it matches, and false if there is no further match" --(From perldoc perlop)
So you should be able to make a global regex loop, like so:
while ($string =~ /(\d+)/g) {
push #queuelist, $1;
}
I'd do something like this:
my #numbers;
if (m/Messages pushed to the Order Book queues ([\d\s]+)/) {
#numbers = split(/\s+/, $1);
}
No need to cram it into one regex.

Regular expression replace a word by a link

I want to write a regular expression that will replace the word Paris by a link, for only the word is not ready a part of a link.
Example:
i'm living in Paris, near Paris Gare du Nord, i love Paris.
would become
i'm living.........near Paris..........i love Paris.
This is hard to do in one step. Writing a single regex that does that is virtually impossible.
Try a two-step approach.
Put a link around every "Paris" there is, regardless if there already is another link present.
Find all incorrectly nested links (<a href="...">Paris</a>), and eliminate the inner link.
Regex for step one is dead-simple:
\bParis\b
Regex for step two is slightly more complex:
(<a[^>]+>.*?(?!:</a>))<a[^>]+>(Paris)</a>
Use that one on the whole string and replace it with the content of match groups 1 and 2, effectively removing the surplus inner link.
Explanation of regex #2 in plain words:
Find every link (<a[^>]+>), optionally followed by anything that is not itself followed by a closing link (.*?(?!:</a>)). Save it into match group 1.
Now look for the next link (<a[^>]+>). Make sure it is there, but do not save it.
Now look for the word Paris. Save it into match group 2.
Look for a closing link (</a>). Make sure it is there, but don't save it.
Replace everything with the content of groups 1 and 2, thereby losing everything you did not save.
The approach assumes these side conditions:
Your input HTML is not horribly broken.
Your regex flavor supports non-greedy quantifiers (.*?) and zero-width negative look-ahead assertions ((?!:...)).
You wrap the word "Paris" only in a link in step 1, no additional characters. Every "Paris" becomes "<a href"...">Paris</a>", or step two will fail (until you change the second regex).
BTW: regex #2 explicitly allows for constructs like this:
in the <b>capital of France</b>, <a href="">Paris</a>
The surplus link comes from step one, replacement result of step 2 will be:
in the <b>capital of France</b>, Paris
You could search for this regular expression:
(<a[^>]*>.*?</a>)|Paris
This regex matches a link, which it captures into the first (and only) capturing group, or the word Paris.
Replace the match with your link only if the capturing group did not match anything.
E.g. in C#:
resultString =
Regex.Replace(
subjectString,
"(<a[^>]*>.*?</a>)|Paris",
new MatchEvaluator(ComputeReplacement));
public String ComputeReplacement(Match m) {
if (m.groups(1).Success) {
return m.groups(1).Value;
} else {
return "Paris";
}
}
Traditional answer for such question: use a real HTML parser. Because REs aren't really good at operating in a context. And HTML is complex, a 'a' tag can have attributes or not, in any order, can have HTML in the link or not, etc.
Regular expression:
!(<a.*</a>.*)*Paris!isU
Replacement:
$1Paris
$1 referes to the first sub-pattern (at least in PHP). Depending on the language you use it could be slightly different.
This should replace all occurencies of "Paris" with the link in the replacement. It just checks whether all opening a-Tags were closed before "Paris".
PHP example:
<?php
$s = 'i\'m living in Paris, near Paris Gare du Nord, i love Paris.';
$regex = '!(<a.*</a>.*)*Paris!isU';
$replace = '$1Paris';
$result = preg_replace( $regex, $replace, $s);
?>
Addition:
This is not the best solution. One situation where this regex won't work is when you have a img-Tag, which is not within an a-Element. When you set the title-Attribute of that image to "Paris", this "Paris" will be replaced, too. And that's not what you want.
Nevertheless I see no way to solve your problem completely with a simple regular expression.
If you weren't limited to using Regular expressions in this case, XSLT is a good choice for a language in which you can define this replacement, because it 'understands' XML.
You define two templates:
One template finds links and removes those links that don't have "Paris" as the body text. Another template finds everything else, splits it into words and adds tags.
$pattern = 'Paris';
$text = 'i\'m living in Paris, near Paris Gare du Nord, i love Paris.';
// 1. Define 2 arrays:
// $matches[1] - array of links with our keyword
// $matches[2] - array of keyword
preg_match_all('#(<a[^>]*?>[^<]*?'.$pattern.'[^<]*?</a>)|(?<!\pL)('.$pattern.')(?!\pL)#', $text, $matches);
// Exists keywords for replace? Define first keyword without tag <a>
$number = array_search($pattern, $matches[2]);
// Keyword exists, let's go rock
if ($number !== FALSE) {
// Replace all link with temporary value
foreach ($matches[1] as $k => $tag) {
$text = preg_replace('#(<a[^>]*?>[^<]*?'.$pattern.'[^<]*?</a>)#', 'KEYWORD_IS_ALREADY_LINK_'.$k, $text, 1);
}
// Replace our keywords with link
$text = preg_replace('/(?<!\pL)('.$pattern.')(?!\pL)/', ''.$pattern.'', $text);
// Return link
foreach ($matches[1] as $k => $tag) {
$text = str_replace('KEYWORD_IS_ALREADY_LINK_'.$k, $tag, $text);
}
// It's work!
echo $text;
}
Regexes don't replace. Languages do.
Languages and libraries would also read from the database or file that holds the list of words you care about, and associate a URL with their name. Here's the easiest substitution I can imagine possible my a single regex (perl is used for the replacement syntax.)
s/([a-z-']+)/<a href="http:\/\/en.wikipedia.org\/wiki\/$1">$1<\/a>/i
Proper names might work better:
s/([A-Z][a-z-']+)/<a href="http:\/\/en.wikipedia.org\/wiki\/$1">$1<\/a>/gi;
Of course "Baton Rouge" would become two links for:
Baton
Rouge
In Perl, you can do this:
my $barred_list_of_cities
= join( '|'
, sort { ( length $a <=> $b ) || ( $a cmp $b ) } keys %url_for_city_of
);
s/($barred_list_of_cities)/<a href="$url_for_city_of{$1}">$1<\/a>/g;
But again, it's a language that implements a set of operations for regexes, regexes don't do a thing. (In reality, it's such a common application, that I'd be surprised if there isn't a CPAN module out there somewhere that does this, and you just need to load the hash.