Don't understand my regex's matches - regex

I'm currently reading xml balises from a file but I tried to reduce this to this simple example.
#!/usr/bin/perl
use strict;
use warnings;
my $str = '<tag x="20" y="7" x="15" z="14"/>';
if($str =~ /<tag.*(x|y|z)=\"(\d+)\".*(x|y|z)=\"(\d+)\".*(x|y|z)=\"(\d+)\".*\/>/){
print "$1-$2\n";
print "$3-$4\n";
print "$5-$6\n";
}
As I understand my regex, the first x should match the first group, the first y the third group and the second x the fifth group.
So I expect as output:
x-20
y-7
x-15
But I get
y-7
x-15
z-14
Could someone explain what's happening here?

Use ? to make *, + quantifiers non-greedy as these are greedy by default (ie. matching any char . as much as possible)
$str =~ /<tag.*?(x|y|z)=\"(\d+)\".*?(x|y|z)=\"(\d+)\".*?(x|y|z)=\"(\d+)\".*\/>/

Instead of .* use \s+. Becasue you actually want to match multiple space characters. not multiple any characters.
If this is really an assignment you should do it in a more proper way. And regular expression is not proper way for xml thing. As its assignment just write a parser. It easier than you think.

Related

In perl match a dot when there're at least three words before it

I'm using (?<=(?:(?:\w|,|'){1,20} ){2}(?:\w|,|'){1,20} ?)\.
But it's not working as expected:
use v5.35.2;
use warnings;
use strict;
my $str = shift // q{If you have to go. you go. That's no problem.};
my $regex = qr/(?<=(?:(?:\w|,|'){1,20} ){2}(?:\w|,|'){1,20} ?)\./;
my #all_parts = split $regex, $str;
say for#all_parts;
It should print out If you have to go and you go. That's no problem
Is there an easier way to achieve this?
#!/usr/bin/env perl
use warnings;
use strict;
use feature qw/say/;
my $str = shift // q{If you have to go. you go. That's no problem.};
my $regex = qr/(?:\b[\w,']+\s*){3}\K\./;
my #all_parts = split $regex, $str;
say for #all_parts;
splits like you want. Using \K to discard everything before the period from the actual match is the key bit. (There's probably tweaks that could be made to the RE to better account for edge cases you didn't provide in your example string).
split / [\w'] (?: [\s,]+ [\w']+ ){2} \K \. /x
Notes:
It's usually easier and more efficient to use \K instead of a lookbehind. It also has the advantage that can look further back than the 255 chars a real variable-length lookbehind can look back. But it has the disadvantage that it can't "look behind" further than the end of the previous match. This isn't a problem here.
Feel free to remove the whitespace. If you do, you can also remove the x.
Adding a + after each existing + should make it a tiny bit faster.
You obviously consider a's to be one word, but the earlier answer can count it as two. For example, it considers the . to be preceded by three words in a's b. c.

Make a regular expression in perl to grep value work on a string with different endings

I have this code in perl where I want to extract the value of 'EUR_AF', in this case '0.39'.
Sometimes 'EUR_AF' ends with ';', sometimes it doesn't.
Alternatively, 'EUR_AF' may end with '=0' instead of '=0.39;' or '=0.39'.
How do I make the code handle that? Can't seem to find it online...I could of course wrap everything in an almost endless if-elsif-else statement, but that seems overkill.
Example text:
AVGPOST=0.9092;AN=2184;RSQ=0.5988;ERATE=0.0081;AC=144;VT=SNP;THETA=0.0045;AA=A;SNPSOURCE=LOWCOV;LDAF=0.0959;AF=0.07;ASN_AF=0.05;AMR_AF=0.10;AFR_AF=0.11;EUR_AF=0.039
Code: $INFO =~ m/\;EUR\_AF\=(.*?)(;)/
I did find that: $INFO =~ m/\;EUR\_AF\=(.*?0)/ handles the cases of EUR_AF=0, but how to handle alternative scenarios efficiently?
Extract one value:
my ($eur_af) = $s =~ /(?:^|;)EUR_AF=([^;]*)/;
my ($eur_af) = ";$s" =~ /;EUR_AF=([^;]*)/;
Extract all values:
my %rec = split(/[=;]/, $s);
my $eur_af = $rec{EUR_AF};
This regex should work for you: (?<=EUR_AF=)\d+(\.\d+)?
It means
(?<=EUR_AF=) - look for a string preceeded by EUR_AF=
\d+(\.\d+)? - consist of a digit, optionally a decimal digit
EDIT: I originally wanted the whole regex to return the correct result, not only the capture group. If you want the correct capture group edit it to (?<=EUR_AF=)(\d+(?:\.\d+)?)
I have found the answer. The code:
$INFO =~ m/(?:^|;)EUR_AF=([^;]*)/
seems to handle the cases where EUR_AF=0 and EUR_AF=0.39, ending with or without ;. The resulting $INFO will be 0 or 0.39.

Regular expression which matches a specific pattern

I want to find a regular expression in Perl which matches a pattern such as this:
my $sumthing = "people say
for -->";
Over here after say there is a single newline character. So I need to find a regular expression which could match such a pattern which includes a newline within a pattern. Please help me to find this as I'm new to Perl & regular expression.
The possible methods I tried were these:
if (($sumthing !~ (/\n+$/)) && ($sumthing !~ (/^\n+/m)))
They kindly help me to find out an expression to match this kind of a pattern, but not getting the output as desired.
It's not clear what you want. Do you want match that string exactly? If so, you could use
$sumthing =~ /^people say\nfor -->\z/
or
$sumthing eq "people say\nfor -->"
Or maybe what you need to know is that . matches any character including newline when /s is used?
/people .* -->/s
The following will check for anything then new line then anything. Not sure if I totally understood your question.
if($sumthing =~ m/.*\n.*/)
Have a look at the /s modifier which causes .to match anything, including a newline.
my $str = "people say for\nsomething...";
$str =~ m{say(.*)}s and print "'$1'\n";
This would print:
' for
something...'

Perl replace nth substring in a string

I have a scenario in which I need to replace the nth sub-string in a string.
s/sub-string/new-string/g; will replace all the sub strings, but I need to do for a particular occurrence.
Please help me with this.
For replacing the nth occurrence of a string using sed, you can use this command:
sed 's/find_string/replace_string/n'
For replacing the substring, we need to know what you want to replace. Give an example.
This question might be interesting: Perl regex replace count
You might do something like this:
use strict;
use warnings;
my $count = 3;
my $str = "blublublublublu";
$str =~ s/(lu)/--$count == 0 ? "LA":$1/ge;
print $str;
Try this:
s/(sub-string{2,2})sub-string/$1new-string/
adjust 2 according to your needs (it's your 'n'- 1). Note that there may no separators exist between those substrings. e.g. 'abcabcabc' would work but 'abcdefabcabc' won't
You can also do like this
my $i=0;
s/(old-string)/++$i%3 ? $1 : "new_string"/ge;
I'm really a believer that there's no point building extra complexity into a regular expression unless it's truly necessary to do so (or unless you're just having fun). In code I actually planned to use I would keep it simple, like this:
my $string = "one two three four five";
$string =~ m/\w+\s*/g for 1 .. 2;
substr( $string,pos($string) ) =~ s/(\w+)/3/;
print "$string\n";
Using the m//g in scalar context causes it to match one time per iteration of the for loop. On each iteration pos() keeps track of the end of the most recent submatch on $string. Once you've gone through 'n' iterations (two in this case), you can plug pos() into substr(). Use substr($string... as an lvalue. It will constrain the regexp match to begin at whatever position you tel it in the second arg. We're plugging pos in there, which constrains it to take its next match wherever the last match left off.
This approach eliminates an explicit counter (though the for loop is essentially the same thing without naming a counter variable). This approach also scales better than a s//condition ? result : result/eg approach because it will stop after that third match is accomplished, rather than continuing to try to match until the end of a potentially large string is reached. In other words, the s///eg approach doesn't constrain the matching, it only deals conditionally with the outcome of an arbitrarily large number of successful matches.
In a previous question on the same topic I once embedded a counter in the left side of the s/// operator. While it worked for that specific case, it's not an ideal solution because it's prone to being thrown off by backtracking. That's another case where keeping it simple would have been the best approach. I mention it here so that you can avoid temptation to try such a trick (unless you want to have fun with backtracking).
The approach I've posted here, I believe is very clear; you look at it and know what's happening: match twice, keep track of last match position, now match a third time with substitution. You can have clever, you can have efficient, and you can have clear. But sometimes you can't have all three. Here you get efficient and clear.
Try this:
s/((old-string).*?){2}\2/\1\1new-string/
See perlvar for #- and #+.
my $str= "one two three four five";
if ( $str =~ /(?: (\w+) .*? ){3}/x ) {
substr $str, $-[1], $+[1] - $-[1], "-c-e-n-s-o-r-e-d-";
}
print $str, "\n";
The regex finds the 3rd instance, and captures its start-word in $1.

Perl search and replace the last character occurrence

I have what I thought would be an easy problem to solve but I am not able to find the answer to this.
How can I find and replace the last occurrence of a character in a string?
I have a string: GE1/0/1 and I would like it to be: GE1/0:1 <- This can be variable length so no substrings please.
Clarification:
I am looking to replace the last / with a : no matter what comes before or after it.
use strict;
use warnings;
my $a = 'GE1/0/1';
(my $b = $a) =~ s{(.*)/}{$1:}xms;
print "$b\n";
I use the greedy behaviour of .*
Perhaps I have not understand the problem with variable length, but I would do the following :
You can match what you want with the regex :
(.+)/
So, this Perl script
my $text = 'GE1/0/1';
$text =~ s|(.+)/|$1:|;
print 'Result : '.$text;
will output :
Result : GE1/0:1
The '+' quantifier being 'greedy' by default, it will match only the last slash character.
Hope this is what you were asking.
This finds a slash and looks ahead to make sure there are no more slashes past it.:
Raw regex:
/(?=[^/]*$)
I think the code would look something like this, but perl isn't my language:
$string =~ s!/(?=[^/]*$)!\:!g;
"last occurrence in a string" is slightly ambiguous. The way I see it, you can mean either:
"Foo: 123, yada: GE1/0/1, Bar: null"
Meaning the last occurrence in the "word" GE1/0/1, or:
"GE1/0/1"
As a complete string.
In the latter case, it is a rather simple matter, you only have to decide how specific you can be in your regex.
$str =~ s{/(\d+)$}{:$1};
Is perfectly fine, assuming the last character(s) can only be digits.
In the former case, which I don't think you are referring to, but I'll include anyway, you'd need to be much more specific:
$str =~ s{(\byada:\s+\w+/\w+)/(\w+\b)}{$1:$2};