Perl replace nth substring in a string - regex

I have a scenario in which I need to replace the nth sub-string in a string.
s/sub-string/new-string/g; will replace all the sub strings, but I need to do for a particular occurrence.
Please help me with this.

For replacing the nth occurrence of a string using sed, you can use this command:
sed 's/find_string/replace_string/n'
For replacing the substring, we need to know what you want to replace. Give an example.

This question might be interesting: Perl regex replace count
You might do something like this:
use strict;
use warnings;
my $count = 3;
my $str = "blublublublublu";
$str =~ s/(lu)/--$count == 0 ? "LA":$1/ge;
print $str;

Try this:
s/(sub-string{2,2})sub-string/$1new-string/
adjust 2 according to your needs (it's your 'n'- 1). Note that there may no separators exist between those substrings. e.g. 'abcabcabc' would work but 'abcdefabcabc' won't

You can also do like this
my $i=0;
s/(old-string)/++$i%3 ? $1 : "new_string"/ge;

I'm really a believer that there's no point building extra complexity into a regular expression unless it's truly necessary to do so (or unless you're just having fun). In code I actually planned to use I would keep it simple, like this:
my $string = "one two three four five";
$string =~ m/\w+\s*/g for 1 .. 2;
substr( $string,pos($string) ) =~ s/(\w+)/3/;
print "$string\n";
Using the m//g in scalar context causes it to match one time per iteration of the for loop. On each iteration pos() keeps track of the end of the most recent submatch on $string. Once you've gone through 'n' iterations (two in this case), you can plug pos() into substr(). Use substr($string... as an lvalue. It will constrain the regexp match to begin at whatever position you tel it in the second arg. We're plugging pos in there, which constrains it to take its next match wherever the last match left off.
This approach eliminates an explicit counter (though the for loop is essentially the same thing without naming a counter variable). This approach also scales better than a s//condition ? result : result/eg approach because it will stop after that third match is accomplished, rather than continuing to try to match until the end of a potentially large string is reached. In other words, the s///eg approach doesn't constrain the matching, it only deals conditionally with the outcome of an arbitrarily large number of successful matches.
In a previous question on the same topic I once embedded a counter in the left side of the s/// operator. While it worked for that specific case, it's not an ideal solution because it's prone to being thrown off by backtracking. That's another case where keeping it simple would have been the best approach. I mention it here so that you can avoid temptation to try such a trick (unless you want to have fun with backtracking).
The approach I've posted here, I believe is very clear; you look at it and know what's happening: match twice, keep track of last match position, now match a third time with substitution. You can have clever, you can have efficient, and you can have clear. But sometimes you can't have all three. Here you get efficient and clear.

Try this:
s/((old-string).*?){2}\2/\1\1new-string/

See perlvar for #- and #+.
my $str= "one two three four five";
if ( $str =~ /(?: (\w+) .*? ){3}/x ) {
substr $str, $-[1], $+[1] - $-[1], "-c-e-n-s-o-r-e-d-";
}
print $str, "\n";
The regex finds the 3rd instance, and captures its start-word in $1.

Related

Make a regular expression in perl to grep value work on a string with different endings

I have this code in perl where I want to extract the value of 'EUR_AF', in this case '0.39'.
Sometimes 'EUR_AF' ends with ';', sometimes it doesn't.
Alternatively, 'EUR_AF' may end with '=0' instead of '=0.39;' or '=0.39'.
How do I make the code handle that? Can't seem to find it online...I could of course wrap everything in an almost endless if-elsif-else statement, but that seems overkill.
Example text:
AVGPOST=0.9092;AN=2184;RSQ=0.5988;ERATE=0.0081;AC=144;VT=SNP;THETA=0.0045;AA=A;SNPSOURCE=LOWCOV;LDAF=0.0959;AF=0.07;ASN_AF=0.05;AMR_AF=0.10;AFR_AF=0.11;EUR_AF=0.039
Code: $INFO =~ m/\;EUR\_AF\=(.*?)(;)/
I did find that: $INFO =~ m/\;EUR\_AF\=(.*?0)/ handles the cases of EUR_AF=0, but how to handle alternative scenarios efficiently?
Extract one value:
my ($eur_af) = $s =~ /(?:^|;)EUR_AF=([^;]*)/;
my ($eur_af) = ";$s" =~ /;EUR_AF=([^;]*)/;
Extract all values:
my %rec = split(/[=;]/, $s);
my $eur_af = $rec{EUR_AF};
This regex should work for you: (?<=EUR_AF=)\d+(\.\d+)?
It means
(?<=EUR_AF=) - look for a string preceeded by EUR_AF=
\d+(\.\d+)? - consist of a digit, optionally a decimal digit
EDIT: I originally wanted the whole regex to return the correct result, not only the capture group. If you want the correct capture group edit it to (?<=EUR_AF=)(\d+(?:\.\d+)?)
I have found the answer. The code:
$INFO =~ m/(?:^|;)EUR_AF=([^;]*)/
seems to handle the cases where EUR_AF=0 and EUR_AF=0.39, ending with or without ;. The resulting $INFO will be 0 or 0.39.

How to grab certain number of characters before and after a perl regex match?

I am crafting regexes that match certain terms best within html code. I'm doing this in an iterative process to whittle down matches to exclude things I don't want. So I craft a regex, run it, and spit out data that I then look through to see how well my match is working. For example, if I am looking for the term "tema" (the name of a trade association that provides standards) I might notice that it also matches "sitemap" and alter my regex in some way to exclude the unwanted items.
To make this easier, I want to print out my match along with some context, say 20 charcters before and after the match, rather than the entire line, to make it easier to scan through the results. This is proving frustratingly hard to accomplish in a simple fashion.
For example, I would think this would work:
$line =~ /(.{,20}tema.{,20})/i;
That is, I want to match up to 20 of anything before and after my keyword and include it in the "context" I print out for scanning.
But it doesn't. Am I missing something here? If a{,20} will match up to 20 'a' characters, why won't .{,20} match 20 of anything that '.' will match?
Scratching my head.
Syntax:
atom{n} (exactly n)
atom{n,} (n or more)
atom{n,m} (n or more, but no more than m)
So,
say $1 if $line =~ /(.{0,20}tema.{0,20})/i;
Or if you're using /g and might get overlapping matches:
say "$1$2$3" while $line =~ /(.{0,20})\K(tema)(?=(.{0,20}))/ig;
(a{,20} doesn't "match up to 20 a characters.")
How about searching with m/^(.*)tema(.*)$/ then use substr or similar to get the last characters of $1 and the first from $2.

remove up to _ in perl using regex?

How would I go about removing all characters before a "_" in perl? So if I had a string that was "124312412_hithere" it would replace the string as just "hithere". I imagine there is a very simple way to do this using regex, but I am still new dealing with that so I need help here.
Remove all characters up to and including "_":
s/^[^_]*_//;
Remove all characters before "_":
s/^[^_]*(?=_)//;
Remove all characters before "_" (assuming the presence of a "_"):
s/^[^_]*//;
This is a bit more verbose than it needs to be, but would be probably more valuable for you to see what's going on:
my $astring = "124312412_hithere";
my $find = "^[^_]*_";
my $replace = "_";
$astring =~ s/$find/$replace/;
print $astring;
Also, there's a bit of conflicting requirements in your question. If you just want hithere (without the leading _), then change it to:
$astring =~ s/$find//;
I know it's slightly different than what was asked, but in cases like this (where you KNOW the character you are looking for exists in the string) I prefer to use split:
$str = '124312412_hithere';
$str = (split (/_/, $str, 2))[1];
Here I am splitting the string into parts, using the '_' as a delimiter, but to a maximum of 2 parts. Then, I am assigning the second part back to $str.
There's still a regex in this solution (the /_/) but I think this is a much simpler solution to read and understand than regexes full of character classes, conditional matches, etc.
You can try out this: -
$_ = "124312412_hithere";
s/^[^_]*_//;
print $_; # hithere
Note that this will also remove the _(as I infer from your sample output). If you want to keep the _ (as it seems doubtful what you want as per your first statement), you would probably need to use look-ahead as in #ikegami's answer.
Also, just to make it little more clear, any substitution and matching in regex is applied by default on $_. So, you don't need to bind it to $_ explicitly. That is implied.
So, s/^[^_]*_//; is essentially same as - $_ =~ s/^[^_]*_//;, but later one is not really required.

How can I match everything that is after the last occurrence of some char in a perl regular expression?

For example, return the part of the string that is after the last x in axxxghdfx445 (should return 445).
my($substr) = $string =~ /.*x(.*)/;
From perldoc perlre:
By default, a quantified subpattern is "greedy", that is, it will match
as many times as possible (given a particular starting location) while
still allowing the rest of the pattern to match.
That's why .*x will match up to the last occurence of x.
The simplest way would be to use /([^x]*)$/
the first answer is a good one,
but when talking about "something that does not contain"...
i like to use the regex that "matches" it
my ($substr) = $string =~ /.*x([^x]*)$/;
very usefull in some case
the simplest way is not regular expression, but a simple split() and getting the last element.
$string="axxxghdfx445";
#s = split /x/ , $string;
print $s[-1];
Yet another way to do it. It's not as simple as a single regular expression, but if you're optimizing for speed, this approach will probably be faster than anything using regex, including split.
my $s = 'axxxghdfx445';
my $p = rindex $s, 'x';
my $match = $p < 0 ? undef : substr($s, $p + 1);
I'm surprised no one has mentioned the special variable that does this, $': "$'" returns everything after the matched string. (perldoc perlre)
my $str = 'axxxghdfx445';
$str =~ /x/;
# $' contains '445';
print $';
However, there is a cost (emphasis mine):
WARNING: Once Perl sees that you need one of $&, "$", or "$'" anywhere
in the program, it has to provide them for every pattern match. This
may substantially slow your program. Perl uses the same mechanism to
produce $1, $2, etc, so you also pay a price for each pattern that
contains capturing parentheses. (To avoid this cost while retaining
the grouping behaviour, use the extended regular expression "(?: ... )"
instead.) But if you never use $&, "$" or "$'", then patterns without
capturing parentheses will not be penalized. So avoid $&, "$'", and
"$`" if you can, but if you can't (and some algorithms really
appreciate them), once you've used them once, use them at will, because
you've already paid the price. As of 5.005, $& is not so costly as the
other two.
But wait, there's more! You get two operators for the price of one, act NOW!
As a workaround for this problem, Perl 5.10.0 introduces
"${^PREMATCH}", "${^MATCH}" and "${^POSTMATCH}", which are equivalent
to "$`", $& and "$'", except that they are only guaranteed to be
defined after a successful match that was executed with the "/p"
(preserve) modifier. The use of these variables incurs no global
performance penalty, unlike their punctuation char equivalents, however
at the trade-off that you have to tell perl when you want to use them.
my $str = 'axxxghdfx445';
$str =~ /x/p;
# ${^POSTMATCH} contains '445';
print ${^POSTMATCH};
I would humbly submit that this route is the best and most straight-forward
approach in most cases, since it does not require that you do special things
with your pattern construction in order to retrieve the postmatch portion, and there
is no performance penalty.
Regular Expression : /([^x]+)$/ #assuming x is not last element of the string.

How can I fix my regex to not match too much with a greedy quantifier? [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Closed 3 years ago.
I have the following line:
"14:48 say;0ed673079715c343281355c2a1fde843;2;laka;hello ;)"
I parse this by using a simple regexp:
if($line =~ /(\d+:\d+)\ssay;(.*);(.*);(.*);(.*)/) {
my($ts, $hash, $pid, $handle, $quote) = ($1, $2, $3, $4, $5);
}
But the ; at the end messes things up and I don't know why. Shouldn't the greedy operator handle "everything"?
The greedy operator tries to grab as much stuff as it can and still match the string. What's happening is the first one (after "say") grabs "0ed673079715c343281355c2a1fde843;2", the second one takes "laka", the third finds "hello " and the fourth matches the parenthesis.
What you need to do is make all but the last one non-greedy, so they grab as little as possible and still match the string:
(\d+:\d+)\ssay;(.*?);(.*?);(.*?);(.*)
(\d+:\d+)\ssay;([^;]*);([^;]*);([^;]*);(.*)
should work better
Although a regex can easily do this, I'm not sure it's the most straight-forward approach. It's probably the shortest, but that doesn't actually make it the most maintainable.
Instead, I'd suggest something like this:
$x="14:48 say;0ed673079715c343281355c2a1fde843;2;laka;hello ;)";
if (($ts,$rest) = $x =~ /(\d+:\d+)\s+(.*)/)
{
my($command,$hash,$pid,$handle,$quote) = split /;/, $rest, 5;
print join ",", map { "[$_]" } $ts,$command,$hash,$pid,$handle,$quote
}
This results in:
[14:48],[say],[0ed673079715c343281355c2a1fde843],[2],[laka],[hello ;)]
I think this is just a bit more readable. Not only that, I think it's also easier to debug and maintain, because this is closer to how you would do it if a human were to attempt the same thing with pen and paper. Break the string down into chunks that you can then parse easier - have the computer do exactly what you would do. When it comes time to make modifications, I think this one will fare better. YMMV.
Try making the first 3 (.*) ungreedy (.*?)
If the values in your semicolon-delimited list cannot include any semicolons themselves, you'll get the most efficient and straightforward regular expression simply by spelling that out. If certain values can only be, say, a string of hex characters, spell that out. Solutions using a lazy or greedy dot will always lead to a lot of useless backtracking when the regex does not match the subject string.
(\d+:\d+)\ssay;([a-f0-9]+);(\d+);(\w+);([^;\r\n]+)
You could make * non-greedy by appending a question mark:
$line =~ /(\d+:\d+)\ssay;(.*?);(.*?);(.*?);(.*)/
or you can match everything except a semicolon in each part except the last:
$line =~ /(\d+:\d+)\ssay;([^;]*);([^;]*);([^;]*);(.*)/