Perl Regex replace all domain except a particular in a string - regex

I have a string of domain like below:
$string = 'https://code.google.com && http://mycode.com/data && times.com && https://thehindu.com';
I want to replace all domains except mycode.com with mycode.com/unknown
So the output of above string after applying regex should be:
https://mycode.com/unknown && http://mycode.com/data && mycode.com/unknown && https://mycode.com/unknown
I have tried below regex, but it changes http://mycode.com/data also:
$string =~ s/(?<!mycode)[a-z\.]+?\.(com|org|net)/mycode\.com\/unknown/g;
How should i modify my regex to not match mycode.com

$s =~ s{
( [a-z.]+\.(?:com|org|net) )
(?![a-z.])
}{
$1 eq "mycode.com" ? $1 : "mycode.com/unknown"
}xeg;
or
$s =~ s{
(?<![a-z.])
(?! mycode\.com (?![a-z.]) )
([a-z.]+\.(?:com|org|net) (?![a-z.])
}{mycode.com/unknown}xg;
Handles
mycoder.com
mycode.combo.com
mycode.combo
notmycode.com
foo.combo

You're really closed, instead of lookbehind, use lookahead:
my $string = 'https://code.google.com && http://mycode.com/data && times.com && https://thehindu.com';
$string =~ s~(?<![a-z.])(?!mycode)[a-z.]+\.(?:com|org|net)~mycode.com/unknown~g;
say $string;
Output:.
https://mycode.com/unknown && http://mycode.com/data && mycode.com/unknown && https://mycode.com/unknown

Try replacing the matching part of the regex by that :
(http:\/\/)?(?!(mycode\.|ycode\.|code\.|ode\.|de\.|e\.))[a-z\.]+?\.(com|org|net)

Please try something like this
$string =~ s/https?:\/\/([^\/\s]+)/$match=$1;($match!~\/mycode.com\/)?'https:\/\/mycode.com\/unknown':$match/eg;

(Ignoring url without http://... and using 3 non-scrolling lines)
my $s = 'https://code.google.com && ....'
$s =~ s!//(?=mycode.com($|[^.\w]))!\cA!g; # // -> CTR-A
$s =~ s!//(\S+)!//mycode.com/unknown!g;
$s =~ s!\cA!//!g; # CTR-A -> //
Basic idea:
protect/mark/save the special cases
substitute the general situations
put-back the specials

Related

how to extract string with any operator between?

I have an array contain #arr = { "a=b", "a>b", "a<b", "a!=b", "a-b" }. What is the best way to get a and b with any operator between. I can extract by
for($i=0; $i<=$#arr; $i++){
$str = $arr[$i];
if($str =~ m/^(.*?)(\s*=\s*)(.*)(;)/g){
my $d = $1;
my $e = $3;
}
Follow by all if statement with the possible operator like "!=", "<" etc. But this will make my code look messy. Any better solution for this?
You could try something like this one liner
perl -e '#a = ("a=b","a>b","a<b","a!=b","a-b"); for $l (#a) { $l =~ s/(.).*(.)/$1/; print "$1$2\n"};'
The key thing is the greedy match ie "(.*)" between the two single character matches ie "(.)". To really make sure that you start at the start and end of the strings you could use this
perl -e '#a = ("a=b","a>b","a<b","a!=b","a-b"); for $l (#a) { $l =~ s/^(.).*(.)$/$1/; print "$1$2\n"};'
A complete working example that demonstrates the whole thing would be
#!/usr/bin/perl
use strict;
use warnings;
my #expressions = ("a=b","a>b","a<b","a!=b","a-b");
for my $exp (#expressions) {
$exp =~ s/^(.).*(.)$/$1$2/;
print "$1$2 is the same as $exp\n";
};
A very simple regex might be
/^(\w+)\s*(\W+)\s*(\w+)$/
Or you enumerate possible operators
/^(\w+)\s*(=|!=|<|>|<=|>=|\+|-|\*|\/|==)\s*(\w+)$/
It depends whether the input can be trusted or not. If not, you might have to be more meticulous w.r.t. the identifiers, too. Here's a simpler loop, and no need to use m//g(lobal). Not sure about the semicolon - omitted it.
my #arr = ( "a=b", "a>b", "a<b", "a!=b", "a-b" );
for my $str (#arr){
if($str =~ /^(\w+)\s*(=|!=|<|>|<=|>=|\+|-|\*|\/|==)\s*(\w+)$/ ){
my $d = $1;
my $e = $3;
print "d=$d e=$e\n";
}
}
Later If you enumerate the operators, you can also add word symbols:
if($str =~ /^(\w+)\s*(=|!=|<|>|<=|>=|\+|-|\*|\/|==|x?or|and)\s*(\w+)$/ ){
...
if there always 'a' and 'b' at the beginning and the end you could try:
my $str = 'a<b';
my( $op ) = $str =~ /^a(.*)b$/;
Not a well thought out answer. Will reconsider the problem.

A non-greedy Perl regular expression

I need to write a script which does the following:
$ cat testdata.txt
this is my file containing data
for checking pattern matching with a patt on the back!
only one line contains the p word.
$ ./mygrep5 pat th testdata.txt
this is my file containing data
for checking PATTERN MATCHING WITH a PATT ON THe back!
only one line contains the p word.
I have been able to print the line which is amended with the "a" capitalized as well. I have no idea how to only take what is needed.
I have been messing around (below is my script so far) and all I manage to return is the "PATT ON TH" part.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use Data::Dump 'pp';
my ($f, $s, $t) = #ARGV;
my #output_lines;
open(my $fh, '<', $t);
while (my $line = <$fh>) {
if ($line =~ /$f/ && $line =~ /$s/) {
$line =~ s/($f.+?$s)/$1/g;
my $sub_phrase = uc $1;
$line =~ s/$1/$sub_phrase/g;
print $line;
}
#else {
# print $line;
#}
}
close($fh);
which returns: "for checking pattern matching with a PATT ON THe back!"
How can I fix this problem?
It sounds like you want to capitalize from pat to th except for instances of a surrounded by spaces. The easiest way is to uppercase the whole thing, and then fix any instances of A surrounded by spaces.
sub capitalize {
my $s = shift;
my $uc = uc($s);
$uc =~ s/ \s \K A (?=\s) /a/xg;
return $uc;
}
s{ ( \Q$f\E .* \Q$s\E ) }{ capitalize($1) }xseg;
The downside is that will replacing any existing A surrounded by spaces with a. The following is more complicated, but it doesn't suffer from that problem:
sub capitalize {
my $s = shift;
my #parts = $s =~ m{ \G ( \s+ | \S+ ) }xg;
for (#parts) {
$_ = uc($_) if $_ ne "a";
}
return join('', #parts);
}
s{ ( \Q$f\E .* \Q$s\E ) }{ capitalize($1) }xseg;
The rest of the code can be simplified:
#!/usr/bin/perl
use strict;
use warnings;
sub capitalize { ... }
my $f = shift;
my $s = shift;
while (<>) {
s{ ( \Q$f\E .* \Q$s\E ) }{ capitalize($1) }xseg;
print;
}
So, if you want to match each sequence that starts with pat and ends with th, non-greedily, and uppercase that sequence, you can simply use an expression on the right side of your substitution:
$line =~ s/($f.+?$s)/uc($1)/eg;
And that's it.

How to get the expression of matching capture in Perl

In Perl, how can I get the expression of a capture that has matched in a regex?
$s = 'aaazzz';
$s =~ s/(a+)|(b+)|(c+)/.../;
$s =~ s/(?<one>a+)|(?<two>b+)|(?<three>c+)/.../;
I mean the expression (e.g. a+), not the string aaa.
I need the expression of both numbered and named captures.
I'd do something like:
use strict;
use warnings;
my #regexes = (
qr/(a+)/,
qr/(b+)/,
qr/(c+)/,
);
my $string = 'aaazzz';
foreach my $re(#regexes) {
if ($string =~ $re) {
print "Used regex is $re\n";
}
}
Output:
Used regex is (?^:(a+))
You could assemble your regex from components and then test which groups matched. For demo purposes, I have only used match and not match and the replace operator 's', but same principle applies.
$s = 'aaazzz';
$part1 = '(a+)';
if ( $s =~ /$part1|(b+)|(c+)/ ) {
if ($1) {
print("$part1 matched\n");
}
else {
print("$part1 did not match\n");
}
}

How to remove the last two occurrences of "-" character of a string?

I have an array of strings that are formatted as such:
Ado-trastuzumab emtansine(HER2)02-22-2013
I would like to remove the last two "-" symbols only (from the date part of the original string) so that the name of the drug (Ado-trastuzumab emtansine) is not altered. Right now my regex removes all "-" symbols:
foreach my $string (#array) {
$string =~ tr/-//d;
}
I would like the output to instead be the following:
Ado-trastuzumab emtansine(HER2)02222013
Thanks for the help!
You can use substr as an lvalue to only apply the transliteration to a particular part of your string:
substr($string, -10, 10) =~ tr/-//d;
In this case, on the last 10 letters of the string.
foreach my $string (#array) {
$string =~ s/(\d{2})-(\d{2})-(\d{4})$/$1$2$3/;
}
To do what you say literally - remove the last two hyphens from a string - you could write this
$string =~ s/-([^-]*)-([^-]*)\z/$1$2/;
But in this case you could simply remove all hyphens that follow a digit:
$string =~ s/\d\K-//g;
If nothing should be done when there's only one -:
$s =~ s/-([^-]*)-([^-]*)\z/$1$2/;
$s = reverse($s);
$s =~ s/^([^-]*)-([^-]*)-/$1$2/;
$s = reverse($s);
$s = reverse( reverse($s) =~ s/^([^-]*)-([^-]*)-/$1$2/r ); # 5.14+
All these work even if there is only one -:
$s =~ s/-([^-]*)(?:-([^-]*))?\z/$1$2/;
$s =~ s/-([^-]*)\z/$1/ for 1..2;
$s =~ s/^.*\K-//s for 1..2;
$s = reverse($s);
$s =~ s/-// for 1..2;
$s = reverse($s);
$s = reverse($s);
$s =~ s/^([^-]*)-(?:([^-]*)-)?/$1$2/;
$s = reverse($s);
$s = reverse( reverse($s) =~ s/^([^-]*)-(?:([^-]*)-)?/$1$2/r ); # 5.14+
For long strings, the reverse solutions should be much faster. For the short strings, go for readability.

Replace only up to N matches on a line

In Perl, how to write a regular expression that replaces only up to N matches per string?
I.e., I'm looking for a middle ground between s/aa/bb/; and s/aa/bb/g;. I want to allow multiple substitutions, but only up to N times.
I can think of three reliable ways. The first is to replace everything after the Nth match with itself.
my $max = 5;
$s =~ s/(aa)/ $max-- > 0 ? 'bb' : $1 /eg;
That's not very efficient if there are far more than N matches. For that, we need to move the loop out of the regex engine. The next two methods are ways of doing that.
my $max = 5;
my $out = '';
$out .= $1 . 'bb' while $max-- && $in =~ /\G(.*?)aa/gcs;
$out .= $1 if $in =~ /\G(.*)/gcs;
And this time, in-place:
my $max = 5;
my $replace = 'bb';
while ($max-- && $s =~ s/\G.*?\Kaa/$replace/s) {
pos($s) = $-[0] + length($replace);
}
You might be tempted to do something like
my $max = 5;
$s =~ s/aa/bb/ for 1..$max;
but that approach will fail for other patterns and/or replacement expressions.
my $max = 5;
$s =~ s/aa/ba/ for 1..$max; # XXX Turns 'aaaaaaaa'
# into 'bbbbbaaa'
# instead of 'babababa'
And of course, starting from the beginning of the string every time could be expensive.
What you want is not posible in regular expressions. But you can put the replacement in a for-loop:
my $i;
my $aa = 'aaaaaaaaaaaaaaaaaaaa';
for ($i=0;$i<4;$i++) {
$aa =~ s/aa/bb/;
}
print "$aa\n";
result:
bbbbbbbbaaaaaaaaaaaa
You can use the /e flag which evaluates the right side as an expression:
my $n = 3;
$string =~ s/(aa)/$n-- > 0 ? "bb" : $1/ge;
Here's a solution using the /e modifier, with which you can use
perl code to generate the replacement string:
my $count = 0;
$string =~ s{ $pattern }
{
$count++;
if ($count < $limit ) {
$replace;
} else {
$&; # faking a no-op, replacing with the original match.
}
}xeg;
With perl 5.10 or later you can drop the $& (which has weird
performance complications) and use ${^MATCH} via the /p modifier
$string =~ s{ $pattern }
{
$count++;
if ($count < $limit ) {
$replace;
} else {
${^MATCH};
}
}xegp;
It's too bad you can't just do this, but you can't:
last if $count >= $limit;