perl regular expressions substitution/replacement using variables with special characters - regex

Okay I've checked previous similar questions and I've been juggling with different variations of quotemeta but something's still not right.
I have a line with a word ID and two words - the first is the wrong word, the second is right.
And I'm using a regex to replace the wrong word with the right one.
$line = "ANN20021015_0104_XML_16_21 A$xAS A$xASA";
#splits = split("\t",$line);
$wrong_word = quotemeta $splits[1];
$right_word = quotemeta $splits[2];
print $right_word."\n";
print $wrong_word."\n";
$line =~ s/$wrong_word\t/$right_word\t/g;
print $line;
What's wrong with what I'm doing?
Edit
The problem is that I'm unable to retain the complete words - they get chopped off at the special characters. This code works perfectly fine for words without special characters.
The output I need for the above example is:
ANN20021015_0104_XML_16_21 A$xASA A$xASA
But what I get is
ANN20021015_0104_XML_16_21 A A
Because of the $ character.

ETA:
Since you get:
ANN20021015_0104_XML_16_21 A A
When you want:
ANN20021015_0104_XML_16_21 A$xASA A$xASA
My suspicions are as follows:
You are not intentionally interpolating the variables $xAS and $xASA, so because they are undefined, they just add the empty string to $line, which is visible in your output. E.g. "A$xAS" is expanded into "A" . undef.
You are not using use warnings, so you do not get information about this error.
Solution:
Use use strict; use warnings;. Always. They save you a lot of time.
When assigning, use single quotes instead to avoid interpolation of variables:
$line = 'ANN20021015_0104_XML_16_21 A$xAS A$xASA';
Old answer:
Since you do not say what goes wrong, it's just guesswork from my end.
I can see a possible accidental interpolation of the variables $xAS and $xASA, which you can solve by either escaping the dollar sign, or by using single quotes on that $line assignment.
You can also build your new string by using join, rather than a regex, e.g.:
$line = join "\t", #splits[0,2,2];

If you had used strict it would have told you that you must declare variables $xAS and $xASA.
If you had used warnings, it would have told you that you were concatenating an uninitialized variable.
Hence the common admonishment: "use strict, use warnings".
You simply need to either put the string in non-interpolated quotes ( '', q{} ) or escape the sigil ($) so that it doesn't try to interpolate what it thinks is a variable.
"" are quotes that will mess with your string
'' are quotes that won't
Lesson: use single quotes unless you want interpolation.

The problem isn't in your substitution; the problem is in the very first line of your code example.
$line = "ANN20021015_0104_XML_16_21 A$xAS A$xASA";
tries to interpolate the variables $xAS and $xASA into $line, and interpolates nothing because those variables are empty. Use single quotes instead of double quotes so that the string doesn't interpolate.
Had you turned on warnings it would have warned you about the fact that you're interpolating an uninitialized variable, and had you turned on strict 'vars' it wouldn't have let you use the undeclared $xAS and $xASA at all.
Finally, you don't have to quotemeta the right side of a substitution; only the left.

Related

Perl: Regex with Variables inside?

Is there a more elegant way of bringing a variable into a pattern than this (put the patterin in a string before instead of using it directly in //)??
my $z = "1"; # variable
my $x = "foo1bar";
my $pat = "^foo".$z."bar\$";
if ($x =~ /$pat/)
{
print "\nok\n";
}
The qr operator does it
my $pat = qr/^foo${z}bar$/;
unless the delimiters are ', in which case it doesn't interpolate variables.
This operator is the best way to build patterns ahead of time, as it builds a proper regex pattern, accepting everything that one can use in a pattern in a regex. It also takes modifiers, so for one the above can be written as
my $pat = qr/^foo $z bar$/x;
for a little extra clarity (but careful with omitting those {}).†
The initial description from the above perlop link (examples and discussion follow):
qr/STRING/msixpodualn
This operator quotes (and possibly compiles) its STRING as a regular expression. STRING is interpolated the same way as PATTERN in m/PATTERN/. If "'" is used as the delimiter, no variable interpolation is done. Returns a Perl value which may be used instead of the corresponding /STRING/msixpodualn expression. The returned value is a normalized version of the original pattern. It magically differs from a string containing the same characters: ref(qr/x/) returns "Regexp"; however, dereferencing it is not well defined (you currently get the normalized version of the original pattern, but this may change).
† Once a variable is getting interpolated it may be a good idea to escape special characters that may be hiding in it, that could throw off the regex, using quotemeta based \Q ... \E
my $pat = qr/^foo \Q$z\E bar$/x;
If this is used then the variable name need not be delimited either
my $pat = qr/^foo\Q$z\Ebar$/;
Thanks to Håkon Hægland for bringing this up.

How can I remove the last comma from a string in Perl

I have a string coming in from raw data. I can't guarantee that there might or might not be an extra comma. I thought I might be able to remove it like this:
$value = "cat, dog, fish, ";
$value =~ s/,//r;
Sadly that doesn't work. Of course I could do a loop to check the last char of the string one by one, but I would like to learn how to do it with the Regex backslash method.
Can someone help me please?
Try this
$value =~ s/,\s*$//;
The pattern ,\s*$ matches a comma (,) followed by zero or more space-chars (\s*), followed by the end of the line/input ($).
s/,// removes the first comma. So,
$value = reverse(reverse($value) =~ s/,//r);
Not sure why you are specifying /r in your code but not using the return value. If in fact you are using it, add it back.
s/.*\K,//
Ah, if there may not be a trailing comma that you don't want, this won't work; it will always delete the last comma. Use Bart's answer then.
The accepted answer removes a comma followed by zero or more white space characters at the end of a string. But you asked about removing the last comma. Either is consistent with your example, but if you really want to remove the last comma, one way is:
$value =~ s/,([^,]*$)/$1/
This will, for example, change "foo,bar,baz" to "foo,barbaz", and in your example"cat, dog, fish, "to"cat, dog, fish "` (leaving the trailing space).
The reverse trick in choruba's answer also works.
If nothing else, this shows the importance of a precise problem statement.
Using positive look ahead,
$value =~ s/,(?=[^,]*\z)//;
I suggest this pattern: ,*\s*$. It matches all commas (if any) and all white spaces (if any) and the end of the string.
A full example:
use 5.18.2;
use strict ;
use warnings ;
use Data::Dumper;
my $data = "cat, dog, fish,,,,,,,,,,,,, ";
$data =~ s/,*\s*$// ;
print $data;

QRegex look ahead/look behind

I have been pondering on this for quite awhile and still can't figure it out. The regex look ahead/behinds. Anyway, I'm not sure which to use in my situation, I am still having trouble grasping the concept. Let me give you an example.
I have a name....My Business,LLC (Milwaukee,WI)_12345678_12345678
What I want to do is if there is a comma in the name, no matter how many, remove it. At the same time, if there is not a comma in the name, still read the line. The one-liner I have is listed below.
s/(.*?)(_)(\d+_)(\d+$)/$1$2$3$4/gi;
I want to remove any comma from $1(My Business,LLC (Milwaukee,WI)). I could call out the comma in regex as a literal string((.?),(.?),(.*?)(_.*?$)) if it was this EXACT situation everytime, however it is not.
I want it to omit commas and match 'My Business, LLC_12345678_12345678' or just 'My Business_12345678_12345678', even though there is no comma.
In any situation I want it to match the line, comma or not, and remove any commas(if any) no matter how many or where.
If someone can help me understand this concept, it will be a breakthrough!!
Use the /e modifier of Perl so that you can pass your function during the replace in s///
$str = 'My Business,LLC (Milwaukee,WI)_12345678_12345678';
## modified your regex as well using lookahead
$str =~ s/(.*?)(?=_\d+_\d+$)/funct($1)/ge;
print $str;
sub funct{
my $val = shift;
## replacing , with empty, use anything what you want!
$val =~ s/,//g;
return $val;
}
Using funct($1) in substitute you are basically calling the funct() function with parameter $1

remove up to _ in perl using regex?

How would I go about removing all characters before a "_" in perl? So if I had a string that was "124312412_hithere" it would replace the string as just "hithere". I imagine there is a very simple way to do this using regex, but I am still new dealing with that so I need help here.
Remove all characters up to and including "_":
s/^[^_]*_//;
Remove all characters before "_":
s/^[^_]*(?=_)//;
Remove all characters before "_" (assuming the presence of a "_"):
s/^[^_]*//;
This is a bit more verbose than it needs to be, but would be probably more valuable for you to see what's going on:
my $astring = "124312412_hithere";
my $find = "^[^_]*_";
my $replace = "_";
$astring =~ s/$find/$replace/;
print $astring;
Also, there's a bit of conflicting requirements in your question. If you just want hithere (without the leading _), then change it to:
$astring =~ s/$find//;
I know it's slightly different than what was asked, but in cases like this (where you KNOW the character you are looking for exists in the string) I prefer to use split:
$str = '124312412_hithere';
$str = (split (/_/, $str, 2))[1];
Here I am splitting the string into parts, using the '_' as a delimiter, but to a maximum of 2 parts. Then, I am assigning the second part back to $str.
There's still a regex in this solution (the /_/) but I think this is a much simpler solution to read and understand than regexes full of character classes, conditional matches, etc.
You can try out this: -
$_ = "124312412_hithere";
s/^[^_]*_//;
print $_; # hithere
Note that this will also remove the _(as I infer from your sample output). If you want to keep the _ (as it seems doubtful what you want as per your first statement), you would probably need to use look-ahead as in #ikegami's answer.
Also, just to make it little more clear, any substitution and matching in regex is applied by default on $_. So, you don't need to bind it to $_ explicitly. That is implied.
So, s/^[^_]*_//; is essentially same as - $_ =~ s/^[^_]*_//;, but later one is not really required.

How do I handle every ASCII character (including regex special characters) in a Perl regex?

I have the following code in Perl:
if (index ($retval, $_[2]) != -1) {
#fs = split ($_[2], $_[1]);
$_[2] is the delimiter variable and $_[1] is the string that the delimiter may exist in. ($_[0] is used elsewhere) You may have guessed that this code is in a subroutine by those variable names.
Anyway, onto my question, when my delimiter is something innocuous like 'a' or ':' the code works like it should. However, when it is something that would get parsed by Perl regex, like a '\' character, then it does not work like it is supposed to. This makes sense because in the split function Perl would see something like:
split (/\/, $_[1]);
which makes no sense to it at all because it would want this:
split (/\//, $_[1]);
So with all of that in mind my question, that I cannot answer, is this: "How do I make it so that any delimiter that I put into $_[2], or all the ASCII characters, gets treated as the character it is supposed to be and not interpreted as something else?"
Thanks in advance,
Robert
You can use quotemeta to escape $_[2] properly so it will work in the regex without getting mangled. This should do it:
my $quoted = quotemeta $_[2];
#fs = split( $quoted, $_[1] );
Alternatively, you can use \Q in your regex to escape it. See "Escape Sequences" in perlre.
split /\Q$_[2]/, $_[1]
As a side note, I'm suspecting that the $_[1] and $_[2] variables refer to the automatically passed in #_ array of a sub.
It's helpful - would have saved you quite some explaining here and made your code more understandable by itself - and common practice to use something like the following at the beginning of the sub:
sub mysub {
my ($param1, $string, $delim) = #_;
# ...
}