Today I came across two different syntaxes for a Perl regular expression match.
#I have a date string
my $time = '2012-10-29';
#Already familiar "m//":
$t =~ m/^(\d{4}-\d\d-\d\d)$/
#Completely new to me m##.
$t =~ m#^(\d{4}-\d\d-\d\d)#/
Now what is the difference between /expression/ and #expression#?
As everone else said, you can use any delimiter after the m.
/ has one special feature: you can use it by itself, e.g.
$string =~ /regexp/;
is equivalent to:
$string =~ m/regexp/;
Perl allows you to use pretty much any characters to delimit strings, including regexes. This is especially useful if you need to match a pattern that contains a lot of slash characters:
$slashy =~ m/\/\//; #Bad
$slashy =~ m|//|; #Good
According to the documentation, the first of those is an example of "leaning toothpick syndrome".
Most but not all characters behave in the same way when escaping. There is an important exception: m?...? is a special case that only matches a single time between calls to reset().
Another exception: if single quotes are used for the delimiter, no variable interpolation is done. You still have to escape $, though, as it is a special character matching the end of the line.
Nothing except what you have to escape in the regex. You can use any pair of matched characters you like.
$string = "http://example.com/";
$string =~ m!http://!;
$string =~ m#http://!#;
$string =~ m{http://};
$string =~ m/http:\/\//;
After the match or search/replace operator (the m and s, respectively) you can use any character as the delimiter, e.g. the # in your case. This also works with pairs of parenthesis: s{ abc (.*) def }{ DEF $1 ABC }x.
Advantages are that you don't have to escape the / (but the actual delimiter characters, of course). It's often used for clarity, especially when dealing with things like paths or protocols.
There is no difference; the "/" and "#" characters are used as delimiters for the expression. They simply mark the "boundary" of the expression, but are not part of the expression. In theory you can use most non-alphanumeric characters as a delimiter. Here is a link to the PHP manual (It doesn't matter that it is the PHP manual, the Regex syntax is the same, I just like it because it explains well) on Perl compatible regular expression syntax; read the part about delimiters
Related
I was unable to decipher what this regex does:
$c =~ s^.*/^^g;
I don't have access to the input or the output.
Does anyone know what it does?
The default delimiter for s/// is the slash, but you can use any printable character as an alternative.
So
$c =~ s^.*/^^g
is equivalent to
$c =~ s/.*\///g
Note that using the conventional delimiter requires the slash within the pattern itself to be escaped
Some options are better than others, and in the case where you're just trying to avoid escaping slashes within the pattern I think a pipe character | is better
I wouldn't hope to learn too much from this programmer. As you have experienced, ^ is a poor and confusing choice. Also, the /g modifier is superfluous, and $c is a terrible choice for an identifier
I would write
$c =~ s|.*/||
Here ^ is used as the delimiter.
We may use any printable character as a regex delimiter.
s^.*/^^g;
s/.*\///g;
Both regex are same
A non-standard delimiter is mostly used to avoid the need to escape the delimiter character within a regex pattern. For
$c = "this is a string with / slash";
Now your regex should be
$c =~ s/.*\///
^^
Here you are escaping the slash.
Both regex are same.
We will use whatever regex we want. #simbabque mentioned in comment.
s{foo}{bar}gs # here curly braces are delimiter
s[some][same] # here square bracket are delimeter.
And we will use character also a regex delimiter for our convenient
To avoid escaping we can use other delimiters.
I’m using a variable to search and replace a string using Perl.
I want to replace the string 23.0 with 23.0.1, so I tried this:
my $old="23.0";
my $new="23.0.1";
$_ =~ s/$old/$new/g;
The problem is that it also replaced the string 2310, so I tried:
my $old="23\.0"
and also /ee.
But can’t get the correct syntax for it to work. Can someone show me the correct syntax?
There are two things that will help you here:
The quotemeta function - that will escape meta characters. And also the \Q and \E regex flags, that stop regex interpolation.
print quotemeta "21.0";
Or:
my $old="23.0";
my $new="23.0.1";
my $str = "2310";
$str =~ s/\Q$old\E/$new/g;
print $str;
Just use single quotes and escape the dot.
my $old='23\.0';
To complement Sobrique's excellent answer, let me note that the reason your attempt with "23\.0" didn't work is that "23\.0" and "23.0" evaluate to the same string: in a double-quoted string literal, the backslash escape sequence \. simply evaluates to ..
There are several things you could do to avoid this:
If you indeed want to match a fixed string, and don't need or want to include any special regexp metacharacters in it, you can do as Sobrique suggest and use quotemeta or \Q to escape them.
In particular, this is almost always the correct solution if the string to be matched comes from user input. If you do want to allow some limited set of non-literal metacharacters, you can unescape those after running the pattern through quotemeta. For a simple example, here's a quick-and-dirty way to turn a basic glob-like pattern (using the metacharacters ? and * for "any character" and "any string of characters" repectively) into an equivalent regexp:
my $regexp = "^\Q$glob\E\$"; # quote and anchor the pattern
$regexp =~ s/\\\?/./g; # replace "?" (escaped to "\?" by \Q) with "."
$regexp =~ s/\\\*/.*/g; # replace "*" (escaped to "\*" by \Q) with ".*"
Conversely, if you want to have a literal regexp pattern in your code, without immediately matching it against something, you can use the qr// regexp-like quote operator, like this:
my $old = qr/\b23\.0(\.0)?\b/; # match 23.0 or 23.0.0 (but not 123.012!)
my $new = "23.0.1"; # just a literal string
s/$old/$new/g; # replace any string matching $old in $_ with $new
Note that qr// has other effects beyond just allowing you to use regexp syntax in a string literal: it actually pre-compiles the pattern into a special Regexp object, so that it doesn't need to be recompiled every time it's used later. In particular, as a side effect, the string representation of a qr// regexp literal will usually not exactly match the original content, although it will be equivalent as a regexp. For example, say qr/\b23\.0(\.0)?\b/ will, on my Perl version, output (?^u:\b23\.0(\.0)?\b).
You could also just use a normal double-quoted string literal, and double any backslashes in it, but that's (usually) less efficient than using qr//, and also less readable due to leaning toothpick syndrome.
Using a single-quoted string literal would be slightly better, since backslashes in a single-quoted string are only special when followed by another backslash or a single quote. Even so, readability can still suffer if you happen to need to match any literal backslashes in your regexp, not to mention that it's easy to create subtle bugs if you forget to double a backslash in those rare places where it's still needed.
I have strings like this:
trn_425374_1_94_-
trn_12_1_200_+
trn_2003_2_198_+
And I want to split all after the first number, like this:
trn_425374
trn_12
trn_2003
I tried the following code:
$string =~ s/(?<=trn_\d)\d+//gi;
But returns the same as the input. I have been following examples of similar questions but I don't know what I'm doing wrong. Any suggestion?
If you are running Perl 5 version 10 or later then you have access to the \K ("keep") regular expression escape. Everything before the \K is excluded from the substitution, so this removes everything after the first sequence of digits (except newlines)
s/\d+\K.+//;
with earlier versions of Perl, you will have to capture the part of the string you want to keep, and replace it in the substitution
s/(\D*\d+).+/$1/;
Note that neither of these will remove any trailing newline characters. If you want to strip those as well, then either chomp the string first, or add the /s modifier to the substitution, like this
s/\d+\K.+//s;
or
s/(\D*\d+).+/$1/s;
Do grouping to save first numbers of digits found and use .* to delete from there until end of line:
#!/usr/bin/env perl
use warnings;
use strict;
while ( <DATA> ) {
s/(\d+).*$/$1/ && print;
}
__DATA__
trn_425374_1_94_-
trn_12_1_200_+
trn_2003_2_198_+
It yields:
trn_425374
trn_12
trn_2003
your regexr should be:
$string =~ s/(trn_\d+).*/$1/g;
It substitutes the whole match by the memorized at $1 (which is the string part you want to preserve)
Use \K to preserve the part of the string you want to keep:
$string =~ s/trn_\d+\K.*//;
To quote the link above:
\K
This appeared in perl 5.10.0. Anything matched left of \K is not
included in $& , and will not be replaced if the pattern is used in a
substitution.
How to can match the next lines?
sometext_TEXT1.yyy-TEXT1.yyy
anothertext_OTHER.yyy-MAX.yyy
want remove the - repetative.text from the end, but only if it repeats.
sometext_TEXT1.yyy
anothertext_OTHER.yyy-MAX.yyy
my trying
use strictures;
my $text="sometext_TEXT1.xxx-TEXT1.xxx";
$text =~ s/(.*?)(.*)(\s*-\s*$2)/$1$2/;
print "$text\n";
prints
Use of uninitialized value $2 in regexp compilation at a line 3.
with other words, looking for better solution for the next split + match...
while(<DATA>) {
chomp;
my($first, $second) = split /\s*-\s*/;
s/\s*-\s*$second$// if ( $first =~ /$second$/ );
print "$_\n";
}
__DATA__
sometext_TEXT1.yyy-TEXT1.yyy
anothertext_OTHER.yyy-MAX.yyy
$text =~ s/(.*?)(.*)(\s*-\s*$2)/$1$2/;
This regex has various issues, but is on the right path.
Use \2 (or better: \g2 or \g{-1}) or something to reference the contents of a capture group. The $2 variable is interpolated when the Perl statement is executed. At that time, $2 is undefined, as there was no previous match. You get a warning as it is uninitialized. Even if it were defined, the pattern would be fixed during compilation.
You define three capture groups, but only need one. There is a trick with the \Keep directive: It let's the regex engine forget the previously matched text, so that it won't be affected by the substitution. That is, s/(foo)b/$1/ is equivalent to s/foo\Kb//. The effect is similar to a variable-length lookbehind.
The (.*?)(.*) part is a bit of an backtracking nightmare. We can reduce the cost of your match by adding further conditions, e.g. by anchoring the pattern at start and end of line. Using above modifications, we now have s/^.*?(.*)\K\s*-\s*\g1$//. But on second thought, we can just remove the ^.*? because this describes something the regex engine does anyway!
A short test:
while(<DATA>) {
s/(.*)\K\s*-\s*\g1$//;
print;
}
__DATA__
sometext_TEXT1.yyy-TEXT1.yyy
anothertext_OTHER.yyy-MAX.yyy
Output:
sometext_TEXT1.yyy
anothertext_OTHER.yyy-MAX.yyy
A few words regarding your splitting solution: This will also shorten the line
sometext_TEXT1xyyy - 1.xyyy
because when you interpolate a variable into a regex, the contents aren't matched literally. Instead, they are interpreted as a pattern (where . matches any non-newline codepoint)! You can avoid this by quoting all metacharacters with the \Q...\E escape:
s/\s*-\s*\Q$second\E$// if $first =~ /\Q$second\E$/;
When you use $2 Perl will try to interpolate that variable, but the variable will only be set after the match has completed. What you want, is a backreference, for which you need to use \2:
$text =~ s/(.*?)(.*)(\s*-\s*\2)/$1$2/;
Note that, when the replacement part is evaluated, $1 and $2 have been set and can be interpolated as expected. Also you could make the pattern a bit more concise (and probably more efficient), by using:
$text =~ s/(.*)\s*-\s*\2/$1/;
There is no need to match the initial part (.*?) if it's arbitrary and you just write it back anyway. What you might want to do though, is anchor the pattern to the end of the string:
$text =~ s/(.*)\s*-\s*\1$/$1/;
Otherwise (with your initial attempt or mine), you'd turn something-thingelse into somethingelse.
if($title =~ s/(\s|^|,|\/|;|\|)$replace(\s|$|,|\/|;|\|)//ig)
$title can be a set of titles ranging from President, MD, COO, CEO,...
$replace can be (shareholder), (Owner) or the like.
I keep getting this error. I have checked for improperly balanced '(', ')', no dice :(
Unmatched ) in regex; marked by <-- HERE in m/(\s|^|,|/|;|\|)Owner) <-- HERE (\s|$|,|/|;|\|)/
If you could tell me what the regex does, that would be awesome. Does it strip those symbols? Thanks guys!
If the variable $replace can contain regex meta characters you should wrap it in \Q...\E
\Q$replace\E
To quote Jeffrey Friedl's Mastering Regular Expressions
Literal Text Span The sequence \Q "Quotes" regex metacharacters (i.e., puts a backslash in front of them) until the end of the string, or until a \E sequence.
As mentioned, it'll strip those punctuation symbols, followed by the contents of $replace, then more punctuation symbols, and that it's failing because $replace itself contains a mismatched parenthesis.
However, a few other general regex things: first, instead of ORing everything together (and this is just to simplify logic and typing) I'd keep them together in a character class. matching [\s^,\/;\|] is potentially less error-prone and finger friendly.
Second, don't use grouping parenthesis a set of () unless you really mean it. This places the captured string in capture buffers, and incurs overhead in the regex engine. Per perldoc perlre:
WARNING: Once Perl sees that you need one of $& , $` , or $' anywhere in the program, it has to provide them for every pattern match. This may substantially slow your program. Perl uses the same mechanism to produce $1, $2, etc, so you also pay a price for each pattern that contains capturing parentheses. Source
You can easily get around this by just changing it by adding ?: to the parenthesis:
(?:[\s^,\/;\|])
Edit: not that you need non-capturing grouping in that instance, but it's already in the original regex.
It appears that your variable $replace contains the string Owner), not (Owner).
$title = "Foo Owner Bar";
$replace = "Owner)";
if($title =~ s/(\s|^|,|\/|;|\|)$replace(\s|$|,|\/|;|\|)//ig) {
print $title;
}
Output:
Unmatched ) in regex; marked by <-- HERE in m/(\s|^|,|/|;|\|)Owner)<-- HERE (\s
|$|,|/|;|\|)/ at test.pl line 3.
$title = "Foo Owner Bar";
$replace = "(Owner)";
if($title =~ s/(\s|^|,|\/|;|\|)$replace(\s|$|,|\/|;|\|)//ig) {
print $title;
}
Output:
FooBar