Why does my Perl regex complain about "Unmatched ) in regex"? - regex

if($title =~ s/(\s|^|,|\/|;|\|)$replace(\s|$|,|\/|;|\|)//ig)
$title can be a set of titles ranging from President, MD, COO, CEO,...
$replace can be (shareholder), (Owner) or the like.
I keep getting this error. I have checked for improperly balanced '(', ')', no dice :(
Unmatched ) in regex; marked by <-- HERE in m/(\s|^|,|/|;|\|)Owner) <-- HERE (\s|$|,|/|;|\|)/
If you could tell me what the regex does, that would be awesome. Does it strip those symbols? Thanks guys!

If the variable $replace can contain regex meta characters you should wrap it in \Q...\E
\Q$replace\E
To quote Jeffrey Friedl's Mastering Regular Expressions
Literal Text Span The sequence \Q "Quotes" regex metacharacters (i.e., puts a backslash in front of them) until the end of the string, or until a \E sequence.

As mentioned, it'll strip those punctuation symbols, followed by the contents of $replace, then more punctuation symbols, and that it's failing because $replace itself contains a mismatched parenthesis.
However, a few other general regex things: first, instead of ORing everything together (and this is just to simplify logic and typing) I'd keep them together in a character class. matching [\s^,\/;\|] is potentially less error-prone and finger friendly.
Second, don't use grouping parenthesis a set of () unless you really mean it. This places the captured string in capture buffers, and incurs overhead in the regex engine. Per perldoc perlre:
WARNING: Once Perl sees that you need one of $& , $` , or $' anywhere in the program, it has to provide them for every pattern match. This may substantially slow your program. Perl uses the same mechanism to produce $1, $2, etc, so you also pay a price for each pattern that contains capturing parentheses. Source
You can easily get around this by just changing it by adding ?: to the parenthesis:
(?:[\s^,\/;\|])
Edit: not that you need non-capturing grouping in that instance, but it's already in the original regex.

It appears that your variable $replace contains the string Owner), not (Owner).
$title = "Foo Owner Bar";
$replace = "Owner)";
if($title =~ s/(\s|^|,|\/|;|\|)$replace(\s|$|,|\/|;|\|)//ig) {
print $title;
}
Output:
Unmatched ) in regex; marked by <-- HERE in m/(\s|^|,|/|;|\|)Owner)<-- HERE (\s
|$|,|/|;|\|)/ at test.pl line 3.
$title = "Foo Owner Bar";
$replace = "(Owner)";
if($title =~ s/(\s|^|,|\/|;|\|)$replace(\s|$|,|\/|;|\|)//ig) {
print $title;
}
Output:
FooBar

Related

Perl remove text within () characters

I have a variable which may or may not contain text within brackets, e.g.
blah blah (soups up)
I want to remove anything within and including the brackets, so for this example I'd be left with:
blah blah
I tried the following substitution but it didn't work as expected:
$desc =~ s/(.*?)//gs;
print "fixed desc: $desc\n";
prints:
fixed desc:
As per the discussion, anything, including sub brackets within brackets should be blitz'd
e.g.
blah blah (soups up (tomato!) )
Matching balanced text is a classic hard regex problem. For example, how do you deal with keep (remove) keep (remove)? Fortunately it's gotten much easier. perlfaq4 covers it. You have two choices.
First is to use recursive regexes introduced in 5.10. (?R) says to recurse the whole pattern.
m{
\( # Open paren
(?>
[^()] | # No nested parens OR
(?R) # Recurse to check for balanced parens
)*
\) # Close paren
}x;
However, this doesn't deal with escapes like (this is \) all in parens).
Rather than go into the regex contortions necessary to handle escapes, use a module to build that regex for you. Regexp::Common::balanced and Regexp::Common::delimited can do that, and a lot of other hard regex problems, and it will handle escapes.
use v5.10;
use strict;
use warnings;
use Regexp::Common;
my $re = $RE{balanced}{-parens=>"()"};
my $s = "blah blah (soups up (tomato!\) )";
$s =~ s{$re}{};
say $s; # "blah blah"
Well the first thing to note in the most simple case, if you aren't yet worried about some of the edge cases mentioned above, is that the bracket characters are also used for grouping and backreferences in regexes. So you'll need to escape them in your match statement like so:
$desc =~ s/\(.*\)//gs;
Here's some more info on the topic:
http://perlmeme.org/faqs/regexp/metachar_regexp.html
Second question: What are you intending to do with the question mark in the match? The '*' will match from 0-n occurrences of the previous character, so I'm not sure the '?' is going to do much here.

Perl regex with grouping in LHS

How to can match the next lines?
sometext_TEXT1.yyy-TEXT1.yyy
anothertext_OTHER.yyy-MAX.yyy
want remove the - repetative.text from the end, but only if it repeats.
sometext_TEXT1.yyy
anothertext_OTHER.yyy-MAX.yyy
my trying
use strictures;
my $text="sometext_TEXT1.xxx-TEXT1.xxx";
$text =~ s/(.*?)(.*)(\s*-\s*$2)/$1$2/;
print "$text\n";
prints
Use of uninitialized value $2 in regexp compilation at a line 3.
with other words, looking for better solution for the next split + match...
while(<DATA>) {
chomp;
my($first, $second) = split /\s*-\s*/;
s/\s*-\s*$second$// if ( $first =~ /$second$/ );
print "$_\n";
}
__DATA__
sometext_TEXT1.yyy-TEXT1.yyy
anothertext_OTHER.yyy-MAX.yyy
$text =~ s/(.*?)(.*)(\s*-\s*$2)/$1$2/;
This regex has various issues, but is on the right path.
Use \2 (or better: \g2 or \g{-1}) or something to reference the contents of a capture group. The $2 variable is interpolated when the Perl statement is executed. At that time, $2 is undefined, as there was no previous match. You get a warning as it is uninitialized. Even if it were defined, the pattern would be fixed during compilation.
You define three capture groups, but only need one. There is a trick with the \Keep directive: It let's the regex engine forget the previously matched text, so that it won't be affected by the substitution. That is, s/(foo)b/$1/ is equivalent to s/foo\Kb//. The effect is similar to a variable-length lookbehind.
The (.*?)(.*) part is a bit of an backtracking nightmare. We can reduce the cost of your match by adding further conditions, e.g. by anchoring the pattern at start and end of line. Using above modifications, we now have s/^.*?(.*)\K\s*-\s*\g1$//. But on second thought, we can just remove the ^.*? because this describes something the regex engine does anyway!
A short test:
while(<DATA>) {
s/(.*)\K\s*-\s*\g1$//;
print;
}
__DATA__
sometext_TEXT1.yyy-TEXT1.yyy
anothertext_OTHER.yyy-MAX.yyy
Output:
sometext_TEXT1.yyy
anothertext_OTHER.yyy-MAX.yyy
A few words regarding your splitting solution: This will also shorten the line
sometext_TEXT1xyyy - 1.xyyy
because when you interpolate a variable into a regex, the contents aren't matched literally. Instead, they are interpreted as a pattern (where . matches any non-newline codepoint)! You can avoid this by quoting all metacharacters with the \Q...\E escape:
s/\s*-\s*\Q$second\E$// if $first =~ /\Q$second\E$/;
When you use $2 Perl will try to interpolate that variable, but the variable will only be set after the match has completed. What you want, is a backreference, for which you need to use \2:
$text =~ s/(.*?)(.*)(\s*-\s*\2)/$1$2/;
Note that, when the replacement part is evaluated, $1 and $2 have been set and can be interpolated as expected. Also you could make the pattern a bit more concise (and probably more efficient), by using:
$text =~ s/(.*)\s*-\s*\2/$1/;
There is no need to match the initial part (.*?) if it's arbitrary and you just write it back anyway. What you might want to do though, is anchor the pattern to the end of the string:
$text =~ s/(.*)\s*-\s*\1$/$1/;
Otherwise (with your initial attempt or mine), you'd turn something-thingelse into somethingelse.

Difference between Perl regular expression delimiters /.../ and #...#

Today I came across two different syntaxes for a Perl regular expression match.
#I have a date string
my $time = '2012-10-29';
#Already familiar "m//":
$t =~ m/^(\d{4}-\d\d-\d\d)$/
#Completely new to me m##.
$t =~ m#^(\d{4}-\d\d-\d\d)#/
Now what is the difference between /expression/ and #expression#?
As everone else said, you can use any delimiter after the m.
/ has one special feature: you can use it by itself, e.g.
$string =~ /regexp/;
is equivalent to:
$string =~ m/regexp/;
Perl allows you to use pretty much any characters to delimit strings, including regexes. This is especially useful if you need to match a pattern that contains a lot of slash characters:
$slashy =~ m/\/\//; #Bad
$slashy =~ m|//|; #Good
According to the documentation, the first of those is an example of "leaning toothpick syndrome".
Most but not all characters behave in the same way when escaping. There is an important exception: m?...? is a special case that only matches a single time between calls to reset().
Another exception: if single quotes are used for the delimiter, no variable interpolation is done. You still have to escape $, though, as it is a special character matching the end of the line.
Nothing except what you have to escape in the regex. You can use any pair of matched characters you like.
$string = "http://example.com/";
$string =~ m!http://!;
$string =~ m#http://!#;
$string =~ m{http://};
$string =~ m/http:\/\//;
After the match or search/replace operator (the m and s, respectively) you can use any character as the delimiter, e.g. the # in your case. This also works with pairs of parenthesis: s{ abc (.*) def }{ DEF $1 ABC }x.
Advantages are that you don't have to escape the / (but the actual delimiter characters, of course). It's often used for clarity, especially when dealing with things like paths or protocols.
There is no difference; the "/" and "#" characters are used as delimiters for the expression. They simply mark the "boundary" of the expression, but are not part of the expression. In theory you can use most non-alphanumeric characters as a delimiter. Here is a link to the PHP manual (It doesn't matter that it is the PHP manual, the Regex syntax is the same, I just like it because it explains well) on Perl compatible regular expression syntax; read the part about delimiters

How can I extract a substring enclosed in double quotes in Perl?

I'm new to Perl and regular expressions and I am having a hard time extracting a string enclosed by double quotes. Like for example,
"Stackoverflow is
awesome"
Before I extract the strings, I want to check if it is the end of the line of the whole text was in the variable:
if($wholeText =~ /\"$/) #check the last character if " which is the end of the string
{
$wholeText =~ s/\"(.*)\"/$1/; #extract the string, removed the quotes
}
My code didn't work; it is not getting inside of the if condition.
You need to do:
if($wholeText =~ /"$/)
{
$wholeText =~ s/"(.*?)"/$1/s;
}
. doesn't match newlines unless you apply the /s modifier.
There's no need to escape the quotes like you're doing.
The above poster who recommended using the "m" flag in the regular expression is correct, however the regex provided won't quite work. When you say:
$wholeText =~ s/\"(.*)\"/$1/m; #extract the string, removed the quotes
...the regular expression is too "greedy", which means the (.*) part will gobble up too much of the text. If you have a sample like this:
"The quick brown fox," he said, "jumped over the lazy dog."
...then the above regex will capture everything from "The" through "dog.", which is probably not what you intend. There are two ways to make the regex less greedy. Which one is better has everything to do with how you choose to handle extra " marks inside your string.
One:
$wholeText =~ s/\"([^"]*)\"/$1/m;
Two:
$wholeText =~ s/\"(.*?)\"/$1/m;
In One, the regex says "start with quote, then find everything that is not a quote and remember it, until you see another quote." In Two, the regex says "Start with quote, then find everything until you find another quote." The extra ? inside the ( ) tells the regex processor to not be greedy. Without considering quote escaping within the string, both regular expressions should behave the same.
By the way, this is a classic problem when parsing a CSV ("Comma Separated Values") file, by the way, so looking up some references on that may help you out.
If you want to anchor a match to the very end of the string (not line, entire string), use the \z anchor:
if( $wholeText =~ /"\z/ ) { ... }
You don't need a guard condition for this. Just use the right regex in the substitution. If it doesn't match the regex, nothing happens:
$wholeText =~ s/"(.*?)"\z/$1/s;
I think you really have a different question though. Why are you trying to anchor it to the end of the string? What problems are you trying to avoid?
For multi-line strings, you need to include the 'm' modifier with the search pattern.
if ($wholeText =~ m/\"$/m) # First m for match operator; second multi-line modifier
{
$wholeText =~ s/\"(.*?)\"/$1/s; #extract the string, removed the quotes
}
You will also need to consider whether you allow double quotes inside the string and if so, which convention to use. The primary ones are backslash and double quote (also backslash backslash), or double quote double quote in the string. These slightly complicate your regex.
The answer by #chaos uses 's' as a multi-line modifier. There's a small difference between the two:
m
Treat string as multiple lines. That is, change "^" and "$" from matching the start or end of the string to matching the start or end of any line anywhere within the string.
s
Treat string as single line. That is, change "." to match any character whatsoever, even a newline, which normally it would not match.
Used together, as /ms, they let the "." match any character whatsoever, while still allowing "^" and "$" to match, respectively, just after and just before newlines within the string.
Assuming you have a single substring in quotes, this will extract it:
s/."(.?)".*/$1/
And the answer above (s/"(.*?)"/$1/s) will just remove quotes.
Test code:
my $text = "no \"need this\" again, no\n";
my $text2 = $text;
print $text;
$text2 =~ s/.*\"(.*?)\".*/$1/;
print $text2;
$text =~ s/"(.*?)"/$1/s;
print $text;
Output:
no "need this" again, no
need this
no need this again, no

How can I preserve whitespace when I match and replace several words in Perl?

Let's say I have some original text:
here is some text that has a substring that I'm interested in embedded in it.
I need the text to match a part of it, say: "has a substring".
However, the original text and the matching string may have whitespace differences. For example the match text might be:
has a
substring
or
has a substring
and/or the original text might be:
here is some
text that has
a substring that I'm interested in embedded in it.
What I need my program to output is:
here is some text that [match starts here]has a substring[match ends here] that I'm interested in embedded in it.
I also need to preserve the whitespace pattern in the original and just add the start and end markers to it.
Any ideas about a way of using Perl regexes to get this to happen? I tried, but ended up getting horribly confused.
Been some time since I've used perl regular expressions, but what about:
$match = s/(has\s+a\s+substring)/[$1]/ig
This would capture zero or more whitespace and newline characters between the words. It will wrap the entire match with brackets while maintaining the original separation. It ain't automatic, but it does work.
You could play games with this, like taking the string "has a substring" and doing a transform on it to make it "has\s*a\s*substring" to make this a little less painful.
EDIT: Incorporated ysth's comments that the \s metacharacter matches newlines and hobbs corrections to my \s usage.
This pattern will match the string that you're looking to find:
(has\s+a\s+substring)
So, when the user enters a search string, replace any whitespace in the search string with \s+ and you have your pattern. The, just replace every match with [match starts here]$1[match ends here] where $1 is the matched text.
In regexes, you can use + to mean "one or more." So something like this
/has\s+a\s+substring/
matches has followed by one or more whitespace chars, followed by a followed by one or more whitespace chars, followed by substring.
Putting it together with a substitution operator, you can say:
my $str = "here is some text that has a substring that I'm interested in embedded in it.";
$str =~ s/(has\s+a\s+substring)/\[match starts here]$1\[match ends here]/gs;
print $str;
And the output is:
here is some text that [match starts here]has a substring[match ends here] that I'm interested in embedded in it.
A many has suggested, use \s+ to match whitespace. Here is how you do it automaticly:
my $original = "here is some text that has a substring that I'm interested in embedded in it.";
my $search = "has a\nsubstring";
my $re = $search;
$re =~ s/\s+/\\s+/g;
$original =~ s/\b$re\b/[match starts here]$&[match ends here]/g;
print $original;
Output:
here is some text that [match starts here]has a substring[match ends here] that I'm interested in embedded in it.
You might want to escape any meta-characters in the string. If someone is interested, I could add it.
This is an example of how you could do that.
#! /opt/perl/bin/perl
use strict;
use warnings;
my $submatch = "has a\nsubstring";
my $str = "
here is some
text that has
a substring that I'm interested in, embedded in it.
";
print substr_match($str, $submatch), "\n";
sub substr_match{
my($string,$match) = #_;
$match =~ s/\s+/\\s+/g;
# This isn't safe the way it is now, you will need to sanitize $match
$string =~ /\b$match\b/;
}
This currently does anything to check the $match variable for unsafe characters.