I have a variable which may or may not contain text within brackets, e.g.
blah blah (soups up)
I want to remove anything within and including the brackets, so for this example I'd be left with:
blah blah
I tried the following substitution but it didn't work as expected:
$desc =~ s/(.*?)//gs;
print "fixed desc: $desc\n";
prints:
fixed desc:
As per the discussion, anything, including sub brackets within brackets should be blitz'd
e.g.
blah blah (soups up (tomato!) )
Matching balanced text is a classic hard regex problem. For example, how do you deal with keep (remove) keep (remove)? Fortunately it's gotten much easier. perlfaq4 covers it. You have two choices.
First is to use recursive regexes introduced in 5.10. (?R) says to recurse the whole pattern.
m{
\( # Open paren
(?>
[^()] | # No nested parens OR
(?R) # Recurse to check for balanced parens
)*
\) # Close paren
}x;
However, this doesn't deal with escapes like (this is \) all in parens).
Rather than go into the regex contortions necessary to handle escapes, use a module to build that regex for you. Regexp::Common::balanced and Regexp::Common::delimited can do that, and a lot of other hard regex problems, and it will handle escapes.
use v5.10;
use strict;
use warnings;
use Regexp::Common;
my $re = $RE{balanced}{-parens=>"()"};
my $s = "blah blah (soups up (tomato!\) )";
$s =~ s{$re}{};
say $s; # "blah blah"
Well the first thing to note in the most simple case, if you aren't yet worried about some of the edge cases mentioned above, is that the bracket characters are also used for grouping and backreferences in regexes. So you'll need to escape them in your match statement like so:
$desc =~ s/\(.*\)//gs;
Here's some more info on the topic:
http://perlmeme.org/faqs/regexp/metachar_regexp.html
Second question: What are you intending to do with the question mark in the match? The '*' will match from 0-n occurrences of the previous character, so I'm not sure the '?' is going to do much here.
Related
I have a (probably very basic) question about how to construct a (perl) regex, perl -pe 's///g;', that would find/replace multiple instances of a given character/set of characters in a specified string. Initially, I thought the g "global" flag would do this, but I'm clearly misunderstanding something very central here. :/
For example, I want to eliminate any non-alphanumeric characters in a specific string (within a larger text corpus). Just by way of example, the string is identified by starting with [ followed by #, possibly with some characters in between.
[abc#def"ghi"jkl'123]
The following regex
s/(\[[^\[\]]*?#[^\[\]]*?)[^a-zA-Z0-9]+?([^\[\]]*?)/$1$2/g;
will find the first " and if I run it three times I have all three.
Similarly, what if I want to replace the non-alphanumeric characters with something else, let's say an X.
s/(\[[^\[\]]*?#[^\[\]]*?)[^a-zA-Z0-9]+?([^\[\]]*?)/$1X$2/g;
does the trick for one instance. But how can I find all of them in one go?
The reason your code doesn't work is that /g doesn't rescan the string after a substitution. It finds all non-overlapping matches of the given regex and then substitutes the replacement part in.
In [abc#def"ghi"jkl'123], there is only a single match (which is the [abc#def" part of the string, with $1 = '[abc#def' and $2 = ''), so only the first " is removed.
After the first match, Perl scans the remaining string (ghi"jkl'123]) for another match, but it doesn't find another [ (or #).
I think the most straightforward solution is to use a nested search/replace operation. The outer match identifies the string within which to substitute, and the inner match does the actual replacement.
In code:
s{ \[ [^\[\]\#]* \# \K ([^\[\]]*) (?= \] ) }{ $1 =~ tr/a-zA-Z0-9//cdr }xe;
Or to replace each match by X:
s{ \[ [^\[\]\#]* \# \K ([^\[\]]*) (?= \] ) }{ $1 =~ tr/a-zA-Z0-9/X/cr }xe;
We match a prefix of [, followed by 0 or more characters that are not [ or ] or #, followed by #.
\K is used to mark the virtual beginning of the match (i.e. everything matched so far is not included in the matched string, which simplifies the substitution).
We match and capture 0 or more characters that are not [ or ].
Finally we match a suffix of ] in a look-ahead (so it's not part of the matched string either).
The replacement part is executed as a piece of code, not a string (as indicated by the /e flag). Here we could have used $1 =~ s/[^a-zA-Z0-9]//gr or $1 =~ s/[^a-zA-Z0-9]/X/gr, respectively, but since each inner match is just a single character, it's also possible to use a transliteration.
We return the modified string (as indicated by the /r flag) and use it as the replacement in the outer s operation.
So...I'm going to suggest a marvelously computationally inefficient approach to this. Marvelously inefficient, but possibly still faster than a variable-length lookbehind would be...and also easy (for you):
The \K causes everything before it to be dropped....so only the character after it is actually replaced.
perl -pe 'while (s/\[[^]]*#[^]]*\K[^]a-zA-Z0-9]//){}' file
Basically we just have an empty loop that executes until the search and replace replaces nothing.
Slightly improved version:
perl -pe 'while (s/\[[^]]*?#[^]]*?\K[^]a-zA-Z0-9](?=[^]]*?])//){}' file
The (?=) verifies that its content exists after the match without being part of the match. This is a variable-length lookahead (what we're missing going the other direction). I also made the *s lazy with the ? so we get the shortest match possible.
Here is another approach. Capture precisely the substring that needs work, and in the replacement part run a regex on it that cleans it of non-alphanumeric characters
use warnings;
use strict;
use feature 'say';
my $var = q(ah [abc#def"ghi"jkl'123] oh); #'
say $var;
$var =~ s{ \[ [^\[\]]*? \#\K ([^\]]+) }{
(my $v = $1) =~ s{[^0-9a-zA-Z]}{}g;
$v
}ex;
say $var;
where the lone $v is needed so to return that and not the number of matches, what s/ operator itself returns. This can be improved by using the /r modifier, which returns the changed string and doesn't change the original (so it doesn't attempt to change $1, what isn't allowed)
$var =~ s{ \[ [^\[\]]*? \#\K ([^\]]+) }{
$1 =~ s/[^0-9a-zA-Z]//gr;
}ex;
The \K is there so that all matches before it are "dropped" -- they are not consumed so we don't need to capture them in order to put them back. The /e modifier makes the replacement part be evaluated as code.
The code in the question doesn't work because everything matched is consumed, and (under /g) the search continues from the position after the last match, attempting to find that whole pattern again further down the string. That fails and only that first occurrence is replaced.
The problem with matches that we want to leave in the string can often be remedied by \K (used in all current answers), which makes it so that all matches before it are not consumed.
I'm new to both perl and using regex. I need to remove the white space from a string.
I found an example but its pretty opaque to me. Is this an accurate description of whats happening?
sub trim($)
{
my $string = shift;
$string =~ s/^\s+//;
# =~ : regex on variable string
# s/ : replace match with value
# ^\s+ : one or more white space characters at beginning
# // : no characters
$string =~ s/\s+$//;
# =~ : regex on variable $string
# s/ : replace match with value
# \s+$ : one or more white space characters until end of line
# // : no characters
return $string;
}
Yes it is.
Nothing else to say, actually.
Yes it is, as answered by sidyll. All your comments are accurate. Since these are basics you are asking, I would like to add a little.
You can do both in one expression, s/^\s+|\s+$//g (there are slight efficiency considerations).
Note that now you need /g ("global") modifier so that all \s+ are found. Otherwise the engine stops after it finds ^\s+ (if there are any) and you are left with trailing space (if any).
You can use spaces in your regex, for readability, by using /x modifier. In this case it isn't much but with more complex ones it can help a lot.
$string =~ s% ^\s+ | \s+$ %%gx;
You may use different delimiters -- as long as you don't use that inside the regex. I use % above to avoid the editor coloring everything red (I find % not very readable in fact, but I need | inside). This is sometimes very useful, for example when your regex has a lot of /. Then you can use a different delimiter so you don't have to escape them.
Complete resources are given by ThisSuitIsBlackNot in the comment.
I've seen people praise this a lot: regex Demo, where you can type in a regex and see how it works.
Consider the following string:
blah, foo(a,b), bar(c,d), yo
I want to extract a list of strings:
blah
foo(a,b)
bar(c,d)
yo
It seems to me that I should be able to use quote words here, but I'm struggling with the regex. Can someone help me out?
Perl has a little thing regex recursion, so you might be able to look for:
either a bare word like blah containing no parentheses (\w+)
a "call", like \w+\((?R)(, *(?R))*\)
The total regex is (\w+(\((?R)(, ?(?R))*\))?), which seems to work.
You can use the following regex to use in split:
\([^()]*\)(*SKIP)(*F)|\s*,\s*
With \([^()]*\), we match a ( followed with 0 or more characters other than ( or ) and then followed with ). We fail the match with (*SKIP)(*F) if that parenthetical construction is found, and then we only match the comma surrounded with optional whitespaces.
See demo
#!/usr/bin/perl
my $string= "blah, foo(a,b), bar(c,d), yo";
my #string = split /\([^()]*\)(*SKIP)(*F)|\s*,\s*/, $string;
foreach(#string) {
print "$_\n";
}
To account for commas inside nested balanced parentheses, you can use
my #string = split /\((?>[^()]|(?R))*\)(*SKIP)(*F)|\s*,\s*/, $string;
Here is an IDEONE demo
With \((?>[^()]|(?R))*\) we match all balanced ()s and fail the match if found with the verbs (*SKIP)(*F), and then we match a comma with optional whitespace around (so as not to manually trim the strings later).
For a blah, foo(b, (a,b)), bar(c,d), yo string, the result is:
blah
foo(b, (a,b))
bar(c,d)
yo
There is a solution given by Borodin for one of your question (which is similar to this question). A small change of regex will give you desire output: (this will not work for nested parentheses)
use strict;
use warnings;
use 5.010;
my $line = q<blah, foo(a,b), bar(c,d), yo>;
my #words = $line =~ / (?: \([^)]*\) | [^,] )+ /xg;
say for #words;
Output:
blah
foo(a,b)
bar(c,d)
yo
How to can match the next lines?
sometext_TEXT1.yyy-TEXT1.yyy
anothertext_OTHER.yyy-MAX.yyy
want remove the - repetative.text from the end, but only if it repeats.
sometext_TEXT1.yyy
anothertext_OTHER.yyy-MAX.yyy
my trying
use strictures;
my $text="sometext_TEXT1.xxx-TEXT1.xxx";
$text =~ s/(.*?)(.*)(\s*-\s*$2)/$1$2/;
print "$text\n";
prints
Use of uninitialized value $2 in regexp compilation at a line 3.
with other words, looking for better solution for the next split + match...
while(<DATA>) {
chomp;
my($first, $second) = split /\s*-\s*/;
s/\s*-\s*$second$// if ( $first =~ /$second$/ );
print "$_\n";
}
__DATA__
sometext_TEXT1.yyy-TEXT1.yyy
anothertext_OTHER.yyy-MAX.yyy
$text =~ s/(.*?)(.*)(\s*-\s*$2)/$1$2/;
This regex has various issues, but is on the right path.
Use \2 (or better: \g2 or \g{-1}) or something to reference the contents of a capture group. The $2 variable is interpolated when the Perl statement is executed. At that time, $2 is undefined, as there was no previous match. You get a warning as it is uninitialized. Even if it were defined, the pattern would be fixed during compilation.
You define three capture groups, but only need one. There is a trick with the \Keep directive: It let's the regex engine forget the previously matched text, so that it won't be affected by the substitution. That is, s/(foo)b/$1/ is equivalent to s/foo\Kb//. The effect is similar to a variable-length lookbehind.
The (.*?)(.*) part is a bit of an backtracking nightmare. We can reduce the cost of your match by adding further conditions, e.g. by anchoring the pattern at start and end of line. Using above modifications, we now have s/^.*?(.*)\K\s*-\s*\g1$//. But on second thought, we can just remove the ^.*? because this describes something the regex engine does anyway!
A short test:
while(<DATA>) {
s/(.*)\K\s*-\s*\g1$//;
print;
}
__DATA__
sometext_TEXT1.yyy-TEXT1.yyy
anothertext_OTHER.yyy-MAX.yyy
Output:
sometext_TEXT1.yyy
anothertext_OTHER.yyy-MAX.yyy
A few words regarding your splitting solution: This will also shorten the line
sometext_TEXT1xyyy - 1.xyyy
because when you interpolate a variable into a regex, the contents aren't matched literally. Instead, they are interpreted as a pattern (where . matches any non-newline codepoint)! You can avoid this by quoting all metacharacters with the \Q...\E escape:
s/\s*-\s*\Q$second\E$// if $first =~ /\Q$second\E$/;
When you use $2 Perl will try to interpolate that variable, but the variable will only be set after the match has completed. What you want, is a backreference, for which you need to use \2:
$text =~ s/(.*?)(.*)(\s*-\s*\2)/$1$2/;
Note that, when the replacement part is evaluated, $1 and $2 have been set and can be interpolated as expected. Also you could make the pattern a bit more concise (and probably more efficient), by using:
$text =~ s/(.*)\s*-\s*\2/$1/;
There is no need to match the initial part (.*?) if it's arbitrary and you just write it back anyway. What you might want to do though, is anchor the pattern to the end of the string:
$text =~ s/(.*)\s*-\s*\1$/$1/;
Otherwise (with your initial attempt or mine), you'd turn something-thingelse into somethingelse.
if($title =~ s/(\s|^|,|\/|;|\|)$replace(\s|$|,|\/|;|\|)//ig)
$title can be a set of titles ranging from President, MD, COO, CEO,...
$replace can be (shareholder), (Owner) or the like.
I keep getting this error. I have checked for improperly balanced '(', ')', no dice :(
Unmatched ) in regex; marked by <-- HERE in m/(\s|^|,|/|;|\|)Owner) <-- HERE (\s|$|,|/|;|\|)/
If you could tell me what the regex does, that would be awesome. Does it strip those symbols? Thanks guys!
If the variable $replace can contain regex meta characters you should wrap it in \Q...\E
\Q$replace\E
To quote Jeffrey Friedl's Mastering Regular Expressions
Literal Text Span The sequence \Q "Quotes" regex metacharacters (i.e., puts a backslash in front of them) until the end of the string, or until a \E sequence.
As mentioned, it'll strip those punctuation symbols, followed by the contents of $replace, then more punctuation symbols, and that it's failing because $replace itself contains a mismatched parenthesis.
However, a few other general regex things: first, instead of ORing everything together (and this is just to simplify logic and typing) I'd keep them together in a character class. matching [\s^,\/;\|] is potentially less error-prone and finger friendly.
Second, don't use grouping parenthesis a set of () unless you really mean it. This places the captured string in capture buffers, and incurs overhead in the regex engine. Per perldoc perlre:
WARNING: Once Perl sees that you need one of $& , $` , or $' anywhere in the program, it has to provide them for every pattern match. This may substantially slow your program. Perl uses the same mechanism to produce $1, $2, etc, so you also pay a price for each pattern that contains capturing parentheses. Source
You can easily get around this by just changing it by adding ?: to the parenthesis:
(?:[\s^,\/;\|])
Edit: not that you need non-capturing grouping in that instance, but it's already in the original regex.
It appears that your variable $replace contains the string Owner), not (Owner).
$title = "Foo Owner Bar";
$replace = "Owner)";
if($title =~ s/(\s|^|,|\/|;|\|)$replace(\s|$|,|\/|;|\|)//ig) {
print $title;
}
Output:
Unmatched ) in regex; marked by <-- HERE in m/(\s|^|,|/|;|\|)Owner)<-- HERE (\s
|$|,|/|;|\|)/ at test.pl line 3.
$title = "Foo Owner Bar";
$replace = "(Owner)";
if($title =~ s/(\s|^|,|\/|;|\|)$replace(\s|$|,|\/|;|\|)//ig) {
print $title;
}
Output:
FooBar