perl regular expression take out text enclosed in parentheses - regex

how do I use Perl to get rid of text within parentheses? For example:
$str = "This is a (extra stuff) string."
to
$str = "This is a string."
I am current using this but it's not working:
$str =~ s/( ( [^)]+ ) )//;
Thanks!

You need to escape the parentheses, like:
s/\([^)]*\)//g
Update by popular demand:
To remove the space you can simply remove spaces before the parenthesis. This will work in most cases:
s/\s*\([^)]*\)//g
To handle nested parenthesis you can use a recursive pattern, like so:
s/\s*\((?:[^()]+|(?R))*\)//g
You can read about (?R) and the like in perlre.
The last expression will work for string like aaa (foo(b,a,2*(3+4)) b) (c (c) c) ddd (x)., giving aaa ddd..

The ( are special and must be escaped
s/\([^)]+\)//g

None of the solutions so far do that the OP asked.
The expression $str =~ s/\([^)]*\)//g;
Converts "This is a (extra stuff) string" to "This is a string", leaving two spaces between the "a" and "string".
Converts "This is a (doubly (nested)) string" to "This is a ) string".
Converts "This is a (doubly (no, (triply!) nested) expression) string" to "This is a nested) expression) string".
Similar problems exist with $str =~ s/[ ]?\(.*?\)[ ]?//g; And why use those square brackets? Aren't regular expressions hairy enough without unneeded stuff?
We're going to need something a bit hairier to so we can eat multiply-nested parenthetical remarks and properly deal with keeping spacing where needed but discarding it otherwise. This does the trick:
1 while $str =~ s/(\w?)(\s*)\([^()]*\)(\s*)(\w?)
/($1&&$4)?($1.($2?$2:$3).$4):($1?$1:$4)/ex;
Edit
Test results:
'This string is OK as is.' -> 'This string is OK as is.'
'This is a (extra stuff) string.' -> 'This is a string.'
'(Preliminary remark) string' -> 'string'
'String (with end remark)' -> 'String'
'A string (remark before punctuation)!' -> 'A string!'
'A (doubly (nested)) string' -> 'A string'
'A (doubly (no, (triply!) nested)) string' -> 'A string'
Edit2
The exg qualification results in incorrect handling of "This (delete) (delete) is a string". All that is needed is ex.

This line should do what you need:
$str =~ s/[ ]?\(.*?\)[ ]?//g;
Do note that it won't work with nested brackets (like (this)), since the regex would have to be a lot more complicated for that type of functionality.

I do converting special characters to hex for easy use in my regex's
/\x28([^\x29]+)\x29/

Hmm I had expected the "greedy" principle to apply, eating all the way to the close parenthesis even when nested. Perhaps a little brute force, using index and rindex functions, would be better.
But I still wonder, why doesn't
$str =~ s/[ ]?\(.*?\)[ ]?//g;
slurp it all the way to the last ')'?

A split version. I kind of like split for this, because it is non-invasive, preserving the original format, and also, regexes tend to become... complicated. Though you need regex to trim it, of course.
You'd still need to work out the spacing. It is not a simple thing to predict whether extra space will appear in the front or end, and removing all double spaces will not preserve original format. This solution removes a single space in front of opening parens, and nothing else. Works in most cases, assuming the input has correct punctuation to begin with.
use warnings;
use strict;
while (<DATA>) {
my #parts = split /\(/;
print de_paren(#parts);
}
sub de_paren {
my $return = shift;
my #parts = #_;
while (my $word = shift #parts) {
next unless $word =~ /\)/;
$word =~ s/^.*?\)// while ($word =~ /\)/);
$return =~ s/ $//;
$return .= $word;
}
return $return;
}
__DATA__
A (doubly (no, (triply!) nested)) string
This is a (extra stuff) string.
(Preliminary remark) string
String (with end remark) String (with end remark)
A string (remark before punctuation)!
A (doubly (nested)) string
Output is:
A string
This is a string.
string
String String
A string!
A string ->

Related

Telling regex search to only start searching at a certain index

Normally, a regex search will start searching for matches from the beginning of the string I provide. In this particular case, I'm working with a very large string (up to several megabytes), and I'd like to run successive regex searches on that string, but beginning at specific indices.
Now, I'm aware that I could use the substr function to simply throw away the part at the beginning I want to exclude from the search, but I'm afraid this is not very efficient, since I'll be doing it several thousand times.
The specific purpose I want to use this for is to jump from word to word in a very large text, skipping whitespace (regardless of whether it's simple space, tabs, newlines, etc). I know that I could just use the split function to split the text into words by passing \s+ as the delimiter, but that would make things for more complicated for me later on, as there a various other possible word delimiters such as quotes (ok, I'm using the term 'word' a bit generously here), so it would be easier for me if I could just hop from word to word using successive regex searches on the same string, always specifying the next index at which to start looking as I go. Is this doable in Perl?
So you want to match against the words of a body of text.
(The examples find words that contain i.)
You think having the starting positions of the words would help, but it isn't useful. The following illustrates what it might look like to obtain the positions and use them:
my #positions;
while ($text =~ /\w+/g) {
push #positions, $-[0];
}
my #matches;
for my $pos (#positions) {
pos($text) = $pos;
push #matches $1 if $text =~ /\G(\w*i\w*)/g;
}
If would far simpler not to use the starting positions at all. Aside from being far simpler, we also remove the need for two different regex patterns to agree as to what constitute a word. The result is the following:
my #matches;
while ($text =~ /\b(\w*i\w*)/g) {
push #matches $1;
}
or
my #matches = $text =~ /\b(\w*i\w*)/g;
A far better idea, however, is to extra the words themselves in advance. This approach allows for simpler patterns and more advanced definitions of "word"[1].
my #matches;
while ($text =~ /(\w+)/g) {
my $word = $1;
push #matches, $word if $word =~ /i/;
}
or
my #matches = grep { /i/ } $text =~ /\w+/g;
For example, a proper tokenizer could be used.
In the absence of more information, I can only suggest the pos function
When doing a global regex search, the engine saves the position where the previous match ended so that it knows where to start searching for the next iteration. The pos function gives access to that value and allows it to be set explicitly, so that a subsequent m//g will start looking at the specified position instead of at the start of the string
This program gives an example. The string is searched for the first non-space character after each of a list of offsets, and displays the character found, if any
Note that the global match must be done in scalar context, which is applied by if here, so that only the next match will be reported. Otherwise the global search will just run on to the end of the file and leave information about only the very last match
use strict;
use warnings 'all';
use feature 'say';
my $str = 'a b c d e f g h i j k l m n';
# 0123456789012345678901234567890123456789
# 1 2 3
for ( 4, 31, 16, 22 ) {
pos($str) = $_;
say $1 if $str =~ /(\S)/g;
}
output
c
l
g
i

Perl - Regex to extract only the comma-separated strings

I have a question I am hoping someone could help with...
I have a variable that contains the content from a webpage (scraped using WWW::Mechanize).
The variable contains data such as these:
$var = "ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig"
$var = "fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf"
$var = "dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew"
The only bits I am interested in from the above examples are:
#array = ("cat_dog","horse","rabbit","chicken-pig")
#array = ("elephant","MOUSE_RAT","spider","lion-tiger")
#array = ("ANTELOPE-GIRAFFE","frOG","fish","crab","kangaROO-KOALA")
The problem I am having:
I am trying to extract only the comma-separated strings from the variables and then store these in an array for use later on.
But what is the best way to make sure that I get the strings at the start (ie cat_dog) and end (ie chicken-pig) of the comma-separated list of animals as they are not prefixed/suffixed with a comma.
Also, as the variables will contain webpage content, it is inevitable that there may also be instances where a commas is immediately succeeded by a space and then another word, as that is the correct method of using commas in paragraphs and sentences...
For example:
Saturn was long thought to be the only ringed planet, however, this is now known not to be the case.
^ ^
| |
note the spaces here and here
I am not interested in any cases where the comma is followed by a space (as shown above).
I am only interested in cases where the comma DOES NOT have a space after it (ie cat_dog,horse,rabbit,chicken-pig)
I have a tried a number of ways of doing this but cannot work out the best way to go about constructing the regular expression.
How about
[^,\s]+(,[^,\s]+)+
which will match one or more characters that are not a space or comma [^,\s]+ followed by a comma and one or more characters that are not a space or comma, one or more times.
Further to comments
To match more than one sequence add the g modifier for global matching.
The following splits each match $& on a , and pushes the results to #matches.
my $str = "sdfds cat_dog,horse,rabbit,chicken-pig then some more pig,duck,goose";
my #matches;
while ($str =~ /[^,\s]+(,[^,\s]+)+/g) {
push(#matches, split(/,/, $&));
}
print join("\n",#matches),"\n";
Though you can probably construct a single regex, a combination of regexs, splits, grep and map looks decently
my #array = map { split /,/ } grep { !/^,/ && !/,$/ && /,/ } split
Going from right to left:
Split the line on spaces (split)
Leave only elements having no comma at the either end but having one inside (grep)
Split each such element into parts (map and split)
That way you can easily change the parts e.g. to eliminate two consecutive commas add && !/,,/ inside grep.
I hope this is clear and suits your needs:
#!/usr/bin/perl
use warnings;
use strict;
my #strs = ("ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig",
"fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf",
"dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew",
"Saturn was long thought to be the only ringed planet, however, this is now known not to be the case.",
"Another sentence, although having commas, should not confuse the regex with this: a,b,c,d");
my $regex = qr/
\s #From your examples, it seems as if every
#comma separated list is preceded by a space.
(
(?:
[^,\s]+ #Now, not a comma or a space for the
#terms of the list
, #followed by a comma
)+
[^,\s]+ #followed by one last term of the list
)
/x;
my #matches = map {
$_ =~ /$regex/;
if ($1) {
my $comma_sep_list = $1;
[split ',', $comma_sep_list];
}
else {
[]
}
} #strs;
$var =~ tr/ //s;
while ($var =~ /(?<!, )\b[^, ]+(?=,\S)|(?<=,)[^, ]+(?=,)|(?<=\S,)[^, ]+\b(?! ,)/g) {
push (#arr, $&);
}
the regular expression matches three cases :
(?<!, )\b[^, ]+(?=,\S) : matches cat_dog
(?<=,)[^, ]+(?=,) : matches horse & rabbit
(?<=\S,)[^, ]+\b(?! ,) : matches chicken-pig

Perl regex - can I say 'if character/string matches, delete it and all to right of it'?

I have an array of strings, some of which contain the character '-'. I want to be able to search for it and for those strings that contain it I wish to delete all characters to the right of it.
So for example if I have:
$string1 = 'home - London';
$string2 = 'office';
$string3 = 'friend-Manchester';
or something as such, then the affected strings would become:
$string1 = 'home';
$string3 = 'friend';
I don't know if the white-space before the '-' would be included in the string afterwards (I don't want it as I will be comparing strings at a later point, although if it doesn't affect string comparisons then it doesn't matter).
I do know that I can search and replace specific strings/characters using something like:
$string1 =~ s/-//
or
$string1 =~ tr/-//
but I'm not very familiar with regular expressions in Perl so I'm not 100% sure of these. I've looked around and couldn't see anything to do with 'to the right of' in regex. Help appreciated!
You can delete anything after a hyphen - with this substitution:
s/-.*$//s
However, you will want to remove the whitespace prior to the hyphen and thus do
s/\s* - .* $//xs
The $ anchores the regex at the end of the string and the /s flag allows the dot to match newlines as well. While the $ is superfluous, it might add clarity.
Your substitution would just have removed the first -, and your transliteration would have removed all hyphens from the string.
Your regular expressions are just searching for the dash, so that's all they replace. You want to search for the dash, and anything after it.
$string =~ s/-.*//;
. represents any character, * means search for that character 0 or more times, and match as many as possible (i.e. to the end of the string if possible)
You can also search for an optional space before it.
$string =~ s/\s?-.*//;
(\s is a clearer way to specify a space character)
Using plain substr() and index() is possible as well.
my #strings = ("we are - so cool",
"lonely",
"friend-Manchester",
"home - london",
"home-new york",
"home with-childeren-first episode");
local $/ = " ";
foreach (#strings) {
$_ = substr($_,0,index($_,'-')) if (index($_,'-') != -1);
chomp;
}
The other answers are good. However, in light of what you said:
...if it doesn't affect string comparisons then it doesn't matter
You don't need a separate step for this at all. Suppose you want to compare $stringwith another variable, $search_string. The following expression will check for an exact match, except that it ignores anything $string has after a dash:
if ($string =~ /^$search_string(\s*-|$)/) { print "Strings matched"; }
#Using Regex:
my #strings =
("we are - so cool",
"lonely",
"friend - Manchester",
"home - london",
"home - new york",
"home with-childeren-first episode"
);
foreach (#strings) {
$_ =~ s/-\s*[a-zA-Z ]+\s*//g;
print "NEW: ".$_."\n";
}

how do you match two strings in two different variables using regular expressions?

$a='program';
$b='programming';
if ($b=~ /[$a]/){print "true";}
this is not working
thanks every one i was a little confused
The [] in regex mean character class which match any one of the character listed inside it.
Your regex is equivalent to:
$b=~ /[program]/
which returns true as character p is found in $b.
To see if the match happens or not you are printing true, printing true will not show anything. Try printing something else.
But if you wanted to see if one string is present inside another you have to drop the [..] as:
if ($b=~ /$a/) { print true';}
If variable $a contained any regex metacharacter then the above matching will fail to fix that place the regex between \Q and \E so that any metacharacters in the regex will be escaped:
if ($b=~ /\Q$a\E/) { print true';}
Assuming either variable may come from external input, please quote the variables inside the regex:
if ($b=~ /\Q$a\E/){print true;}
You then won't get burned when the pattern you'll be looking for will contain "reserved characters" like any of -[]{}().
(apart the missing semicolons:) Why do you put $a in square brackets? This makes it a list of possible characters. Try:
$b =~ /\Q${a}\E/
Update
To answer your remarks regarding = and =~:
=~ is the matching operator, and specifies the variable to which you are applying the regex ($b) in your example above. If you omit =~, then Perl will automatically use an implied $_ =~.
The result of a regular expression is an array containing the matches. You usually assign this so an array, such as in ($match1, $match2) = $b =~ /.../;. If, on the other hand, you assign the result to a scalar, then the scalar will be assigned the number of elements in that array.
So if you write $b = /\Q$a\E/, you'll end up with $b = $_ =~ /\Q$a\E/.
$a='program';
$b='programming';
if ( $b =~ /\Q$a\E/) {
print "match found\n";
}
If you're just looking for whether one string is contained within another and don't need to use any character classes, quantifiers, etc., then there's really no need to fire up the regex engine to do an exact literal match. Consider using index instead:#!/usr/bin/env perl
#!/usr/bin/env perl
use strict;
use warnings;
my $target = 'program';
my $string = 'programming';
if (index($string, $target) > -1) {
print "target is in string\n";
}

Perl regular expression variables and matched pattern substitution

Can anyone explain regular expression text substitutions when the regular expression is held in a variable? I'm trying to process some text, Clearcase config specs actually, and substitute text as I go. The rules for the substitution are held in an array of hashes that have the regular expression to match and the text to substitute.
The input text looks somthing like this:
element /my_elem/releases/... VERSION_STRING.020 -nocheckout
Most of the substitutions are simply to remove lines that contain a specific text string, this works fine. In some cases I want to substitute the text, but re-use the VERSION_STRING text. I've tried using $1 in the substitution expression but it doesn't work. $1 gets the version string in the match, but the replacement of $1 doesn't work in the substitution.
In these cases the output should look something like this:
element -directory /my_elem/releases/... VERSION_STRING.020 -nocheckout
element /my_elem/releases/.../*.[ch] VERSION_STRING.020 -nocheckout
ie. One line input became two output and the version string has been re-used.
The code looks something like this. First the regular expressions and substitutions:
my #Special_Regex = (
{ regex => "\\s*element\\s*\/my_elem_removed\\s*\/main\/\\d+\$", subs => "# Line removed" },
{ regex => "\\s*element\\s*\/my_elem_changed\/releases\/\.\.\.\\s*\(\.\*\$\)",
subs => "element \-directory \/my_elem\/releases\/\.\.\. \\1\nelement \/my_elem\/releases\/\.\.\.\/\*\.\[ch\] \\1" }
);
In the second regex the variable $1 is defined in the portion (.*\$) and this is working correctly. The subs expression does not substitute it, however.
foreach my $line (<INFILE>)
{
chomp($line);
my $test = $line;
foreach my $hash (#Special_Regex)
{
my $regex = qr/$hash->{regex}/is;
if($test =~ s/$regex/$hash->{subs}/)
{
print "$test\n";
print "$line\n";
print "$1\n";
}
}
}
What am I missing? Thanks in advance.
The substitution string in your regex is only getting evaluated once, which transforms $hash->{subs} into its string. You need to evaluate it again to interpolate its internal variables. You can add the e modifier to the end of the regex which tells Perl to run the substitution through eval which can perform the second interpolation among other things. You can apply multiple e flags to evaluate more than once (if you have a problem that needs it). As tchrist helpfully points out, in this case, you need ee since the first eval will just expand the variable, the second is needed to expand the variables in the expansion.
You can find more detail in perlop about the s operator.
There is no compilation for a replace expression. So about the only thing you can do is exec or eval it with the e flag:
if($test =~ s/$regex/eval qq["$hash->{subs}"]/e ) { #...
worked for me after changing \\1 to \$1 in the replacement strings.
s/$regex/$hash->{subs}/
only replaces the matched part with the literal value stored in $hash->{subs} as the complete substitution. In order to get the substitution working, you have to force Perl to evaluate the string as a string, so that means you even have to add the dquotes back in in order to get the interpolating behavior you are looking for (because they are not part of the string.)
But that's kind of clumsy, so I changed the replace expressions into subs:
my #Special_Regex
= (
{ regex => qr{\s*element\s+/my_elem_removed\s*/main/\d+$}
, subs => sub { '#Line removed' }
}
, { regex => qr{\s*element\s+/my_elem_changed/releases/\.\.\.\s*(.*$)}
, subs => sub {
return "element -directory /my_elem/releases/... $1\n"
. "element /my_elem/releases/.../*.[ch] $1"
;
}
}
);
I got rid of a bunch of stuff that you don't have to escape in a substitution expression. Since what you want to do is interpolate the value of $1 into the replacement string, the subroutine does simply that. And because $1 will be visible until something else is matched, it will be the right value when we run this code.
So now the replacement looks like:
s/$regex/$hash->{subs}->()/e
Of course making it pass $1 makes it a little more bulletproof, because you're not depending on the global $1:
s/$regex/$hash->{subs}->( $1 )/e
Of course, you would change the sub like so:
subs => sub {
my $c1 = shift;
return "element -directory /my_elem/releases/... $c1\n"
. "element /my_elem/releases/.../*.[ch] $c1"
;
}
Just one last note: "\.\.\." didn't do what you think it did. You just ended up with '...' in the regex, which matches any three characters.