How do I use regex to replace something between parentheses? - regex

For example suppose I have the following string
"we went to the (big) zoo"
I would like to match and replace the text between the parentheses and also catch one of the extra white spaces to end up with
"we went to the zoo"
What is the syntax to do this? I can't seem to quite get it right

You need to a global search of \s*\([^)]\)\s* replacing each occurrence with a single space. Exactly how you would code this depends on what language you are using.
In Perl it would look like
my $s = "we went to the (big) zoo";
$s =~ s/\s*\(.*?\)\s*/ /g;
print $s;
OUTPUT
we went to the zoo

In VIM you can type command:
%s/([[:alpha:]]*)\ //g
This would do it all globally (g stands for global replacement, you can put 1 for intance and it'd do it only once per line).
If you are using sed, then it be similar. Something along the lines of:
cat input.txt | sed s/([[:alpha:]]*)\ //g
Note that here I'd used [[:alpha:]] for char strings only. If you use .*, then it'd be for any characters (including numbers, white spaces, non-printable chars, etc)...

Generally, the regex will look something like this:
/\((.*?)\)/
But as the comments suggest, the language and application may affect this.

You can to detect the parentheses (/\(.*?*\)/) and remove them.
Which are language you write?

Related

Is there a regex engine that supports "for each captured group" in replacement strings?

Here's my example. If I want to use a regex to replace tabs in the code with spaces, but wanted to preserve tab characters in the middle or end of a line of code, I would use this as my search string to capture each tab character at the start of a line: ^(\t)+
Now, how could I write a search string that replaces each captured group with four spaces? I'm thinking there must be some way to do this with backreferences?
I've found I can work around this by running similar regex-replacements (like s/^\t/ /g, s/^ \t/ /g, ...) multiple times until no more matches are found, but I wonder if there's a quicker way to do all the necessary replacements at once.
Note: I used sed format in my example, but I'm not sure if this is possible with sed. I'm wondering if sed supports this, and if not, is there a platform that does? (e.g., there's a Python/Java/bash extended regex lib that supports this.)
With perl and other languages that support this feature (Java, PCRE(PHP, R, libboost), Ruby, Python(the new regex module), .NET), you can use the \G anchor that matches the position after the last match or the start of the string:
s/(?:\G|^)\t/ /gm
This works in Perl. Maybe sed too, I don't know sed.
It relies on doing an eval, basically a callback.
It takes the length of $1 then cats ' ' that many times.
Perl sample.
my $str = "
\t\t\tThree
\t\tTwo
\tOne
None";
$str =~ s/^(\t+)/ ' ' x length($1) /emg;
print "$str\n";
Output
Three
Two
One
None
Just another idea that came to me, this could also be solved with positive lookbehind:
s/(?<=^[\t]*)\t/ /gm
It's ugly, but it works.
sed ':a
s/^\(\t*\)\t/\1 /
ta' YourFile
Use recursive action on 1 regex with sed, it's a workaround

Regex substituting opening parenthesis

As part of a parsing script I'm trying to convert strings like this:
<a href="http://www.web.com/%20Special%20event%202013%20%282%29.pdf">
into
<a href="http://www.web.com/%20Special%20event%202013%20(2).pdf">
The regex for the closing parenthesis works fine
perl -i -pe "s~(href\=\/?[\"\']\.\.\/$i\-(?:(?!%29).)*)%29([^\"\']*[\"\'])~\1)\2~g" "$pageName".html
giving me
<a href="http://www.web.com/%20Special%20event%202013%20%282).pdf">
The problem arrises with the equivalent regex for the opening parenthesis:
perl -i -pe "s~(href\=\/?[\"\']\.\.\/$i\-(?:(?!%28).)*)%28([^\"\']*[\"\'])~\1(\2~g" "$pageName".html
just returns the two groups with nothing in between:
<a href="http://www.web.com/%20Special%20event%202013%202%29.pdf">
Escaping the ( in the substitution with a backslash (or two) has no effect. If I wrap it in some other characters (say ~\1#(#\2~g ) the parenthesis still disappears (giving me %20##2%29 ).
If however in a fit of desperation I add seven parenthesises into the substitution, it works.
perl -i -pe "s~(href\=\/?[\"\']\.\.\/$i\-(?:(?!%28).)*)%28([^\"\']*[\"\'])~\1(((((((\L\2~g" "$pageName".html
outputs
<a href="http://www.web.com/%20Special%20event%202013%20(2%29.pdf">
Can somebody please make sense of this.
Perhaps the following will be helpful or at least provide some direction. It will work on Perl version 10 and above.
use strict;
use warnings;
use v5.10.0; # For regex \K
use URI::Escape;
my $string = '<a href="http://www.web.com/%20Special%20event%202013%20%282%29.pdf">';
$string =~ s/.+2013%20\K([^.]+)(?=\.pdf)/uri_unescape($1)/e;
print $string;
Output:
<a href="http://www.web.com/%20Special%20event%202013%20(2).pdf">
Left enough of the date and the space (%20) as an anchor, then used \K to *K*eep all of that. Then captured the URI encoded text, which is later decoded and used as the substitution text.
The pattern you have doesn't match the string you show at all. It matches something that looks like
<a href=/"../$i-xxxxxxxxxxxxxxx%29xxxxxxxxxx">
with literal dots, and whatever $i contains.
Also, a couple of points about your substitution:
Don't escape characters that don't need escaping. It may take some experience to know without checking which characters you need to escape, but the main point of using ~ as a delimiter is to avoid having to escape slashes in the regex, so at least you could have avoided that.
Don't use \1, \2 etc. in the replacement string. Perl tries very hard to make this work, but normally in Perl those sequences mean to insert the characters \x01 and \x02. Use $1 and $2.
So your regex could be written
s~(href=/?["']\.\./$i-(?:(?!%29).)*)%29([^"']*["'])~$1)$2~;
but it still doesn't "work fine" with the string you gave, which would have to look something like
<a href=/"../$i-xxxxxxxxxxxxxxx%282%29xxxxxxxxxx">
again, containing whatever is in $i. I don't understand at all the optional slash before the href attribute value: it is invalid HTML.
However, using a string that your first regex matches, your second one also works, replacing opening parentheses correctly, so I can't guess at what the problem may be.
There is often no need to verify the entire string. You can just replace the parts you're interested in. So I would write something like
s/(href="[^"]+)%28(\d+)%29(\.pdf")/$1($2)$3/;
which works fine on the string you gave, and replaces both open and close parentheses at once.
I had some problems understanding your regex, but this might work:
perl -pe "s~(href\s*=\s*\"[^\"]*)%28(.*?)%29~\$1(\$2)~g" input

Change delimiter of grep command

I am using grep to detect something here
This is not working when the link is split on two lines in the input. I want to grep to check till it detects a </a> but right now it only is taking the input into grep till it detects a new line.
So if input is like something here it works, but if input is like
<a href="xxxx">
something here /a>
, then it doesn't.
Any solutions?
I'd use awk rather than grep. This should work:
awk '/a href="xxxx">/,/\/a>/' filename
I think you would have much less trouble using some xslt tool, but you can do it with sed, awk or an extended version of grep pcregrep, which is capable of multiline pattern (-M).
I'd suggest to fold input so openning and closing tags are on the same line, then check the line against the pattern. An idiomatic approach using sed(1):
sed '/<[Aa][^A-Za-z]/{ :A
/<\/[Aa]>/ bD
N
bA
:D
/\n/ s// /g
}
# now try your pattern
/<[Aa][^A-Za-z] href="xxx"[^>]*>[^<]*something here[^<]*<\/[Aa]>/ !d'
This is probably a repeat question:
Grep search strings with line breaks
You can try it with tr '\n' ' 'command as was explained in one of the answers, if all you need is to find the files and not the line numbers.
Consider egrep -3 '(<a|</a>)'
"-3" prints up to 3 surrounding lines around each regex match (3 lines before and 3 lines after the match). You can use -1 or -2 as well if that works better.
perl -e '$_=join("", <>); m#<a.*?>.*?<.*?/a>#s; print "$&\n";'
So the trick here is that the entire input is read into $_. Then a standard /.../ regex is run. I used the alternate syntax m#...# so that I do not have to backslash "/"s which are used in xml. Finally the "s" postfix makes multiline matches work by making "." also match newlines (note also option "m" which changes the meaning of ^ and $). "$&" is the matched string. It is the result you are looking for. If you want just the inner-text, you can put round brackets around that part and print $1.
I am assuming that you meant </a> rather than /a> as an xml closing delimiter.
Note the .*? is a non-greedy version of .* so for <a>1</a><a>2</a>, it only matches <a>1</a>.
Note that nested nodes may cause problems eg <a><a></a></a>. This is the same as when trying to match nested brackets "(", ")" or "{", "}". This is a more interesting problem. Regex's are normally stateless so they do not by themselves support keeping an unlimited bracket-nesting-depth. When programming parsers, you normally use regex's for low-level string matching and use something else for higher level parsing of tokens eg bison. There are bison grammars for many languages and probably for xml. xslt might even be better but I am not familiar with it. But for a very simple use case, you can also handle nested blocks like this in perl:
Nested bracket-handling code: (this could be easily adapted to handle nested xml blocks)
$_ = "a{b{c}e}f";
my($level)=(1);
s/.*?({|})/$1/; # throw away everything before first match
while(/{|}/g) {
if($& eq "{") {
++$level;
} elsif($& eq "}") {
--$level;
if($level == 1) {
print "Result: ".$`.$&."\n";
$_=$'; # reset searchspace to after the match
last;
}
}
}
Result: {b{c}e}

How can I search and replace text that looks like Perl variables?

I'm really getting my butt kicked here. I can not figure out how to write a search and replace that will properly find this string.
String:
$QData{"OrigFrom"} $Text{"wrote"}:
Note: That is the actual STRING. Those are NOT variables. I didn't write it.
I need to replace that string with nothing. I've tried escaping the $, {, and }. I've tried all kinds of combinations but it just can't get it right.
Someone out there feel like taking a stab at it?
Thanks!
No one likes quotemeta? Let Perl figure it out so you don't strain you eyes with all those backslashes. :)
my $string = 'abc $QData{"OrigFrom"} $Text{"wrote"}: def';
my $escaped = quotemeta '$QData{"OrigFrom"} $Text{"wrote"}:';
$string =~ s/$escaped/Ponies!/;
print $string;
I originally thought that wrapping your regex in \Q/\E (the quotemeta start and end escapes) would be all that you needed to do, but it turns out that $ (and #) are not
allowed inside \Q...\E sequences (see http://search.cpan.org/perldoc/perlre#Escape_sequences).
So what you need to do is escape the $ characters separately, but you can wrap everything else in \Q ... \E:
/\$\QQData{"OrigFrom"} \E\$\QText{"wrote"}:\E/
regex using escape character \ would be
s/\$QData\{"OrigFrom"\} \$Text\{"wrote"\}://;
full test code:
#!/sw/bin/perl
$_='$QData{"OrigFrom"} $Text{"wrote"}:';
s/\$QData\{"OrigFrom"\} \$Text\{"wrote"\}://;
print $_."\n";
outputs nothing but newline.

Why doesn't this simple regex match what I think it should?

I have a data file that looks like the following example. I've added '%' in lieu of \t, the tab control character.
1234:56% Alice Worthington
alicew% Jan 1, 2010 10:20:30 AM% Closed% Development
Digg:
Reddit:
Update%% file-one.txt% 1.1% c:/foo/bar/quux
Add%% file-two.txt% 2.5.2% c:/foo/bar/quux
Remove%% file-three.txt% 3.4% c:/bar/quux
Update%% file-four.txt% 4.6.5.3% c:/zzz
... many more records of the above form
The records I'm interested in are the lines beginning with "Update", "Add", "Remove", and so on. I won't know what the lines begin with ahead of time, or how many lines precede them. I do know that they always begin with a string of letters followed by two tabs. So I wrote this regex:
generate-report-for 1234:56 | egrep "^[[:alpha:]]+\t\t.+"
But this matches zero lines. Where did I go wrong?
Edit: I get the same results whether I use '...' or "..." for the egrep expression, so I'm not sure it's a shell thing.
Apparently \t isn't a special character for egrep. You can either use grep -P to enable Perl-compatible regex engine, or insert literal tabs with CtrlvCtrli
Even better, you could use the excellent ack
It looks like the shell is parsing "\t\t" before it is sent to egrep. Try "\\t\\t" or '\t\t' instead. That is 2 slashes in double quotes and one in single quotes.
The file might not be exactly what you see. Maybe there are control characters hidden. It happens, sometimes. My suggestion is that you debug this. First, reduce to the minimum regex pattern that matches, and then keep adding stuff one by one, until you find the problem:
egrep "[[:alpha:]]"
egrep "[[:alpha:]]+"
egrep "[[:alpha:]]+\t"
egrep "[[:alpha:]]+\t\t"
egrep "[[:alpha:]]+\t\t.+"
egrep "^[[:alpha:]]+\t\t.+"
There are variations on that sequence, depending on what you find out at each step. Also, the first step can really be skipped, but this is just for the sake of showing the technique.
you can use awk
awk '/^[[:alpha:]]\t\t/' file