How can I search and replace text that looks like Perl variables? - regex

I'm really getting my butt kicked here. I can not figure out how to write a search and replace that will properly find this string.
String:
$QData{"OrigFrom"} $Text{"wrote"}:
Note: That is the actual STRING. Those are NOT variables. I didn't write it.
I need to replace that string with nothing. I've tried escaping the $, {, and }. I've tried all kinds of combinations but it just can't get it right.
Someone out there feel like taking a stab at it?
Thanks!

No one likes quotemeta? Let Perl figure it out so you don't strain you eyes with all those backslashes. :)
my $string = 'abc $QData{"OrigFrom"} $Text{"wrote"}: def';
my $escaped = quotemeta '$QData{"OrigFrom"} $Text{"wrote"}:';
$string =~ s/$escaped/Ponies!/;
print $string;

I originally thought that wrapping your regex in \Q/\E (the quotemeta start and end escapes) would be all that you needed to do, but it turns out that $ (and #) are not
allowed inside \Q...\E sequences (see http://search.cpan.org/perldoc/perlre#Escape_sequences).
So what you need to do is escape the $ characters separately, but you can wrap everything else in \Q ... \E:
/\$\QQData{"OrigFrom"} \E\$\QText{"wrote"}:\E/

regex using escape character \ would be
s/\$QData\{"OrigFrom"\} \$Text\{"wrote"\}://;
full test code:
#!/sw/bin/perl
$_='$QData{"OrigFrom"} $Text{"wrote"}:';
s/\$QData\{"OrigFrom"\} \$Text\{"wrote"\}://;
print $_."\n";
outputs nothing but newline.

Related

perl regex remove newlines in string

I have a Perl script which runs over a database dump in a plain text file, trying to remove all instances of newlines and possibly other odd characters when I see strings between quotes:
INSERT INTO ... VALUES ( "... these are the lines I'm interested in." )
I slurp in the file:
#file = <FILE>;
and:
foreach my $line (#file) {
$line =~ s/"[^"]*(\R)+[^"]*"//g;
# I want to get rid of newlines in strings
# And other odd characters I might come across
}
One character class I used instead of (\R) was:
([\r\n\t\v\f]+)
and I would try to:
$line =~ s/"[^"]+?([\r\n\t\v\f]+)[^"]*"//g;
I'm sure I'm missing something. I try to start matching with a literal double quote, scan past anything not a double quote (non-greedy, at least one match), reach the characters I want to get rid of, and keep scanning not double quote (any number of other characters not a double quote) until I reach the ending double quote.
So I wanted to replace $1 capture above with nothing.
I've tried on-line regex builders, and
/"[^"]*?([\r\n\t\f\v]+)[^"]*"/
worked with an on-line test, using a short paragraph with newlines and tabs in it, although it was in PHP pcre mode. I thought it would have worked with Perl.
Perhaps I'm not escaping some characters properly in the regex for Perl? Or the pattern is just not going to work the way I want it to, because it's wrong.
Thank you, any help appreciated.
The regex at regex101.com:
"[^"]*?([\r\n\f\t\v]+)[^"]*?"
matches for strings like this:
"This is
my\t test
string.
So there!"
I'm thoroughly puzzled now. :)
The real problem is that you will only find one group of \R's when there could be many groups between quotes. The best thing to do is make a callback (eval) with a general match between quotes, then substitute the \R's in
the replacement.
something like:
sub repl {
my ($content) = _#;
$content =~ s/\R+//g;
return $content;
}
$input =~ s/"([^"]*)"/ repl($1) /ge;
edit: If you're looking for only 1 linebreak cluster, you have to
exclude linebreaks leading up to it. For example: [^"\r\n]+
edit2: To slurp the file into $input, do a
$/ = undef;
my $input = <$fh>;

QRegex look ahead/look behind

I have been pondering on this for quite awhile and still can't figure it out. The regex look ahead/behinds. Anyway, I'm not sure which to use in my situation, I am still having trouble grasping the concept. Let me give you an example.
I have a name....My Business,LLC (Milwaukee,WI)_12345678_12345678
What I want to do is if there is a comma in the name, no matter how many, remove it. At the same time, if there is not a comma in the name, still read the line. The one-liner I have is listed below.
s/(.*?)(_)(\d+_)(\d+$)/$1$2$3$4/gi;
I want to remove any comma from $1(My Business,LLC (Milwaukee,WI)). I could call out the comma in regex as a literal string((.?),(.?),(.*?)(_.*?$)) if it was this EXACT situation everytime, however it is not.
I want it to omit commas and match 'My Business, LLC_12345678_12345678' or just 'My Business_12345678_12345678', even though there is no comma.
In any situation I want it to match the line, comma or not, and remove any commas(if any) no matter how many or where.
If someone can help me understand this concept, it will be a breakthrough!!
Use the /e modifier of Perl so that you can pass your function during the replace in s///
$str = 'My Business,LLC (Milwaukee,WI)_12345678_12345678';
## modified your regex as well using lookahead
$str =~ s/(.*?)(?=_\d+_\d+$)/funct($1)/ge;
print $str;
sub funct{
my $val = shift;
## replacing , with empty, use anything what you want!
$val =~ s/,//g;
return $val;
}
Using funct($1) in substitute you are basically calling the funct() function with parameter $1

Regex substituting opening parenthesis

As part of a parsing script I'm trying to convert strings like this:
<a href="http://www.web.com/%20Special%20event%202013%20%282%29.pdf">
into
<a href="http://www.web.com/%20Special%20event%202013%20(2).pdf">
The regex for the closing parenthesis works fine
perl -i -pe "s~(href\=\/?[\"\']\.\.\/$i\-(?:(?!%29).)*)%29([^\"\']*[\"\'])~\1)\2~g" "$pageName".html
giving me
<a href="http://www.web.com/%20Special%20event%202013%20%282).pdf">
The problem arrises with the equivalent regex for the opening parenthesis:
perl -i -pe "s~(href\=\/?[\"\']\.\.\/$i\-(?:(?!%28).)*)%28([^\"\']*[\"\'])~\1(\2~g" "$pageName".html
just returns the two groups with nothing in between:
<a href="http://www.web.com/%20Special%20event%202013%202%29.pdf">
Escaping the ( in the substitution with a backslash (or two) has no effect. If I wrap it in some other characters (say ~\1#(#\2~g ) the parenthesis still disappears (giving me %20##2%29 ).
If however in a fit of desperation I add seven parenthesises into the substitution, it works.
perl -i -pe "s~(href\=\/?[\"\']\.\.\/$i\-(?:(?!%28).)*)%28([^\"\']*[\"\'])~\1(((((((\L\2~g" "$pageName".html
outputs
<a href="http://www.web.com/%20Special%20event%202013%20(2%29.pdf">
Can somebody please make sense of this.
Perhaps the following will be helpful or at least provide some direction. It will work on Perl version 10 and above.
use strict;
use warnings;
use v5.10.0; # For regex \K
use URI::Escape;
my $string = '<a href="http://www.web.com/%20Special%20event%202013%20%282%29.pdf">';
$string =~ s/.+2013%20\K([^.]+)(?=\.pdf)/uri_unescape($1)/e;
print $string;
Output:
<a href="http://www.web.com/%20Special%20event%202013%20(2).pdf">
Left enough of the date and the space (%20) as an anchor, then used \K to *K*eep all of that. Then captured the URI encoded text, which is later decoded and used as the substitution text.
The pattern you have doesn't match the string you show at all. It matches something that looks like
<a href=/"../$i-xxxxxxxxxxxxxxx%29xxxxxxxxxx">
with literal dots, and whatever $i contains.
Also, a couple of points about your substitution:
Don't escape characters that don't need escaping. It may take some experience to know without checking which characters you need to escape, but the main point of using ~ as a delimiter is to avoid having to escape slashes in the regex, so at least you could have avoided that.
Don't use \1, \2 etc. in the replacement string. Perl tries very hard to make this work, but normally in Perl those sequences mean to insert the characters \x01 and \x02. Use $1 and $2.
So your regex could be written
s~(href=/?["']\.\./$i-(?:(?!%29).)*)%29([^"']*["'])~$1)$2~;
but it still doesn't "work fine" with the string you gave, which would have to look something like
<a href=/"../$i-xxxxxxxxxxxxxxx%282%29xxxxxxxxxx">
again, containing whatever is in $i. I don't understand at all the optional slash before the href attribute value: it is invalid HTML.
However, using a string that your first regex matches, your second one also works, replacing opening parentheses correctly, so I can't guess at what the problem may be.
There is often no need to verify the entire string. You can just replace the parts you're interested in. So I would write something like
s/(href="[^"]+)%28(\d+)%29(\.pdf")/$1($2)$3/;
which works fine on the string you gave, and replaces both open and close parentheses at once.
I had some problems understanding your regex, but this might work:
perl -pe "s~(href\s*=\s*\"[^\"]*)%28(.*?)%29~\$1(\$2)~g" input

How do I use regex to replace something between parentheses?

For example suppose I have the following string
"we went to the (big) zoo"
I would like to match and replace the text between the parentheses and also catch one of the extra white spaces to end up with
"we went to the zoo"
What is the syntax to do this? I can't seem to quite get it right
You need to a global search of \s*\([^)]\)\s* replacing each occurrence with a single space. Exactly how you would code this depends on what language you are using.
In Perl it would look like
my $s = "we went to the (big) zoo";
$s =~ s/\s*\(.*?\)\s*/ /g;
print $s;
OUTPUT
we went to the zoo
In VIM you can type command:
%s/([[:alpha:]]*)\ //g
This would do it all globally (g stands for global replacement, you can put 1 for intance and it'd do it only once per line).
If you are using sed, then it be similar. Something along the lines of:
cat input.txt | sed s/([[:alpha:]]*)\ //g
Note that here I'd used [[:alpha:]] for char strings only. If you use .*, then it'd be for any characters (including numbers, white spaces, non-printable chars, etc)...
Generally, the regex will look something like this:
/\((.*?)\)/
But as the comments suggest, the language and application may affect this.
You can to detect the parentheses (/\(.*?*\)/) and remove them.
Which are language you write?

How do I handle every ASCII character (including regex special characters) in a Perl regex?

I have the following code in Perl:
if (index ($retval, $_[2]) != -1) {
#fs = split ($_[2], $_[1]);
$_[2] is the delimiter variable and $_[1] is the string that the delimiter may exist in. ($_[0] is used elsewhere) You may have guessed that this code is in a subroutine by those variable names.
Anyway, onto my question, when my delimiter is something innocuous like 'a' or ':' the code works like it should. However, when it is something that would get parsed by Perl regex, like a '\' character, then it does not work like it is supposed to. This makes sense because in the split function Perl would see something like:
split (/\/, $_[1]);
which makes no sense to it at all because it would want this:
split (/\//, $_[1]);
So with all of that in mind my question, that I cannot answer, is this: "How do I make it so that any delimiter that I put into $_[2], or all the ASCII characters, gets treated as the character it is supposed to be and not interpreted as something else?"
Thanks in advance,
Robert
You can use quotemeta to escape $_[2] properly so it will work in the regex without getting mangled. This should do it:
my $quoted = quotemeta $_[2];
#fs = split( $quoted, $_[1] );
Alternatively, you can use \Q in your regex to escape it. See "Escape Sequences" in perlre.
split /\Q$_[2]/, $_[1]
As a side note, I'm suspecting that the $_[1] and $_[2] variables refer to the automatically passed in #_ array of a sub.
It's helpful - would have saved you quite some explaining here and made your code more understandable by itself - and common practice to use something like the following at the beginning of the sub:
sub mysub {
my ($param1, $string, $delim) = #_;
# ...
}