Why does my regular expression fail with certain substitutions? - regex

I am new to perl and not sure how to achieve the following.
I am reading a file and putting the lines in a variable called $tline. Next, I am trying to replace some character from the $tline.
This substitution fails if $tline has some special characters like (, ?,= etc in it. How to escape the special characters from this variable $tline?
if ($tline ne "") {
$tline =~ s/\//\%;
}
EDIT
Sorry for the confusions. Here is what I am trying to do.
$tline =~ s/"\//"\<\%\=request\.getContextPath\(\)\%\>\//;
This is working for most of the cases. But when the input file has ? in it, it is failing.

How about:
$tline =~ s/\Q$var\E/;
That will cause quotemeta to be applied to contents of $var which is being used as the pattern.

This isn't a valid regex:
$tline =~ s/\//\%;
It gets read like this to perl
$tline =~ s/a/%;
Where a = /
What you wanted to do is replace a forward-slash with a percent sign you probably want
$tline =~ s/\//%/;
Which is better written like this:
$tline =~ s,/,%,;
You probably also want to replace more than just the first forward-slash, so you want the /g flag:
$tline =~ s,/,%,g;
And, this exactly what tr (transliteration) does:
$tline =~ tr,/,%,;
UPDATE I think what you want is a simple quotemeta() which takes your input, and regex-escapes the meta characters
$ perl -e'print quotemeta("</foo?>")'
\<\/foo\?\>

You could place all your special characters between square brackets (called a "character class"). The following will replace all left parentheses, question marks and equal signs in your string with percent signs:
my $tline = 'fo(?=o';
$tline =~ s/[(?=]/%/g;
print "$tline\n";
Prints:
fo%%%o

quotemeta is a good function for getting a exact literal with special characters into a regex. And \Q and \E are good operators for doing the same thing inside the regex.
However, you're search expression is not that complex. In your edit, you're simply looking for a double quote and a slash. In fact, I've quite simplified your expression so it contains not a single backslash. So it's not a problem for quotemeta nor for that matter \Q and \E.
Once pared down, I don't see anything in your revised substitution that would cause a problem with '?' in $tline.
Key to the simplification is that '.', '(', and ')' mean nothing special to the replacement section of your expression, so this is equivalent:
$tline =~ s/"\//"<%=request.getContextPath()%>\//;
Not to mention easier to read. Of course this is even easier:
$tline =~ s|"/|"<%=request.getContextPath()%>/|;
Because in Perl, you can choose the delimiter you wish with the s operator.
But with any of these, this works:
use Test::More tests => 1;
my $tline = '"/?"';
$tline =~ s|"/|"<%=request.getContextPath()%>/|;
ok( $tline =~ /getContextPath/ );
It passes the test. Perhaps you're having a problem with more than one substitution on a line. That can be fixed with:
$tline =~ s|"/|"<%=request.getContextPath()%>/|g;
That g is the global switch on the end, saying make this substitution for as many times as it occurs in the input.
However, since I can see what you are doing, I suggest an even tighter specification of what you want to search:
$tline =~ s~\b(href|link|src)="/~$1="<%=2request.getContextPath()%>/~g;
And when I run this:
use Test::More tests => 2;
my $tline = '"/?"';
$tline =~ s/"\//"<%=request.getContextPath()%>\//;
ok( $tline =~ /getContextPath/ );
$tline = 'src="/?/?/beer"';
ok( $tline =~ s~\b(href|link|src)="/~$1="<%=request.getContextPath()%>/~g
);
I get two successes.
Your true problem is yet unspecified.

Well, one way to do it is to put all the characters you want to replace in square brackets. Like so:
$string =~ s/[,?=\/]//; # This will remove the first ',', '?', '=', or '/' from your string.
If you want to remove all the '?' in a string, for example, use a g on the end of it like so:
$string =~ s/[?]//g;
I'm a little rusty, but I believe that you only need a '\' in front of \ or /, (and of course the other special characters like \n,\t, etc...). Like so:
$string =~ s/[\\]/[\/]/g; # Switch from DOS to Unix delimiters.
$string =~ s/[\n\t]//g; # Remove all newlines and tabs
As others have said, the code you've posted isn't going to work since you forgot the last /. That's another nice reason to keep the "weird" characters in a box.

Related

Search and replace a special character in perl

I want to search a character and replace it with a string. First, I search for ':' and replace it with 'to'. Next I want to search '$' and replace it with 'END'. This is the code that I've tried. In below code, it work for the first character but not the second character. I tried to use backslash to escape the special character '$' but it still did not work. What else can I do?
$string = "[9:8],
if ($string =~ /^.*:+/){
$stringreplaced =~ s/:/to/g;
}
elsif ($string =~ /^.*\$+/){
$stringreplaced =~ s/\$/END/g;
}
First of all, the code you posted doesn't even compile, yet you say it actually ran. Only post code that you've run.
Second, you're matching against the wrong string. You're checking if $string contains the character, but you replace the characters in $stringreplaced. ALWAYS use use strict; use warnings;. This would have caught this error.
Third, you only check if the character (: or $) is on the first line. This is because . doesn't match line feeds without /s.
Finally, You only check if the string contains $ if it doesn't contain : because you used elsif.
The following is all you need:
$string =~ s/:/to/g;
$string =~ s/\$/END/g;

Remove certain characters from a regex group

I have a string that looks like this (key":["value","value","value"])
"emailDomains":["google.co.uk","google.com","google.com","google.com","google.co.uk"]
and I use the following regex to select from the string. (the regex is setup in a way where it wont select a string that looks like this "key":[{"key":"value","key":"value"}] )
(?<=:\[").*?(?="])
Resulting Selection:
google.co.uk","google.com","google.com","google.com","google.co.uk
I want to remove the " in that select string, and i was wondering if there was an easy way to do this using the replace command. Desired result...
"emailDomains":["google.co.uk, google.com, google.com, google.com, google.co.uk"]
How do I solve this problem?
If your string indeed has the form "key":["v1", "v2", ... "vN"], you can split off the part that needs to be changed, replace "," by a space in it, and re-assemble:
my #parts = split / (\["\s* | \s*\"]) /x, $string; #"
$parts[2] =~ s/",\s*"/ /g;
my $processed = join '', #parts;
The regex pattern for the separator in split is captured since in that case the separators are also in the returned list, what is helpful here for putting the string back together. Then, we need to change the third element of the array.
In this approach, we have to change a specific element in the array so if your format varies, even a little, this may not (or still may) be suitable.
This should of course be processed as JSON, using a module. If the format isn't sure, as indicated in a comment, it would be best to try to ensure that you have JSON. Picking bits and pieces like above (or below) is a road to madness once requirements slowly start evolving.
The same approach can be used in a regex, and this may in fact have an advantage to be able to scoop up and ignore everything preceding the : (with split that part may end up with multiple elements if the format isn't exactly as shown, what then affects everything)
$string =~ s{ :\["\s*\K (.*?) ( "\] ) }{
my $e = $2;
my $n = $1 =~ s/",\s*"/ /gr;
$n.$e
}ex;
Here /e modifier makes it so that the replacement side is evaluated as code, where we do the same as with the split above. Notes on regex
Have to save away $2 first, since it gets reset in the next regex
The /r modifier†, which doesn't change its target but rather returns the changed string, is what allows us to use substitution operator on the read-only $1
If nothing gets captured for $2, and perhaps for $1, that means that there was no match and the outcome is simply that $string doesn't change, quietly. So if this substitution should always work then you may want to add handling of such unexpected data
Don't need a $n above, but can return ($1 =~ s/",\s*"/ /gr) . $e
Or, using lookarounds as attempted
$string =~ s{ (?<=:\[") (.+?) (?="\]) }{ $1 =~ s/",\s*"/ /gr }egx;
what does reduce the amount of code, but may be trickier to work with later.
While this is a direct answer to the question I think it's least maintainable.
†  This useful modifier, for "non-destructive substitution," appeared in v5.14. In earlier Perl versions we would copy the string and run regex on that, with an idiom
(my $n = $1) =~ s/",\s*"/ /g;
In the lookarounds-example we then need a little more
$string =~ s{...}{ (my $n = $1) =~ s/",\s*"/ /g; $n }gr
since s/ operator returns the number of substitutions made while we need $n to be returned from that whole piece of code in {} (the replacement side), to be used as the replacement.
You can use this \G based regex to start the match with :[" and further captures the values appropriately and replaces matched text so that only comma is retained and doublequotes are removed.
(:\[")|(?!^)\G([^"]+)"(,)"
Regex Demo
Your text is almost proper JSON, so it's really easy to go the final inch and make it so, and then process that:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say postderef/;
no warnings qw/experimental::postderef/;
use JSON::XS; # Install through your OS package manager or a CPAN client
my $str = q/"emailDomains":["google.co.uk","google.com","google.com","google.com","google.co.uk"]/;
my $json = JSON::XS->new();
my $obj = $json->decode("{$str}");
my $fixed = $json->ascii->encode({emailDomains =>
join(', ', $obj->{'emailDomains'}->#*)});
$fixed =~ s/^\{|\}$//g;
say $fixed;
Try Regex: " *, *"
Replace with: ,
Demo

How can I extract a substring up to the first digit?

How can I find the first substring until I find the first digit?
Example:
my $string = 'AAAA_BBBB_12_13_14' ;
Result expected: 'AAAA_BBBB_'
Judging from the tags you want to use a regular expression. So let's build this up.
We want to match from the beginning of the string so we anchor with a ^ metacharacter at the beginning
We want to match anything but digits so we look at the character classes and find out this is \D
We want 1 or more of these so we use the + quantifier which means 1 or more of the previous part of the pattern.
This gives us the following regular expression:
^\D+
Which we can use in code like so:
my $string = 'AAAA_BBBB_12_13_14';
$string =~ /^\D+/;
my $result = $&;
Most people got half of the answer right, but they missed several key points.
You can only trust the match variables after a successful match. Don't use them unless you know you had a successful match.
The $&, $``, and$'` have well known performance penalties across all regexes in your program.
You need to anchor the match to the beginning of the string. Since Perl now has user-settable default match flags, you want to stay away from the ^ beginning of line anchor. The \A beginning of string anchor won't change what it does even with default flags.
This would work:
my $substring = $string =~ m/\A(\D+)/ ? $1 : undef;
If you really wanted to use something like $&, use Perl 5.10's per-match version instead. The /p switch provides non-global-perfomance-sucking versions:
my $substring = $string =~ m/\A\D+/p ? ${^MATCH} : undef;
If you're worried about what might be in \D, you can specify the character class yourself instead of using the shortcut:
my $substring = $string =~ m/\A[^0-9]+/p ? ${^MATCH} : undef;
I don't particularly like the conditional operator here, so I would probably use the match in list context:
my( $substring ) = $string =~ m/\A([^0-9]+)/;
If there must be a number in the string (so, you don't match an entire string that has no digits, you can throw in a lookahead, which won't be part of the capture:
my( $substring ) = $string =~ m/\A([^0-9]+)(?=[0-9])/;
$str =~ /(\d)/; print $`;
This code print string, which stand before matching
perl -le '$string=q(AAAA_BBBB_12_13_14);$string=~m{(\D+)} and print $1'
AAAA_BBBB_

What does s-/-- and s-/\Z-- in perl mean?

I am a beginner in perl and I have a query regarding pattern matching.
I came across a line in perl where it was written
$variable =~ s-/\Z--;
And as the code goes ahead some another variable was assigned
$variable1 =~ s-/--;
Can you please tell me what does these 2 lines do?
I want to know what does s-/\Z-- and s-/-- mean.
$variable =~ s-/\Z--;
- is used as a delimiter here. However, best practice suggests that you either use / or {} as delimiters.
It could be re-written as:
$variable =~ s{/\Z}{}; # remove a / at the end of a string
Consider:
$variable1 =~ s-/--;
Again, it could be re-written as:
$variable1 =~ s{/}{}; # remove the first /
The s/// operator in Perl is a substitution operation, which performs a search-and-replace on a string using a special kind of pattern called a regular expression. You can read more about regular expressions and Perl's pattern matching in the man pages that come with Perl:
man perlretut
man perlre
If you don't have these on your system, try searching Google for the same.
Applying a substitution to a variable is done with the =~ operator. So the following replaces all instances of 'foo' in the variable $var with 'bar'.
$var =~ s/foo/bar/;
All the Perl operators are documented on the 'perlop' man page.
Even though the most common separator character is a slash (hence s///), you can also use any other punctuation character as a separator. So in this case, the author has decided to use the dash (-) as the separator.
Here's the same line of code above using dash as a separator:
$var =~ s-foo-bar-;
In your case, the dash doesn't seem to add any clarity to the code, so it might be best to update it to use the conventional slashes instead.
The s/// search and replace function in perl can be used with different delimeters, which is what is done in this case. They have replaced / with the minus sign -, or dash.
The s-/-- removes the first / from the string.
The s-/\Z-- matches and removes a slash at the end of the line. I think this is better written: s{/$}{}.
$variable1 =~ s-/--;could be written as
$variable =~ s{/}{}xms;
or this
$variable =~ s/ \/ //xms;
It means delete the first / in the string.
Regarding s-/\Z--, it is usually written like this
$variable =~ s{/ \Z}{}xms;
or this
$variable =~ s/ \/ \Z //xms;
It means delete a / if it is at the end of the string (\Z).

In Perl, how can I correctly extract URLs that are enclosed in parentheses?

I've got two question about Regexp::Common qw/URI/ and Regex in Perl.
I use Regexp::Common qw/URI/ to parse URI in the strings and delete them. But I've got an error when a URI is between parentheses.
For example: (http://www.example.com)
The error is caused by ')', and when it try to parse the URI, the app crash. So I've thought two fixes:
Do a simple (or I thought so) that writes a whitespace between parentheses and ) characters
The Regexp::Common qw/URI/ has a function that implement a fix.
In my code I've tried to implement the Regex but the app freezes. The code that I've tried is this:
use strict;
use Regexp::Common qw/URI/;
my $str = "Hello!!, I love (http://www.example.com)";
while ($str =~ m/\)/){
$str =~ s/\)/ \)/;
}
my ($uri) = $str =~ /$RE{URI}{-keep}/;
print "$uri\n";
print $str;
The output that I want is: (http://www.example.com )
I'm not sure, but I think that the problem is in $str =~ s/\)/ \)/;
BTW, I've got a question about Regexp::Common qw/URI/. I've got two string type:
ablalbalblalblalbal http://www.example.com
asfasdfasdf http://www.example.com aasdfasdfasdf
I want to remove the URI if it is the last component (and save it). And, if not, save it without removing it from the text.
You don't have to first test for a match to be able to use the s/// operator correctly: If the string does not match the search pattern, it will not do anything.
#!/usr/bin/perl
use strict; use warnings;
my $str = "Hello!!, I love (GOOGLE)";
$str =~ s/\)/ )/g;
print "$str\n";
The general problem of detecting URLs correctly in text is error-prone. See for example Jeff's thoughts on this.
my $str = "Hello!!, I love (GOOGLE)";
while ($str =~ m/)/){
$str =~ s/)/ )/;
}
Your program goes into an infinite loop at this point. To see why, try printing the value of $str each time round the loop.
my $str = "Hello!!, I love (GOOGLE)";
while ($str =~ m/)/){
$str =~ s/)/ )/;
print $str, "\n";
}
The first time it prints "Hello!!, I love (GOOGLE )". The while loop condition is then evaluated again. Your string still matches your regular expression (it still contains a closing parenthesis) so the replacement is run again and this time it prints out "Hello!!, I love (GOOGLE )" with two spaces.
And so it goes on. Each time round the loop another space is added, but each time you still have a closing parenthesis, so another substitution is run.
The simplest solution I can see is to only match the closing parenthesis if it is preceded by a non-whitespace character (using \S).
my $str = "Hello!!, I love (GOOGLE)";
while ($str =~ m/\S)/){
$str =~ s/)/ )/;
print $str, "\n";
}
In this case the loop is only executed once.
Why not just include the parentheses in the search? If the URLs will always be bracketed, then something like this:
#!/usr/bin/perl
use warnings;
use strict;
use Regexp::Common qw/URI/;
my $str = "Hello!!, I love (http://www.google.com)";
my ($uri) = $str =~ / \( ( $RE{URI} ) \) /x;
print "$uri\n";
The regex from Regex::Common can be used as part of a longer regex, it doesn't have to be used on its own. Also I've used the 'x' modifier on the regex to allow whitespace so you can see more clearly what is going on - the brackets with the backslashes are treated as characters to match, those without define what is to matched (presumably like the {-keep} - I've not used that before).
You could also make the brackets optional, with something like:
/ (?: \( ( $RE{URI} ) \) | ( $RE{URI} ) ) /
although that would result in two match variables, one undefined - so something like following would be needed:
my $uri = $1 || $2 || die "Didn't match a URL!";
There's probably a better way to do this, and also if you're not bothered about matching parentheses then you could simply make the brackets optional (via a '?') in the first regex...
To answer your second question about only matching URLs at the end of the line - have a look at Regex 'anchors' which can force a match against the beginning or end of a line: ^ and $ (or \A and \Z if you prefer). e.g. matching a URL at the end of a line only:
/$RE{URI}\Z/