Why Perl can match two places with '/$/g'? [duplicate] - regex

This question already has answers here:
$ and Perl's global regular expression modifier
(3 answers)
Closed 8 years ago.
I wrote a sample code like this:
$var="123\n123\n\n\n\n\n1\n";
$var=~s/$/___/g;
print $var;
it output this:
123
123
1___
___
Why '/$/g' can match two places? I think it matched one is the last "\n" and the other is end of string. But I think it should only match the last line.

Be careful of zero width regular expressions. They often will not behave entirely the way that you expect.
In this case, the $ boundary can actually match both directly before the last newline and directly after. This is part of the spec of the $.
Therefore, your fix is to use the string end code \z instead of $:
$var = "abc\n";
$var =~ s/\z/<foo>/g;
print "'$var'";
Outputs:
'abc
<foo>'

g is a global modifier, that's why you're seeing replacement in all the places that $ would match. If you don't use g then only first match will be replaced. So without g output will be:
123
123
1___
Also see: $ and Perl's global regular expression modifier

Related

Why does the regex (aba?)+ not match with abab? [duplicate]

This question already has answers here:
Reference - What does this regex mean?
(1 answer)
Match exact string
(3 answers)
Closed 5 years ago.
Given (aba?)+ as the Regex and abab is the string.
Why does it only matches aba?
Since the last a in the regex is optional, isn't abab a match as well?
tested on https://regex101.com/
The reason (aba?)+ only matches aba out of abab is greedy matching: The optional a in the first loop is tested before the group is tested again, and matches. Therefore, the remaining string is b, which does not match (aba?) again.
If you want to turn off greedy matching for this optional a, use a??, or write your regex differently.
Since (aba?)+ is greedy, your pattern tries to match as much as possible. And since it first matches "aba", the remaining "b" is not matched.
Try the non-greedy version (it will match the first and second "ab"'s):
$ echo "abab" | grep -Po "(aba?)+"
aba
$ echo "abab" | grep -Po "(aba??)+"
abab
The correct regex for this is:
^(aba??)+$
and not (aba??)+ as discussed with #WiktorStribizew and YSelf.

Overlapping text substitution with Perl regular expression

I have a text file that contains a bunch of sentences. The sentences contain white space (spaces, tabs, new lines) to separate out words consisting of letter and/or digits.
I want to find the word "123" or "-123" and insert a dot (.) before the digits begin. So all occurrences of "123" and "-123" will be converted to ".123" and "-.123".
I was trying this with the following:
$line =~ s/(\s+-*123\s+)/getNewWord($1)/ge
Where $line contains a line read from the file and the function getNewWord word will put the dot(.) at appropriate place in the matched word.
But it's not working for cases where there are two consecutive "123" like " 123 123 ". As the first "123" is replaced by a " .123 " the space following the word has already been matched and the second "123" is not matched since the regex engine can't match the preceding space with that word.
Can anyone help me with this? Thanks!
I agree with MRAB (and have +1'd his/her answer), but there's no real need for the getNewWord function. I'd change the entire statement to something like one of these:
$line =~ s/((?:^|\s)-?)(123)(?=\s|$)/$1.$2/g;
$line =~ s/(?:^|(?<=\s))(-?)(123)(?=\s|$)/$1.$2/g;
$line =~ s/(?:^|(?<=\s)|(?<=\s-))(?=123(?:\s|$))/./g;
It might be slightly faster (no explicit capture) and it allows a file without leading/trailing whitespace:
$ echo '123 -123 -123 123' | perl -pe's/(?:^|\s+)\K(?=-?123\b)/./g'
.123 .-123 .-123 .123
To put . after -:
$ echo '123 -123 -123 123' | perl -pe's/(?:^|\s+)-*\K(?=123\b)/./g'
.123 -.123 -.123 .123
Try using a positive lookahead like this: (\s+-*123)(?=\s).
This reminded me of this question: Search html file for random string using regex, where I found (was shown) a good use for negative lookaround assertions, i.e. matching optional delimiters and avoiding partial matches.
Matching -?123 is simple, the problems are
Not matching partial strings
Avoiding start/end of line mismatches
Avoid moving the \G anchor
Doing a lookbehind assertion of optional dash -?
I did not manage to solve #4, as variable length lookbehind assertions are not supported, so the fix is using a capture group.
Do note that some of the other answers to this question do not address these problems.
Explanation:
Negative lookbehind assertion for non-whitespace matches both whitespace and beginning of string, and assures we do not match partial strings. Then follows an optional dash in a capture group. The end of the match is a nested lookahead, where we must match 123 followed by anything that is not non-whitespace.
Code:
use strict;
use warnings;
while(<DATA>) {
s/(?<!\S)(-?)(?=123(?!\S))/$1./g;
print;
}
__DATA__
r 123 z123 "123" -1233 d123 123-123
123 -123 -123 123 123
Output:
r .123 z123 "123" -1233 d123 123-123
.123 -.123 -.123 .123 .123
Or simply this? This does not bother about the whitespaces, and works on perl 5.8.
echo '123 -123 -123 123' | perl -pe's/(-)?(123)/$1.$2/g'

Space character in regex is not recognised

I'm writing a simple program - please see below for my code with comments. Does anyone know why the space character is not recognised in line 10? When I run the code, it finds the :: but does not replace it with a space.
1 #!/usr/bin/perl
2
3 # This program replaces :: with a space
4 # but ignores a single :
5
6 $string = 'this::is::a:string';
7
8 print "Current: $string\n";
9
10 $string =~ s/::/\s/g;
11 print "New: $string\n";
Try s/::/ /g instead of s/::/\s/g.
The \s is actually a character class representing all whitespace characters, so it only makes sense to have it in the regular expression (the first part) rather than in the replacement string.
Use s/::/ /g. \s only denotes whitespace on the matching side, on the replacement side it becomes s.
Replace the \s with a real space.
The \s is shorthand for a whitespace matching pattern. It isn't used when specifying the replacement string.
Replace string should be a literal space, i.e.:
$string =~ s/::/ /g;

Replace repeating characters with one with a regex

I need a regex script to remove double repetition for these particular words..If these character occurs replace it with single.
/[\s.'-,{2,0}]
These are character that if they comes I need to replace it with single same character.
Is this the regex you're looking for?
/([\s.'-,])\1+/
Okay, now that will match it. If you're using Perl, you can replace it using the following expression:
s/([\s.'-,])\1+/$1/g
Edit: If you're using :ahem: PHP, then you would use this syntax:
$out = preg_replace('/([\s.\'-,])\1+/', '$1', $in);
The () group matches the character and the \1 means that the same thing it just matched in the parentheses occurs at least once more. In the replacement, the $1 refers to the match in first set of parentheses.
Note: this is Perl-Compatible Regular Expression (PCRE) syntax.
From the perlretut man page:
Matching repetitions
The examples in the previous section display an annoying weakness. We were only matching 3-letter words, or chunks of words of 4 letters or less. We'd like to be able to match words or, more generally, strings of any length, without writing out tedious alternatives like \w\w\w\w|\w\w\w|\w\w|\w.
This is exactly the problem the quantifier metacharacters ?, *, +, and {} were created for. They allow us to delimit the number of repeats for a portion of a regexp we consider to be a match. Quantifiers are put immediately after the character, character class, or grouping that we want to specify. They have the following meanings:
a? means: match 'a' 1 or 0 times
a* means: match 'a' 0 or more times, i.e., any number of times
a+ means: match 'a' 1 or more times, i.e., at least once
a{n,m} means: match at least "n" times, but not more than "m" times.
a{n,} means: match at least "n" or more times
a{n} means: match exactly "n" times
As others said it depends on you regex engine but a small example how you could do this:
/([ _-,.])\1*/\1/g
With sed:
$ echo "foo , bar" | sed 's/\([ _-,.]\)\1*/\1/g'
foo , bar
$ echo "foo,. bar" | sed 's/\([ _-,.]\)\1*/\1/g'
foo,. bar
Using Javascript as mentioned in a commennt, and assuming (It's not too clear from your question) the characters you want to replace are space characters, ., ', -, and ,:
var str = 'a b....,,';
str = str.replace(/(\s){2}|(\.){2}|('){2}|(-){2}|(,){2}/g, '$1$2$3$4$5');
// Now str === 'a b..,'
If I understand correctly, you want to do the following: given a set of characters, replace any multiple occurrence of each of them with a single character. Here's how I would do it in perl:
perl -pi.bak -e "s/\.{2,}/\./g; s/\-{2,}/\-/g; s/'{2,}/'/g" text.txt
If, for example, text.txt originally contains:
Here is . and here are 2 .. that should become a single one. Here's
also a double -- that should become a single one. Finally here we have
three ''' which should be substituted with one '.
it is modified as follows:
Here is . and here are 2 . that should become a single one. Here's
also a double - that should become a single one. Finally here we have
three ' which should be substituted with one '.
I simply use the same replacement regex for each character in in the set: for example
s/\.{2,}/\./g;
replaces 2 or more occurrences of a dot character with a single dot. I concatenate several of this expressions, one for each character of your original set.
There may be more compact ways of doing this, but, I think this is simple and it works :)
I hope it helps.

How can I get my Perl regex not to use special characters from an interpolated variable? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
How can I escape meta-characters when I interpolate a variable in Perl's match operator?
I am using the following regex to search for a string $word in the bigger string $referenceLine as follows :
$wordRefMatchCount =()= $referenceLine =~ /(?=\b$word\b)/g
The problem happens when my $word substring contains some (, etc. Because it takes it as a part of the regex rather than the string to match and gives the following error :
Unmatched ( in regex; marked by <-- HERE in
m/( <-- HERE ?=\b( darsheel safary\b)/
at ./bleu.pl line 119, <REFERENCE> line 1.
Can somone please tell me a solution to this? I think If I could somehow get perl to understand that we want to look for the whole $word as it is without evaluating it, it might work out.
Use
$wordRefMatchCount =()= $referenceLine =~ /(?=\b\Q$word\E\b)/g
to tell the regex engine to treat every character in $word as a literal character.
\Q marks the start, \E marks the end of a literal string in Perl regex.
Alternatively, you could do
$quote_word = quotemeta($word);
and then use
$wordRefMatchCount =()= $referenceLine =~ /(?=\b$quote_word\b)/g
One more thing (taken up here from the comments where it's harder to find:
Your regex fails in your example case because of the word boundary anchor \b. This anchor matches between a word character and a non-word character. It only makes sense if placed around actual words, i. e. \bbar\b to ensure that only bar is matched, not foobar or barbaric. If you put it around non-words (as in \b( darsheel safary\b) then it will cause the match to fail (unless there is a letter, digit or underscore right before the ().