Overlapping text substitution with Perl regular expression - regex

I have a text file that contains a bunch of sentences. The sentences contain white space (spaces, tabs, new lines) to separate out words consisting of letter and/or digits.
I want to find the word "123" or "-123" and insert a dot (.) before the digits begin. So all occurrences of "123" and "-123" will be converted to ".123" and "-.123".
I was trying this with the following:
$line =~ s/(\s+-*123\s+)/getNewWord($1)/ge
Where $line contains a line read from the file and the function getNewWord word will put the dot(.) at appropriate place in the matched word.
But it's not working for cases where there are two consecutive "123" like " 123 123 ". As the first "123" is replaced by a " .123 " the space following the word has already been matched and the second "123" is not matched since the regex engine can't match the preceding space with that word.
Can anyone help me with this? Thanks!

I agree with MRAB (and have +1'd his/her answer), but there's no real need for the getNewWord function. I'd change the entire statement to something like one of these:
$line =~ s/((?:^|\s)-?)(123)(?=\s|$)/$1.$2/g;
$line =~ s/(?:^|(?<=\s))(-?)(123)(?=\s|$)/$1.$2/g;
$line =~ s/(?:^|(?<=\s)|(?<=\s-))(?=123(?:\s|$))/./g;

It might be slightly faster (no explicit capture) and it allows a file without leading/trailing whitespace:
$ echo '123 -123 -123 123' | perl -pe's/(?:^|\s+)\K(?=-?123\b)/./g'
.123 .-123 .-123 .123
To put . after -:
$ echo '123 -123 -123 123' | perl -pe's/(?:^|\s+)-*\K(?=123\b)/./g'
.123 -.123 -.123 .123

Try using a positive lookahead like this: (\s+-*123)(?=\s).

This reminded me of this question: Search html file for random string using regex, where I found (was shown) a good use for negative lookaround assertions, i.e. matching optional delimiters and avoiding partial matches.
Matching -?123 is simple, the problems are
Not matching partial strings
Avoiding start/end of line mismatches
Avoid moving the \G anchor
Doing a lookbehind assertion of optional dash -?
I did not manage to solve #4, as variable length lookbehind assertions are not supported, so the fix is using a capture group.
Do note that some of the other answers to this question do not address these problems.
Explanation:
Negative lookbehind assertion for non-whitespace matches both whitespace and beginning of string, and assures we do not match partial strings. Then follows an optional dash in a capture group. The end of the match is a nested lookahead, where we must match 123 followed by anything that is not non-whitespace.
Code:
use strict;
use warnings;
while(<DATA>) {
s/(?<!\S)(-?)(?=123(?!\S))/$1./g;
print;
}
__DATA__
r 123 z123 "123" -1233 d123 123-123
123 -123 -123 123 123
Output:
r .123 z123 "123" -1233 d123 123-123
.123 -.123 -.123 .123 .123

Or simply this? This does not bother about the whitespaces, and works on perl 5.8.
echo '123 -123 -123 123' | perl -pe's/(-)?(123)/$1.$2/g'

Related

What is the regex to match exactly an alphanumeric 16 character string?

Here is a regex string I need to use but I only want it to match exactly 16 alphanumeric characters not the 16 within a longer string.
[A-Z]{6}[0-9]{2}[A-E,H,L,M,P,R-T][0-9]{2}[A-Z0-9]{5}
Its matches this: PLDTLL47S04L424T and MRTMTT25D09F205Z perfectly But what i dont want it to match is something like this in bold thats in middle of this long string:
FA4127E57FE52E49BC1FEEECC32E1246530EE1C#BL2PRD9301MB014.024d.mgd.msft.net
Thanks in advance!
You didn't say which regex flavor you're using, but the issue is that you're missing start and end anchors.
Add ^ and $ to your regex as such:
^[A-Z]{6}[0-9]{2}[A-E,H,L,M,P,R-T][0-9]{2}[A-Z0-9]{5}$
^ means match at the start of a string, or the point after any newline in multiline mode.
$ means the opposite: the end of a string, or the point before the newline in multiline mode.
In addition to my predecessors:
assuming that you want to match if and only if the line starts with something that matches your pattern, both anchor ^ and word boundary \b will do.
Ending the pattern with anchor $ and/or \b is, however, - taken into account the assumption that a line starting with something that matches, NOT correct.
See some example code:
#!/usr/bin/perl -w
my #tests = qw/
AAAAAA00A00AAAAA49BC1FEEECC32E1246530EE1C#BL2PRD9301MB014.024d.mgd.msft.net
0AAAAAA00A00AAAAA49BC1FEEECC32E1246530EE1C#BL2PRD9301MB014.024d.mgd.msft.net
/;
foreach my $test (#tests){
if ( $test =~ /^([A-Z]{6}[0-9]{2}[A-EHLMPR-T][0-9]{2}[A-Z0-9]{5})/ ) {
print "$1 matches\n";
} else {
print "NO MATCH\n";
}
}
generates output:
marc:tmp marc$ perl test.pl
AAAAAA00A00AAAAA matches
NO MATCH
if you change the pattern to
if ( $test =~ /^([A-Z]{6}[0-9]{2}[A-EHLMPR-T][0-9]{2}[A-Z0-9]{5}$)/ ) {
the result is:
marc:tmp marc$ perl test.pl
NO MATCH
NO MATCH
You can use Boundry Matchers to match the beginning and endings of lines, strings, words or other things. What is available depends on your flavour of regex. The start and end of string/input matchers are pretty universal.
^[A-Z]{6}[0-9]{2}[A-E,H,L,M,P,R-T][0-9]{2}[A-Z0-9]{5}$
Again depending on the flavour of regex you are using you can also POSIX character classes to match alpha numerics with \p{Alpha} and \p{Digit}. This will simplfy your regex a bit.
You should use ^ and $ to bound the regex
You can use word boundaries \b for this purpose:
\b[A-Z]{6}[0-9]{2}[A-E,H,L,M,P,R-T][0-9]{2}[A-Z0-9]{5}\b
^ ^
Edit: Word boundaries and not start ^ and end $ anchors because I am assuming you just want to avoid matches as a substring and your patterns are more like your sample string but with spaces
You may try this regex: ^(?=.*[0-9])(?=.*[a-zA-Z])([a-zA-Z0-9]+){16}$

Match pattern with exceptions

I want to match a pattern using regular expressions, but I need some exceptions to the match. For instance, match every occurence of "John Doe" except for those occurences where "John Doe" is enclosed by bold tags, i.e. "<b>John Doe</b>".
Match: John Doe
Don't match: <b>John Doe</b>
How can I achieve this with regular expressions?
Clarification: I want to exclude everything between the bold tags. This excluded content may contain a wide variety of characters, line breaks and so on.
If your regex dialect allows lookarounds you may use a negative lookbehind and a negative lookahead to achieve that task:
(?<!<b>)John Doe(?!<b>)
You could use negative look-arounds for this:
(?<!<b>)John Doe(?!</b>)
That wouldn't match <b>John Doe or John Doe</b> either though.
If you only want to not match instances with both the opening and closing tag you could do something like:
John Doe(?!(?<=<b>John Doe)</b>)
Or slightly shorter (but less understandable - 8 is the length of John Doe):
John Doe(?!(?<=<b>.{8})</b>)
Using Perl you can use negative lookbehind:
$ echo "<b>John Doe</b>" | perl -ne 'print if /(?<!<b>)John Doe/'
(above prints nothing - does not match).
$ echo "John Doe" | perl -ne 'print if /(?<!<b>)John Doe/'
John Doe
(above matches).
Symbol (?<!<b>) is a negative lookbehind - string matches if it's not followed by what's inside of it (<b> in this case).

Perl regex that can match both positive and negative values

I have a list of data which I want to match:
0:1
0:3
0:-1
0:2
0:-4
What's the regex I can use to match all of them:
I tried this but won't work:
$line =~ /0:(\w+)/
It only match the positives.
\w is for word symbols: letters, digits and underscore. That means your regexp besides 0:34 will match smth like 0:hello, but won't match minus symbol.
If you need only digits then /0:-?\d+/ should work. And if you need to match whole string (to filter out strings like a0:-3b you can use /^0:-?\d+$/.
how about $line =~ /0:[-]?[0-9]

Regular expression for number search

I need a regular expression that will find a number(s) that is not inside parenthesis.
Example abcd 1 (35) (df)
It would only see the 1.
Is this very complex? I've tried and had no luck.
Thanks for any help
An easy solution is to first remove the unwanted values:
my $string = "abcd 12 (35) (df) 2311,22";
$string =~ s/\(\d+\)//g; # remove numbers within parens
my #numbers = $string =~ /\d+/g; # extract the numbers
This is quite hard but something like this will probably do:
^(?:\()(\d+)(?:[^)])|(?:[^(0-9]|^)(\d+)(?:[^)0-9]|^)|(?:[^(])(\d+)(?:\))$
The problem is to match (123, 123) and also to not match the string 123 as the number 2 between the non-parentheses characters 1 and 3. Also there are probably some edge cases for start of and end of string.
My suggestion is to not use a regex for this. Maybe a regex that matches numbers and then use the capture info to check if the surrounding characters are not parentheses.
The regular expression would be:
^[a-z]+ ([0-9]+) \([0-9]+\) \([a-z]+\)$
The result is the first (and only) matching group of the regex.
Maybe you want to remove the ^ and $ if the regex should not match only if it’s the content of a whole single line. You can also use [a-zA-Z] or [[:alpha:]]. This depends on the regular expression engine you use and, of course, the content you want to match.
Example perl code:
if (m/^[a-z]+ ([0-9]+) \([0-9]+\) \([a-z]+\)$/) {
print("$1\n");
}
Please note that your question contains not enough information to make a good answer possible (you did not say anything about the general format of your expression, for example if you want to match integers or floating points)
How about
/(?:^|[^\d(])(\d+)(?:[^\d)]|$)/
? This matches a string of digits (\d+) that are
preceded by the beginning of the string, or a character that is not a digit or an open parenthesis ((?:^|[^\d(]))
succeeded by the end of the string, or by a character that is not a digit or a close parenthesis ((?:[^\d)]|$))

Insertion with Regex to format a date (Perl)

Suppose I have a string 04032010.
I want it to be 04/03/2010. How would I insert the slashes with a regex?
To do this with a regex, try the following:
my $var = "04032010";
$var =~ s{ (\d{2}) (\d{2}) (\d{4}) }{$1/$2/$3}x;
print $var;
The \d means match single digit. And {n} means the preceding matched character n times. Combined you get \d{2} to match two digits or \d{4} to match four digits. By surrounding each set in parenthesis the match will be stored in a variable, $1, $2, $3 ... etc.
Some of the prior answers used a . to match, this is not a good thing because it'll match any character. The one we've built here is much more strict in what it'll accept.
You'll notice I used extra spacing in the regex, I used the x modifier to tell the engine to ignore whitespace in my regex. It can be quite helpful to make the regex a bit more readable.
Compare s{(\d{2})(\d{2})(\d{4})}{$1/$2/$3}x; vs s{ (\d{2}) (\d{2}) (\d{4}) }{$1/$2/$3}x;
Well, a regular expression just matches, but you can try something like this:
s/(..)(..)(..)/$1/$2/$3/
#!/usr/bin/perl
$var = "04032010";
$var =~ s/(..)(..)(....)/$1\/$2\/$3/;
print $var, "\n";
Works for me:
$ perl perltest
04/03/2010
I always prefer to use a different delimiter if / is involved so I would go for
s| (\d\d) (\d\d) |$1/$2/|x ;