Negative regex for Perl string pattern match - regex

I have this regex:
if($string =~ m/^(Clinton|[^Bush]|Reagan)/i)
{print "$string\n"};
I want to match with Clinton and Reagan, but not Bush.
It's not working.

Your regex does not work because [] defines a character class, but what you want is a lookahead:
(?=) - Positive look ahead assertion foo(?=bar) matches foo when followed by bar
(?!) - Negative look ahead assertion foo(?!bar) matches foo when not followed by bar
(?<=) - Positive look behind assertion (?<=foo)bar matches bar when preceded by foo
(?<!) - Negative look behind assertion (?<!foo)bar matches bar when NOT preceded by foo
(?>) - Once-only subpatterns (?>\d+)bar Performance enhancing when bar not present
(?(x)) - Conditional subpatterns
(?(3)foo|fu)bar - Matches foo if 3rd subpattern has matched, fu if not
(?#) - Comment (?# Pattern does x y or z)
So try: (?!bush)

Sample text:
Clinton said
Bush used crayons
Reagan forgot
Just omitting a Bush match:
$ perl -ne 'print if /^(Clinton|Reagan)/' textfile
Clinton said
Reagan forgot
Or if you really want to specify:
$ perl -ne 'print if /^(?!Bush)(Clinton|Reagan)/' textfile
Clinton said
Reagan forgot

Your regex says the following:
/^ - if the line starts with
( - start a capture group
Clinton| - "Clinton"
| - or
[^Bush] - Any single character except "B", "u", "s" or "h"
| - or
Reagan) - "Reagan". End capture group.
/i - Make matches case-insensitive
So, in other words, your middle part of the regex is screwing you up. As it is a "catch-all" kind of group, it will allow any line that does not begin with any of the upper or lower case letters in "Bush". For example, these lines would match your regex:
Our president, George Bush
In the news today, pigs can fly
012-3123 33
You either make a negative look-ahead, as suggested earlier, or you simply make two regexes:
if( ($string =~ m/^(Clinton|Reagan)/i) and
($string !~ m/^Bush/i) ) {
print "$string\n";
}
As mirod has pointed out in the comments, the second check is quite unnecessary when using the caret (^) to match only beginning of lines, as lines that begin with "Clinton" or "Reagan" could never begin with "Bush".
However, it would be valid without the carets.

What's wrong with using two regexs (or three)? This makes your intentions more clear and may even improve your performance:
if ($string =~ /^(Clinton|Reagan)/i && $string !~ /Bush/i) { ... }
if (($string =~ /^Clinton/i || $string =~ /^Reagan/i)
&& $string !~ /Bush/i) {
print "$string\n"
}

If my understanding is correct then you want to match any line which has Clinton and Reagan, in any order, but not Bush. As suggested by Stuck, here is a version with lookahead assertions:
#!/usr/bin/perl
use strict;
use warnings;
my $regex = qr/
(?=.*clinton)
(?!.*bush)
.*reagan
/ix;
while (<DATA>) {
chomp;
next unless (/$regex/);
print $_, "\n";
}
__DATA__
shouldn't match - reagan came first, then clinton, finally bush
first match - first two: reagan and clinton
second match - first two reverse: clinton and reagan
shouldn't match - last two: clinton and bush
shouldn't match - reverse: bush and clinton
shouldn't match - and then came obama, along comes mary
shouldn't match - to clinton with perl
Results
first match - first two: reagan and clinton
second match - first two reverse: clinton and reagan
as desired it matches any line which has Reagan and Clinton in any order.
You may want to try reading how lookahead assertions work with examples at http://www252.pair.com/comdog/mastering_perl/Chapters/02.advanced_regular_expressions.html
they are very tasty :)

Related

How to user regex to in PowerShell to format a dynamic string to an array?

I have this string...
12345;#john, doe (io-124)[Company I work for], 8732;#jane, smith (dos-12)[my company], 902743;#jack, johnson (123-as), 1824;#sam, sampson (1235-oi), 089932;#jessie, jackson (1232-ahs)[top notch company], 2134;#last, one (123-fl)
I want this output in an array...
12345
john, doe (io-124)[Company I work for]
8732
jane, smith (dos-12)[my company]
902743
jack, johnson (123-as)
1824
sam, sampson (1235-oi)
089932
jessie, jackson (1232-ahs)[top notch company]
2134
last, one (123-fl)
I'm still learning regex, but managed to find this expression "\d+;" That will give me the numbers in the beginning of each substring with a ";" on the end which I can trim off, but I don't know how to extract that. If I could extract it, I would be left with the names with a "#" in the beginning of them. So I could split on those and then trim the spaces off the ends. Even if it put it in 2 arrays would be fine. maybe even better..
Hope this makes sense.
Thank you all in advance!
You might use a pattern with 2 capture groups and add the groups to an array
(\d+);#(.*?)(?=,\s+\d+;|$)
Explanation
(\d+) Capture 1+ digits in group 1
;# Match literally
(.*?) Capture group 2, match as least chars as possible (non greedy)
(?= Positive lookahead to assert what is at the right is
,\s+\d+;|$ Match 1+ whitespaces, 1+ digits and ; or assert the end of the string to also get the last item
) Close the lookahead
Regex demo and a Powershell demo
$regex = '(\d+);#(.*?)(?=,\s+\d+;|$)'
$items = [System.Collections.ArrayList]#()
Select-String $regex -input $str -AllMatches | Foreach-Object {$_.Matches} | Foreach-Object {
$items.Add($_.Groups[1].Value) | Out-Null
$items.Add($_.Groups[2].Value) | Out-Null
}
You can use
$result = $text -split '(?:,\s*)?(\d+);#?'
# Or, to also remove the empty items:
$result = $text -split '(?:,\s*)?(\d+);#?' | Where-Object {$_}
See the regex demo
The regex matches
(?:,\s*)? - an optional sequence of a comma and then zero or more whitespaces
(\d+) - captures into Group 1 (and thus also outputs these values) one or more digits
;#? - a ; and an optional #.

Matching numbers with non-digits embedded

I am trying to match strings of digits that contain non-digits within them. Using the default text in http://regexr.com/, the following should match:
v2.1
-98.7
3.141
.6180
9,000
+42
555.123.4567
+1-(800)-555-2468
The following should not match:
0123456789
12345
I tried:
/[^\n\ ]{1,}\d+\S+\d/g
But it would not match +42 and it incorrectly matched 0123456789 and 12345, and it treated "555.123.4567 +1-(800)-555-2468" as one string.
I tried to alleviate it by putting a $ at the end but that matched nothing. Not sure what I am doing wrong.
You can use this regex to match any text with at least one non-digit:
/^\d*[^\d\n]+\d.*$/mg
RegEx Demo
RegEx Breakup:
^ - Start
\d* - Match 0 or more digits
[^\d\n]+ - Match 1 or more of any character that is not a digit and not a newline
\d - Match a digit
.* - Match 0 or more of any character
$ - End
Try this:
^(?=.*\d)(?=.*[^\d\s])\S+$
This means "at least one digit and one non-digit and no whitespace".
See live demo.
If no newlines were in your input, you could use slightly simpler:
^(?=.*\d)(?=.*\D)\S+$
Aren't you over-thinking this massively? What's wrong with using /\D/ to match a string that contains a non-digit?
I'm not sure what your exact requirements are, but if you're looking for a string that contains at least one digit and at least one non-digit, then the easiest approach is to use to regex matches - /\d/ && /\D/.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
while (<DATA>) {
chomp;
say "$_: " . (/\d/ && /\D/ ? 'matches' : 'doesn\'t match');
}
__DATA__
v2.1
-98.7
3.141
.6180
9,000
+42
555.123.4567
+1-(800)-555-2468
0123456789
12345
Looks like you want to dodge strings made up entirely of digits, or entirely of letters. So you can exclude those. That will also let in strings without any numbers, so also require a number.
my $exclude = qr/(?: [0-9]+ | [A-Za-z]+ )/x;
my #res = grep { not /^$exclude$/ and /\d/ } #strings;
If any other characters need be excluded (underscore?), add it to the list.
It is not clear how your input is coming, this takes a list of ready strings. Add word boundaries and/or /s, depending on the input. Or parse the input into a list of strings for this.
If input comes as as a multi-line string, my #strings = split '\n|\s+', $text;.

Regex anchor string

I am working through a regex right now. My issue is that my string could have 2 or 3 names in it. I want to grab the first name and then the second and third as one string.
Here is the small powershell script:
$string = "ALDERS PAUL GERARD"
$string2 = "Alders Paul"
$pattern = '^(.*)\s(.*)$'
if($string -match $pattern){
$last = $Matches[1]
Write-Host "Success - $last"
}
if($string2 -match $pattern){
$last = $Matches[1]
Write-Host "Success - $last"
}
The results are Success - Alders Paul and Success - Alders
How can I make the regex anchor on the first space and not the second space in the line? So I get Success - Alders and Success - Alders
You need to use lazy matching with the first capturing group:
^(.*?)\s(.*)$
^
See Demo 1
From rexegg.com Lazy Quantifier Solution:
The lazy .*? guarantees that the quantified dot only matches as many characters as needed for the rest of the pattern to succeed.
Or, use a non-whitespace shorthand class \S (i.e. matching any character but whitespace characters):
^(\S*)\s(.*)$
Here is a second demo

Trying to use /^\s*$/ match multiple blank lines and replace them failed and get a confusing result

Perl version : 5.16.01
I'm reading a book about regex which based on perl 5.8
The book said that s/^\s*$/blabla/mg can match and replace multiple blank lines.
But when I praticed, I got a confusing result.
code:
$text = "c\n\n\n\n\nb";
$text =~ s/^\s*$/<p>/mg;
print "$text";
Here is the result:
C:\Users\Administrator\Desktop\regex>perl t2h.pl
c
<p><p>
b
I want to know why I didn't get a single <p> but double between 'c' and 'b'. Does Perl's /$/ change after 5.8 ?
The lesson here is be wary of regular expressions that will match a zero-width pattern, you could get unexpected results.
We can see what's happening here by showing the prematch, match and post match of both replacements:
use strict;
use warnings;
my $text = "c\n\n\n\nb";
$text =~ s{^\s*$}{
printf qq{<"%s" - "%s" - "%s">\n}, map s/\n/\\n/gr, ($`, $&, $');
"<p>"
}emg;
$text =~ s/\n/\\n/g;
print qq{Result: "$text"};
Outputs <"Prematch" - "Match" - "Postmatch">:
<"c\n" - "\n\n" - "\nb">
<"c\n\n\n" - "" - "\nb">
Result: "c\n<p><p>\nb"
Basically, the regex matches from position 2 until 4, capturing 2 return characters. After that replacement it starts searching from position 4 and matches a zero width pattern, so adds a second <p>.
One of the reasons this isn't intuitive is because our regex has replaced the \n\n at positions 2 & 3 with a <p>. However, lookbehind assertions (which ^ is special variant) treat the string as it originally was, not as it might have been replaced by previous passes of a /g regex. Therefore when matching at position 4, the regex sees c\n\n\n behind it instead of c\n<p> (as demonstrated in our output above), and therefore will match ^ again and $ immediately in front of it with no spacing between.
The solution to this is to not allow zero width patterns by using + in this instance instead of *.
Secondary Example
Another example of this is the following, simpler regex
my $text = "caab";
$text =~ s/a*/<p>/g;
print $text;
Outputs:
<p>c<p><p>b<p>
The positional breakdown of this matching is as follows:
0 c - match a zero width pattern
1 a - Match a 2 character pattern
2 a
3 b - Match a zero width pattern
4 $ - match a zero width pattern
Therefore, the final lesson is to simply be wary of regexes that will match a zero width pattern.
Quantifier * match 0 or more times,
quantifier ? match 1 or more times.
So your regex should be written as s/^\s+$/<p>/mg
You can try this:
#!/usr/bin/perl
$text = "c\n\n\n\n\nb";
$text =~ s/[\r\n]//g;
print $text;
DEMO http://ideone.com/WmVFHz

Regex: Matching 4-Digits within words

I have a body of text I'm looking to pull repeat sets of 4-digit numbers out from.
For Example:
The first is 1234 2) The Second is 2098 3) The Third is 3213
Now I know i'm able to get the first set of digits out by simply using:
/\d{4}/
...returning 1234
But how do I match the second set of digits, or the third, and so on...?
edit: How do i return 2098, or 3213
You don't appear to have a proper answer to your question yet.
The solution is to use the /g modifier on your regex. In list context it will find all of the numbers in your string at once, like this
my $str = 'The first is 1234 2) The Second is 2098 3) The Third is 3213';
my #numbers = $str =~ /\b \d{4} \b/gx;
print "#numbers\n";
output
1234 2098 3213
Or you can iterate through them, using scalar context in a while loop, like this
while ($str =~ /\b (\d{4}) \b/gx) {
my $number = $1;
print $number, "\n";
}
output
1234
2098
3213
I have added the \b patterns to the regex so that it only matches whole four-digit numbers and doesn't, for example, find 1234 in 1234567. The /x modifier just allows me to add spaces so that the pattern is more intelligible.
See http://perldoc.perl.org/perlre.html for discussion on the use of the 'g' modifier which will cause your regular expression to match ALL occurrances of its pattern, not just the first.
If you want a pattern that finds the $n'th 4-digit group, this seems to work:
$pat = "^(?:.*?\\b(\\d{4})\\b){$n}";
if ($s =~ /$pat/) {
print "Found $1\n";
} else {
print "Not found\n";
}
I did this by building a string pattern because I couldn't get a variable interpolated into a quantifier {$n}.
This pattern finds 4-digit groups that are on word boundaries (the \b tests); I don't know if that meets your requirements. The pattern uses .*? to ensure that as few characters as possible are matched between each four-digit group. The pattern is matched $n times, and the capture group $1 is set to whatever it was in the last iteration, i.e. the $n'th one.
EDIT: When I just tried it again, it seemed to interpolate $n in a quantifier just fine. I don't know what I did differently that it didn't work last time. So maybe this will work:
if ($s =~ /^(?:.*?\b(\d{4}\b){$n}/) { ...
If not, see amon's comment about qr//.
If the regex is only matched once, then match all three in one regex and extract them using matched groups:
^.*\b(\d{4})\b.*\b(\d{4})\b.*\b(\d{4})\b.*$
The three 4-digit numbers will be captured in group 1. 2 and 3.
Ajb's answer with "gx" is the best. If you know you will have three numbers, this straighforward line does the trick:
my $str = 'The first is 1234 2) The Second is 2098 3) The Third is 3213';
my ($num1, $num2, $num3) = $str =~ /\b \d{4} \b/gx;
print "$num1, $num2, $num3\n";