Regex anchor string - regex

I am working through a regex right now. My issue is that my string could have 2 or 3 names in it. I want to grab the first name and then the second and third as one string.
Here is the small powershell script:
$string = "ALDERS PAUL GERARD"
$string2 = "Alders Paul"
$pattern = '^(.*)\s(.*)$'
if($string -match $pattern){
$last = $Matches[1]
Write-Host "Success - $last"
}
if($string2 -match $pattern){
$last = $Matches[1]
Write-Host "Success - $last"
}
The results are Success - Alders Paul and Success - Alders
How can I make the regex anchor on the first space and not the second space in the line? So I get Success - Alders and Success - Alders

You need to use lazy matching with the first capturing group:
^(.*?)\s(.*)$
^
See Demo 1
From rexegg.com Lazy Quantifier Solution:
The lazy .*? guarantees that the quantified dot only matches as many characters as needed for the rest of the pattern to succeed.
Or, use a non-whitespace shorthand class \S (i.e. matching any character but whitespace characters):
^(\S*)\s(.*)$
Here is a second demo

Related

Powershell - Should take only set of numbers from file name

I have a script that read a file name from path location and then he takes only the numbers and do something with them. Its working fine until I encounter with this situation.
For an example:
For the file name Patch_1348968.vip it takes the number 1348968.
In the case the file name is Patch_1348968_v1.zip it takes the number 13489681 that is wrong.
I am using this to fetch the numbers. In general it always start with patch_#####.vip with 7-8 digits so I want to take only the digits
before any sign like _ or -.
$PatchNumber = $file.Name -replace "[^0-9]" , ''
You can use
$PatchNumber = $file.Name -replace '.*[-_](\d+).*', '$1'
See the regex demo.
Details:
.* - any chars other than newline char as many as possible
[-_] - a - or _
(\d+) - Group 1 ($1): one or more digits
.* - any chars other than newline char as many as possible.
I suggest to use -match instead, so you don't have to think inverted:
if( $file.Name -match '\d+' ) {
$PatchNumber = $matches[0]
}
\d+ matches the first consecutive sequence of digits. The automatic variable $matches contains the full match at index 0, if the -match operator successfully matched the input string against the pattern.
If you want to be more specific, you could use a more complex pattern and extract the desired sub string using a capture group:
if( $file.Name -match '^Patch_(\d+)' ) {
$PatchNumber = $matches[1]
}
Here, the anchor ^ makes sure the match starts at the beginning of the input string, then Patch_ gets matched literally (case-insensitive), followed by a group of consecutive digits which gets captured () and can be extracted using $matches[1].
You can get an even more detailed explanation of the RegEx and the ability to experiment with it at regex101.com.

REGEX - Extract OU from Distinguished Name

I need to extract "OU" part from my Distinguished Name with REGEX.
For exemple :
"CN=DAVID Jean Louis (a),OU=Coiffeur,OU=France,DC=Paris,DC=France"
"CN=PROVOST Franck,OU=Coiffeur,OU=France,DC=Paris,DC=France"
"CN=SZHARCOFF Michel (AB),OU=Coiffeur_Inter,OU=France,DC=Paris,DC=France"
I need to have
"OU=Coiffeur,OU=France"
"OU=Coiffeur,OU=France"
"OU=Coiffeur_Inter,OU=France"
I try "CN=SZHARCOFF Michel (AB),OU=Coiffeur_Inter,OU=France,DC=Paris,DC=France" -match "^CN=[\w-()]*[\w]*"
But doesn't succeed
You may match all the OU= + 1 or more non-comma substrings with \bOU=[^,]+ regex and then join them with ,:
$matches = [regex]::matches($s, '\bOU=[^,]+') | % { $_.value }
$res = $matches -join ','
Output for the first string:
OU=Coiffeur,OU=France
Pattern details
\b - a word boundary to only match OU as a whole word
OU= - a literal substring
[^,]+ - 1 or more (+) characters other than (as [^...] is a negated character class) a comma.
See the regex demo.
This pattern will support DistinguishedName properties containing commas, and provides named groups for matches. I use this in PowerShell to parse an ADObject's parent DN, etc.
^(?:(?<cn>CN=(?<name>.*?)),)?(?<parent>(?:(?<path>(?:CN|OU).*?),)?(?<domain>(?:DC=.*)+))$
See Regexr demo: https://regexr.com/5bt64

PowerShell -replace to get string between two different characters

I am current using split to get what I need, but I am hoping I can use a better way in powershell.
Here is the string:
server=ss8.server.com;database=CSSDatabase;uid=WS_CSSDatabase;pwd=abc123-1cda23-123-A7A0-CC54;Max Pool Size=5000
I want to get the server and database with out the database= or the server=
here is the method I am currently using and this is what I am currently doing:
$databaseserver = (($details.value).split(';')[0]).split('=')[1]
$database = (($details.value).split(';')[1]).split('=')[1]
This outputs to:
ss8.server.com
CSSDatabase
I would like it to be as simple as possible.
Thank you in advance
Replacing approach
You may use the following regex replace:
$s = 'server=ss8.server.com;database=CSSDatabase;uid=WS_CSSDatabase;pwd=abc123-1cda23-123-A7A0-CC54;Max Pool Size=5000'
$dbserver = $s -replace '^server=([^;]+).*', '$1'
$db = $s -replace '^[^;]*;database=([^;]+).*', '$1'
The technique is to match and capture (with (...)) what we need and just match what we need to remove.
Pattern details:
^ - start of the line
server= - a literal substring
([^;]+) - Group 1 (what $1 refers to) matching 1+ chars other than ;
.* - any 0+ chars other than a newline, as many as possible
Pattern 2 is almost the same, the capturing group is shifted a bit to capture another detail, and some more literal values are added to match the right context.
Note: if the values you need to extract may appear anywhere in the string, replace ^ in the first one and ^[^;]*; pattern in the second one with .*?\b (any 0+ chars other than a newline, as few as possible followed with a word boundary).
Matching approach
With a -match, you may do it the following way:
$s -match '^server=(.+?);database=([^;]+)'
The $Matches[1] will contain the server details and $Matches[2] will hold the DB info:
Name Value
---- -----
2 CSSDatabase
1 ss8.server.com
0 server=ss8.server.com;database=CSSDatabase
Pattern details
^ - start of string
server= - literal substring
(.+?) - Group 1: any 1+ non-linebreak chars as few as possible
;database= - literal substring
([^;]+) - 1+ chars other than ;
Another solution with a RegEx and named capture groups, similar to Wiktor's Matching Approach.
$s = 'server=ss8.server.com;database=CSSDatabase;uid=WS_CSSDatabase;pwd=abc123-1cda23-123-A7A0-CC54;Max Pool Size=5000'
$RegEx = '^server=(?<databaseserver>[^;]+);database=(?<database>[^;]+)'
if ($s -match $RegEx){
$Matches.databaseserver
$Matches.database
}

Trying to use /^\s*$/ match multiple blank lines and replace them failed and get a confusing result

Perl version : 5.16.01
I'm reading a book about regex which based on perl 5.8
The book said that s/^\s*$/blabla/mg can match and replace multiple blank lines.
But when I praticed, I got a confusing result.
code:
$text = "c\n\n\n\n\nb";
$text =~ s/^\s*$/<p>/mg;
print "$text";
Here is the result:
C:\Users\Administrator\Desktop\regex>perl t2h.pl
c
<p><p>
b
I want to know why I didn't get a single <p> but double between 'c' and 'b'. Does Perl's /$/ change after 5.8 ?
The lesson here is be wary of regular expressions that will match a zero-width pattern, you could get unexpected results.
We can see what's happening here by showing the prematch, match and post match of both replacements:
use strict;
use warnings;
my $text = "c\n\n\n\nb";
$text =~ s{^\s*$}{
printf qq{<"%s" - "%s" - "%s">\n}, map s/\n/\\n/gr, ($`, $&, $');
"<p>"
}emg;
$text =~ s/\n/\\n/g;
print qq{Result: "$text"};
Outputs <"Prematch" - "Match" - "Postmatch">:
<"c\n" - "\n\n" - "\nb">
<"c\n\n\n" - "" - "\nb">
Result: "c\n<p><p>\nb"
Basically, the regex matches from position 2 until 4, capturing 2 return characters. After that replacement it starts searching from position 4 and matches a zero width pattern, so adds a second <p>.
One of the reasons this isn't intuitive is because our regex has replaced the \n\n at positions 2 & 3 with a <p>. However, lookbehind assertions (which ^ is special variant) treat the string as it originally was, not as it might have been replaced by previous passes of a /g regex. Therefore when matching at position 4, the regex sees c\n\n\n behind it instead of c\n<p> (as demonstrated in our output above), and therefore will match ^ again and $ immediately in front of it with no spacing between.
The solution to this is to not allow zero width patterns by using + in this instance instead of *.
Secondary Example
Another example of this is the following, simpler regex
my $text = "caab";
$text =~ s/a*/<p>/g;
print $text;
Outputs:
<p>c<p><p>b<p>
The positional breakdown of this matching is as follows:
0 c - match a zero width pattern
1 a - Match a 2 character pattern
2 a
3 b - Match a zero width pattern
4 $ - match a zero width pattern
Therefore, the final lesson is to simply be wary of regexes that will match a zero width pattern.
Quantifier * match 0 or more times,
quantifier ? match 1 or more times.
So your regex should be written as s/^\s+$/<p>/mg
You can try this:
#!/usr/bin/perl
$text = "c\n\n\n\n\nb";
$text =~ s/[\r\n]//g;
print $text;
DEMO http://ideone.com/WmVFHz

Negative regex for Perl string pattern match

I have this regex:
if($string =~ m/^(Clinton|[^Bush]|Reagan)/i)
{print "$string\n"};
I want to match with Clinton and Reagan, but not Bush.
It's not working.
Your regex does not work because [] defines a character class, but what you want is a lookahead:
(?=) - Positive look ahead assertion foo(?=bar) matches foo when followed by bar
(?!) - Negative look ahead assertion foo(?!bar) matches foo when not followed by bar
(?<=) - Positive look behind assertion (?<=foo)bar matches bar when preceded by foo
(?<!) - Negative look behind assertion (?<!foo)bar matches bar when NOT preceded by foo
(?>) - Once-only subpatterns (?>\d+)bar Performance enhancing when bar not present
(?(x)) - Conditional subpatterns
(?(3)foo|fu)bar - Matches foo if 3rd subpattern has matched, fu if not
(?#) - Comment (?# Pattern does x y or z)
So try: (?!bush)
Sample text:
Clinton said
Bush used crayons
Reagan forgot
Just omitting a Bush match:
$ perl -ne 'print if /^(Clinton|Reagan)/' textfile
Clinton said
Reagan forgot
Or if you really want to specify:
$ perl -ne 'print if /^(?!Bush)(Clinton|Reagan)/' textfile
Clinton said
Reagan forgot
Your regex says the following:
/^ - if the line starts with
( - start a capture group
Clinton| - "Clinton"
| - or
[^Bush] - Any single character except "B", "u", "s" or "h"
| - or
Reagan) - "Reagan". End capture group.
/i - Make matches case-insensitive
So, in other words, your middle part of the regex is screwing you up. As it is a "catch-all" kind of group, it will allow any line that does not begin with any of the upper or lower case letters in "Bush". For example, these lines would match your regex:
Our president, George Bush
In the news today, pigs can fly
012-3123 33
You either make a negative look-ahead, as suggested earlier, or you simply make two regexes:
if( ($string =~ m/^(Clinton|Reagan)/i) and
($string !~ m/^Bush/i) ) {
print "$string\n";
}
As mirod has pointed out in the comments, the second check is quite unnecessary when using the caret (^) to match only beginning of lines, as lines that begin with "Clinton" or "Reagan" could never begin with "Bush".
However, it would be valid without the carets.
What's wrong with using two regexs (or three)? This makes your intentions more clear and may even improve your performance:
if ($string =~ /^(Clinton|Reagan)/i && $string !~ /Bush/i) { ... }
if (($string =~ /^Clinton/i || $string =~ /^Reagan/i)
&& $string !~ /Bush/i) {
print "$string\n"
}
If my understanding is correct then you want to match any line which has Clinton and Reagan, in any order, but not Bush. As suggested by Stuck, here is a version with lookahead assertions:
#!/usr/bin/perl
use strict;
use warnings;
my $regex = qr/
(?=.*clinton)
(?!.*bush)
.*reagan
/ix;
while (<DATA>) {
chomp;
next unless (/$regex/);
print $_, "\n";
}
__DATA__
shouldn't match - reagan came first, then clinton, finally bush
first match - first two: reagan and clinton
second match - first two reverse: clinton and reagan
shouldn't match - last two: clinton and bush
shouldn't match - reverse: bush and clinton
shouldn't match - and then came obama, along comes mary
shouldn't match - to clinton with perl
Results
first match - first two: reagan and clinton
second match - first two reverse: clinton and reagan
as desired it matches any line which has Reagan and Clinton in any order.
You may want to try reading how lookahead assertions work with examples at http://www252.pair.com/comdog/mastering_perl/Chapters/02.advanced_regular_expressions.html
they are very tasty :)