REGEX - Extract OU from Distinguished Name - regex

I need to extract "OU" part from my Distinguished Name with REGEX.
For exemple :
"CN=DAVID Jean Louis (a),OU=Coiffeur,OU=France,DC=Paris,DC=France"
"CN=PROVOST Franck,OU=Coiffeur,OU=France,DC=Paris,DC=France"
"CN=SZHARCOFF Michel (AB),OU=Coiffeur_Inter,OU=France,DC=Paris,DC=France"
I need to have
"OU=Coiffeur,OU=France"
"OU=Coiffeur,OU=France"
"OU=Coiffeur_Inter,OU=France"
I try "CN=SZHARCOFF Michel (AB),OU=Coiffeur_Inter,OU=France,DC=Paris,DC=France" -match "^CN=[\w-()]*[\w]*"
But doesn't succeed

You may match all the OU= + 1 or more non-comma substrings with \bOU=[^,]+ regex and then join them with ,:
$matches = [regex]::matches($s, '\bOU=[^,]+') | % { $_.value }
$res = $matches -join ','
Output for the first string:
OU=Coiffeur,OU=France
Pattern details
\b - a word boundary to only match OU as a whole word
OU= - a literal substring
[^,]+ - 1 or more (+) characters other than (as [^...] is a negated character class) a comma.
See the regex demo.

This pattern will support DistinguishedName properties containing commas, and provides named groups for matches. I use this in PowerShell to parse an ADObject's parent DN, etc.
^(?:(?<cn>CN=(?<name>.*?)),)?(?<parent>(?:(?<path>(?:CN|OU).*?),)?(?<domain>(?:DC=.*)+))$
See Regexr demo: https://regexr.com/5bt64

Related

How to user regex to in PowerShell to format a dynamic string to an array?

I have this string...
12345;#john, doe (io-124)[Company I work for], 8732;#jane, smith (dos-12)[my company], 902743;#jack, johnson (123-as), 1824;#sam, sampson (1235-oi), 089932;#jessie, jackson (1232-ahs)[top notch company], 2134;#last, one (123-fl)
I want this output in an array...
12345
john, doe (io-124)[Company I work for]
8732
jane, smith (dos-12)[my company]
902743
jack, johnson (123-as)
1824
sam, sampson (1235-oi)
089932
jessie, jackson (1232-ahs)[top notch company]
2134
last, one (123-fl)
I'm still learning regex, but managed to find this expression "\d+;" That will give me the numbers in the beginning of each substring with a ";" on the end which I can trim off, but I don't know how to extract that. If I could extract it, I would be left with the names with a "#" in the beginning of them. So I could split on those and then trim the spaces off the ends. Even if it put it in 2 arrays would be fine. maybe even better..
Hope this makes sense.
Thank you all in advance!
You might use a pattern with 2 capture groups and add the groups to an array
(\d+);#(.*?)(?=,\s+\d+;|$)
Explanation
(\d+) Capture 1+ digits in group 1
;# Match literally
(.*?) Capture group 2, match as least chars as possible (non greedy)
(?= Positive lookahead to assert what is at the right is
,\s+\d+;|$ Match 1+ whitespaces, 1+ digits and ; or assert the end of the string to also get the last item
) Close the lookahead
Regex demo and a Powershell demo
$regex = '(\d+);#(.*?)(?=,\s+\d+;|$)'
$items = [System.Collections.ArrayList]#()
Select-String $regex -input $str -AllMatches | Foreach-Object {$_.Matches} | Foreach-Object {
$items.Add($_.Groups[1].Value) | Out-Null
$items.Add($_.Groups[2].Value) | Out-Null
}
You can use
$result = $text -split '(?:,\s*)?(\d+);#?'
# Or, to also remove the empty items:
$result = $text -split '(?:,\s*)?(\d+);#?' | Where-Object {$_}
See the regex demo
The regex matches
(?:,\s*)? - an optional sequence of a comma and then zero or more whitespaces
(\d+) - captures into Group 1 (and thus also outputs these values) one or more digits
;#? - a ; and an optional #.

Renaming files with criteria

Need some advice. I'm trying to do something with regular expressions that might not be possible, and if it is possible it's over my head. I can't get anything to work. I'm trying to create a tagging system for my PDF files. So if I have this file name:
"csharp 8 in a nutshell[studying programming csharp ebooks].pdf"
I would like all the words inside the '[ ]' to have a '#' in from of them. So the above file name would look like this:
"csharp 8 in a nutshell[#studying #programming #csharp #ebooks].pdf"
The problem is keeping the '#' inside the '[ ]'. For example I'd rather the 'csharp' at the very front of the file name not have the '#'.
Also, I'm using a bulk renamer called 'Bulk Rename Utility' to help me.
Can this be done?
If it can, any hints on how?
Thanks.
Bulk Rename Utility does not support replacing multiple matches, you can only match the whole file name and perform replacements using capturing groups/backreferences.
Since you are using Windows, I suggest using Powershell:
cd 'C:\YOUR_FOLDER\HERE'
Get-ChildItem -File | Rename-Item -NewName { $_.Name -replace '(?<=\[[^][]*?)\w+(?=[^][]*])','#$&' }
See this regex demo and the proof it works with .NET regex flavor.
(?<=\[[^][]*?) - right before this location, there must be a [ and then any amount of chars other than [ and ], as few as possible
\w+ - 1+ word chars
(?=[^][]*]) - right after this location, there must be any amount of chars other than [ and ], as many as possible, and then a ] char.
The replacement is # + the whole match value ($&).
Also, you may use
Get-ChildItem -File | Rename-Item -NewName { $_.Name -replace '(\G(?!\A)[^][\w]+|\[)(\w+)','$1#$2' }
See this regex demo and .NET regex test.
(\G(?!\A)[^][\w]+|\[) - Group 1 ($1): either the end of the previous match and 1+ chars other than ], [ and word chars, or a [ char
(\w+) - Group 2 ($2): one or more word chars.
If you only want to rename *.pdf files, replace Get-ChildItem -File with Get-ChildItem *.pdf.
I assume there is at most one bracket-delimited substring.
You can replace zero-length matches of the following regular expression with '#' when using Perl (click "Perl" then check global and case-different options), Ruby, Python's alternative regex engine, R with perl=true or languages that uses the PCRE regex engine, which includes PHP. With the exception of Ruby, the case-different (\i) and general (\g) flags need be set. Ruby only requires the case-indifferent flag.
r = /(?:^.*\[ *|\G(?<!^)|[a-z]+ +)\K(?<=\[| )(?=[a-z][^\[\]]*\])/
If using Ruby, for example, one would execute
str = "csharp 8 in a nutshell[studying programming csharp ebooks].pdf"
str.gsub(r,'#')
#=> "csharp 8 in a nutshell[#studying #programming #csharp #ebooks].pdf"
I believe all of the languages I named above allow one to run a short script from the command line. (I provide a Ruby script below.)
The regex engine performs the following operations.
(?: : begin non-capture group
^.*\[ * : match beginning of string then 0+ characters then '['
then 0+ spaces
| : or
\G : asserts the position at the end of the previous match
or at the start of the string for the first match
(?<!^) : use a negative lookbehind to assert that the current
location is not the start of the string
| : or
[a-z]+ + : match 1+ letters then 1+ spaces
) : end non-capture group
\K : reset beginning of reported match to current location
and discard all previously-matched characters from match
to be returned
(?<= : begin positive lookbehind
\[|[ ] : match '[' or a space
) : end positive lookbehind
(?= : begin positive lookahead
[a-z][^\[\]]*\] : match a letter then 0+ characters other than '[' and ']'
then ']'
) : end positive lookahead
Another possibility (illustrated with Ruby) is to break the string into three pieces, modify the middle one, then rejoin the pieces:
first, mid, last = str.split /(?<=\[)|(?=\])/
#=> ["csharp 8 in a nutshell[",
# "studying programming csharp ebooks",
# "].pdf"]
first + mid.gsub(/(?<=\A| )(?! )/,'#') + last
#=> "csharp 8 in a nutshell[#studying #programming #csharp #ebooks].pdf"
The regex used by split reads, "match a (zero-width) string that is preceded by '[' ((?<=\[) being a positive lookbehind) or is followed by ']' ((?=\]) being a positive lookahead.) By matching zero-width strings split does not remove any characters.
gsub's regex reads, "match a zero-width string that is at the start of the string or is preceded by a space and is followed by a character other than a space ((?! ) being a negative lookahead). It could alternatively be written /(?<![^ ])(?! )/ ((?<![^ ]) being a negative lookbehind).
A variant:
first + mid.split.map { |s| '#' + s }.join(' ') + last
#=> "csharp 8 in a nutshell[#studying #programming #csharp #ebooks].pdf"
I created a file named 'in' that contains the following two lines:
Little [Miss Muffet sat on her] tuffet
eating her [curds and] whey
Here is an example of a (Ruby) script that could be run from the command line to perform the necessary replacements.
ruby -e "File.open('out', 'w') do |fout|
File.foreach('in') do |str|
first, mid, last = str.split(/(?<=\[)|(?=\])/)
fout.puts(first + mid.gsub(/(?<=\A| )(?! )/,'#') + last)
end
end"
This produces a file named 'out' that contains these two lines:
Little [#Miss #Muffet #sat #on #her] tuffet
eating her [#curds #and] whey

Explode string with comma when comma is not inside any brackets

I have string "xyz(text1,(text2,text3)),asd" I want to explode it with , but only condition is that explode should happen only on , which are not inside any brackets (here it is ()).
I saw many such solutions on stackoverflow but it didn't work with my pattern. (example1) (example2)
What is correct regex for my pattern?
In my case xyz(text1,(text2,text3)),asd
result should be
xyz(text1,(text2,text3)) and asd.
You may use a matching approach using a regex with a subroutine:
preg_match_all('~\w+(\((?:[^()]++|(?1))*\))?~', $s, $m)
See the regex demo
Details
\w+ - 1+ word chars
(\((?:[^()]++|(?1))*\))? - an optional capturing group matching
\( - a (
(?:[^()]++|(?1))* - zero or more occurrences of
[^()]++ - 1+ chars other than ( and )
| - or
(?1) - the whole Group 1 pattern
\) - a ).
PHP demo:
$rx = '/\w+(\((?:[^()]++|(?1))*\))?/';
$s = 'xyz(text1,(text2,text3)),asd';
if (preg_match_all($rx, $s, $m)) {
print_r($m[0]);
}
Output:
Array
(
[0] => xyz(text1,(text2,text3))
[1] => asd
)
If the requirement is to split at , but only outside nested parenthesis another idea would be to use preg_split and skip the parenthesized stuff also by use of a recursive pattern.
$res = preg_split('/(\((?>[^)(]*(?1)?)*\))(*SKIP)(*F)|,/', $str);
See this pattern demo at regex101 or a PHP demo at eval.in
The left side of the pipe character is used to match and skip what is inside the parenthesis.
On the right side it will match remaining commas that are left outside of the parenthesis.
The pattern used is a variant of different common patterns to match nested parentehsis.

PowerShell -replace to get string between two different characters

I am current using split to get what I need, but I am hoping I can use a better way in powershell.
Here is the string:
server=ss8.server.com;database=CSSDatabase;uid=WS_CSSDatabase;pwd=abc123-1cda23-123-A7A0-CC54;Max Pool Size=5000
I want to get the server and database with out the database= or the server=
here is the method I am currently using and this is what I am currently doing:
$databaseserver = (($details.value).split(';')[0]).split('=')[1]
$database = (($details.value).split(';')[1]).split('=')[1]
This outputs to:
ss8.server.com
CSSDatabase
I would like it to be as simple as possible.
Thank you in advance
Replacing approach
You may use the following regex replace:
$s = 'server=ss8.server.com;database=CSSDatabase;uid=WS_CSSDatabase;pwd=abc123-1cda23-123-A7A0-CC54;Max Pool Size=5000'
$dbserver = $s -replace '^server=([^;]+).*', '$1'
$db = $s -replace '^[^;]*;database=([^;]+).*', '$1'
The technique is to match and capture (with (...)) what we need and just match what we need to remove.
Pattern details:
^ - start of the line
server= - a literal substring
([^;]+) - Group 1 (what $1 refers to) matching 1+ chars other than ;
.* - any 0+ chars other than a newline, as many as possible
Pattern 2 is almost the same, the capturing group is shifted a bit to capture another detail, and some more literal values are added to match the right context.
Note: if the values you need to extract may appear anywhere in the string, replace ^ in the first one and ^[^;]*; pattern in the second one with .*?\b (any 0+ chars other than a newline, as few as possible followed with a word boundary).
Matching approach
With a -match, you may do it the following way:
$s -match '^server=(.+?);database=([^;]+)'
The $Matches[1] will contain the server details and $Matches[2] will hold the DB info:
Name Value
---- -----
2 CSSDatabase
1 ss8.server.com
0 server=ss8.server.com;database=CSSDatabase
Pattern details
^ - start of string
server= - literal substring
(.+?) - Group 1: any 1+ non-linebreak chars as few as possible
;database= - literal substring
([^;]+) - 1+ chars other than ;
Another solution with a RegEx and named capture groups, similar to Wiktor's Matching Approach.
$s = 'server=ss8.server.com;database=CSSDatabase;uid=WS_CSSDatabase;pwd=abc123-1cda23-123-A7A0-CC54;Max Pool Size=5000'
$RegEx = '^server=(?<databaseserver>[^;]+);database=(?<database>[^;]+)'
if ($s -match $RegEx){
$Matches.databaseserver
$Matches.database
}

Match a number in a string with letters and numbers

I need to write a Perl regex to match numbers in a word with both letters and numbers.
Example: test123. I want to write a regex that matches only the number part and capture it
I am trying this \S*(\d+)\S* and it captures only the 3 but not 123.
Regex atoms will match as much as they can.
Initially, the first \S* matched "test123", but the regex engine had to backtrack to allow \d+ to match. The result is:
+------------------- Matches "test12"
| +-------------- Matches "3"
| | +--------- Matches ""
| | |
--- --- ---
\S* (\d+) \S*
All you need is:
my ($num) = "test123" =~ /(\d+)/;
It'll try to match at position 0, then position 1, ... until it finds a digit, then it will match as many digits it can.
The * in your regex are greedy, that's why they "eat" also numbers. Exactly what #Marc said, you don't need them.
perl -e '$_ = "qwe123qwe"; s/(\d+)/$numbers=$1/e; print $numbers . "\n";'
"something122320" =~ /(\d+)/ will return 122320; this is probably what you're trying to do ;)
\S matches any non-whitespace characters, including digits. You want \d+:
my ($number) = 'test123' =~ /(\d+)/;
Were it a case where a non-digit was required (say before, per your example), you could use the following non-greedy expressions:
/\w+?(\d+)/ or /\S+?(\d+)/
(The second one is more in tune with your \S* specification.)
Your expression satisfies any condition with one or more digits, and that may be what you want. It could be a string of digits surrounded by spaces (" 123 "), because the border between the last space and the first digit satisfies zero-or-more non-space, same thing is true about the border between the '3' and the following space.
Chances are that you don't need any specification and capturing the first digits in the string is enough. But when it's not, it's good to know how to specify expected patterns.
I think parentheses signify capture groups, which is exactly what you don't want. Remove them. You're looking for /\d+/ or /[0-9]+/