match regex and replace bug with special charakters - regex

I've built a script to read all Active Directory Group Memberships and save them to a file.
Problem is, the Get-ADPrincipalGroupMembership cmdlet outputs all groups like this:
CN=Group_Name,OU=Example Mail,OU=Example Management, DC=domain,DC=de
So I need to do a bit of a regex and/or replacement magic here to replace the whole line with just the first string beginning from "CN=" to the first ",".
The result would be like this:
Group_Name
So, there is one AD group that's not gonna be replaced. I already got an idea why tho, but I don't know how to work around this. In our AD there is a group with a special character, something like this:
CN=AD_Group_Name+up,OU=Example Mail,OU=Example Management, DC=domain,DC=de
So, because of the little "+" sign, the whole line doesn't even get touched.
Does anyone know why this is happening?
Import-Module ActiveDirectory
# Get Username
Write-Host "Please enter the Username you want to export the AD-Groups from."
$UserName = Read-Host "Username"
# Set Working-Dir and Output-File Block:
$WorkingDir = "C:\Users\USER\Desktop"
Write-Host "Working directory is set to " + $WorkingDir
$OutputFile = $WorkingDir + "\" + $UserName + ".txt"
# Save Results to File
Get-ADPrincipalGroupMembership $UserName |
select -Property distinguishedName |
Out-File $OutputFile -Encoding UTF8
# RegEx-Block to find every AD-Group in Raw Output File and delete all
# unnaccessary information:
[regex]$RegEx_mark_whole_Line = "^.*"
# The ^ matches the start of a line (in Ruby) and .* will match zero or more
# characters other than a newline
[regex]$RegEx_mark_ADGroup_Name = "(?<=CN=).*?(?=,)"
# This regex matches everything behind the first "CN=" in line and stops at
# the first "," in the line. Then it should jump to the next line.
# Replace-Block (line by line): Replace whole line with just the AD group
# name (distinguishedName) of this line.
foreach ($line in Get-Content $OutputFile) {
if ($line -like "CN=*") {
$separator = "CN=",","
$option = [System.StringSplitOptions]::RemoveEmptyEntries
$ADGroup = $line.Split($separator, $option)
(Get-Content $OutputFile) -replace $line, $ADGroup[0] |
Set-Content $OutputFile -Encoding UTF8
}
}

Your group name contains a character (+) that has a special meaning in a regular expression (one or more times the preceding expression). To disable special characters escape the search string in your replace operation:
... -replace [regex]::Escape($line), $ADGroup[0]
However, I fail to see what you need that replacement for in the first place. Basically you're replacing a line in the output file with a substring from that line that you already extracted before. Just write that substring to the output file and you're done.
$separator = 'CN=', ','
$option = [StringSplitOptions]::RemoveEmptyEntries
(Get-Content $OutputFile) | ForEach-Object {
$_.Split($separator, $option)[0]
} | Set-Content $OutputFile
Better yet, use the Get-ADObject cmdlet to expand the names of the group members:
Get-ADPrincipalGroupMembership $UserName |
Get-ADObject |
Select-Object -Expand Name

First off, depending on what you're doing here this might or might not be a good idea. The CN is /not/ immutable so if you're storing it somewhere as a key you're likely to run into problems down the road. The objectGUID property of the group is a good primary key, though.
As far as getting this value, I think you can simplify this a lot. The name property that the cmdlet outputs will always have your desired value:
Get-ADPrincipalGroupMembership <username> | select name

Ansgar's answer is much better in terms of using the regex, but I believe that in this case you could do a dirty workaround with the IndexOf function. In your if-statement you could do the following:
if ($line -like "CN=*") {
$ADGroup = $line.Substring(3, $line.IndexOf(',')-3)
}
The reason this works here is that you know the output will begin with CN=YourGroupName meaning that you know that the string you want begins at the 4th character. Secondly, you know that the group name will not contain any comma, meaning that the IndexOf(',') will always find the end of that string so you don't need to worry about the nth occurrence of a string in a string.

Related

Powershell script to replace link:lalala.html[lalala] with xref:lalala.adoc[lalala] capture pattern and replace recursively

I have a folder full of text documents in .adoc format that have some text in them. The text is following: link:lalala.html[lalala]. I want to replace this text with xref:lalala.adoc[lalala]. So, basically, just replace link: with xref:, .html with .adoc, leave all the rest unchanged.
But the problem is that lalala can be anything from a word to ../topics/halva.html.
I definitely know that I need to use regex patterns, I previously used similar script. A replace directive wrapped in an object:
Get-ChildItem -Path *.adoc -file -recurse | ForEach-Object {
$lines = Get-Content -Path $PSItem.FullName -Encoding UTF8 -Raw
$patterns = #{
'(\[\.dfn \.term])#(.*?)#' = '$1_$2_' ;
}
$option = [System.Text.RegularExpressions.RegexOptions]::Singleline
foreach($k in $patterns.Keys){
$pat = [regex]::new($k, $option)
$lines = $pat.Replace($lines, $patterns.$k)
}
$lines | Set-Content -Path $PSItem.FullName -Encoding UTF8 -Force
}
Looks like I need a different script since the new task cannot be added as just another object. I could've just replaced each part separately, using two objects: replace link: with xref:, then replace .html with .adoc.
But this can interfere with other links that end with .html and don't start with link:. In the text, absolute links usually don't have link: in the beginning. They always start with http:// or https://. And they still may or may not end with .html. So the best idea is to take the whole string link:lalala.html[lalala] and try to replace it with xref:lalala.adoc[lalala].
I need the help of someone who knows regex and PowerShell, please this would save me.
As a pattern, you might use
\blink:(.+?)\.html(?=\[[^][]*])
\blink: Match link:
(.+?) Capture 1+ chars as least as possbile in group 1
\.html match .html
(?=\[[^][]*]) Assert from an opening till closing square bracket at the right
Regex demo
In the replacement use group 1 using $1
xref:$1.adoc
Example
$Strings = #("link:lalala.html[lalala]", "link:../topics/halva.html[../topics/halva.html]")
$Strings -replace "\blink:(.+?)\.html(?=\[[^][]*])",'xref:$1.adoc'
Output
xref:lalala.adoc[lalala]
xref:../topics/halva.adoc[../topics/halva.html]

PowerShell to slip a text file on specific string

I am trying to split a large text file into several files based on a specific string. Every time I see the string ABCDE - 3 I want to cut and paste the content up to that string in a new text file. I also want to extract the last 4 of the social, last name and first name. The new text file needs be saved as first_name,last_name and last 4 of social.
See text file example and a bit of initial code. I would feel much more comfortbale doing it in Python but PowerShell is the only option.
$my_text = Get-Content .\ab.txt
$ssn_pattern = '([0-8]\d{2})-(\d{2})-(\d{4})'
ForEach ($file in my_text)
To get the firstname, lastname and the last 4 digits of the social, you could make use of capturing groups and use those groups when assembling the filename.
From your pattern, only the last 4 digits should be grouped.
You could use a pattern to start the match with TO: and from the next line get the values for the names and the number.
Then match all lines the do not start with ABCDE - 3 using a negative lookahead (?!
You can adjust the pattern and the code to match your exact text.
(?m)^[^\S\r\n]+TO:.*\r?\n\s*ATTN:\s*[A-Z]{3} ([^,\r\n]+),[^\S\r\n]*(.+?)[^\S\r\n]*[0-8]\d{2}-\d{2}-(\d{4})(?:\r?\n(?![^\S\r\n]+ABCDE - 3).*)*\r?\n[^\S\r\n]+ABCDE - 3.*
Regex demo
I constructed a code snippet using stackoverflow postings, so this might be improved. It basically comes down to load a raw string and get all the matches.
Then loop over all the matches and get the groups to assemble a filename an save the full match as the content.
If there are names which contain spaces and you don't want those to be in the filename, you could replace those with an empty string.
Example code:
$my_text = Get-Content -Raw ./Documents/stack-overflow/powershell/ab.txt
$pattern = "(?m)^[^\S\r\n]+TO:.*\r?\n\s*ATTN:\s*[A-Z]{3} ([^,\r\n]+),[^\S\r\n]*(.+?)[^\S\r\n]*[0-8]\d{2}-\d{2}-(\d{4})(?:\r?\n(?![^\S\r\n]+ABCDE - 3).*)*\r?\n[^\S\r\n]+ABCDE - 3.*"
Select-String $pattern -input $my_text -AllMatches |
ForEach-Object { $_.Matches } |
ForEach-Object {
$fileName = -join ($_.groups[2].Value, $_.groups[1].Value, $_.groups[3].Value)
Write-Host $fileName
Set-Content -Path "your-path-here/$fileName.txt" -Value $_.Value
}
When I run this, I get 2 files with the content for each match:
MIOTTISAREMO2222.txt
MIOTTSANREMO1111.txt

Powershell regex match sequence doesn't work although it matches in Sublime Text find and replace

I am trying to create a Powershell regex statement to remove the top five lines of this output from a git diff file that has already been modified with Powershell regex.
[1mdiff --git a/uk1.adoc b/uk2.adoc</span>+++
[1mindex b5d3bf7..90299b8 100644</span>+++
[1m--- a/uk1.adoc</span>+++
[1m+++ b/uk2.adoc</span>+++
[36m## -1,9 +1,9 ##</span>+++
= Heading
Body text
Image shown because binary code doesn't show in the text
The following statement matches the text so the '= Heading' line is placed at the top of the page if I replace with nothing.
^[^=]*.[+][\n]
But in Powershell, it isn't matching the text.
Get-Content "result2.adoc" | % { $_ -Replace '^[^=]*.[+][\n]', '' } | Out-File "result3.adoc";
Any ideas about why it doesn't work in Powershell?
My overall goal is to create a diff file of two versions of an AsciiDoc file and then replace the ASCII codes with HTML/CSS code to display the resulting AsciiDoc file with green/red track changes.
The simplest - and faster - approach is to read the input file as a single, multiline string with Get-Content -Raw and let the regex passed to -replace operate across multiple lines:
(Get-Content -Raw result2.adoc) -replace '(?s)^.+?\n(?==)' |
Set-Content result3.adoc
(?s) activates in-line option s which makes . match newline (\n) characters too.
^.+?\n(?==) matches from the start of the string (^) any number of characters (including newlines) (.+), non-greedily (?)
until a newline (\n) followed by a = is found.
(?=...) is a look-ahead assertion, which matches = without consuming it, i.e., without considering it part of the substring that matched.
Since no replacement operand is passed to -replace, the entire match is replace with the implied empty string, i.e., what was matched is effectively removed.
As for what you tried:
The -replace operator passes its LHS through if no match is found, so you cannot use it to filter out non-matching lines.
Even if you match an undesired line in full and replace it with '' (the empty string), it will show up as an empty line in the output when sent to Set-Content or Out-File (>).
As for your specific regex, ^[^=]*.[+][\n] (whether or not the first ^ is followed by an ESC (0x1b) char.):
[\n] (just \n would suffice) tries to match a newline char. after a literal + ([+]), yet lines read individually with Get-Content (without -Raw) by definition are stripped of their trailing newline, so the \n will never match; instead, use $ to match the end of a line.
Instead of % (the built-in alias for the ForEach-Object cmdlet) you could have used ? (the built-in alias for the Where-Object cmdlet) to perform the desired filtering:
Get-Content result2.adoc | ? { $_ -notmatch '^\e\[' }
$_ -notmatch '^\e[' returns $True only for lines that don't start (^) with an ESC character (\e, whose code point is 0x1b) followed by a literal (\) [, thereby effectively filtering out the lines before the = Heading line.
However, the multi-line -replace command at the top is a more direct and faster expression of your intent.
Here is the code I ended up with after help from #mklement0. This Powershell script creates MS Word-style track changes for two versions of an AsciiDoc file. It creates the Diff file, uses regex to replace ASCII codes with HTML/CSS tags, removes the Diff header (thank you!), uses AsciiDoctor to create an HTML file and then PrinceXML to create a PDF file of the output that I can send to document reviewers.
git diff --color-words file1.adoc file2.adoc > result.adoc;
Get-Content "result.adoc" | % {
$_ -Replace '(=+ ?)([A-Za-z\s]+)(\[m)', '$1$2' `
-Replace '\[32m', '+++<span style="color: #00cd00;">' `
-Replace '\[31m', '+++<span style="color: #cd0000; text-decoration: line-through;">' `
-Replace '\[m', '</span>+++' } | Out-File -encoding utf8 "result2.adoc" ;
(Get-Content -Raw result2.adoc) -replace '(?s)^.+?\n(?==)', '' | Out-File -encoding utf8 "result3.adoc" ;
asciidoctor result3.adoc -o result3.html;
prince result3.html --javascript -o result3.pdf;
Read-Host -Prompt "Press Enter to exit"
Here's a screenshot of the result using some text from Wikipedia:

Powershell: Pull URL out of String

I am pulling a string from a text file that looks like:
C:\Users\users\Documents\Firefox\tools\Install.ps1:37: Url = "https://somewebsite.com"
I need to some how remove everything except the URL, so it should look like:
https://www.somewebsite.com
Here is what I have tried:
$Urlselect = Select-String -Path "$zipPath\tools\chocolateyInstall.ps1" -Pattern "url","Url"-List # Selects URL download path
$Urlselect = $Urlselect -replace ".*" ","" -replace ""*.","" # remove everything but the download link
but this didn't seam to do anything. I am thinking that its going to have to do with regex but I am not sure how to put it. Any help is appreciated. Thanks
I suggest using the switch statement with the -Regex and -File options:
$url = switch -regex -file "$zipPath\tools\chocolateyInstall.ps1" {
' Url = "(.*?)"' { $Matches[1]; break }
}
-file makes switch loop over all lines of the specified file.
-regex interprets the branch conditionals as regular expressions, and the automatic $Matches variable can be used in the associated script block ({ ... }) to access the results of the match, notably, what the 1st (and only) capture group in the regex ((...)) captured - the URL of interest.
break stops processing once the 1st match is found. (To continue matching, use continue).
If you do want to use Select-String:
$url = Select-String -List ' Url = "(.*?)"' "$zipPath\tools\chocolateyInstall.ps1" |
ForEach-Object { $_.Matches.Groups[1].Value }
Note that the switch solution will perform much better.
As for what you tried:
Select-String -Path "$zipPath\tools\chocolateyInstall.ps1" -Pattern "url","Url"
Select-String is case-insensitive by default, so there's no need to specify case variations of the same string. (Conversely, you must use the -CaseSensitive switch to force case-sensitive matching).
Also note that Select-String doesn't output the matching line directly, as a string, but as a match-information objects; to get the matching line, access the .Line property[1].
$Urlselect -replace ".*" ","" -replace ""*.",""
".*" " and ""*." result in syntax errors, because you forgot to escape the _embedded " as `".
Alternatively, use '...' (single-quoted literal strings), which allows you to embed " as-is and is generally preferable for regexes and replacement operands, because there's no confusion over what parts PowerShell may interpret up front (string expansion).
Even with the escaping problem solved, however, your -replace operations wouldn't have worked, because .*" matches greedily and therefore up to the last "; here's a corrected solution with non-greedy matching, and with the replacement operand omitted (which makes it default to the empty string):
PS> 'C:\...ps1:37: Url = "https://somewebsite.com"' -replace '^.*?"' -replace '"$'
https://somewebsite.com
^.*?" non-greedily replaces everything up to the first ".
"$ replaces a " at the end of the string.
However, you can do it with a single -replace operation, using the same regex as with the switch solution at the top:
PS> 'C:\...ps1:37: Url = "https://somewebsite.com"' -replace '^.*?"(.*?)"', '$1'
https://somewebsite.com
$1 in the replacement operand refers to what the 1st capture group ((...)) captured, i.e. the bare URL; for more information, see this answer.
[1] Note that there's a green-lit feature suggestion - not yet implemented as of Windows PowerShell Core 6.2.0 - to allow Select-String to emit strings directly, using the proposed -Raw switch - see https://github.com/PowerShell/PowerShell/issues/7713

How to move first 7 characters of a file name to the end using Powershell

My company has millions of old reports in pdf form. They are Typically named in the format: 2018-09-18 - ReportName.pdf
The organization we need to submit these to is now requiring that we name the files in this format: Report Name - 2018-09.pdf
I need to move the first 7 characters of the file name to the end. I'm thinking there is probably an easy code to perform this task, but I cannot figure it out. Can anyone help me.
Thanks!
Caveat:
As jazzdelightsme points out, the desired renaming operation can result in name collisions, given that you're removing the day component from your dates; e.g., 2018-09-18 - ReportName.pdf and 2018-09-19 - ReportName.pdf would result in the same filename, Report Name - 2018-09.pdf.
Either way, I'm assuming that the renaming operation is performed on copies of the original files. Alternatively, you can create copies with new names elsewhere with Copy-Item while enumerating the originals, but the advantage of Rename-Item is that it will report an error in case of a name collision.
Get-ChildItem -Filter *.pdf | Rename-Item -NewName {
$_.Name -replace '^(\d{4}-\d{2})-\d{2} - (.*?)\.pdf$', '$2 - $1.pdf'
} -WhatIf
-WhatIf previews the renaming operation; remove it to perform actual renaming.
Add -Recurse to the Get-CildItem call to process an entire directory subtree.
The use of -Filter is optional, but it speeds up processing.
A script block ({ ... }) is passed to Rename-Item's -NewName parameter, which enables dynamic renaming of each input file ($_) received from Get-ChildItem using a string-transformation (replacement) expression.
The -replace operator uses a regex (regular expression) as its first operand to perform string replacements based on patterns; here, the regex breaks down as follows:
^(\d{4}-\d{2}) matches something like 2018-09 at the start (^) of the name and - by virtue of being enclosed in (...) - captures that match in a so-called capture group, which can be referenced in the replacement string by its index, namely $1, because it is the first capture group.
(.*?) captures the rest of the filename excluding the extension in capture group $2.
The ? after .* makes the sub-expression non-greedy, meaning that it will give subsequent sub-expressions a chance to match too, as opposed to trying to match as many characters as possible (which is the default behavior, termed greedy).
\.pdf$ matches the the filename extension (.pdf) at the end ($) - note that case doesn't matter. . is escaped as \., because it is meant to be matched literally here (without escaping, . matches any single character in a single-line string).
$2 - $1.pdf is the replacement string, which arranges what the capture groups captured in the desired form.
Note that any file whose name doesn't match the regex is quietly left alone, because the -replace operator passes the input string through if there is no match, and Rename-Item does nothing if the new name is the same as the old one.
Get-ChildItem with some RegEx and Rename-Item can do it:
Get-ChildItem -Path "C:\temp" | foreach {
$newName = $_.Name -replace '(^.{7}).*?-\s(.*?)\.(.*$)','$2 - $1.$3'
$_ | Rename-Item -NewName $newName
}
The RegEx
'(^.{7}).*?-\s(.*?)\.(.*$)' / $2 - $1.$3
(^.{7}) matches the first 7 characters
.*?-\s matches any characters until (and including) the first found - (space dash space)
(.*?)\. matches anything until the first found dot ( . )
(.*$) matches the file extension in this case
$2 - $1.$3 puts it all together in the changed order
This won't properly work if there are filenames with multiple dots ( . ) in it.
This should work (added some test data):
$test = '2018-09-18 - ReportName.pdf','2018-09-18 - Other name.pdf','other pattern.pdf','2018-09-18 - double.extension.pdf'
$test | % {
$match = [Regex]::Match($_, '(?<Date>\d{4}-\d\d)-\d\d - (?<Name>.+)\.pdf')
if ($match.Success) {
"$($match.Groups['Name'].Value) - $($match.Groups['Date'].Value).pdf"
} else {
$_
}
}
Something like this -
Get-ChildItem -path $path | Rename-Item -NewName {$_.BaseName.Split(' - ')[-1] + ' - ' + $_.BaseName.SubString(0,7) + $_.Extension} -WhatIf
Explanation -
Split will segregate the name of the file based on the parameter - and [-1] tells PowerShell to select the last of the segregated values.
SubString(0,7) will select 7 characters starting from the first character of the BaseName of the file.
Remove -WhatIf to apply the rename.