Powershell - Regex match multiple lines from file - regex

I am able to match and replace multiple lines if the text string is part of the powsershell script:
$regex = #"
(?s)(--match from here--.*?
--up to here--)
"#
$text = #"
first line
--match from here--
other lines
--up to here--
last line
"#
$editedText = ($text -replace $regex, "")
$editedText | Set-Content ".\output.txt"
output.txt:
first line
last line
But if I instead read the text in from a file with Get-Content -Raw, the same regex fails to match anything.
$text = Get-Content ".\input.txt" -Raw
input.txt:
first line
--match from here--
other lines
--up to here--
last line
output.txt:
first line
--match from here--
other lines
--up to here--
last line
Why is this? What can I do to match the text read in from input.txt? Thanks in advance!

Using a here-string the code depends on the kind of newline characters used by the .ps1 file. It won't work if it doesn't match the newline characters used by the input file.
To remove this dependency, define a RegEx that uses \r?\n to match all kinds of newlines:
$regex = "(?s)(--match from here--.*?\r?\n--up to here--)"
$text = Get-Content "input.txt" -Raw
$editedText = $text -replace $regex, ""
$editedText | Set-Content ".\output.txt"
Alternatively you may use a switch based solution, so you can use simpler RegEx pattern:
$include = $true
& { switch -File 'input.txt' -RegEx {
'--match from here--' { $include = $false }
{ $include } { $_ } # Output line if $include equals $true
'--up to here--' { $include = $true }
}} | Set-Content 'output.txt'
The switch -File construct loops over all lines of the input file and passes each one to the match expressions.
When we find the 1st pattern we set an $include flag to $false, which causes the code to skip over all lines until after the 2nd pattern is found, which sets the $include flag back to $true.
Writing $_ on its own causes the current line to be outputted.
We pipe to Set-Content to reduce memory footprint of the script. Instead of reading all lines into a variable in memory, we use a streaming approach where each processed line is immediately passed to Set-Content. Note that we can't pipe directly from a switch block, so as workaround we wrap the switch inside a script block (& { ... } creates and calls the script block).
The idea has been adopted from this GitHub comment.

Related

Cannot remove text between two strings with ReadLines

test.txt contents:
foo
[HKEY_USERS\S-1-5-18\Software\Microsoft]
bar
delete me!
[HKEY_other_key]
end-------------
Online regex matches the text to be removed correctly (starting from string delete until string [HKEY), but code written in PowerShell doesn't remove anything when I run it in PowerShell ISE:
$file = [System.IO.File]::ReadLines("test.txt")
$pattern = $("(?sm)^delete.*?(?=^\[HKEY)")
$file -replace $pattern, "" # returns original test.txt including line "delete me!" which should be removed
It seems to be a problem with ReadLines because when I use alternative Get-Content:
$file = Get-Content -Path test.txt -Raw
it removes the unwanted line correctly, but I don't want to use Get-Content.
[System.IO.File]::ReadAllLines(..) reads all lines of the file into a string array and you're using a multi-line regex pattern.
Get-Content -Raw same as [System.IO.File]::ReadAllText(..), reads all the text in the file into a string.
[System.IO.File]::ReadAllText("$pwd\test.txt") -replace "(?sm)^delete.*?(?=^\[HKEY)"
Results in:
foo
[HKEY_USERS\S-1-5-18\Software\Microsoft]
bar
[HKEY_other_key]
end-------------
In case you do need to read the file line-by-line due to, for example, high memory consumption, switch -File is an excellent built-in PowerShell alternative:
switch -Regex -File('test.txt') {
'^delete' { # if starts with `delete`
$skip = $true # set this var to `$true
continue # go to next line
}
'^\[HKEY' { # if starts with `[HKEY`
$skip = $false # set this var to `$false`
$_ # output this line
continue # go to next line
}
{ $skip } { continue } # if this var is `$true`, go next line
Default { $_ } # if none of the previous conditions were met, ouput this line
}

Powershell regex replace line that contains ONLY certain characters

I read a file with get-content -raw because of other operations I perform.
$c = get-content myfile.txt -raw
I want to replace the entirety of each line that contains ONLY the characters "*" or "=" with "hare"
I try
$c -replace "^[*=]*$","hare"
but that does not succeed. It works with simple string input but not with my string that contains CRLFs. (Other regex replace operations not involving character classes work fine.)
TEST:
given an input file of two lines
*=**
keep this line ***
***=
The output should be
hare
keep this line ***
hare
Tried many things, no luck.
You should use (?m) (RegexOptions.Multiline) option to make ^ match the start of a line and $ the end of a line positions.
However, there is a caveat: the $ anchor in a .NET regex with a multiline option matches only before a newline, LF, "`n", char. You need to make sure an optional (or if it is always there, obligatory) CR symbol before $.
You may use
$file -replace "(?m)^[*=]*\r?$", "hare"
Powershell test demo:
PS> $file = "*=**`r`nkeep this line ***`r`n***=`r`n***==Keep this line as is"
PS> $file -replace "(?m)^[*=]*\r?$", "hare"
hare
keep this line ***
hare
***==Keep this line as is
Try this:
$c = get-content "myfile.txt" -raw
$c -split [environment]::NewLine | % { if( $_ -match "^[*= ]+$" ) { "hare" } else { $_ } }

Powershell regex match sequence doesn't work although it matches in Sublime Text find and replace

I am trying to create a Powershell regex statement to remove the top five lines of this output from a git diff file that has already been modified with Powershell regex.
[1mdiff --git a/uk1.adoc b/uk2.adoc</span>+++
[1mindex b5d3bf7..90299b8 100644</span>+++
[1m--- a/uk1.adoc</span>+++
[1m+++ b/uk2.adoc</span>+++
[36m## -1,9 +1,9 ##</span>+++
= Heading
Body text
Image shown because binary code doesn't show in the text
The following statement matches the text so the '= Heading' line is placed at the top of the page if I replace with nothing.
^[^=]*.[+][\n]
But in Powershell, it isn't matching the text.
Get-Content "result2.adoc" | % { $_ -Replace '^[^=]*.[+][\n]', '' } | Out-File "result3.adoc";
Any ideas about why it doesn't work in Powershell?
My overall goal is to create a diff file of two versions of an AsciiDoc file and then replace the ASCII codes with HTML/CSS code to display the resulting AsciiDoc file with green/red track changes.
The simplest - and faster - approach is to read the input file as a single, multiline string with Get-Content -Raw and let the regex passed to -replace operate across multiple lines:
(Get-Content -Raw result2.adoc) -replace '(?s)^.+?\n(?==)' |
Set-Content result3.adoc
(?s) activates in-line option s which makes . match newline (\n) characters too.
^.+?\n(?==) matches from the start of the string (^) any number of characters (including newlines) (.+), non-greedily (?)
until a newline (\n) followed by a = is found.
(?=...) is a look-ahead assertion, which matches = without consuming it, i.e., without considering it part of the substring that matched.
Since no replacement operand is passed to -replace, the entire match is replace with the implied empty string, i.e., what was matched is effectively removed.
As for what you tried:
The -replace operator passes its LHS through if no match is found, so you cannot use it to filter out non-matching lines.
Even if you match an undesired line in full and replace it with '' (the empty string), it will show up as an empty line in the output when sent to Set-Content or Out-File (>).
As for your specific regex, ^[^=]*.[+][\n] (whether or not the first ^ is followed by an ESC (0x1b) char.):
[\n] (just \n would suffice) tries to match a newline char. after a literal + ([+]), yet lines read individually with Get-Content (without -Raw) by definition are stripped of their trailing newline, so the \n will never match; instead, use $ to match the end of a line.
Instead of % (the built-in alias for the ForEach-Object cmdlet) you could have used ? (the built-in alias for the Where-Object cmdlet) to perform the desired filtering:
Get-Content result2.adoc | ? { $_ -notmatch '^\e\[' }
$_ -notmatch '^\e[' returns $True only for lines that don't start (^) with an ESC character (\e, whose code point is 0x1b) followed by a literal (\) [, thereby effectively filtering out the lines before the = Heading line.
However, the multi-line -replace command at the top is a more direct and faster expression of your intent.
Here is the code I ended up with after help from #mklement0. This Powershell script creates MS Word-style track changes for two versions of an AsciiDoc file. It creates the Diff file, uses regex to replace ASCII codes with HTML/CSS tags, removes the Diff header (thank you!), uses AsciiDoctor to create an HTML file and then PrinceXML to create a PDF file of the output that I can send to document reviewers.
git diff --color-words file1.adoc file2.adoc > result.adoc;
Get-Content "result.adoc" | % {
$_ -Replace '(=+ ?)([A-Za-z\s]+)(\[m)', '$1$2' `
-Replace '\[32m', '+++<span style="color: #00cd00;">' `
-Replace '\[31m', '+++<span style="color: #cd0000; text-decoration: line-through;">' `
-Replace '\[m', '</span>+++' } | Out-File -encoding utf8 "result2.adoc" ;
(Get-Content -Raw result2.adoc) -replace '(?s)^.+?\n(?==)', '' | Out-File -encoding utf8 "result3.adoc" ;
asciidoctor result3.adoc -o result3.html;
prince result3.html --javascript -o result3.pdf;
Read-Host -Prompt "Press Enter to exit"
Here's a screenshot of the result using some text from Wikipedia:

Is there a way to optimise my Powershell function for removing pattern matches from a large file?

I've got a large text file (~20K lines, ~80 characters per line).
I've also got a largish array (~1500 items) of objects containing patterns I wish to remove from the large text file. Note, if the pattern from the array appears on a line in the input file, I wish to remove the entire line, not just the pattern.
The input file is CSVish with lines similar to:
A;AAA-BBB;XXX;XX000029;WORD;WORD-WORD-1;00001;STRING;2015-07-01;;010;
The pattern in the array which I search each line in the input file for resemble the
XX000029
part of the line above.
My somewhat naïve function to achieve this goal looks like this currently:
function Remove-IdsFromFile {
param(
[Parameter(Mandatory=$true,Position=0)]
[string]$BigFile,
[Parameter(Mandatory=$true,Position=1)]
[Object[]]$IgnorePatterns
)
try{
$FileContent = Get-Content $BigFile
}catch{
Write-Error $_
}
$IgnorePatterns | ForEach-Object {
$IgnoreId = $_.IgnoreId
$FileContent = $FileContent | Where-Object { $_ -notmatch $IgnoreId }
Write-Host $FileContent.count
}
$FileContent | Set-Content "CleansedBigFile.txt"
}
This works, but is slow.
How can I make it quicker?
function Remove-IdsFromFile {
param(
[Parameter(Mandatory=$true,Position=0)]
[string]$BigFile,
[Parameter(Mandatory=$true,Position=1)]
[Object[]]$IgnorePatterns
)
# Create the pattern matches
$regex = ($IgnorePatterns | ForEach-Object{[regex]::Escape($_)}) -join "|"
If(Test-Path $BigFile){
$reader = New-Object System.IO.StreamReader($BigFile)
$line=$reader.ReadLine()
while ($line -ne $null)
{
# Check if the line should be output to file
If($line -notmatch $regex){$line | Add-Content "CleansedBigFile.txt"}
# Attempt to read the next line.
$line=$reader.ReadLine()
}
$reader.close()
} Else {
Write-Error "Cannot locate: $BigFile"
}
}
StreamReader is one of the preferred methods to read large text files. We also use regex to build pattern string to match based on. With the pattern string we use [regex]::Escape() as a precaution if regex control characters are present. Have to guess since we only see one pattern string.
If $IgnorePatterns can easily be cast as strings this should working in place just fine. A small sample of what $regex looks like would be:
XX000029|XX000028|XX000027
If $IgnorePatterns is populated from a database you might have less control over this but since we are using regex you might be able to reduce that pattern set by actually using regex (instead of just a big alternative match) like in my example above. You could reduce that to XX00002[7-9] for instance.
I don't know if the regex itself will provide an performance boost with 1500 possibles. The StreamReader is supposed to be the focus here. However I did sully the waters by using Add-Content to the output which does not get any awards for being fast either (could use a stream writer in its place).
Reader and Writer
I still have to test this to be sure it works but this just uses streamreader and streamwriter. If it does work better I am just going to replace the above code.
function Remove-IdsFromFile {
param(
[Parameter(Mandatory=$true,Position=0)]
[string]$BigFile,
[Parameter(Mandatory=$true,Position=1)]
[Object[]]$IgnorePatterns
)
# Create the pattern matches
$regex = ($IgnorePatterns | ForEach-Object{[regex]::Escape($_)}) -join "|"
If(Test-Path $BigFile){
# Prepare the StreamReader
$reader = New-Object System.IO.StreamReader($BigFile)
#Prepare the StreamWriter
$writer = New-Object System.IO.StreamWriter("CleansedBigFile.txt")
$line=$reader.ReadLine()
while ($line -ne $null)
{
# Check if the line should be output to file
If($line -notmatch $regex){$writer.WriteLine($line)}
# Attempt to read the next line.
$line=$reader.ReadLine()
}
# Don't cross the streams!
$reader.Close()
$writer.Close()
} Else {
Write-Error "Cannot locate: $BigFile"
}
}
You might need some error prevention in there for the streams but it does appear to work in place.

RegEx Match whole line with first occurrence from the bottom of the file, upwards

I'm trying to parse a file with error codes.
I would only like the first occurrence from the bottom of the file to be returned.
So far, I've got this regex searching for the error code numbers, and it returns the whole line with the Multiline option, but it returns all lines in the file, not just the last one.
^.*?\b(639|640|460|458|664|148)\b.*$
I'm using powershell, so if you have an example using powershell - that would be great.
Thank you.
Assuming your regex is correct for matching on a line then you should be able to do something like this:
$pattern = '^.*?\b(639|640|460|458|664|148)\b.*$'
$content = Get-Content c:\somefile.txt
for ($i = $content.Length - 1; $i -ge 0; $i--) {
if ($content[$i] -match $pattern) {
$matches[1]
break
}
}
I'd use Select-String for this:
$filename = 'C:\path\to\input.txt'
$pattern = '\b(639|640|460|458|664|148)\b'
Get-Content $filename | Select-String $pattern | select -Last 1