Cannot remove text between two strings with ReadLines

Cannot remove text between two strings with ReadLines - regex

test.txt contents:
foo
[HKEY_USERS\S-1-5-18\Software\Microsoft]
bar
delete me!
[HKEY_other_key]
end-------------
Online regex matches the text to be removed correctly (starting from string delete until string [HKEY), but code written in PowerShell doesn't remove anything when I run it in PowerShell ISE:
$file = [System.IO.File]::ReadLines("test.txt")
$pattern = $("(?sm)^delete.*?(?=^\[HKEY)")
$file -replace $pattern, "" # returns original test.txt including line "delete me!" which should be removed
It seems to be a problem with ReadLines because when I use alternative Get-Content:
$file = Get-Content -Path test.txt -Raw
it removes the unwanted line correctly, but I don't want to use Get-Content.

[System.IO.File]::ReadAllLines(..) reads all lines of the file into a string array and you're using a multi-line regex pattern.
Get-Content -Raw same as [System.IO.File]::ReadAllText(..), reads all the text in the file into a string.
[System.IO.File]::ReadAllText("$pwd\test.txt") -replace "(?sm)^delete.*?(?=^\[HKEY)"
Results in:
foo
[HKEY_USERS\S-1-5-18\Software\Microsoft]
bar
[HKEY_other_key]
end-------------
In case you do need to read the file line-by-line due to, for example, high memory consumption, switch -File is an excellent built-in PowerShell alternative:
switch -Regex -File('test.txt') {
'^delete' { # if starts with `delete`
$skip = $true # set this var to `$true
continue # go to next line
}
'^\[HKEY' { # if starts with `[HKEY`
$skip = $false # set this var to `$false`
$_ # output this line
continue # go to next line
}
{ $skip } { continue } # if this var is `$true`, go next line
Default { $_ } # if none of the previous conditions were met, ouput this line
}

Related

Powershell - Regex match multiple lines from file

I am able to match and replace multiple lines if the text string is part of the powsershell script:
$regex = #"
(?s)(--match from here--.*?
--up to here--)
"#
$text = #"
first line
--match from here--
other lines
--up to here--
last line
"#
$editedText = ($text -replace $regex, "")
$editedText | Set-Content ".\output.txt"
output.txt:
first line
last line
But if I instead read the text in from a file with Get-Content -Raw, the same regex fails to match anything.
$text = Get-Content ".\input.txt" -Raw
input.txt:
first line
--match from here--
other lines
--up to here--
last line
output.txt:
first line
--match from here--
other lines
--up to here--
last line
Why is this? What can I do to match the text read in from input.txt? Thanks in advance!

Using a here-string the code depends on the kind of newline characters used by the .ps1 file. It won't work if it doesn't match the newline characters used by the input file.
To remove this dependency, define a RegEx that uses \r?\n to match all kinds of newlines:
$regex = "(?s)(--match from here--.*?\r?\n--up to here--)"
$text = Get-Content "input.txt" -Raw
$editedText = $text -replace $regex, ""
$editedText | Set-Content ".\output.txt"
Alternatively you may use a switch based solution, so you can use simpler RegEx pattern:
$include = $true
& { switch -File 'input.txt' -RegEx {
'--match from here--' { $include = $false }
{ $include } { $_ } # Output line if $include equals $true
'--up to here--' { $include = $true }
}} | Set-Content 'output.txt'
The switch -File construct loops over all lines of the input file and passes each one to the match expressions.
When we find the 1st pattern we set an $include flag to $false, which causes the code to skip over all lines until after the 2nd pattern is found, which sets the $include flag back to $true.
Writing $_ on its own causes the current line to be outputted.
We pipe to Set-Content to reduce memory footprint of the script. Instead of reading all lines into a variable in memory, we use a streaming approach where each processed line is immediately passed to Set-Content. Note that we can't pipe directly from a switch block, so as workaround we wrap the switch inside a script block (& { ... } creates and calls the script block).
The idea has been adopted from this GitHub comment.

Powershell regex replace line that contains ONLY certain characters

I read a file with get-content -raw because of other operations I perform.
$c = get-content myfile.txt -raw
I want to replace the entirety of each line that contains ONLY the characters "*" or "=" with "hare"
I try
$c -replace "^[*=]*$","hare"
but that does not succeed. It works with simple string input but not with my string that contains CRLFs. (Other regex replace operations not involving character classes work fine.)
TEST:
given an input file of two lines
*=**
keep this line ***
***=
The output should be
hare
keep this line ***
hare
Tried many things, no luck.

You should use (?m) (RegexOptions.Multiline) option to make ^ match the start of a line and $ the end of a line positions.
However, there is a caveat: the $ anchor in a .NET regex with a multiline option matches only before a newline, LF, "`n", char. You need to make sure an optional (or if it is always there, obligatory) CR symbol before $.
You may use
$file -replace "(?m)^[*=]*\r?$", "hare"
Powershell test demo:
PS> $file = "*=**`r`nkeep this line ***`r`n***=`r`n***==Keep this line as is"
PS> $file -replace "(?m)^[*=]*\r?$", "hare"
hare
keep this line ***
hare
***==Keep this line as is

Try this:
$c = get-content "myfile.txt" -raw
$c -split [environment]::NewLine | % { if( $_ -match "^[*= ]+$" ) { "hare" } else { $_ } }

Splitting text in PowerShell using content delimiter as filename

I am trying to split a txt transcription into single files, one for each folio.
The file is marked as [c. 1r],[c. 1v] ... [c. 7v] and so on.
Using this example I was able to create a PowerShell script that does the magic with a regex that match each page delimiter , but I seem totally unable to use the regex in order to give proper names to the pages. With this code
$InputFile = "input.txt"
$Reader = New-Object System.IO.StreamReader($InputFile)
$a = 1
while (($Line = $Reader.ReadLine()) -ne $null) {
if ($Line -match "\[c\. .*?\]") {
$OutputFile = "MySplittedFileNumber$a$Matches.txt"
$a++
}
Add-Content $OutputFile $Line
}
all the files are named with MySplittedFileNumber1System.Collections.Hashtable.txt instead of the match, with "$Matches[0]" I'm told that the variable does not exist or has been filtered by -Exclude.
All my attempts of setting the $regex before executing seems to go nowhere, can someone point me on how to get the result filenames formatted as MySplittedFileNumber[c. 1r].txt.
Using just a partial match as \[(c\. .*?)\] would be even better, but once I know how to retrieve the match, I bet I can find the solution.
I can do the variable 1r 1v setting in $a, somehow, but I'd rather use the one inside the txt file, since some folio may have been misnumbered in the manuscript and I need to retain this.
Content of original input.txt:
> [c. 1r]
Text paragraph
text paragraph
...
Text paragraph
[c. 1v]
Text paragraph
text paragraph
...
Text paragraph
[c. 2r]
Text paragraph
text paragraph
...
Text paragraph
Desired result:
Content of MySplittedFileNumber[c. 1r].txt:
> [c. 1r]
Text paragraph
text paragraph
...
Text paragraph
Content of MySplittedFileNumber[c. 1v].txt:
> [c. 1v]
Text paragraph
text paragraph
...
Text paragraph
Content of MySplittedFileNumber[c. 2r].txt:
> [c. 2r]
Text paragraph
text paragraph
...
Text paragraph

I tried to reproduce it and with a little change it worked:
$InputFile = "input.txt"
$Reader = New-Object System.IO.StreamReader($InputFile)
$a = 1
While (($Line = $Reader.ReadLine()) -ne $null) {
If ($Line -match "\[c\. .*?\]") {
$OutputFile = "MySplittedFileNumber$a$($Matches[0]).txt"
$a++
}
Out-File -LiteralPath "<yourFolder>\$OutputFile" -InputObject $Line -Append
}
To call a position of an array while in "" you have to format the variable like this $($array[number])
To write to the file, you should give the Fullpath and not just the Filename.

From Version 3 on PowerShells Get-Content cmdlet has the -Raw parameter which allows to read a file as a whole into a string you can then split into chunks with a regular exression (using a positive look ahead ).
The very same RegEx can be use to grep the section name and insert into the destination file name.
## Q:\Test\2018\07\19\SO_51421567.ps1
##
$RE = [RegEx]'(?=(\[c\. \d+[rv]\]))'
$Sections = (Get-Content '.\input.txt' -raw) -split $RE -ne ''
ForEach ($Section in $Sections){
If ($Section -Match $RE){
$Section | Out-File -LiteralPath ("MySplittedFileNumber{0}.txt" -f $Matches[1])
}
}

Is there a way to optimise my Powershell function for removing pattern matches from a large file?

I've got a large text file (~20K lines, ~80 characters per line).
I've also got a largish array (~1500 items) of objects containing patterns I wish to remove from the large text file. Note, if the pattern from the array appears on a line in the input file, I wish to remove the entire line, not just the pattern.
The input file is CSVish with lines similar to:
A;AAA-BBB;XXX;XX000029;WORD;WORD-WORD-1;00001;STRING;2015-07-01;;010;
The pattern in the array which I search each line in the input file for resemble the
XX000029
part of the line above.
My somewhat naïve function to achieve this goal looks like this currently:
function Remove-IdsFromFile {
param(
[Parameter(Mandatory=$true,Position=0)]
[string]$BigFile,
[Parameter(Mandatory=$true,Position=1)]
[Object[]]$IgnorePatterns
)
try{
$FileContent = Get-Content $BigFile
}catch{
Write-Error $_
}
$IgnorePatterns | ForEach-Object {
$IgnoreId = $_.IgnoreId
$FileContent = $FileContent | Where-Object { $_ -notmatch $IgnoreId }
Write-Host $FileContent.count
}
$FileContent | Set-Content "CleansedBigFile.txt"
}
This works, but is slow.
How can I make it quicker?

function Remove-IdsFromFile {
param(
[Parameter(Mandatory=$true,Position=0)]
[string]$BigFile,
[Parameter(Mandatory=$true,Position=1)]
[Object[]]$IgnorePatterns
)
# Create the pattern matches
$regex = ($IgnorePatterns | ForEach-Object{[regex]::Escape($_)}) -join "|"
If(Test-Path $BigFile){
$reader = New-Object System.IO.StreamReader($BigFile)
$line=$reader.ReadLine()
while ($line -ne $null)
{
# Check if the line should be output to file
If($line -notmatch $regex){$line | Add-Content "CleansedBigFile.txt"}
# Attempt to read the next line.
$line=$reader.ReadLine()
}
$reader.close()
} Else {
Write-Error "Cannot locate: $BigFile"
}
}
StreamReader is one of the preferred methods to read large text files. We also use regex to build pattern string to match based on. With the pattern string we use [regex]::Escape() as a precaution if regex control characters are present. Have to guess since we only see one pattern string.
If $IgnorePatterns can easily be cast as strings this should working in place just fine. A small sample of what $regex looks like would be:
XX000029|XX000028|XX000027
If $IgnorePatterns is populated from a database you might have less control over this but since we are using regex you might be able to reduce that pattern set by actually using regex (instead of just a big alternative match) like in my example above. You could reduce that to XX00002[7-9] for instance.
I don't know if the regex itself will provide an performance boost with 1500 possibles. The StreamReader is supposed to be the focus here. However I did sully the waters by using Add-Content to the output which does not get any awards for being fast either (could use a stream writer in its place).
Reader and Writer
I still have to test this to be sure it works but this just uses streamreader and streamwriter. If it does work better I am just going to replace the above code.
function Remove-IdsFromFile {
param(
[Parameter(Mandatory=$true,Position=0)]
[string]$BigFile,
[Parameter(Mandatory=$true,Position=1)]
[Object[]]$IgnorePatterns
)
# Create the pattern matches
$regex = ($IgnorePatterns | ForEach-Object{[regex]::Escape($_)}) -join "|"
If(Test-Path $BigFile){
# Prepare the StreamReader
$reader = New-Object System.IO.StreamReader($BigFile)
#Prepare the StreamWriter
$writer = New-Object System.IO.StreamWriter("CleansedBigFile.txt")
$line=$reader.ReadLine()
while ($line -ne $null)
{
# Check if the line should be output to file
If($line -notmatch $regex){$writer.WriteLine($line)}
# Attempt to read the next line.
$line=$reader.ReadLine()
}
# Don't cross the streams!
$reader.Close()
$writer.Close()
} Else {
Write-Error "Cannot locate: $BigFile"
}
}
You might need some error prevention in there for the streams but it does appear to work in place.

Powershell -creplace doesn't find Expression at end of line

I'm trying to find and replace some text at the end of line with Powershell. (ascii, txt, windows) I need to do this with a given script, which is already used for string replace:
$inputText = [system.IO.File]::ReadAllText("Text.txt")
$regex = '\\DE$|\DE_02'
$regex > test.txt
$th = [system.IO.File]::ReadAllText("test.txt")
foreach($expression in $th) {
if ($expression -eq 'EOF') { break }
$parts = $expression.Split("|")
if ($parts.Count -eq 2) {
$inputText = $InputText -creplace $parts
echo $inputText | out-file "Text_neu.txt" -enc ascii
}
}
The cmdlet works fine so far, but cannot match the end of line ($) -.-
I also tried `r`n instead of $ but didn't work...
When I try like this:
$inputText = [system.IO.File]::ReadAllText("Text.txt")
$inputText.Replace("\DE\`r\`n","\DE_02\`r\`n") | Out-File Text_neu.txt
it's al replaced correctly.
How can I change the existing script so that it will work also?

I am not sure if I understand your script correctly, but I think your problem is, you are replacing on the whole text and not on single rows.
So $ is not the end of a row (\r\n) it will per default match on the end of the string!
You can modify this behaviour by using the inline modifier (?m). This will change the behaviour of $ to match the end of the row.
Try
$regex = '(?m)\\DE$|\DE_02'
as you regular expression.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Cannot remove text between two strings with ReadLines - regex

Related

Powershell - Regex match multiple lines from file

Powershell regex replace line that contains ONLY certain characters

Splitting text in PowerShell using content delimiter as filename

Is there a way to optimise my Powershell function for removing pattern matches from a large file?

Powershell -creplace doesn't find Expression at end of line

Categories

Resources