Powershell regex replace line that contains ONLY certain characters - regex

I read a file with get-content -raw because of other operations I perform.
$c = get-content myfile.txt -raw
I want to replace the entirety of each line that contains ONLY the characters "*" or "=" with "hare"
I try
$c -replace "^[*=]*$","hare"
but that does not succeed. It works with simple string input but not with my string that contains CRLFs. (Other regex replace operations not involving character classes work fine.)
TEST:
given an input file of two lines
*=**
keep this line ***
***=
The output should be
hare
keep this line ***
hare
Tried many things, no luck.

You should use (?m) (RegexOptions.Multiline) option to make ^ match the start of a line and $ the end of a line positions.
However, there is a caveat: the $ anchor in a .NET regex with a multiline option matches only before a newline, LF, "`n", char. You need to make sure an optional (or if it is always there, obligatory) CR symbol before $.
You may use
$file -replace "(?m)^[*=]*\r?$", "hare"
Powershell test demo:
PS> $file = "*=**`r`nkeep this line ***`r`n***=`r`n***==Keep this line as is"
PS> $file -replace "(?m)^[*=]*\r?$", "hare"
hare
keep this line ***
hare
***==Keep this line as is

Try this:
$c = get-content "myfile.txt" -raw
$c -split [environment]::NewLine | % { if( $_ -match "^[*= ]+$" ) { "hare" } else { $_ } }

Related

Powershell - Regex match multiple lines from file

I am able to match and replace multiple lines if the text string is part of the powsershell script:
$regex = #"
(?s)(--match from here--.*?
--up to here--)
"#
$text = #"
first line
--match from here--
other lines
--up to here--
last line
"#
$editedText = ($text -replace $regex, "")
$editedText | Set-Content ".\output.txt"
output.txt:
first line
last line
But if I instead read the text in from a file with Get-Content -Raw, the same regex fails to match anything.
$text = Get-Content ".\input.txt" -Raw
input.txt:
first line
--match from here--
other lines
--up to here--
last line
output.txt:
first line
--match from here--
other lines
--up to here--
last line
Why is this? What can I do to match the text read in from input.txt? Thanks in advance!
Using a here-string the code depends on the kind of newline characters used by the .ps1 file. It won't work if it doesn't match the newline characters used by the input file.
To remove this dependency, define a RegEx that uses \r?\n to match all kinds of newlines:
$regex = "(?s)(--match from here--.*?\r?\n--up to here--)"
$text = Get-Content "input.txt" -Raw
$editedText = $text -replace $regex, ""
$editedText | Set-Content ".\output.txt"
Alternatively you may use a switch based solution, so you can use simpler RegEx pattern:
$include = $true
& { switch -File 'input.txt' -RegEx {
'--match from here--' { $include = $false }
{ $include } { $_ } # Output line if $include equals $true
'--up to here--' { $include = $true }
}} | Set-Content 'output.txt'
The switch -File construct loops over all lines of the input file and passes each one to the match expressions.
When we find the 1st pattern we set an $include flag to $false, which causes the code to skip over all lines until after the 2nd pattern is found, which sets the $include flag back to $true.
Writing $_ on its own causes the current line to be outputted.
We pipe to Set-Content to reduce memory footprint of the script. Instead of reading all lines into a variable in memory, we use a streaming approach where each processed line is immediately passed to Set-Content. Note that we can't pipe directly from a switch block, so as workaround we wrap the switch inside a script block (& { ... } creates and calls the script block).
The idea has been adopted from this GitHub comment.

Powershell regex match sequence doesn't work although it matches in Sublime Text find and replace

I am trying to create a Powershell regex statement to remove the top five lines of this output from a git diff file that has already been modified with Powershell regex.
[1mdiff --git a/uk1.adoc b/uk2.adoc</span>+++
[1mindex b5d3bf7..90299b8 100644</span>+++
[1m--- a/uk1.adoc</span>+++
[1m+++ b/uk2.adoc</span>+++
[36m## -1,9 +1,9 ##</span>+++
= Heading
Body text
Image shown because binary code doesn't show in the text
The following statement matches the text so the '= Heading' line is placed at the top of the page if I replace with nothing.
^[^=]*.[+][\n]
But in Powershell, it isn't matching the text.
Get-Content "result2.adoc" | % { $_ -Replace '^[^=]*.[+][\n]', '' } | Out-File "result3.adoc";
Any ideas about why it doesn't work in Powershell?
My overall goal is to create a diff file of two versions of an AsciiDoc file and then replace the ASCII codes with HTML/CSS code to display the resulting AsciiDoc file with green/red track changes.
The simplest - and faster - approach is to read the input file as a single, multiline string with Get-Content -Raw and let the regex passed to -replace operate across multiple lines:
(Get-Content -Raw result2.adoc) -replace '(?s)^.+?\n(?==)' |
Set-Content result3.adoc
(?s) activates in-line option s which makes . match newline (\n) characters too.
^.+?\n(?==) matches from the start of the string (^) any number of characters (including newlines) (.+), non-greedily (?)
until a newline (\n) followed by a = is found.
(?=...) is a look-ahead assertion, which matches = without consuming it, i.e., without considering it part of the substring that matched.
Since no replacement operand is passed to -replace, the entire match is replace with the implied empty string, i.e., what was matched is effectively removed.
As for what you tried:
The -replace operator passes its LHS through if no match is found, so you cannot use it to filter out non-matching lines.
Even if you match an undesired line in full and replace it with '' (the empty string), it will show up as an empty line in the output when sent to Set-Content or Out-File (>).
As for your specific regex, ^[^=]*.[+][\n] (whether or not the first ^ is followed by an ESC (0x1b) char.):
[\n] (just \n would suffice) tries to match a newline char. after a literal + ([+]), yet lines read individually with Get-Content (without -Raw) by definition are stripped of their trailing newline, so the \n will never match; instead, use $ to match the end of a line.
Instead of % (the built-in alias for the ForEach-Object cmdlet) you could have used ? (the built-in alias for the Where-Object cmdlet) to perform the desired filtering:
Get-Content result2.adoc | ? { $_ -notmatch '^\e\[' }
$_ -notmatch '^\e[' returns $True only for lines that don't start (^) with an ESC character (\e, whose code point is 0x1b) followed by a literal (\) [, thereby effectively filtering out the lines before the = Heading line.
However, the multi-line -replace command at the top is a more direct and faster expression of your intent.
Here is the code I ended up with after help from #mklement0. This Powershell script creates MS Word-style track changes for two versions of an AsciiDoc file. It creates the Diff file, uses regex to replace ASCII codes with HTML/CSS tags, removes the Diff header (thank you!), uses AsciiDoctor to create an HTML file and then PrinceXML to create a PDF file of the output that I can send to document reviewers.
git diff --color-words file1.adoc file2.adoc > result.adoc;
Get-Content "result.adoc" | % {
$_ -Replace '(=+ ?)([A-Za-z\s]+)(\[m)', '$1$2' `
-Replace '\[32m', '+++<span style="color: #00cd00;">' `
-Replace '\[31m', '+++<span style="color: #cd0000; text-decoration: line-through;">' `
-Replace '\[m', '</span>+++' } | Out-File -encoding utf8 "result2.adoc" ;
(Get-Content -Raw result2.adoc) -replace '(?s)^.+?\n(?==)', '' | Out-File -encoding utf8 "result3.adoc" ;
asciidoctor result3.adoc -o result3.html;
prince result3.html --javascript -o result3.pdf;
Read-Host -Prompt "Press Enter to exit"
Here's a screenshot of the result using some text from Wikipedia:

How to use regex to remove everything except certain "key"/"character containing"

Running my code gives me this output in a txt file:
19:27:28.636 ASSOS\032AB5601\0223-\032312DEEE8EB423._http._tcp.local. can
be reached at ASSOS-032DEEE8EB423.local.:80 (interface 1)
So I just want to parse out string "ASSOS-032DEEE8EB423.local" and remove everything else from the txt file. I can't figure out how to use regex to do so to remove everything except string containing ASSOS-. So the thing is that the string will always contain ASSOS- but the rest is always changing to different numbers. So I'm trying to always be able to get ASSOS-XXXXXXXXXXX.local
This is how I'm trying to do:
$string = 'Get-Content C:\MyFile.Txt'
$pattern = ''
$string -replace $pattern, ' '
It's just that I don't know so much about regex and how to write it to parse out string containing "ASSOS-" and remove everything after ASSOS-XXXXXXXXXXX.local
I would pipe the file content to Select-String and return the values of matches for a string starting with "ASSOS-", ending with "local" and having whatever non-whitespace characters in between:
Get-Content test.txt | Select-String -Pattern "ASSOS-\S*local" | ForEach-Object {$_.Matches.Value}
A possible solution:
$str = "19:27:28.636 ASSOS\032AB5601\0223-\032312DEEE8EB423._http._tcp.local. can
be reached at **ASSOS-032DEEE8EB423.local**.:80 (interface 1)"
$str -replace '.*\*\*(.*?)\*\*.*', '$1'
The RegEx .*\*\*(.*?)\*\*.* captures all characters within **...**. The * have to be escaped by a \ to make it work.

Extract certain values from string in .txt files with PowerShell

Im trying to extract certain values from multiple lines inside a .txt file with PowerShell. Im currently using multiple replace and remove cmd's but it doesn't work as expected and is a bit too complex.
Is there a more simple way to do this?
My script:
$file = Get-Content "C:\RS232_COM2*"
foreach($line in $file){
$result1 = $file.replace(" <<< [NAK]#99","")
$result2 = $result1.remove(0,3) #this only works for the first line for some reason...
$result3 = $result2.replace("\(([^\)]+)\)", "") #this should remove the string within paranthesis but doesn't work
.txt file:
29 09:10:16.874 (0133563471) <<< [NAK]#99[CAR]0998006798[CAR]
29 09:10:57.048 (0133603644) <<< [NAK]#99[CAR]0998019022[CAR]
29 09:59:56.276 (0136542798) <<< [NAK]#99[CAR]0998016987[CAR]
29 10:05:36.728 (0136883233) <<< [NAK]#99[CAR]0998050310[CAR]
29 10:55:36.792 (0139883179) <<< [NAK]#99[CAR]099805241D[CAR]0998028452[CAR]
29 11:32:16.737 (0142083132) <<< [NAK]#99[CAR]0998050289[CAR]0998031483[CAR]
29 11:34:16.170 (0142202566) <<< [NAK]#99[CAR]0998034787[CAR]
29 12:01:56.317 (0143862644) <<< [NAK]#99[CAR]0998005147[CAR]
The output i expect:
09:10:16.874 [CAR]0998006798[CAR]
09:10:57.048 [CAR]0998019022[CAR]
09:59:56.276 [CAR]0998016987[CAR]
10:05:36.728 [CAR]0998050310[CAR]
10:55:36.792 [CAR]099805241D[CAR]0998028452[CAR]
11:32:16.737 [CAR]0998050289[CAR]0998031483[CAR]
11:34:16.170 [CAR]0998034787[CAR]
12:01:56.317 [CAR]0998005147[CAR]
or more simple:
$Array = #()
foreach ($line in $file)
{
$Array += $line -replace '^..\s' -replace '\s\(.*\)' -replace '<<<.*#\d+'
}
$Array
Another option is to just grab the parts of a line you need with one regex and concat them:
$input_path = 'c:\data\in.txt'
$output_file = 'c:\data\out.txt'
$regex = '(\d+(?::\d+)+\.\d+).*?\[NAK]#99(.*)'
select-string -Path $input_path -Pattern $regex -AllMatches | % { $_.Matches } | % { [string]::Format("{0} {1}", $_.Groups[1].Value, $_.Groups[2].Value) } > $output_file
The regex is
(\d+(?::\d+)+\.\d+).*?\[NAK]#99(.*)
See the regex demo
Details:
(\d+(?::\d+)+\.\d+) - Group 1: one or more digits followed with 1+ sequences of : and one or more digits, then . and again 1+ digits
.*?\[NAK]#99 - any 0+ chars other than newline as few as possible up to the first [NAK]#99 literal char sequence
(.*) - Group 2: the rest of the line
After we get all matches, the $_.Groups[1].Value concatenated with $_.Groups[2].Value yield the expected output.
Multiple issues.
Inside the loop you reference $file rather than $line. In the last operation, you're using the String.Replace() method with a regex pattern - something that method doesn't understand - use the -replace operator instead:
$file = Get-Content "C:\RS232_COM2*"
foreach($line in $file){
$line = $line.Replace(" <<< [NAK]#99","")
$line = $line.Remove(0,3)
# now use the -replace operator and output the result
$line -replace "\(([^\)]+)\)",""
}
You could do it all in one regular expression replacement:
$line -replace '\(\d{10}\)\ <<<\s+\[NAK]\#99',''

Powershell Regex acting per-line rather than on entire string

Given foo.txt
this is a file
it has some text
the text has three lines
The following regex replacement
(get-content -raw foo.txt) -replace ".*", "hello" | write-output
produces the output
hellohello
hellohello
hellohello
rather than the desired
hello
My understanding was that get-content returns the content as an array of strings, one per line. The -raw flag replaces this behavior with returning the contents as a single string. As far as I know, ".*" should match the entire string, but instead it matches twice on each line.
Please advise.
Use the inline (?s) (dotall) modifier which forces . to span across newlines.
(Get-Content .\foo.txt -Raw) -replace "(?s).+", "hello"
Example:
PS> $data = Get-Content .\foo.txt -Raw
PS> $data
this is a file
it has some text
the text has three lines
PS> $data -replace "(?s).+", "hello"
hello
I can't explain it other than to say that . appears not to be matching newline characters so you get one match for each complete line then one match for the zero characters at the end of each line.
This also explains the .+ behavior of hello once per-line.
You can "fix" this by using a better pattern that does match the newline characters.
(Get-Content -raw .\foo.txt) -replace "(.|\r|\n)+", "hello"
From https://stackoverflow.com/a/13674250/1252649,
The trick around DotAll mode is to use [\s\S] instead of .. This character class matches any character ...
Of course, this raises the question as to what exactly . is supposed to match other than 'any character'.