Using powershell to search for a pattern - regex

I am trying to write a powershell script to search for a pattern in a text file. Specifically I am looking at reading a file line by line and returning any line that has a space at the 32nd character position.
I have this so far but it just returns all lines that have white space. I need to narrow it down to the 32nd position
Get-Content -path C:\VM_names.txt | Where-Object {$_ -match "\s+"}

Use this pattern:
-match '^.{31} '
Explanation:
^ - beginning of the string
. - any character
{31} - repeated 31 times
- a space

This is actually really easy to do. By default, Get-Content reads a text file as an array of strings (individual lines), unless you use the -Raw parameter, which reads it as a single string. You can use the -match PowerShell operator to "match" the lines that meet your regular expression.
(Get-Content -Path c:\VM_names.txt) -match '^.{31}\s'
The result of the above command is an array of lines that match the desired regular expression.
NOTE: The call to Get-Content must be wrapped in parentheses, otherwise the PowerShell parser will think that -match is a parameter on that command.
NOTE2: As a good practice, use single quotes around strings, unless you specifically know that you need double quotes. You'll save yourself from accidental interpolation.

Related

Regex to remove enter from line starting with specific character in Powershell

I have huge csv file with data, and some of lines are incorrect and contains enters. When file is imported into Excel then I need to correct hundreds lines manually. I have regex which is work in Notepad++ and remove enters from line which is not start with specific string in this case ";" However same regex is not working in PowerShell script.
Example of input
;BP;7165378;XX_RAW;200SSS952;EU-PL;PL02;PL02;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
15:00:00;;;;Jhon Name;;;;;;;;9444253;;;;;;;;;;;;;"Jhon Name";;;;;;;;;;Jhon Name;;;;;;;;Final Check Approved;;;;;;;;;09.01.2023;;;;;Approve;;;;;;12077;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
How it should look:
;BP;7165378;XX_RAW;200SSS952;EU-PL;PL02;PL02;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;15:00:00;;;;Jhon Name;;;;;;;;9444253;;;;;;;;;;;;;"Jhon Name";;;;;;;;;;Jhon Name;;;;;;;;Final Check Approved;;;;;;;;;09.01.2023;;;;;Approve;;;;;;12077;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
Code:
$content = Get-Content -path "C:\Users\TUF17\Desktop\File\Fix\xx_fix_temp.csv"
$content -Replace '"\R(?!;)"', ' ' | Out-File "C:\Users\TUF17\Desktop\File\Fix\xx_noenters.csv"
It has to do with line continuation \ in your ps script.
I would also suggest adding -Raw if you want to get content of file as single string, rather than an array of strings, for easier replacing.
I'm assuming it's a .csv file you are using.
$content = Get-Content -Path "C:\Users\TUF17\Desktop\File\Fix\xx_fix_temp.csv" -Raw
$content -Replace '(?m)(^[^;].*)\r?\n(?!;)', '$1 ' | Out-File "C:\Users\TUF17\Desktop\File\Fix\xx_noenters.csv"
Building on the helpful comments on the question:
In order to perform replacements across lines of a text file, you need to either read the file in full - with Get-Content -Raw - or perform stateful line-by-line processing, such as with the -File parameter of a switch statement.
Note: While you could also do stateful line-by-line processing by combining Get-Content (without -Raw) with a ForEach-Object call, such a solution would be much slower - see this answer.
Your regex, '"\R(?!;)"', has two problems:
It accidentally uses embedded " quoting. Use only '...' quoting. PowerShell has no special syntax for regex literals - it simply uses strings.
To avoid confusion with PowerShell's own up-front string interpolation, it is better to use verbatim '...' strings rather than expandable (interpolating) "..." strings - see the conceptual about_Quoting_Rules help topic.
\R is an unsupported regex escape sequence; you presumably meant \r, i.e. a CR char. (CARRIAGE RETURN, U+000D)
If you instead want to match CRLF, a Windows-format newline sequence, use \r\n
If you want to match LF (LINE FEED, U+000A)) alone (a Unix-format newline), use \n
If you want to match both newline formats, use \r?\n
As an aside: While use of CR alone is rare in practice, PowerShell treats stand-alone CR characters as newlines as well, which is why Get-Content without -Raw, which reads line by line (as you've tried) wouldn't work.
Get-Content -Raw solution (easier and faster than switch -File, but requires the whole file to fit into memory twice):
# Adjust the '\r' part as needed (see above).
(Get-Content -Raw -LiteralPath $inFile) -replace '\r(?!;)' |
Set-Content -NoNewLine -Encoding utf8 -LiteralPath $outFile
Note:
By not specifying a substitution operand to -replace, the command removes all newlines not followed by a ; ((?!;)), effectively joining the line that follows the CR directly to the previous line, which is the desired behavior based on your sample output.
For saving text, Set-Content is a bit faster than Out-File (it'll make no appreciable difference here, given that only a single, large string is written).
-NoNewLine prevents a(n additional) trailing newline from getting appended to the file.
-Encoding utf8 specifies the output character encoding. Note that PowerShell never preserves the input character encoding, so unless you use -Encoding on output, you'll get the respective cmdlet's default character encoding, which in Windows PowerShell varies from cmdlet to cmdlet; in PowerShell (Core) 7+, the consistent default is now BOM-less UTF-8. Note that in Windows PowerShell -Encoding utf8 always create a file with a BOM; see this answer for background information and workarounds.

Replace text + optional newline in file

I've been through other similar questions and tried their advice, but it wouldn't help.
I'm trying to delete a specific line of text in a text file.
My code which works
(Get-Content -Path "MyPath.txt" -Raw).Replace('this is the line', '') | Set-Content "MyPath.txt" -Encoding UTF8
Now this works but leaves an ugly empty line in the text file. I wanted to also replace an optional newline character by adding this regex at the end of the line
\n?
and this wouldn't work. The other threads made other recommendations and I've tried all combinations but just can't match. I'm using windows style ending (CRLF)
Both using -Raw and not using it
\n
\r\n
`n
`r`n
I haven't even added the regex question mark at the end (or non-capturing group in case it needs the \r\n syntax).
The [string] type's .Replace() method doesn't support regexes (regular expressions), whereas PowerShell's -replace operator does.
However, the simplest solution in this case is to take advantage of the fact that the -ne operator acts as a filter with an array-valued LHS (as other comparison operators do):
#(Get-Content -Path MyPath.txt) -ne 'this is the line' |
Set-Content MyPath.txt -Encoding UTF8
Note how Get-Content is called without -Raw in order to return an array of lines, from which -ne then filters out the line of (non)-interest; #(...), the array-subexpression operator ensures that the output is an array even if the file happens to contain just one line.
The assumption is that string 'this is the line' matches the whole line (case-insensitively).
If that is not the case, instead of -ne you could use -notlike with a wildcard expression or -notmatch with a regex (e.g.,
-notmatch 'this is the line' or -notlike '*this is the line')

PowerShell regex does not match near newline

I have an exe output in form
Compression : CCITT Group 4
Width : 3180
and try to extract CCITT Group 4 to $var with PowerShell script
$var = [regex]::match($exeoutput,'Compression\s+:\s+([\w\s]+)(?=\n)').Groups[1].Value
The http://regexstorm.net/tester say, the regexp Compression\s+:\s+([\w\s]+)(?=\n) is correct but not PowerShell. PowerShell does not match. How can I write the regexp correctly?
You want to get all text from some specific pattern till the end of the line. So, you do not even need the lookahead (?=\n), just use .+, because . matches any char but a newline (LF) char:
$var = [regex]::match($exeoutput,'Compression\s+:\s+(.+)').Groups[1].Value
Or, you may use a -match operator and after the match is found access the captured value using $matches[1]:
$exeoutput -match 'Compression\s*:\s*(.+)'
$var = $matches[1]
Wiktor Stribiżew's helpful answer simplifies your regex and shows you how to use PowerShell's -match operator as an alternative.
Your follow-up comment about piping to Out-String fixing your problem implies that your problem was that $exeOutput contained an array of lines rather than a single, multiline string.
This is indeed what happens when you capture the output from a call to an external program (*.exe): PowerShell captures the stdout output lines as an array of strings (the lines without their trailing newline).
As an alternative to converting array $exeOutput to a single, multiline string with Out-String (which, incidentally, is slow[1]), you can use a switch statement to operate on the array directly:
# Stores 'CCITT Group 4' in $var
$var = switch -regex ($exeOutput) { 'Compression\s+:\s+(.+)' { $Matches[1]; break } }
Alternatively, given the specific format of the lines in $exeOutput, you could leverage the ConvertFrom-StringData cmdlet, which can perform parsing the lines into key-value pairs for you, after having replaced the : separator with =:
$var = ($exeoutput -replace ':', '=' | ConvertFrom-StringData).Compression
[1] Use of a cmdlet is generally slower than using an expression; with a string array $array as input, you can achieve what $array | Out-String does more efficiently with $array -join "`n", though note that Out-String also appends a trailing newline.

Powershell regex match sequence doesn't work although it matches in Sublime Text find and replace

I am trying to create a Powershell regex statement to remove the top five lines of this output from a git diff file that has already been modified with Powershell regex.
[1mdiff --git a/uk1.adoc b/uk2.adoc</span>+++
[1mindex b5d3bf7..90299b8 100644</span>+++
[1m--- a/uk1.adoc</span>+++
[1m+++ b/uk2.adoc</span>+++
[36m## -1,9 +1,9 ##</span>+++
= Heading
Body text
Image shown because binary code doesn't show in the text
The following statement matches the text so the '= Heading' line is placed at the top of the page if I replace with nothing.
^[^=]*.[+][\n]
But in Powershell, it isn't matching the text.
Get-Content "result2.adoc" | % { $_ -Replace '^[^=]*.[+][\n]', '' } | Out-File "result3.adoc";
Any ideas about why it doesn't work in Powershell?
My overall goal is to create a diff file of two versions of an AsciiDoc file and then replace the ASCII codes with HTML/CSS code to display the resulting AsciiDoc file with green/red track changes.
The simplest - and faster - approach is to read the input file as a single, multiline string with Get-Content -Raw and let the regex passed to -replace operate across multiple lines:
(Get-Content -Raw result2.adoc) -replace '(?s)^.+?\n(?==)' |
Set-Content result3.adoc
(?s) activates in-line option s which makes . match newline (\n) characters too.
^.+?\n(?==) matches from the start of the string (^) any number of characters (including newlines) (.+), non-greedily (?)
until a newline (\n) followed by a = is found.
(?=...) is a look-ahead assertion, which matches = without consuming it, i.e., without considering it part of the substring that matched.
Since no replacement operand is passed to -replace, the entire match is replace with the implied empty string, i.e., what was matched is effectively removed.
As for what you tried:
The -replace operator passes its LHS through if no match is found, so you cannot use it to filter out non-matching lines.
Even if you match an undesired line in full and replace it with '' (the empty string), it will show up as an empty line in the output when sent to Set-Content or Out-File (>).
As for your specific regex, ^[^=]*.[+][\n] (whether or not the first ^ is followed by an ESC (0x1b) char.):
[\n] (just \n would suffice) tries to match a newline char. after a literal + ([+]), yet lines read individually with Get-Content (without -Raw) by definition are stripped of their trailing newline, so the \n will never match; instead, use $ to match the end of a line.
Instead of % (the built-in alias for the ForEach-Object cmdlet) you could have used ? (the built-in alias for the Where-Object cmdlet) to perform the desired filtering:
Get-Content result2.adoc | ? { $_ -notmatch '^\e\[' }
$_ -notmatch '^\e[' returns $True only for lines that don't start (^) with an ESC character (\e, whose code point is 0x1b) followed by a literal (\) [, thereby effectively filtering out the lines before the = Heading line.
However, the multi-line -replace command at the top is a more direct and faster expression of your intent.
Here is the code I ended up with after help from #mklement0. This Powershell script creates MS Word-style track changes for two versions of an AsciiDoc file. It creates the Diff file, uses regex to replace ASCII codes with HTML/CSS tags, removes the Diff header (thank you!), uses AsciiDoctor to create an HTML file and then PrinceXML to create a PDF file of the output that I can send to document reviewers.
git diff --color-words file1.adoc file2.adoc > result.adoc;
Get-Content "result.adoc" | % {
$_ -Replace '(=+ ?)([A-Za-z\s]+)(\[m)', '$1$2' `
-Replace '\[32m', '+++<span style="color: #00cd00;">' `
-Replace '\[31m', '+++<span style="color: #cd0000; text-decoration: line-through;">' `
-Replace '\[m', '</span>+++' } | Out-File -encoding utf8 "result2.adoc" ;
(Get-Content -Raw result2.adoc) -replace '(?s)^.+?\n(?==)', '' | Out-File -encoding utf8 "result3.adoc" ;
asciidoctor result3.adoc -o result3.html;
prince result3.html --javascript -o result3.pdf;
Read-Host -Prompt "Press Enter to exit"
Here's a screenshot of the result using some text from Wikipedia:

Regex in Powershell fails to check for newlines

I'm trying to get the first block of releasenotes...
(See sample content in the code)
Whenever I use something simple it works, it only breaks when I try to
search across multiple lines (\n). I'm using (Get-Content $changelog | Out-String) because that gives back 1 string instead of an array from each line.
$changelog = 'C:\Source\VSTS\AcmeLab\AcmeLab Core\changelog.md'
$regex = '([Vv][0-9]+\.[0-9]+\.[0-9]+\n)(^-.*$\n)+'
(Get-Content $changelog | Out-String) | Select-String -Pattern $regex -AllMatches
<#
SAMPLE:
------
v1.0.23
- Adds an IContainer API.
- Bugfixes.
v1.0.22
- Hotfix: Language operators.
v1.0.21
- Support duplicate query parameters.
v1.0.20
- Splitting up the ICommand interface.
- Fixing the referrer header empty field value.
#>
The result I need is:
v1.0.23
- Adds an IContainer API.
- Bugfixes.
Update:
Using options..
$changelog = 'C:\Source\VSTS\AcmeLab\AcmeLab Core\changelog.md'
$regex = '(?smi)([Vv][0-9]+\.[0-9]+\.[0-9]+\n)(^-.*$\n)+'
Get-Content -Path $changelog -Raw | Select-String -Pattern $regex -AllMatches
I also get nothing.. (no matter if I use \n or \r\n)
Unless you're stuck with PowerShell v2, it's simpler and more efficient to use Get-Content -Raw to read an entire file as a single string; besides, Out-String adds an extra newline to the string.[1]
Since you're only looking for the first match, you can use the -match operator - no need for Select-String's -AllMatches switch.
Note: While you could use Select-String without it, it is more efficient to use the -match operator, given that you've read the entire file into memory already.
Regex matching is by default always case-insensitive in PowerShell, consistent with PowerShell's overall case-insensitivity.
Thus, the following returns the first block, if any:
if ((Get-Content -Raw $changelog) -match '(?m)^v\d+\.\d+\.\d+.*(\r?\n-\s?.*)+') {
# Match found - output it.
$Matches[0]
}
* (?m) turns on inline regex option m (multi-line), which causes anchors ^ and $ to match the beginning and end of individual lines rather than the overall string's.
\r?\n matches both CRLF and LF-only newlines.
You could make the regex slightly more efficient by making the (...) subexpression non-capturing, given that you're not interested in what it captured: (?:...).
Note that -match itself returns a Boolean (with a scalar LHS), but information about the match is recorded in the automatic $Matches hashtable variables, whose 0 entry contains the overall match.
As for what you tried:
'([Vv][0-9]+\.[0-9]+\.[0-9]+\n)(^-.*$\n)+'
doesn't work, because by default $ only matches at the very end of the input string, at the end of the last line (though possibly before a final newline).
To make $ to match the end of each line, you'd have to turn on the multiline regex option (which you did in your 2nd attempt).
As a result, nothing matches.
'(?smi)([Vv][0-9]+\.[0-9]+\.[0-9]+\n)(^-.*$\n)+'
doesn't work as intended, because by using option s (single-line) you've made . match newlines too, so that a greedy subexpression such as .* will match the remainder of the string, across lines.
As a result, everything from the first block on matches.
[1] This problematic behavior is discussed in GitHub issue #14444.