Looking for a function that removes all comments from a script [duplicate] - regex

I'm looking for a way to strip all comments from a file. There are various ways to do comments, but I'm only interested in the simple # form comments. Reason is that I only use <# #> for in-function .SYNOPSIS which is functional code as opposed to just a comment so I want to keep those).
EDIT: I have updated this question using the helpful answers below.
So there are only a couple of scenarios that I need:
a) whole line comments with # at start of line (or possibly with white-space before. i.e. regex of ^\s*# seems to work.
b) with some code at start of line then a command at the end of the line.
I want to avoid stripping lines that have e.g. Write-Host "#####" but I think this is covered in the code that I have.
I was able to remove end-of-line comments with a split as I couldn't work out how to do it with regex, does anyone know a way to achieve that with regex?
The split was not ideal as a <# on a line would be removed by the -split but I've fixed that by splitting on " #". This is not perfect but might be good enough - maybe a more reliable way with regex might exist?
When I do the below against my 7,000 line long script, it works(!) and strips a huge amount of comments, BUT, the output file is almost doubled in size(!?) from 400kb to about 700kb. Does anyone understand why that happens and how to prevent that (is it something to do with BOM's or Unicode or things like that? Out-File seems to really balloon the file-size!)
$x = Get-Content ".\myscript.ps1" # $x is an array, not a string
$out = ".\myscript.ps1"
$x = $x -split "[\r\n]+" # Remove all consecutive line-breaks, in any format '-split "\r?\n|\r"' would just do line by line
$x = $x | ? { $_ -notmatch "^\s*$" } # Remove empty lines
$x = $x | ? { $_ -notmatch "^\s*#" } # Remove all lines starting with ; including with whitespace before
$x = $x | % { ($_ -split " #")[0] } # Remove end of line comments
$x = ($x -replace $regex).Trim() # Remove whitespace only at start and end of line
$x | Out-File $out
# $x | more

Honestly, the best approach to identify and process all comments is to use PowerShell's language parser or one of the Ast classes. I apologize that I don't know which Ast contains comments; so this is an uglier way that will filter out block and line comments.
$code = Get-Content file.txt -Raw
$comments = [System.Management.Automation.PSParser]::Tokenize($code,[ref]$null) |
Where Type -eq 'Comment' | Select -Expand Content
$regex = ( $comments |% { [regex]::Escape($_) } ) -join '|'
# Output to remove all empty lines
$code -replace $regex -split '\r?\n' -notmatch '^\s*$'
# Output that Removes only Beginning and Ending Blank Lines
($code -replace $regex).Trim()

Do the inverse of your example: Only emit lines that do NOT match:
## Output to console
Get-Content .\file.ps1 | Where-Object { $_ -notmatch '#' }
## Output to file
Get-Content .\file.ps1 | Where-Object { $_ -notmatch '#' } | Out-file .\newfile.ps1 -Append

Based on #AdminOfThings helpful answer using the Abstract Syntax Tree (AST) Class parser approach but avoiding any regular expressions:
$Code = $Code.ToString() # Prepare any ScriptBlock for the substring method
$Tokens = [System.Management.Automation.PSParser]::Tokenize($Code, [ref]$null)
-Join $Tokens.Where{ $_.Type -ne 'Comment' }.ForEach{ $Code.Substring($_.Start, $_.Length) }

As for the incidental problem of the size of the output file being roughly double that of the input file:
As AdminOfThings points out, Out-File in Windows PowerShell defaults to UTF-16LE ("Unicode") encoding, where characters are represented by (at least) two bytes, whereas ANSI encoding, as used by Set-Content in Windows PowerShell by default, encodes all (supported) characters in a single byte. Similarly, UTF-8-encoded files use only one byte for characters in the ASCII range (note that PowerShell (Core) 7+ now consistently defaults to (BOM-less) UTF-8). Use the -Encoding parameter as needed.
A regex-based solution to your problem is never fully robust, even if you try to limit the comment removal to single-line comments.
For full robustness, you must indeed use PowerShell's language parser, as noted in the other answers.
However, care must be taken when reconstructing the original source code with the comments removed:
AdminOfThings's answer risks removing too much, given the subsequent global regex-based processing with -replace: while the scenario may be unlikely, if a comment is repeated inside a string, it would mistakenly be removed from there too.
iRon's answer risks syntax errors by joining the tokens without spaces, so that . .\foo.ps1 would turn into ..\foo.ps1, for instance. Blindly putting a space between tokens is not an option, because the property-access syntax would break (e.g. $host.Name would turn into $host . Name, but whitespace between a value and the . operator isn't allowed)
The following solution avoids these problems, while trying to preserve the formatting of the original code as much as possible, but this has limitations, because intra-line whitespace isn't reported by the parser:
This means that you can't tell whether whitespace between tokens on a given line is made up of tabs, spaces, or a mix of both. The solution below replaces any tab characters with 2 spaces before processing; adjust as needed.
To somewhat compensate for the removal of comments occupying their own line(s), more than 2 consecutive blank or empty lines are folded into a single empty one. It is possible to remove blank/empty lines altogether, but that could hurt readability.
# Tokenize the file content.
# Note that tabs, if any, are replaced by 2 spaces first; adjust as needed.
$tokens = $null
$null = [System.Management.Automation.Language.Parser]::ParseInput(
((Get-Content -Raw .\myscript.ps1) -replace '\t', ' '),
[ref] $tokens,
[ref] $null
)
# Loop over all tokens while omitting comments, and rebuild the source code
# without them, trying to preserve the original formatting as much as possible.
$sb = [System.Text.StringBuilder]::new()
$prevExtent = $null; $numConsecNewlines = 0
$tokens.
Where({ $_.Kind -ne 'Comment' }).
ForEach({
$startColumn = if ($_.Extent.StartLineNumber -eq $prevExtent.StartLineNumber) { $prevExtent.EndColumnNumber }
else { 1 }
if ($_.Kind -eq 'NewLine') {
# Fold multiple blank or empty lines into a single empty one.
if (++$numConsecNewlines -ge 3) { return }
} else {
$numConsecNewlines = 0
$null = $sb.Append(' ' * ($_.Extent.StartColumnNumber - $startColumn))
}
$null = $sb.Append($_.Text)
$prevExtent = $_.Extent
})
# Output the result.
# Pipe to Set-Content as needed.
$sb.ToString()

Related

PowerShell regex does not match near newline

I have an exe output in form
Compression : CCITT Group 4
Width : 3180
and try to extract CCITT Group 4 to $var with PowerShell script
$var = [regex]::match($exeoutput,'Compression\s+:\s+([\w\s]+)(?=\n)').Groups[1].Value
The http://regexstorm.net/tester say, the regexp Compression\s+:\s+([\w\s]+)(?=\n) is correct but not PowerShell. PowerShell does not match. How can I write the regexp correctly?
You want to get all text from some specific pattern till the end of the line. So, you do not even need the lookahead (?=\n), just use .+, because . matches any char but a newline (LF) char:
$var = [regex]::match($exeoutput,'Compression\s+:\s+(.+)').Groups[1].Value
Or, you may use a -match operator and after the match is found access the captured value using $matches[1]:
$exeoutput -match 'Compression\s*:\s*(.+)'
$var = $matches[1]
Wiktor Stribiżew's helpful answer simplifies your regex and shows you how to use PowerShell's -match operator as an alternative.
Your follow-up comment about piping to Out-String fixing your problem implies that your problem was that $exeOutput contained an array of lines rather than a single, multiline string.
This is indeed what happens when you capture the output from a call to an external program (*.exe): PowerShell captures the stdout output lines as an array of strings (the lines without their trailing newline).
As an alternative to converting array $exeOutput to a single, multiline string with Out-String (which, incidentally, is slow[1]), you can use a switch statement to operate on the array directly:
# Stores 'CCITT Group 4' in $var
$var = switch -regex ($exeOutput) { 'Compression\s+:\s+(.+)' { $Matches[1]; break } }
Alternatively, given the specific format of the lines in $exeOutput, you could leverage the ConvertFrom-StringData cmdlet, which can perform parsing the lines into key-value pairs for you, after having replaced the : separator with =:
$var = ($exeoutput -replace ':', '=' | ConvertFrom-StringData).Compression
[1] Use of a cmdlet is generally slower than using an expression; with a string array $array as input, you can achieve what $array | Out-String does more efficiently with $array -join "`n", though note that Out-String also appends a trailing newline.

Powershell regex match sequence doesn't work although it matches in Sublime Text find and replace

I am trying to create a Powershell regex statement to remove the top five lines of this output from a git diff file that has already been modified with Powershell regex.
[1mdiff --git a/uk1.adoc b/uk2.adoc</span>+++
[1mindex b5d3bf7..90299b8 100644</span>+++
[1m--- a/uk1.adoc</span>+++
[1m+++ b/uk2.adoc</span>+++
[36m## -1,9 +1,9 ##</span>+++
= Heading
Body text
Image shown because binary code doesn't show in the text
The following statement matches the text so the '= Heading' line is placed at the top of the page if I replace with nothing.
^[^=]*.[+][\n]
But in Powershell, it isn't matching the text.
Get-Content "result2.adoc" | % { $_ -Replace '^[^=]*.[+][\n]', '' } | Out-File "result3.adoc";
Any ideas about why it doesn't work in Powershell?
My overall goal is to create a diff file of two versions of an AsciiDoc file and then replace the ASCII codes with HTML/CSS code to display the resulting AsciiDoc file with green/red track changes.
The simplest - and faster - approach is to read the input file as a single, multiline string with Get-Content -Raw and let the regex passed to -replace operate across multiple lines:
(Get-Content -Raw result2.adoc) -replace '(?s)^.+?\n(?==)' |
Set-Content result3.adoc
(?s) activates in-line option s which makes . match newline (\n) characters too.
^.+?\n(?==) matches from the start of the string (^) any number of characters (including newlines) (.+), non-greedily (?)
until a newline (\n) followed by a = is found.
(?=...) is a look-ahead assertion, which matches = without consuming it, i.e., without considering it part of the substring that matched.
Since no replacement operand is passed to -replace, the entire match is replace with the implied empty string, i.e., what was matched is effectively removed.
As for what you tried:
The -replace operator passes its LHS through if no match is found, so you cannot use it to filter out non-matching lines.
Even if you match an undesired line in full and replace it with '' (the empty string), it will show up as an empty line in the output when sent to Set-Content or Out-File (>).
As for your specific regex, ^[^=]*.[+][\n] (whether or not the first ^ is followed by an ESC (0x1b) char.):
[\n] (just \n would suffice) tries to match a newline char. after a literal + ([+]), yet lines read individually with Get-Content (without -Raw) by definition are stripped of their trailing newline, so the \n will never match; instead, use $ to match the end of a line.
Instead of % (the built-in alias for the ForEach-Object cmdlet) you could have used ? (the built-in alias for the Where-Object cmdlet) to perform the desired filtering:
Get-Content result2.adoc | ? { $_ -notmatch '^\e\[' }
$_ -notmatch '^\e[' returns $True only for lines that don't start (^) with an ESC character (\e, whose code point is 0x1b) followed by a literal (\) [, thereby effectively filtering out the lines before the = Heading line.
However, the multi-line -replace command at the top is a more direct and faster expression of your intent.
Here is the code I ended up with after help from #mklement0. This Powershell script creates MS Word-style track changes for two versions of an AsciiDoc file. It creates the Diff file, uses regex to replace ASCII codes with HTML/CSS tags, removes the Diff header (thank you!), uses AsciiDoctor to create an HTML file and then PrinceXML to create a PDF file of the output that I can send to document reviewers.
git diff --color-words file1.adoc file2.adoc > result.adoc;
Get-Content "result.adoc" | % {
$_ -Replace '(=+ ?)([A-Za-z\s]+)(\[m)', '$1$2' `
-Replace '\[32m', '+++<span style="color: #00cd00;">' `
-Replace '\[31m', '+++<span style="color: #cd0000; text-decoration: line-through;">' `
-Replace '\[m', '</span>+++' } | Out-File -encoding utf8 "result2.adoc" ;
(Get-Content -Raw result2.adoc) -replace '(?s)^.+?\n(?==)', '' | Out-File -encoding utf8 "result3.adoc" ;
asciidoctor result3.adoc -o result3.html;
prince result3.html --javascript -o result3.pdf;
Read-Host -Prompt "Press Enter to exit"
Here's a screenshot of the result using some text from Wikipedia:

How can I make this PowerShell script more efficient?

I am trying to make a script that takes an XML file, looks for a matching condition, if it finds it adds a new line of asteriks, then when done going through the file to strip it of all its XML tags and leave the data in a plain text file.
The script has been tested on a small input xml file and works fine, but when I pass a large XML file to it takes forever (not actually sure how long as I ran it for over an hour and still no result so I just stopped it).
I'm guessing I must be performing the work in an extremely inefficient manner, hoping you guys can help me make it fast and efficient.
Here is the script below:
# Takes input XML File, cleans up XML elements, outputs plain text file
$FileName = "C:\Users\someguy\Desktop\input.xml"
$Pattern = "ProcessSpecifier = ""true"""
$FileOriginal = Get-Content $FileName
[String[]] $FileModified = #()
Foreach ($Line in $FileOriginal)
{
$FileModified += $Line
if ($Line -match $Pattern)
{
#Add Lines after the selected pattern
$FileModified += "*************isActive=true*****************"
}
}
$FileModified -replace "<[^>]+>", "" | Out-File C:\Users\someguy\Desktop\Output.txt
Let's go with a look behind and a bunch of regex to speed things up here. Also, I'm not going to store the whole thing in memory, I'm just going to pass it down the pipeline, which should help. I remove whitespace from the beginning and ends of lines, and filter out blank lines, but you can remove that bit if you want.
# Takes input XML File, cleans up XML elements, outputs plain text file
$FileName = "C:\Users\someguy\Desktop\input.xml"
$Pattern = '(?<=^.*ProcessSpecifier = "true".*$)'
(Get-Content $FileName) -replace $Pattern, "`n*************isActive=true*****************" -replace '<[^>]+?>' -replace '^\s*|\s$' | ?{$_} | Set-Content C:\Users\someguy\Desktop\Output.txt
So, the main thing here is that I use a look behind to find your pattern text, and then add a new line and the asterisk line to that line. So that the line
<SomeTag>ProcessSpecifier = "true"</SomeTag>
becomes:
<SomeTag>ProcessSpecifier = "true"</SomeTag>`n*************isActive=true*****************
When used inside double quote a backtick ` followed by n creates a new line, so the '*************isActive=true*****************' is on its own line immediately following your search pattern line. Past that I remove the XML tags, and then any leading or trailing whitespace from any line.
After the RegEx replacements I pass the result to a Where statement that removes blank lines, and then pass the remaining lines to Set-Content which I've seen better performance out of than Out-File.
Variation of TheMadTechnician's answer:
# Takes input XML File, cleans up XML elements, outputs plain text file
$FileName = "C:\Users\someguy\Desktop\input.xml"
$Pattern = '(?<=^.*ProcessSpecifier = "true".*$)'
Set-Content -Path C:\Users\someguy\Desktop\Output.txt -Value (((Get-Content $FileName) -replace $Pattern, "`n*************isActive=true*****************" -replace '<[^>]+?>' -replace '^\s*|\s$').Where{$_})
I actually try to avoid the pipeline, it is rather slow afaik. Of course you will run into problem with memory consumption if the files are very large.
The "().Where" construct doesn't work on all powershell versions (Version 4+ iirc).
This is a guess, I am not sure whether this is actually faster than TheMadTechnician's. I'd be curious about the result :)

replace thousands separators in csv with regex

I'm running into problems trying to pull the thousands separators out of some currency values in a set of files. The "bad" values are delimited with commas and double quotes. There are other values in there that are < $1000 that present no issue.
Example of existing file:
"12,345.67",12.34,"123,456.78",1.00,"123,456,789.12"
Example of desired file (thousands separators removed):
"12345.67",12.34,"123456.78",1.00,"123456789.12"
I found a regex expression for matching the numbers with separators that works great, but I'm having trouble with the -replace operator. The replacement value is confusing me. I read about $& and I'm wondering if I should use that here. I tried $_, but that pulls out ALL my commas. Do I have to use $matches somehow?
Here's my code:
$Files = Get-ChildItem *input.csv
foreach ($file in $Files)
{
$file |
Get-Content | #assume that I can't use -raw
% {$_ -replace '"[\d]{1,3}(,[\d]{3})*(\.[\d]+)?"', ("$&" -replace ',','')} | #this is my problem
out-file output.csv -append -encoding ascii
}
Tony Hinkle's comment is the answer: don't use regex for this (at least not directly on the CSV file).
Your CSV is valid, so you should parse it as such, work on the objects (change the text if you want), then write a new CSV.
Import-Csv -Path .\my.csv | ForEach-Object {
$_ | ForEach-Object {
$_ -replace ',',''
}
} | Export-Csv -Path .\my_new.csv
(this code needs work, specifically the middle as the row will have each column as a property, not an array, but a more complete version of your CSV would make that easier to demonstrate)
You can try with this regex:
,(?=(\d{3},?)+(?:\.\d{1,3})?")
See Live Demo or in powershell:
% {$_ -replace ',(?=(\d{3},?)+(?:\.\d{1,3})?")','' }
But it's more about the challenge that regex can bring. For proper work, use #briantist answer which is the clean way to do this.
I would use a simpler regex, and use capture groups instead of the entire capture.
I have tested the follow regular expression with your input and found no issues.
% {$_ -replace '([\d]),([\d])','$1$2' }
eg. Find all commas with a number before and after (so that the weird mixed splits dont matter) and replace the comma entirely.
This would have problems if your input has a scenario without that odd mixing of quotes and no quotes.

PowerShell multiple string replacement efficiency

I'm trying to replace 600 different strings in a very large text file 30Mb+. I'm current building a script that does this; following this Question:
Script:
$string = gc $filePath
$string | % {
$_ -replace 'something0','somethingelse0' `
-replace 'something1','somethingelse1' `
-replace 'something2','somethingelse2' `
-replace 'something3','somethingelse3' `
-replace 'something4','somethingelse4' `
-replace 'something5','somethingelse5' `
...
(600 More Lines...)
...
}
$string | ac "C:\log.txt"
But as this will check each line 600 times and there are well over 150,000+ lines in the text file this means there’s a lot of processing time.
Is there a better alternative to doing this that is more efficient?
Combining the hash technique from Adi Inbar's answer, and the match evaluator from Keith Hill's answer to another recent question, here is how you can perform the replace in PowerShell:
# Build hashtable of search and replace values.
$replacements = #{
'something0' = 'somethingelse0'
'something1' = 'somethingelse1'
'something2' = 'somethingelse2'
'something3' = 'somethingelse3'
'something4' = 'somethingelse4'
'something5' = 'somethingelse5'
'X:\Group_14\DACU' = '\\DACU$'
'.*[^xyz]' = 'oO{xyz}'
'moresomethings' = 'moresomethingelses'
}
# Join all (escaped) keys from the hashtable into one regular expression.
[regex]$r = #($replacements.Keys | foreach { [regex]::Escape( $_ ) }) -join '|'
[scriptblock]$matchEval = { param( [Text.RegularExpressions.Match]$matchInfo )
# Return replacement value for each matched value.
$matchedValue = $matchInfo.Groups[0].Value
$replacements[$matchedValue]
}
# Perform replace over every line in the file and append to log.
Get-Content $filePath |
foreach { $r.Replace( $_, $matchEval ) } |
Add-Content 'C:\log.txt'
So, what you're saying is that you want to replace any of 600 strings in each of 150,000 lines, and you want to run one replace operation per line?
Yes, there is a way to do it, but not in PowerShell, at least I can't think of one. It can be done in Perl.
The Method:
Construct a hash where the keys are the somethings and the values are the somethingelses.
Join the keys of the hash with the | symbol, and use it as a match group in the regex.
In the replacement, interpolate an expression that retrieves a value from the hash using the match variable for the capture group
The Problem:
Frustratingly, PowerShell doesn't expose the match variables outside the regex replace call. It doesn't work with the -replace operator and it doesn't work with [regex]::replace.
In Perl, you can do this, for example:
$string =~ s/(1|2|3)/#{[$1 + 5]}/g;
This will add 5 to the digits 1, 2, and 3 throughout the string, so if the string is "1224526123 [2] [6]", it turns into "6774576678 [7] [6]".
However, in PowerShell, both of these fail:
$string -replace '(1|2|3)',"$($1 + 5)"
[regex]::replace($string,'(1|2|3)',"$($1 + 5)")
In both cases, $1 evaluates to null, and the expression evaluates to plain old 5. The match variables in replacements are only meaningful in the resulting string, i.e. a single-quoted string or whatever the double-quoted string evaluates to. They're basically just backreferences that look like match variables. Sure, you can quote the $ before the number in a double-quoted string, so it will evaluate to the corresponding match group, but that defeats the purpose - it can't participate in an expression.
The Solution:
[This answer has been modified from the original. It has been formatted to fit match strings with regex metacharacters. And your TV screen, of course.]
If using another language is acceptable to you, the following Perl script works like a charm:
$filePath = $ARGV[0]; # Or hard-code it or whatever
open INPUT, "< $filePath";
open OUTPUT, '> C:\log.txt';
%replacements = (
'something0' => 'somethingelse0',
'something1' => 'somethingelse1',
'something2' => 'somethingelse2',
'something3' => 'somethingelse3',
'something4' => 'somethingelse4',
'something5' => 'somethingelse5',
'X:\Group_14\DACU' => '\\DACU$',
'.*[^xyz]' => 'oO{xyz}',
'moresomethings' => 'moresomethingelses'
);
foreach (keys %replacements) {
push #strings, qr/\Q$_\E/;
$replacements{$_} =~ s/\\/\\\\/g;
}
$pattern = join '|', #strings;
while (<INPUT>) {
s/($pattern)/$replacements{$1}/g;
print OUTPUT;
}
close INPUT;
close OUTPUT;
It searches for the keys of the hash (left of the =>), and replaces them with the corresponding values. Here's what's happening:
The foreach loop goes through all the elements of the hash and create an array called #strings that contains the keys of the %replacements hash, with metacharacters quoted using \Q and \E, and the result of that quoted for use as a regex pattern (qr = quote regex). In the same pass, it escapes all the backslashes in the replacement strings by doubling them.
Next, the elements of the array are joined with |'s to form the search pattern. You could include the grouping parentheses in $pattern if you want, but I think this way makes it clearer what's happening.
The while loop reads each line from the input file, replaces any of the strings in the search pattern with the corresponding replacement strings in the hash, and writes the line to the output file.
BTW, you might have noticed several other modifications from the original script. My Perl has collected some dust during my recent PowerShell kick, and on a second look I noticed several things that could be done better.
while (<INPUT>) reads the file one line at a time. A lot more sensible than reading the entire 150,000 lines into an array, especially when your goal is efficiency.
I simplified #{[$replacements{$1}]} to $replacements{$1}. Perl doesn't have a built-in way of interpolating expressions like PowerShell's $(), so #{[ ]} is used as a workaround - it creates a literal array of one element containing the expression. But I realized that it's not necessary if the expression is just a single scalar variable (I had it in there as a holdover from my initial testing, where I was applying calculations to the $1 match variable).
The close statements aren't strictly necessary, but it's considered good practice to explicitly close your filehandles.
I changed the for abbreviation to foreach, to make it clearer and more familiar to PowerShell programmers.
I also have no idea how to solve this in powershell, but I do know how to solve it in Bash and that is by using a tool called sed. Luckily, there is also Sed for Windows. If all you want to do is replace "something#" with "somethingelse#" everywhere then this command will do the trick for you
sed -i "s/something([0-9]+)/somethingelse\1/g" c:\log.txt
In Bash you'd actually need to escape a couple of those characters with backslashes, but I'm not sure you need to in windows. If the first command complains you can try
sed -i "s/something\([0-9]\+\)/somethingelse\1/g" c:\log.txt
I would use the powershell switch statement:
$string = gc $filePath
$string | % {
switch -regex ($_) {
'something0' { 'somethingelse0' }
'something1' { 'somethingelse1' }
'something2' { 'somethingelse2' }
'something3' { 'somethingelse3' }
'something4' { 'somethingelse4' }
'something5' { 'somethingelse5' }
'pattern(?<a>\d+)' { $matches['a'] } # sample of more complex logic
...
(600 More Lines...)
...
default { $_ }
}
} | ac "C:\log.txt"