I'm looking for a way to strip all comments from a file. There are various ways to do comments, but I'm only interested in the simple # form comments. Reason is that I only use <# #> for in-function .SYNOPSIS which is functional code as opposed to just a comment so I want to keep those).
EDIT: I have updated this question using the helpful answers below.
So there are only a couple of scenarios that I need:
a) whole line comments with # at start of line (or possibly with white-space before. i.e. regex of ^\s*# seems to work.
b) with some code at start of line then a command at the end of the line.
I want to avoid stripping lines that have e.g. Write-Host "#####" but I think this is covered in the code that I have.
I was able to remove end-of-line comments with a split as I couldn't work out how to do it with regex, does anyone know a way to achieve that with regex?
The split was not ideal as a <# on a line would be removed by the -split but I've fixed that by splitting on " #". This is not perfect but might be good enough - maybe a more reliable way with regex might exist?
When I do the below against my 7,000 line long script, it works(!) and strips a huge amount of comments, BUT, the output file is almost doubled in size(!?) from 400kb to about 700kb. Does anyone understand why that happens and how to prevent that (is it something to do with BOM's or Unicode or things like that? Out-File seems to really balloon the file-size!)
$x = Get-Content ".\myscript.ps1" # $x is an array, not a string
$out = ".\myscript.ps1"
$x = $x -split "[\r\n]+" # Remove all consecutive line-breaks, in any format '-split "\r?\n|\r"' would just do line by line
$x = $x | ? { $_ -notmatch "^\s*$" } # Remove empty lines
$x = $x | ? { $_ -notmatch "^\s*#" } # Remove all lines starting with ; including with whitespace before
$x = $x | % { ($_ -split " #")[0] } # Remove end of line comments
$x = ($x -replace $regex).Trim() # Remove whitespace only at start and end of line
$x | Out-File $out
# $x | more
Honestly, the best approach to identify and process all comments is to use PowerShell's language parser or one of the Ast classes. I apologize that I don't know which Ast contains comments; so this is an uglier way that will filter out block and line comments.
$code = Get-Content file.txt -Raw
$comments = [System.Management.Automation.PSParser]::Tokenize($code,[ref]$null) |
Where Type -eq 'Comment' | Select -Expand Content
$regex = ( $comments |% { [regex]::Escape($_) } ) -join '|'
# Output to remove all empty lines
$code -replace $regex -split '\r?\n' -notmatch '^\s*$'
# Output that Removes only Beginning and Ending Blank Lines
($code -replace $regex).Trim()
Do the inverse of your example: Only emit lines that do NOT match:
## Output to console
Get-Content .\file.ps1 | Where-Object { $_ -notmatch '#' }
## Output to file
Get-Content .\file.ps1 | Where-Object { $_ -notmatch '#' } | Out-file .\newfile.ps1 -Append
Based on #AdminOfThings helpful answer using the Abstract Syntax Tree (AST) Class parser approach but avoiding any regular expressions:
$Code = $Code.ToString() # Prepare any ScriptBlock for the substring method
$Tokens = [System.Management.Automation.PSParser]::Tokenize($Code, [ref]$null)
-Join $Tokens.Where{ $_.Type -ne 'Comment' }.ForEach{ $Code.Substring($_.Start, $_.Length) }
As for the incidental problem of the size of the output file being roughly double that of the input file:
As AdminOfThings points out, Out-File in Windows PowerShell defaults to UTF-16LE ("Unicode") encoding, where characters are represented by (at least) two bytes, whereas ANSI encoding, as used by Set-Content in Windows PowerShell by default, encodes all (supported) characters in a single byte. Similarly, UTF-8-encoded files use only one byte for characters in the ASCII range (note that PowerShell (Core) 7+ now consistently defaults to (BOM-less) UTF-8). Use the -Encoding parameter as needed.
A regex-based solution to your problem is never fully robust, even if you try to limit the comment removal to single-line comments.
For full robustness, you must indeed use PowerShell's language parser, as noted in the other answers.
However, care must be taken when reconstructing the original source code with the comments removed:
AdminOfThings's answer risks removing too much, given the subsequent global regex-based processing with -replace: while the scenario may be unlikely, if a comment is repeated inside a string, it would mistakenly be removed from there too.
iRon's answer risks syntax errors by joining the tokens without spaces, so that . .\foo.ps1 would turn into ..\foo.ps1, for instance. Blindly putting a space between tokens is not an option, because the property-access syntax would break (e.g. $host.Name would turn into $host . Name, but whitespace between a value and the . operator isn't allowed)
The following solution avoids these problems, while trying to preserve the formatting of the original code as much as possible, but this has limitations, because intra-line whitespace isn't reported by the parser:
This means that you can't tell whether whitespace between tokens on a given line is made up of tabs, spaces, or a mix of both. The solution below replaces any tab characters with 2 spaces before processing; adjust as needed.
To somewhat compensate for the removal of comments occupying their own line(s), more than 2 consecutive blank or empty lines are folded into a single empty one. It is possible to remove blank/empty lines altogether, but that could hurt readability.
# Tokenize the file content.
# Note that tabs, if any, are replaced by 2 spaces first; adjust as needed.
$tokens = $null
$null = [System.Management.Automation.Language.Parser]::ParseInput(
((Get-Content -Raw .\myscript.ps1) -replace '\t', ' '),
[ref] $tokens,
[ref] $null
)
# Loop over all tokens while omitting comments, and rebuild the source code
# without them, trying to preserve the original formatting as much as possible.
$sb = [System.Text.StringBuilder]::new()
$prevExtent = $null; $numConsecNewlines = 0
$tokens.
Where({ $_.Kind -ne 'Comment' }).
ForEach({
$startColumn = if ($_.Extent.StartLineNumber -eq $prevExtent.StartLineNumber) { $prevExtent.EndColumnNumber }
else { 1 }
if ($_.Kind -eq 'NewLine') {
# Fold multiple blank or empty lines into a single empty one.
if (++$numConsecNewlines -ge 3) { return }
} else {
$numConsecNewlines = 0
$null = $sb.Append(' ' * ($_.Extent.StartColumnNumber - $startColumn))
}
$null = $sb.Append($_.Text)
$prevExtent = $_.Extent
})
# Output the result.
# Pipe to Set-Content as needed.
$sb.ToString()
I am trying to add quote characters around two fields in a file of comma separated lines. Here is one line of data:
1/22/2018 0:00:00,0000000,001B9706BE,1,21,0,1,0,0,0,0,0,0,0,0,0,0,13,0,1,0,0,0,0,0,0,0,0,0,0
which I would like to become this:
1/22/2018 0:00:00,"0000000","001B9706BE",1,21,0,1,0,0,0,0,0,0,0,0,0,0,13,0,1,0,0,0,0,0,0,0,0,0,0
I began developing my regular expression in a simple PowerShell script, and soon I have the following:
$strData = '1/29/2018 0:00:00,0000000,001B9706BE,1,21,0,1,0,0,0,0,0,0,0,0,0,0,13,0,1,0,0,0,0,0,0,0,0,0,0'
$strNew = $strData -replace "([^,]*),([^,]*),([^,]*),(.*)",'$1,"$2","$3",$4'
$strNew
which gives me this output:
1/29/2018 0:00:00,"0000000","001B9706BE",1,21,0,1,0,0,0,0,0,0,0,0,0,0,13,0,1,0,0,0,0,0,0,0,0,0,0
Great! I'm all set. Extend this example to the general case of a file of similar lines of data:
Get-Content test_data.csv | Where-Object -FilterScript {
$_ -replace "([^,]*),([^,]*),([^,]*),(.*)", '$1,"$2","$3",$4'
}
This is a listing of test_data.csv:
1/29/2018 0:00:00,0000000,001B9706BE,1,21,0,1,0,0,0,0,0,0,0,0,0,0,13,0,1,0,0,0,0,0,0,0,0,0,0
1/29/2018 0:00:00,104938428,0016C4C483,1,45,0,1,0,0,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,0,0,0,0,0
1/29/2018 0:00:00,104943875,0016C4B0BC,1,31,0,1,0,0,0,0,0,0,0,0,0,0,25,0,1,0,0,0,0,0,0,0,0,0,0
1/29/2018 0:00:00,104948067,0016C4834D,1,33,0,1,0,0,0,0,0,0,0,0,0,0,23,0,1,0,0,0,0,0,0,0,0,0,0
This is the output of my script:
1/29/2018 0:00:00,0000000,001B9706BE,1,21,0,1,0,0,0,0,0,0,0,0,0,0,13,0,1,0,0,0,0,0,0,0,0,0,0
1/29/2018 0:00:00,104938428,0016C4C483,1,45,0,1,0,0,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,0,0,0,0,0
1/29/2018 0:00:00,104943875,0016C4B0BC,1,31,0,1,0,0,0,0,0,0,0,0,0,0,25,0,1,0,0,0,0,0,0,0,0,0,0
1/29/2018 0:00:00,104948067,0016C4834D,1,33,0,1,0,0,0,0,0,0,0,0,0,0,23,0,1,0,0,0,0,0,0,0,0,0,0
I have also tried this version of the script:
Get-Content test_data.csv | Where-Object -FilterScript {
$_ -replace "([^,]*),([^,]*),([^,]*),(.*)", "`$1,`"`$2`",`"`$3`",$4"
}
and obtained the same results.
My simple test script has convinced me that the regex is correct, but something happens when I use that regex inside a filter script in the Where-Object cmdlet.
What simple, yet critical, detail am I overlooking here?
Here is my PSVerion:
Major Minor Build Revision
----- ----- ----- --------
5 0 10586 117
You're misunderstanding how Where-Object works. The cmdlet outputs those input lines for which the -FilterScript expression evaluates to $true. It does NOT output whatever you do inside that scriptblock (you'd use ForEach-Object for that).
You don't need either Where-Object or ForEach-Object, though. Just put Get-Content in parentheses and use that as the first operand for the -replace operator. You also don't need the 4th capturing group. I would recommend anchoring the expression at the beginning of the string, though.
(Get-Content test_data.csv) -replace '^([^,]*),([^,]*),([^,]*)', '$1,"$2","$3"'
This seems to work here. I used ForEach-Object to process each record.
Get-Content test_data.csv |
ForEach-Object { $_ -replace "([^,]*),([^,]*),([^,]*),(.*)", '$1,"$2","$3",$4' }
This also seems to work. Uses the ? to create a reluctant (lazy) capture.
Get-Content test_data.csv |
ForEach-Object { $_ -replace '(.*?),(.*?),(.*?),(.*)', '$1,"$2","$3",$4' }
I would just make a small change to what you have in order for this to work. Simply change the script to the following, noting that I changed the -FilterScript to a ForEach-Object and fixed a minor typo that you had on the last item in the regular expression with the quotes:
Get-Content c:\temp\test_data.csv | ForEach-Object {
$_ -replace "([^,]*),([^,]*),([^,]*),(.*)", "`$1,`"`$2`",`"`$3`",`"`$4"
}
I tested this with the data you provided and it adds the quotes to the correct columns.