Regex to remove enter from line starting with specific character in Powershell - regex

I have huge csv file with data, and some of lines are incorrect and contains enters. When file is imported into Excel then I need to correct hundreds lines manually. I have regex which is work in Notepad++ and remove enters from line which is not start with specific string in this case ";" However same regex is not working in PowerShell script.
Example of input
;BP;7165378;XX_RAW;200SSS952;EU-PL;PL02;PL02;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
15:00:00;;;;Jhon Name;;;;;;;;9444253;;;;;;;;;;;;;"Jhon Name";;;;;;;;;;Jhon Name;;;;;;;;Final Check Approved;;;;;;;;;09.01.2023;;;;;Approve;;;;;;12077;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
How it should look:
;BP;7165378;XX_RAW;200SSS952;EU-PL;PL02;PL02;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;15:00:00;;;;Jhon Name;;;;;;;;9444253;;;;;;;;;;;;;"Jhon Name";;;;;;;;;;Jhon Name;;;;;;;;Final Check Approved;;;;;;;;;09.01.2023;;;;;Approve;;;;;;12077;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
Code:
$content = Get-Content -path "C:\Users\TUF17\Desktop\File\Fix\xx_fix_temp.csv"
$content -Replace '"\R(?!;)"', ' ' | Out-File "C:\Users\TUF17\Desktop\File\Fix\xx_noenters.csv"

It has to do with line continuation \ in your ps script.
I would also suggest adding -Raw if you want to get content of file as single string, rather than an array of strings, for easier replacing.
I'm assuming it's a .csv file you are using.
$content = Get-Content -Path "C:\Users\TUF17\Desktop\File\Fix\xx_fix_temp.csv" -Raw
$content -Replace '(?m)(^[^;].*)\r?\n(?!;)', '$1 ' | Out-File "C:\Users\TUF17\Desktop\File\Fix\xx_noenters.csv"

Building on the helpful comments on the question:
In order to perform replacements across lines of a text file, you need to either read the file in full - with Get-Content -Raw - or perform stateful line-by-line processing, such as with the -File parameter of a switch statement.
Note: While you could also do stateful line-by-line processing by combining Get-Content (without -Raw) with a ForEach-Object call, such a solution would be much slower - see this answer.
Your regex, '"\R(?!;)"', has two problems:
It accidentally uses embedded " quoting. Use only '...' quoting. PowerShell has no special syntax for regex literals - it simply uses strings.
To avoid confusion with PowerShell's own up-front string interpolation, it is better to use verbatim '...' strings rather than expandable (interpolating) "..." strings - see the conceptual about_Quoting_Rules help topic.
\R is an unsupported regex escape sequence; you presumably meant \r, i.e. a CR char. (CARRIAGE RETURN, U+000D)
If you instead want to match CRLF, a Windows-format newline sequence, use \r\n
If you want to match LF (LINE FEED, U+000A)) alone (a Unix-format newline), use \n
If you want to match both newline formats, use \r?\n
As an aside: While use of CR alone is rare in practice, PowerShell treats stand-alone CR characters as newlines as well, which is why Get-Content without -Raw, which reads line by line (as you've tried) wouldn't work.
Get-Content -Raw solution (easier and faster than switch -File, but requires the whole file to fit into memory twice):
# Adjust the '\r' part as needed (see above).
(Get-Content -Raw -LiteralPath $inFile) -replace '\r(?!;)' |
Set-Content -NoNewLine -Encoding utf8 -LiteralPath $outFile
Note:
By not specifying a substitution operand to -replace, the command removes all newlines not followed by a ; ((?!;)), effectively joining the line that follows the CR directly to the previous line, which is the desired behavior based on your sample output.
For saving text, Set-Content is a bit faster than Out-File (it'll make no appreciable difference here, given that only a single, large string is written).
-NoNewLine prevents a(n additional) trailing newline from getting appended to the file.
-Encoding utf8 specifies the output character encoding. Note that PowerShell never preserves the input character encoding, so unless you use -Encoding on output, you'll get the respective cmdlet's default character encoding, which in Windows PowerShell varies from cmdlet to cmdlet; in PowerShell (Core) 7+, the consistent default is now BOM-less UTF-8. Note that in Windows PowerShell -Encoding utf8 always create a file with a BOM; see this answer for background information and workarounds.

Related

Looking for a function that removes all comments from a script [duplicate]

I'm looking for a way to strip all comments from a file. There are various ways to do comments, but I'm only interested in the simple # form comments. Reason is that I only use <# #> for in-function .SYNOPSIS which is functional code as opposed to just a comment so I want to keep those).
EDIT: I have updated this question using the helpful answers below.
So there are only a couple of scenarios that I need:
a) whole line comments with # at start of line (or possibly with white-space before. i.e. regex of ^\s*# seems to work.
b) with some code at start of line then a command at the end of the line.
I want to avoid stripping lines that have e.g. Write-Host "#####" but I think this is covered in the code that I have.
I was able to remove end-of-line comments with a split as I couldn't work out how to do it with regex, does anyone know a way to achieve that with regex?
The split was not ideal as a <# on a line would be removed by the -split but I've fixed that by splitting on " #". This is not perfect but might be good enough - maybe a more reliable way with regex might exist?
When I do the below against my 7,000 line long script, it works(!) and strips a huge amount of comments, BUT, the output file is almost doubled in size(!?) from 400kb to about 700kb. Does anyone understand why that happens and how to prevent that (is it something to do with BOM's or Unicode or things like that? Out-File seems to really balloon the file-size!)
$x = Get-Content ".\myscript.ps1" # $x is an array, not a string
$out = ".\myscript.ps1"
$x = $x -split "[\r\n]+" # Remove all consecutive line-breaks, in any format '-split "\r?\n|\r"' would just do line by line
$x = $x | ? { $_ -notmatch "^\s*$" } # Remove empty lines
$x = $x | ? { $_ -notmatch "^\s*#" } # Remove all lines starting with ; including with whitespace before
$x = $x | % { ($_ -split " #")[0] } # Remove end of line comments
$x = ($x -replace $regex).Trim() # Remove whitespace only at start and end of line
$x | Out-File $out
# $x | more
Honestly, the best approach to identify and process all comments is to use PowerShell's language parser or one of the Ast classes. I apologize that I don't know which Ast contains comments; so this is an uglier way that will filter out block and line comments.
$code = Get-Content file.txt -Raw
$comments = [System.Management.Automation.PSParser]::Tokenize($code,[ref]$null) |
Where Type -eq 'Comment' | Select -Expand Content
$regex = ( $comments |% { [regex]::Escape($_) } ) -join '|'
# Output to remove all empty lines
$code -replace $regex -split '\r?\n' -notmatch '^\s*$'
# Output that Removes only Beginning and Ending Blank Lines
($code -replace $regex).Trim()
Do the inverse of your example: Only emit lines that do NOT match:
## Output to console
Get-Content .\file.ps1 | Where-Object { $_ -notmatch '#' }
## Output to file
Get-Content .\file.ps1 | Where-Object { $_ -notmatch '#' } | Out-file .\newfile.ps1 -Append
Based on #AdminOfThings helpful answer using the Abstract Syntax Tree (AST) Class parser approach but avoiding any regular expressions:
$Code = $Code.ToString() # Prepare any ScriptBlock for the substring method
$Tokens = [System.Management.Automation.PSParser]::Tokenize($Code, [ref]$null)
-Join $Tokens.Where{ $_.Type -ne 'Comment' }.ForEach{ $Code.Substring($_.Start, $_.Length) }
As for the incidental problem of the size of the output file being roughly double that of the input file:
As AdminOfThings points out, Out-File in Windows PowerShell defaults to UTF-16LE ("Unicode") encoding, where characters are represented by (at least) two bytes, whereas ANSI encoding, as used by Set-Content in Windows PowerShell by default, encodes all (supported) characters in a single byte. Similarly, UTF-8-encoded files use only one byte for characters in the ASCII range (note that PowerShell (Core) 7+ now consistently defaults to (BOM-less) UTF-8). Use the -Encoding parameter as needed.
A regex-based solution to your problem is never fully robust, even if you try to limit the comment removal to single-line comments.
For full robustness, you must indeed use PowerShell's language parser, as noted in the other answers.
However, care must be taken when reconstructing the original source code with the comments removed:
AdminOfThings's answer risks removing too much, given the subsequent global regex-based processing with -replace: while the scenario may be unlikely, if a comment is repeated inside a string, it would mistakenly be removed from there too.
iRon's answer risks syntax errors by joining the tokens without spaces, so that . .\foo.ps1 would turn into ..\foo.ps1, for instance. Blindly putting a space between tokens is not an option, because the property-access syntax would break (e.g. $host.Name would turn into $host . Name, but whitespace between a value and the . operator isn't allowed)
The following solution avoids these problems, while trying to preserve the formatting of the original code as much as possible, but this has limitations, because intra-line whitespace isn't reported by the parser:
This means that you can't tell whether whitespace between tokens on a given line is made up of tabs, spaces, or a mix of both. The solution below replaces any tab characters with 2 spaces before processing; adjust as needed.
To somewhat compensate for the removal of comments occupying their own line(s), more than 2 consecutive blank or empty lines are folded into a single empty one. It is possible to remove blank/empty lines altogether, but that could hurt readability.
# Tokenize the file content.
# Note that tabs, if any, are replaced by 2 spaces first; adjust as needed.
$tokens = $null
$null = [System.Management.Automation.Language.Parser]::ParseInput(
((Get-Content -Raw .\myscript.ps1) -replace '\t', ' '),
[ref] $tokens,
[ref] $null
)
# Loop over all tokens while omitting comments, and rebuild the source code
# without them, trying to preserve the original formatting as much as possible.
$sb = [System.Text.StringBuilder]::new()
$prevExtent = $null; $numConsecNewlines = 0
$tokens.
Where({ $_.Kind -ne 'Comment' }).
ForEach({
$startColumn = if ($_.Extent.StartLineNumber -eq $prevExtent.StartLineNumber) { $prevExtent.EndColumnNumber }
else { 1 }
if ($_.Kind -eq 'NewLine') {
# Fold multiple blank or empty lines into a single empty one.
if (++$numConsecNewlines -ge 3) { return }
} else {
$numConsecNewlines = 0
$null = $sb.Append(' ' * ($_.Extent.StartColumnNumber - $startColumn))
}
$null = $sb.Append($_.Text)
$prevExtent = $_.Extent
})
# Output the result.
# Pipe to Set-Content as needed.
$sb.ToString()

How do i specify a specific file name in powershell? [duplicate]

This question already has answers here:
PowerShell String Matching and the Pipe Character
(3 answers)
Unable to escape pipe character (|) in powershell
(2 answers)
Closed 1 year ago.
I am very new to powershell. I have a csv file that i want to find and replace some text with. after some searching, this seems simple to do, but i still seem to be having problems with the code:
$csv = get-content .\test.csv
$csv = $csv -replace "|", "$"
$csv | out-file .\test.csv
My file is located here: C:\Users\CB1\test.csv
How do I specify that location in powershell?
I've tried this but it doesn't work:
$csv = get-content C:\Users\CB1\test.csv
$csv = $csv -replace "|", "$"
$csv | out-file C:\Users\CB1\test.csv
The problem isn't whether you're using relative or absolute paths (assuming your relative paths are relative to the right directory).
Rather, the problem is that the -replace operator is regex-based, and that | is therefore interpreted as a regex metacharacter (representing alternation).
Therefore, you need to escape such metacharacters, using \ (or, if you were to do this programmatically, you could use the [regex]::Escape() method).
Additionally, since your replacement operation isn't line-specific, you can speed up your operation by reading the file into memory as a whole, using the -Raw switch.
That, in turn, requires that you use the -NoNewLine switch when (re)writing the file.
Also, with text input, Set-Content is preferable to Out-File for performance reasons.
To put it all together:
(Get-Content -Raw .\test.csv) -replace '\|', '$' | Set-Content -NoNewLine .\test.csv
Note: Use the -Encoding parameter as needed, as the input file's encoding will not be honored:
In Windows PowerShell, Out-File produces UTF-16LE ("Unicode") files by default, whereas Set-Content uses the system's ANSI code page.
In PowerShell (Core) 7+, BOM-less UTF-8 is the consistently applied default.

Replace text + optional newline in file

I've been through other similar questions and tried their advice, but it wouldn't help.
I'm trying to delete a specific line of text in a text file.
My code which works
(Get-Content -Path "MyPath.txt" -Raw).Replace('this is the line', '') | Set-Content "MyPath.txt" -Encoding UTF8
Now this works but leaves an ugly empty line in the text file. I wanted to also replace an optional newline character by adding this regex at the end of the line
\n?
and this wouldn't work. The other threads made other recommendations and I've tried all combinations but just can't match. I'm using windows style ending (CRLF)
Both using -Raw and not using it
\n
\r\n
`n
`r`n
I haven't even added the regex question mark at the end (or non-capturing group in case it needs the \r\n syntax).
The [string] type's .Replace() method doesn't support regexes (regular expressions), whereas PowerShell's -replace operator does.
However, the simplest solution in this case is to take advantage of the fact that the -ne operator acts as a filter with an array-valued LHS (as other comparison operators do):
#(Get-Content -Path MyPath.txt) -ne 'this is the line' |
Set-Content MyPath.txt -Encoding UTF8
Note how Get-Content is called without -Raw in order to return an array of lines, from which -ne then filters out the line of (non)-interest; #(...), the array-subexpression operator ensures that the output is an array even if the file happens to contain just one line.
The assumption is that string 'this is the line' matches the whole line (case-insensitively).
If that is not the case, instead of -ne you could use -notlike with a wildcard expression or -notmatch with a regex (e.g.,
-notmatch 'this is the line' or -notlike '*this is the line')

Powershell regex match sequence doesn't work although it matches in Sublime Text find and replace

I am trying to create a Powershell regex statement to remove the top five lines of this output from a git diff file that has already been modified with Powershell regex.
[1mdiff --git a/uk1.adoc b/uk2.adoc</span>+++
[1mindex b5d3bf7..90299b8 100644</span>+++
[1m--- a/uk1.adoc</span>+++
[1m+++ b/uk2.adoc</span>+++
[36m## -1,9 +1,9 ##</span>+++
= Heading
Body text
Image shown because binary code doesn't show in the text
The following statement matches the text so the '= Heading' line is placed at the top of the page if I replace with nothing.
^[^=]*.[+][\n]
But in Powershell, it isn't matching the text.
Get-Content "result2.adoc" | % { $_ -Replace '^[^=]*.[+][\n]', '' } | Out-File "result3.adoc";
Any ideas about why it doesn't work in Powershell?
My overall goal is to create a diff file of two versions of an AsciiDoc file and then replace the ASCII codes with HTML/CSS code to display the resulting AsciiDoc file with green/red track changes.
The simplest - and faster - approach is to read the input file as a single, multiline string with Get-Content -Raw and let the regex passed to -replace operate across multiple lines:
(Get-Content -Raw result2.adoc) -replace '(?s)^.+?\n(?==)' |
Set-Content result3.adoc
(?s) activates in-line option s which makes . match newline (\n) characters too.
^.+?\n(?==) matches from the start of the string (^) any number of characters (including newlines) (.+), non-greedily (?)
until a newline (\n) followed by a = is found.
(?=...) is a look-ahead assertion, which matches = without consuming it, i.e., without considering it part of the substring that matched.
Since no replacement operand is passed to -replace, the entire match is replace with the implied empty string, i.e., what was matched is effectively removed.
As for what you tried:
The -replace operator passes its LHS through if no match is found, so you cannot use it to filter out non-matching lines.
Even if you match an undesired line in full and replace it with '' (the empty string), it will show up as an empty line in the output when sent to Set-Content or Out-File (>).
As for your specific regex, ^[^=]*.[+][\n] (whether or not the first ^ is followed by an ESC (0x1b) char.):
[\n] (just \n would suffice) tries to match a newline char. after a literal + ([+]), yet lines read individually with Get-Content (without -Raw) by definition are stripped of their trailing newline, so the \n will never match; instead, use $ to match the end of a line.
Instead of % (the built-in alias for the ForEach-Object cmdlet) you could have used ? (the built-in alias for the Where-Object cmdlet) to perform the desired filtering:
Get-Content result2.adoc | ? { $_ -notmatch '^\e\[' }
$_ -notmatch '^\e[' returns $True only for lines that don't start (^) with an ESC character (\e, whose code point is 0x1b) followed by a literal (\) [, thereby effectively filtering out the lines before the = Heading line.
However, the multi-line -replace command at the top is a more direct and faster expression of your intent.
Here is the code I ended up with after help from #mklement0. This Powershell script creates MS Word-style track changes for two versions of an AsciiDoc file. It creates the Diff file, uses regex to replace ASCII codes with HTML/CSS tags, removes the Diff header (thank you!), uses AsciiDoctor to create an HTML file and then PrinceXML to create a PDF file of the output that I can send to document reviewers.
git diff --color-words file1.adoc file2.adoc > result.adoc;
Get-Content "result.adoc" | % {
$_ -Replace '(=+ ?)([A-Za-z\s]+)(\[m)', '$1$2' `
-Replace '\[32m', '+++<span style="color: #00cd00;">' `
-Replace '\[31m', '+++<span style="color: #cd0000; text-decoration: line-through;">' `
-Replace '\[m', '</span>+++' } | Out-File -encoding utf8 "result2.adoc" ;
(Get-Content -Raw result2.adoc) -replace '(?s)^.+?\n(?==)', '' | Out-File -encoding utf8 "result3.adoc" ;
asciidoctor result3.adoc -o result3.html;
prince result3.html --javascript -o result3.pdf;
Read-Host -Prompt "Press Enter to exit"
Here's a screenshot of the result using some text from Wikipedia:

Using powershell to search for a pattern

I am trying to write a powershell script to search for a pattern in a text file. Specifically I am looking at reading a file line by line and returning any line that has a space at the 32nd character position.
I have this so far but it just returns all lines that have white space. I need to narrow it down to the 32nd position
Get-Content -path C:\VM_names.txt | Where-Object {$_ -match "\s+"}
Use this pattern:
-match '^.{31} '
Explanation:
^ - beginning of the string
. - any character
{31} - repeated 31 times
- a space
This is actually really easy to do. By default, Get-Content reads a text file as an array of strings (individual lines), unless you use the -Raw parameter, which reads it as a single string. You can use the -match PowerShell operator to "match" the lines that meet your regular expression.
(Get-Content -Path c:\VM_names.txt) -match '^.{31}\s'
The result of the above command is an array of lines that match the desired regular expression.
NOTE: The call to Get-Content must be wrapped in parentheses, otherwise the PowerShell parser will think that -match is a parameter on that command.
NOTE2: As a good practice, use single quotes around strings, unless you specifically know that you need double quotes. You'll save yourself from accidental interpolation.