replace thousands separators in csv with regex

replace thousands separators in csv with regex - regex

I'm running into problems trying to pull the thousands separators out of some currency values in a set of files. The "bad" values are delimited with commas and double quotes. There are other values in there that are < $1000 that present no issue.
Example of existing file:
"12,345.67",12.34,"123,456.78",1.00,"123,456,789.12"
Example of desired file (thousands separators removed):
"12345.67",12.34,"123456.78",1.00,"123456789.12"
I found a regex expression for matching the numbers with separators that works great, but I'm having trouble with the -replace operator. The replacement value is confusing me. I read about $& and I'm wondering if I should use that here. I tried $_, but that pulls out ALL my commas. Do I have to use $matches somehow?
Here's my code:
$Files = Get-ChildItem *input.csv
foreach ($file in $Files)
{
$file |
Get-Content | #assume that I can't use -raw
% {$_ -replace '"[\d]{1,3}(,[\d]{3})*(\.[\d]+)?"', ("$&" -replace ',','')} | #this is my problem
out-file output.csv -append -encoding ascii
}

Tony Hinkle's comment is the answer: don't use regex for this (at least not directly on the CSV file).
Your CSV is valid, so you should parse it as such, work on the objects (change the text if you want), then write a new CSV.
Import-Csv -Path .\my.csv | ForEach-Object {
$_ | ForEach-Object {
$_ -replace ',',''
}
} | Export-Csv -Path .\my_new.csv
(this code needs work, specifically the middle as the row will have each column as a property, not an array, but a more complete version of your CSV would make that easier to demonstrate)

You can try with this regex:
,(?=(\d{3},?)+(?:\.\d{1,3})?")
See Live Demo or in powershell:
% {$_ -replace ',(?=(\d{3},?)+(?:\.\d{1,3})?")','' }
But it's more about the challenge that regex can bring. For proper work, use #briantist answer which is the clean way to do this.

I would use a simpler regex, and use capture groups instead of the entire capture.
I have tested the follow regular expression with your input and found no issues.
% {$_ -replace '([\d]),([\d])','$1$2' }
eg. Find all commas with a number before and after (so that the weird mixed splits dont matter) and replace the comma entirely.
This would have problems if your input has a scenario without that odd mixing of quotes and no quotes.

Related

Looking for a function that removes all comments from a script [duplicate]

I'm looking for a way to strip all comments from a file. There are various ways to do comments, but I'm only interested in the simple # form comments. Reason is that I only use <# #> for in-function .SYNOPSIS which is functional code as opposed to just a comment so I want to keep those).
EDIT: I have updated this question using the helpful answers below.
So there are only a couple of scenarios that I need:
a) whole line comments with # at start of line (or possibly with white-space before. i.e. regex of ^\s*# seems to work.
b) with some code at start of line then a command at the end of the line.
I want to avoid stripping lines that have e.g. Write-Host "#####" but I think this is covered in the code that I have.
I was able to remove end-of-line comments with a split as I couldn't work out how to do it with regex, does anyone know a way to achieve that with regex?
The split was not ideal as a <# on a line would be removed by the -split but I've fixed that by splitting on " #". This is not perfect but might be good enough - maybe a more reliable way with regex might exist?
When I do the below against my 7,000 line long script, it works(!) and strips a huge amount of comments, BUT, the output file is almost doubled in size(!?) from 400kb to about 700kb. Does anyone understand why that happens and how to prevent that (is it something to do with BOM's or Unicode or things like that? Out-File seems to really balloon the file-size!)
$x = Get-Content ".\myscript.ps1" # $x is an array, not a string
$out = ".\myscript.ps1"
$x = $x -split "[\r\n]+" # Remove all consecutive line-breaks, in any format '-split "\r?\n|\r"' would just do line by line
$x = $x | ? { $_ -notmatch "^\s*$" } # Remove empty lines
$x = $x | ? { $_ -notmatch "^\s*#" } # Remove all lines starting with ; including with whitespace before
$x = $x | % { ($_ -split " #")[0] } # Remove end of line comments
$x = ($x -replace $regex).Trim() # Remove whitespace only at start and end of line
$x | Out-File $out
# $x | more

Honestly, the best approach to identify and process all comments is to use PowerShell's language parser or one of the Ast classes. I apologize that I don't know which Ast contains comments; so this is an uglier way that will filter out block and line comments.
$code = Get-Content file.txt -Raw
$comments = [System.Management.Automation.PSParser]::Tokenize($code,[ref]$null) |
Where Type -eq 'Comment' | Select -Expand Content
$regex = ( $comments |% { [regex]::Escape($_) } ) -join '|'
# Output to remove all empty lines
$code -replace $regex -split '\r?\n' -notmatch '^\s*$'
# Output that Removes only Beginning and Ending Blank Lines
($code -replace $regex).Trim()

Do the inverse of your example: Only emit lines that do NOT match:
## Output to console
Get-Content .\file.ps1 | Where-Object { $_ -notmatch '#' }
## Output to file
Get-Content .\file.ps1 | Where-Object { $_ -notmatch '#' } | Out-file .\newfile.ps1 -Append

Based on #AdminOfThings helpful answer using the Abstract Syntax Tree (AST) Class parser approach but avoiding any regular expressions:
$Code = $Code.ToString() # Prepare any ScriptBlock for the substring method
$Tokens = [System.Management.Automation.PSParser]::Tokenize($Code, [ref]$null)
-Join $Tokens.Where{ $_.Type -ne 'Comment' }.ForEach{ $Code.Substring($_.Start, $_.Length) }

As for the incidental problem of the size of the output file being roughly double that of the input file:
As AdminOfThings points out, Out-File in Windows PowerShell defaults to UTF-16LE ("Unicode") encoding, where characters are represented by (at least) two bytes, whereas ANSI encoding, as used by Set-Content in Windows PowerShell by default, encodes all (supported) characters in a single byte. Similarly, UTF-8-encoded files use only one byte for characters in the ASCII range (note that PowerShell (Core) 7+ now consistently defaults to (BOM-less) UTF-8). Use the -Encoding parameter as needed.
A regex-based solution to your problem is never fully robust, even if you try to limit the comment removal to single-line comments.
For full robustness, you must indeed use PowerShell's language parser, as noted in the other answers.
However, care must be taken when reconstructing the original source code with the comments removed:
AdminOfThings's answer risks removing too much, given the subsequent global regex-based processing with -replace: while the scenario may be unlikely, if a comment is repeated inside a string, it would mistakenly be removed from there too.
iRon's answer risks syntax errors by joining the tokens without spaces, so that . .\foo.ps1 would turn into ..\foo.ps1, for instance. Blindly putting a space between tokens is not an option, because the property-access syntax would break (e.g. $host.Name would turn into $host . Name, but whitespace between a value and the . operator isn't allowed)
The following solution avoids these problems, while trying to preserve the formatting of the original code as much as possible, but this has limitations, because intra-line whitespace isn't reported by the parser:
This means that you can't tell whether whitespace between tokens on a given line is made up of tabs, spaces, or a mix of both. The solution below replaces any tab characters with 2 spaces before processing; adjust as needed.
To somewhat compensate for the removal of comments occupying their own line(s), more than 2 consecutive blank or empty lines are folded into a single empty one. It is possible to remove blank/empty lines altogether, but that could hurt readability.
# Tokenize the file content.
# Note that tabs, if any, are replaced by 2 spaces first; adjust as needed.
$tokens = $null
$null = [System.Management.Automation.Language.Parser]::ParseInput(
((Get-Content -Raw .\myscript.ps1) -replace '\t', ' '),
[ref] $tokens,
[ref] $null
)
# Loop over all tokens while omitting comments, and rebuild the source code
# without them, trying to preserve the original formatting as much as possible.
$sb = [System.Text.StringBuilder]::new()
$prevExtent = $null; $numConsecNewlines = 0
$tokens.
Where({ $_.Kind -ne 'Comment' }).
ForEach({
$startColumn = if ($_.Extent.StartLineNumber -eq $prevExtent.StartLineNumber) { $prevExtent.EndColumnNumber }
else { 1 }
if ($_.Kind -eq 'NewLine') {
# Fold multiple blank or empty lines into a single empty one.
if (++$numConsecNewlines -ge 3) { return }
} else {
$numConsecNewlines = 0
$null = $sb.Append(' ' * ($_.Extent.StartColumnNumber - $startColumn))
}
$null = $sb.Append($_.Text)
$prevExtent = $_.Extent
})
# Output the result.
# Pipe to Set-Content as needed.
$sb.ToString()

How Do I change a string in a specific line contained in a file preserving all other lines?

I have a file that contains this information:
Type=OleDll
Reference=*\G{00020430-0000-0000-C000-000000000046}#2.0#0#..\..\..\..\..\..\..\Windows\SysWOW64\stdole2.tlb#OLE Automation
Reference=*\G{7C0FFAB0-CD84-11D0-949A-00A0C91110ED}#1.0#0#..\..\..\..\..\..\..\Windows\SysWOW64\msdatsrc.tlb#Microsoft Data Source Interfaces for ActiveX Data Binding Type Library
Reference=*\G{26C4A893-1B44-4616-8684-8AC2FA6B0610}#1.0#0#..\..\..\..\..\..\..\Windows\SysWow64\Conexion_NF.dll#Zeus Data Access Library 1.0 (NF)
Reference=*\G{9668818B-3228-49FD-A809-8229CC8AA40F}#1.0#0#..\packages\ZeusMaestrosContabilidad.19.3.0\lib\native\ZeusMaestrosContabilidad190300.dll#Zeus Maestros Contables Des (Contabilidad)
I need to change the data between {} characters on line 5 using powershell and save the change preserving all other information in the file.

You can use the -replace operator to perform a regex match and string replacement.
If there is only one pair of {} per line, you can do the following where .*? matches any non-newline character as few as possible. Since by default Get-Content creates an object that is an array of lines, you can access each line by index with [4] being line 5.
$content = Get-Content File.txt
$content[4] = $content[4] -replace '{.*?}','{new data}'
$content | Set-Content File.txt
If there could be multiple {} pairs per line, you will need to be more specific with your regex. A positive lookbehind assertion (?<=) will do.
$content = Get-Content File.txt
$content[4] = $content[4] -replace '(?<=Reference=\*\\G){.*?}','{newest data}'
$content | Set-Content File.txt
For the case when you don't know which line contains the data you want to replace, you will need to be more specific about the data you are replacing.
Get-Content File.txt -replace '{9668818B-3228-49FD-A809-8229CC8AA40F}','{New Data}' | Set-Content
If there are an encoding requirements, consider using the -Encoding parameter on the Get-Content and Set-Content commands.

Try Regex: (?<=(?:.*\n){4}Reference=\*\\G\{)[\w-]+
Demo

If the content of the {} is always the same you can do this:
(Get-Content $yourfile) -replace $regex, ('{9668818B-3228-49FD-A809-8229CC8AA40F}') | Set-Content $newValue;

One solution :
$Content=Get-Content "C:\temp\test.txt"
$Row5Splited=$Content[4].Split("{}".ToCharArray())
$Content[4]="{0}{1}{2}" -f $Row5Splited[0], "{YOURNEWVALUE}", $Row5Splited[2]
$Content | Out-File "C:\temp\test2.txt"

One approach would be to find,
(.*Reference=\*\\G{)[^\r\n}]+
and replace with,
$1any_thing_you_like_to_replace_with
RegEx Circuit
jex.im visualizes regular expressions:
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

Regular expression seems not to work in Where-Object cmdlet

I am trying to add quote characters around two fields in a file of comma separated lines. Here is one line of data:
1/22/2018 0:00:00,0000000,001B9706BE,1,21,0,1,0,0,0,0,0,0,0,0,0,0,13,0,1,0,0,0,0,0,0,0,0,0,0
which I would like to become this:
1/22/2018 0:00:00,"0000000","001B9706BE",1,21,0,1,0,0,0,0,0,0,0,0,0,0,13,0,1,0,0,0,0,0,0,0,0,0,0
I began developing my regular expression in a simple PowerShell script, and soon I have the following:
$strData = '1/29/2018 0:00:00,0000000,001B9706BE,1,21,0,1,0,0,0,0,0,0,0,0,0,0,13,0,1,0,0,0,0,0,0,0,0,0,0'
$strNew = $strData -replace "([^,]*),([^,]*),([^,]*),(.*)",'$1,"$2","$3",$4'
$strNew
which gives me this output:
1/29/2018 0:00:00,"0000000","001B9706BE",1,21,0,1,0,0,0,0,0,0,0,0,0,0,13,0,1,0,0,0,0,0,0,0,0,0,0
Great! I'm all set. Extend this example to the general case of a file of similar lines of data:
Get-Content test_data.csv | Where-Object -FilterScript {
$_ -replace "([^,]*),([^,]*),([^,]*),(.*)", '$1,"$2","$3",$4'
}
This is a listing of test_data.csv:
1/29/2018 0:00:00,0000000,001B9706BE,1,21,0,1,0,0,0,0,0,0,0,0,0,0,13,0,1,0,0,0,0,0,0,0,0,0,0
1/29/2018 0:00:00,104938428,0016C4C483,1,45,0,1,0,0,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,0,0,0,0,0
1/29/2018 0:00:00,104943875,0016C4B0BC,1,31,0,1,0,0,0,0,0,0,0,0,0,0,25,0,1,0,0,0,0,0,0,0,0,0,0
1/29/2018 0:00:00,104948067,0016C4834D,1,33,0,1,0,0,0,0,0,0,0,0,0,0,23,0,1,0,0,0,0,0,0,0,0,0,0
This is the output of my script:
1/29/2018 0:00:00,0000000,001B9706BE,1,21,0,1,0,0,0,0,0,0,0,0,0,0,13,0,1,0,0,0,0,0,0,0,0,0,0
1/29/2018 0:00:00,104938428,0016C4C483,1,45,0,1,0,0,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,0,0,0,0,0
1/29/2018 0:00:00,104943875,0016C4B0BC,1,31,0,1,0,0,0,0,0,0,0,0,0,0,25,0,1,0,0,0,0,0,0,0,0,0,0
1/29/2018 0:00:00,104948067,0016C4834D,1,33,0,1,0,0,0,0,0,0,0,0,0,0,23,0,1,0,0,0,0,0,0,0,0,0,0
I have also tried this version of the script:
Get-Content test_data.csv | Where-Object -FilterScript {
$_ -replace "([^,]*),([^,]*),([^,]*),(.*)", "`$1,`"`$2`",`"`$3`",$4"
}
and obtained the same results.
My simple test script has convinced me that the regex is correct, but something happens when I use that regex inside a filter script in the Where-Object cmdlet.
What simple, yet critical, detail am I overlooking here?
Here is my PSVerion:
Major Minor Build Revision
----- ----- ----- --------
5 0 10586 117

You're misunderstanding how Where-Object works. The cmdlet outputs those input lines for which the -FilterScript expression evaluates to $true. It does NOT output whatever you do inside that scriptblock (you'd use ForEach-Object for that).
You don't need either Where-Object or ForEach-Object, though. Just put Get-Content in parentheses and use that as the first operand for the -replace operator. You also don't need the 4th capturing group. I would recommend anchoring the expression at the beginning of the string, though.
(Get-Content test_data.csv) -replace '^([^,]*),([^,]*),([^,]*)', '$1,"$2","$3"'

This seems to work here. I used ForEach-Object to process each record.
Get-Content test_data.csv |
ForEach-Object { $_ -replace "([^,]*),([^,]*),([^,]*),(.*)", '$1,"$2","$3",$4' }
This also seems to work. Uses the ? to create a reluctant (lazy) capture.
Get-Content test_data.csv |
ForEach-Object { $_ -replace '(.*?),(.*?),(.*?),(.*)', '$1,"$2","$3",$4' }

I would just make a small change to what you have in order for this to work. Simply change the script to the following, noting that I changed the -FilterScript to a ForEach-Object and fixed a minor typo that you had on the last item in the regular expression with the quotes:
Get-Content c:\temp\test_data.csv | ForEach-Object {
$_ -replace "([^,]*),([^,]*),([^,]*),(.*)", "`$1,`"`$2`",`"`$3`",`"`$4"
}
I tested this with the data you provided and it adds the quotes to the correct columns.

Remove Extra Lines from CSV

I have CSV file that I am trying to remove extra lines (not sure how many lines it will be) from top of CSV, and then lines in the middle of the CSV that say SourceIP, DestinationIP, etc
I tried the following:
$m = gc D:\Script\textfile.txt
Select-String D:\Script\my.csv -pattern $m -Match
And textfile.text has
*.*.*.*
But I get error,
Select-String : A parameter cannot be found that matches parameter name 'Match'.
How do I even match the strings I want (or don't want), because I'd like the resulting CSV to be

Use Import-Csv cmdlet:
Import-Csv YourFileLocation -Header SourceIP, DestinationIP, Application |
where {$_.SourceIP -match "^[0-9]+"} | Export-Csv OutputFile.csv
It allows you to set custom header names, and then you can do regex search through SourceIP header, and take only stuff that starts with digit. If that's done, you can use Export-Csv to spit it out.

Using powershell, in a csv doc, need to iterate and insert a character

So my csv file looks something like:
J|T|W
J|T|W
J|T|W
I'd like to iterate through, most likely using a regex so that after the two pipes and content \|.+{2}, and insert a tab character `t.
I'm assuming I'd use get-content to loop through, but I'm unsure of where to go from there.
Also...just thought of this, it is possible that the line will overrun to the next line, and therefore the two pipes will be on different lines, which I'm pretty sure makes a difference.
-Thanks

Ok, I'll move the comment discussion to an answer since it seems like it is a potentially valid solution:
Import-csv .\test.csv -Delimiter '|' -Header 'One', 'two', 'three' | %{$_.Three = "`t$($_.Three)"; $_} | Export-CSV .\test_result.cs
This works for a file that is known to have 3 fields. For a more generic solution, if you have the ability to determine the number of fields initially being exported to CSV, then:
Import-csv .\test.csv -Delimiter '|' -Header (1..$fieldCount) | %{$_.$fieldCount = "`t$($_.$fieldCount)"; $_} | Export-CSV .\test_result.cs

In PowerShell you can use the -replace operator with a regex e.g.:
$c = Get-Content foo.csv | Foreach {$_ -replace '<regex_here>','new_string'}
$c | Out-File foo.csv -encoding ascii
Note that in new_string you can refer to capture groups using $1 but you'll want to put that string in single quotes so PowerShell won't try to interpret $1 as a variable reference.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

replace thousands separators in csv with regex - regex

You can try with this regex: ,(?=(\d{3},?)+(?:\.\d{1,3})?") See Live Demo or in powershell: % {$_ -replace ',(?=(\d{3},?)+(?:\.\d{1,3})?")','' } But it's more about the challenge that regex can bring. For proper work, use #briantist answer which is the clean way to do this.

Related

Looking for a function that removes all comments from a script [duplicate]

How Do I change a string in a specific line contained in a file preserving all other lines?

Regular expression seems not to work in Where-Object cmdlet

Remove Extra Lines from CSV

Using powershell, in a csv doc, need to iterate and insert a character

Categories

Resources