Regex replace contents of file and delete lines that don't match - regex

I have a large log file where I want to extract certain types of lines. I have created a working regex to match these lines. How can I now use this regex to extract the lines and nothing else? I have tried
cat .\file | %{
if($_ -match "..."){
$_ -replace "...", '...'
}
else{
$_ -replace ".*", ""
}
}
Which almost works, but the lines that are not of interest still remain as blank lines (meaning the lines of interested are spaced VERY far apart).

The best way is to remove the else clause altogether. If you do that, then no object will be returned from that iteration of the ForEach-Object block.
cat .\file | %{
if($_ -match "..."){
$_ -replace "...", '...'
}
}

Just to append to briantist's answer you don't even need the loop structure. -match and -replace will function as array operators. Removing the need for the if and ForEach-Object.
(Get-Content .\file) -match "..." -replace "...","..."
Get-Content being the target of the alias cat

Related

Issues finding and replacing strings in PowerShell

I'm rather new to PowerShell and I'm trying to write a PowerShell script to convert some statements in VBScript to Microsoft JScript. Here is my code:
$vbs = 'C:\infile.vbs'
$js = 'C:\outfile.js'
(Get-Content $vbs | Set-Content $js)
(Get-Content $js) |
Foreach-Object { $_ -match "Sub " } | Foreach-Object { "$_()`n`{" } | Foreach-Object { $_ -replace "Sub", "function" } | Out-File $js
Foreach-Object { $_ -match "End Sub" } | Foreach-Object { $_ -replace "End Sub", "`}" } | Out-File $js
Foreach-Object { $_ -match "Function " } | Foreach-Object { "$_()`n`{" } | Foreach-Object { $_ -replace "Function", "function" } | Out-File $js
Foreach-Object { $_ -match "End Function" } | Foreach-Object { $_ -replace "End Function", "`}" } | Out-File $js
What I want is for my PowerShell program to take the code from the VBScript input file infile.vbs, convert it, and output it to the JScript output file outfile.js. Here is an example of what I want it to do:
Input file:
Sub HelloWorld
(Code Here)
End Sub
Output File:
function HelloWorld()
{
(Code Here)
}
Something similar would happen with regard to functions. From there, I would tweak the code manually to convert it. When I run my program in PowerShell v5.1, it does not show any errors. However, when I open outfile.js, I see only one line:
False
So really, I have two questions. 1. Why is this happening?2. How can I fix this program so that it behaves how I want it to (as detailed above)?
Thanks,
Gabe
You could also do this with the switch statement. Like so:
$vbs = 'C:\infile.vbs'
$js = 'C:\outfile.js'
Get-Content $vbs | ForEach-Object {
switch -Regex ($_) {
'Sub '{
'function {0}(){1}{2}' -f $_.Remove($_.IndexOf('Sub '),4).Trim(),[Environment]::NewLine,'{'
}
'End Sub'{
'}'
}
'Function ' {
'function {0}(){1}{2}' -f $_.Remove($_.IndexOf('Function '),9).Trim(),[Environment]::NewLine,'{'
}
'End Function' {
'}'
}
default {
$_
}
}
} | Out-File $js
As for question #2 (How can I fix this program [...]?):
Kirill Pashkov's helpful answer offers an elegant solution based on the switch statement.
Note, however, that his solution:
is predicated on Sub <name> / Function <name> statement parts not being on the same line as the matching End Sub / End Function parts - while this is typically the case, it isn't a syntactical requirement; e.g., Sub Foo() WScript.Echo("hi") End Sub - on a single line - works too.
in line with your own solution attempt, blindly appends () to Sub / Function definitions, which won't work with input procedures / functions that already have parameter declarations (e.g., Sub Foo (bar, baz)).
The following solution:
also works with single-line Sub / Function definition
correctly preserves parameter declarations
Get-Content $vbs | ForEach-Object {
$_ -replace '\b(?:sub|function)\s+(\w+)\s*(\(.*?\))', 'function $1$2 {' `
-replace '\bend\s+(?:sub|function)\b', '}'
} | Out-File $js
The above relies heavily on regexes (regular expressions) to transform the input; for specifics on how regex matching results can be referred to in the -replace operator's replacement-string operand, see this answer.
Caveat: There are many other syntax differences between VBScript and JScript that your approach doesn't cover, notably that VBScript has no return statement and instead uses <funcName> = ... to return values from functions.
As for question #1:
However, when I open outfile.js, I see only one line:
False
[...]
1. Why is this happening?
All but the first ForEach-Object cmdlet call run in separate statements, because the initial pipeline ends with the first call to Out-File $js.
The subsequent ForEach-Object calls each start a new pipeline, and since each pipeline ends with Out-File $js, each such pipeline writes to file $js - and thereby overwrites whatever the previous one wrote.
Therefore, it is the last pipeline that determines the ultimate contents of file $js.
A ForEach-Object that starts a pipeline receives no input. However, its associated script block ({...}) is still entered once in this case, with $_ being $null[1]:
The last pipeline starts with Foreach-Object { $_ -match "End Function" }, so its output is the equivalent of $null -match "End Function", which yields $False, because -match with a scalar LHS (a single input object) outputs a Boolean value that indicates whether a match was found or not.
Therefore, given that the middle pipeline segment (Foreach-Object { $_ -replace "End Function", "}" }) is an effective no-op ($False is stringified to 'False', and the -replace operator therefore finds no match to replace and passes the stringified input out unmodified), Out-File $js receives string 'False' and writes just that to output file $js.
Even if you transformed your separate commands into a single pipeline with a single Out-File $js segment at the very end, your command wouldn't work, however:
Given that Get-Content sends the input file's lines through the pipeline one by one, something like $_ -match "Sub " will again produce a Boolean result - indicating whether the line at hand ($_) matched string "Sub " - and pass that on.
While you could turn -match into a filter by making the LHS an array - by enclosing it in the array-subexpression operator #(...); e.g., #($_) -match "Sub " - that would:
pass line that contain substring Sub through as a whole, and
omit lines that don't.
In other words: This wouldn't work as intended, because:
lines that do not contain a matching substring would be omitted from the output, and
the lines that do match are reflected in full in $_ in the next pipeline segment - not just the matched part.
[1] Strictly speaking, $_ will retain whatever value it had in the current scope, but that will only be non-$null if you explicitly assigned a value to $_ - given that $_ is an automatic variable that is normally controlled by PowerShell itself, however, doing so is ill-advised - see this GitHub discussion.
OK there is a few things wrong with this script.
Foreach-Object otherwise known as % is to iterate every item in a pipe.
Example is
#(1..10) | %{ "This is Array Item $_"}
This will out put 10 lines counting the array items. In you current script you are using this where a Where-Object also known as ? should be.
#(1..10) | ?{ $_ -gt 5 }
This will output all numbers greater then 5.
A example of what you are kind of trying to go for is something like
function ConvertTo-JS([string]$InputFilePath,[string]$SaveAs){
Get-Content $InputFilePath |
%{$_ -replace "Sub", "function"} |
%{$_ -replace "End Function", "}"} |
%{$_ -replace "Function", "function"} |
%{$_ -replace "End Function", "}" } |
Out-File $SaveAs
}
ConvertTo-JS -InputFilePath "C:\TEST\TEST.vbs" -SaveAs "C:\TEST\TEST.JS"
This doesnt take into account adding a { at the beginning of a function or adding the () ether. But with the information provided hopefully that puts you on the right track.

Regular expression seems not to work in Where-Object cmdlet

I am trying to add quote characters around two fields in a file of comma separated lines. Here is one line of data:
1/22/2018 0:00:00,0000000,001B9706BE,1,21,0,1,0,0,0,0,0,0,0,0,0,0,13,0,1,0,0,0,0,0,0,0,0,0,0
which I would like to become this:
1/22/2018 0:00:00,"0000000","001B9706BE",1,21,0,1,0,0,0,0,0,0,0,0,0,0,13,0,1,0,0,0,0,0,0,0,0,0,0
I began developing my regular expression in a simple PowerShell script, and soon I have the following:
$strData = '1/29/2018 0:00:00,0000000,001B9706BE,1,21,0,1,0,0,0,0,0,0,0,0,0,0,13,0,1,0,0,0,0,0,0,0,0,0,0'
$strNew = $strData -replace "([^,]*),([^,]*),([^,]*),(.*)",'$1,"$2","$3",$4'
$strNew
which gives me this output:
1/29/2018 0:00:00,"0000000","001B9706BE",1,21,0,1,0,0,0,0,0,0,0,0,0,0,13,0,1,0,0,0,0,0,0,0,0,0,0
Great! I'm all set. Extend this example to the general case of a file of similar lines of data:
Get-Content test_data.csv | Where-Object -FilterScript {
$_ -replace "([^,]*),([^,]*),([^,]*),(.*)", '$1,"$2","$3",$4'
}
This is a listing of test_data.csv:
1/29/2018 0:00:00,0000000,001B9706BE,1,21,0,1,0,0,0,0,0,0,0,0,0,0,13,0,1,0,0,0,0,0,0,0,0,0,0
1/29/2018 0:00:00,104938428,0016C4C483,1,45,0,1,0,0,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,0,0,0,0,0
1/29/2018 0:00:00,104943875,0016C4B0BC,1,31,0,1,0,0,0,0,0,0,0,0,0,0,25,0,1,0,0,0,0,0,0,0,0,0,0
1/29/2018 0:00:00,104948067,0016C4834D,1,33,0,1,0,0,0,0,0,0,0,0,0,0,23,0,1,0,0,0,0,0,0,0,0,0,0
This is the output of my script:
1/29/2018 0:00:00,0000000,001B9706BE,1,21,0,1,0,0,0,0,0,0,0,0,0,0,13,0,1,0,0,0,0,0,0,0,0,0,0
1/29/2018 0:00:00,104938428,0016C4C483,1,45,0,1,0,0,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,0,0,0,0,0
1/29/2018 0:00:00,104943875,0016C4B0BC,1,31,0,1,0,0,0,0,0,0,0,0,0,0,25,0,1,0,0,0,0,0,0,0,0,0,0
1/29/2018 0:00:00,104948067,0016C4834D,1,33,0,1,0,0,0,0,0,0,0,0,0,0,23,0,1,0,0,0,0,0,0,0,0,0,0
I have also tried this version of the script:
Get-Content test_data.csv | Where-Object -FilterScript {
$_ -replace "([^,]*),([^,]*),([^,]*),(.*)", "`$1,`"`$2`",`"`$3`",$4"
}
and obtained the same results.
My simple test script has convinced me that the regex is correct, but something happens when I use that regex inside a filter script in the Where-Object cmdlet.
What simple, yet critical, detail am I overlooking here?
Here is my PSVerion:
Major Minor Build Revision
----- ----- ----- --------
5 0 10586 117
You're misunderstanding how Where-Object works. The cmdlet outputs those input lines for which the -FilterScript expression evaluates to $true. It does NOT output whatever you do inside that scriptblock (you'd use ForEach-Object for that).
You don't need either Where-Object or ForEach-Object, though. Just put Get-Content in parentheses and use that as the first operand for the -replace operator. You also don't need the 4th capturing group. I would recommend anchoring the expression at the beginning of the string, though.
(Get-Content test_data.csv) -replace '^([^,]*),([^,]*),([^,]*)', '$1,"$2","$3"'
This seems to work here. I used ForEach-Object to process each record.
Get-Content test_data.csv |
ForEach-Object { $_ -replace "([^,]*),([^,]*),([^,]*),(.*)", '$1,"$2","$3",$4' }
This also seems to work. Uses the ? to create a reluctant (lazy) capture.
Get-Content test_data.csv |
ForEach-Object { $_ -replace '(.*?),(.*?),(.*?),(.*)', '$1,"$2","$3",$4' }
I would just make a small change to what you have in order for this to work. Simply change the script to the following, noting that I changed the -FilterScript to a ForEach-Object and fixed a minor typo that you had on the last item in the regular expression with the quotes:
Get-Content c:\temp\test_data.csv | ForEach-Object {
$_ -replace "([^,]*),([^,]*),([^,]*),(.*)", "`$1,`"`$2`",`"`$3`",`"`$4"
}
I tested this with the data you provided and it adds the quotes to the correct columns.

replace thousands separators in csv with regex

I'm running into problems trying to pull the thousands separators out of some currency values in a set of files. The "bad" values are delimited with commas and double quotes. There are other values in there that are < $1000 that present no issue.
Example of existing file:
"12,345.67",12.34,"123,456.78",1.00,"123,456,789.12"
Example of desired file (thousands separators removed):
"12345.67",12.34,"123456.78",1.00,"123456789.12"
I found a regex expression for matching the numbers with separators that works great, but I'm having trouble with the -replace operator. The replacement value is confusing me. I read about $& and I'm wondering if I should use that here. I tried $_, but that pulls out ALL my commas. Do I have to use $matches somehow?
Here's my code:
$Files = Get-ChildItem *input.csv
foreach ($file in $Files)
{
$file |
Get-Content | #assume that I can't use -raw
% {$_ -replace '"[\d]{1,3}(,[\d]{3})*(\.[\d]+)?"', ("$&" -replace ',','')} | #this is my problem
out-file output.csv -append -encoding ascii
}
Tony Hinkle's comment is the answer: don't use regex for this (at least not directly on the CSV file).
Your CSV is valid, so you should parse it as such, work on the objects (change the text if you want), then write a new CSV.
Import-Csv -Path .\my.csv | ForEach-Object {
$_ | ForEach-Object {
$_ -replace ',',''
}
} | Export-Csv -Path .\my_new.csv
(this code needs work, specifically the middle as the row will have each column as a property, not an array, but a more complete version of your CSV would make that easier to demonstrate)
You can try with this regex:
,(?=(\d{3},?)+(?:\.\d{1,3})?")
See Live Demo or in powershell:
% {$_ -replace ',(?=(\d{3},?)+(?:\.\d{1,3})?")','' }
But it's more about the challenge that regex can bring. For proper work, use #briantist answer which is the clean way to do this.
I would use a simpler regex, and use capture groups instead of the entire capture.
I have tested the follow regular expression with your input and found no issues.
% {$_ -replace '([\d]),([\d])','$1$2' }
eg. Find all commas with a number before and after (so that the weird mixed splits dont matter) and replace the comma entirely.
This would have problems if your input has a scenario without that odd mixing of quotes and no quotes.

Powershell replace exact string

I want to replace a simple string "WEEK." (with a dot) in a text file with the string "TEST"
$LOG= "C:\FILE.TXT"
$A= "TEST"
(Get-Content $LOG) | Foreach { $_ -Replace "WEEK.", $A } | Set-Content $LOG;
The problem is that my file has this content:
WEEK_A WEEK.
And when I run my script the result is:
TESTA TEST
and the result that i want is:
WEEK_A TEST
I try with ^ "WEEK." and "^WEEK.$" but it not worked
Can you help me with the regexp? Thanks
====== EDIT ==================
Ok. I try with
$LOG= "C:\FILE.TXT"
$A= "TEST"
(Get-Content $LOG) | Foreach { $_ -Replace "WEEK\.", $A } | Set-Content $LOG;
and seems its works
The reason why this happened is because you have used pattern WEEK. The dot was a problem: in a regular expression world, the dot means "any character". That's why it was replacing both WEEK_ and WEEK..
When you have added backslash, then the dot was escaped ie. it lost it's special meaning. Thus making it work.

Powershell: Leave item alone if regex doesn't match

I have a list of pdf files (from daily processing), some with date stamps of various formatting, some without.
Example:
$f = #("testLtr06-09-02.pdf", "otherletter.pdf","WelcomeLtr043009.pdf")
I am trying to remove the datestamp by stripping out dashes, then replacing any consecutive group of numbers (4 or more, I may change this to 6) with the string "DATESTAMP".
So far I have this:
$d = $f | foreach {$_ -replace "-", ""} | foreach { $_ -replace ([regex]::Matches($_ , "\d{4,}")), "DATESTAMP"}
echo $d
The output:
testLtrDATESTAMP.pdf
DATESTAMPoDATESTAMPtDATESTAMPhDATESTAMPeDATESTAMPrDATESTAMPlDATESTAMPeDATESTAMPtDATESTAMPtDATESTAMPeDATESTAMPrDATESTAMP.DATESTAMPpDATESTAMPdDATESTAMPfDATESTAMP
WelcomeLtrDATESTAMP.pdf
It works fine if the file has a datestamp but it seems to be freaking out the -replace and inserting DATESTAMP after every character. Is there a way to fix this? I tried to change it to a foreach loop but I couldn't figure out how to get true/false from regex.
Thanks in advance.
You can simply do:
PS > $f -replace "(\d{2}-){2}\d{2}|\d{4,}","DATESTAMP"
testLtrDATESTAMP.pdf
otherletter.pdf
WelcomeLtrDATESTAMP.pdf
$_ -replace ([regex]::Matches($_ , "\d{4,}")), "DATESTAMP"
Means in $_ replace every finding of ([regex]::Matches($_ , "\d{4,}")) with "DATESTAMP".
As in a filename with no timestamp (or at least 4 consecutive numbers) there is no match, it returns "" (an empty string).
Thus every empty string gets replaced with DATESTAMP. And such a empty string "" sits at the start of the string and after every other character.
Thats why you get this long string with every character surrounded by DATESTAMP.
To check if there even exists a \d{4,} in your string you should able to use
[regex]::IsMatch($_, "\d{4,}")
I'm no Powershell user but this line alone should do the job. But I'm not sure about being able to use the if in a pipeline and wether or not the assignment and the echo $d are needed
$f | foreach-object {$_ -replace "-", ""} | foreach-object {if ($_ -match "\d{4,}") { $_ -replace "\d{4,}", "DATESTAMP"} else { $_ }}