Remove Extra Lines from CSV - regex

I have CSV file that I am trying to remove extra lines (not sure how many lines it will be) from top of CSV, and then lines in the middle of the CSV that say SourceIP, DestinationIP, etc
I tried the following:
$m = gc D:\Script\textfile.txt
Select-String D:\Script\my.csv -pattern $m -Match
And textfile.text has
*.*.*.*
But I get error,
Select-String : A parameter cannot be found that matches parameter name 'Match'.
How do I even match the strings I want (or don't want), because I'd like the resulting CSV to be

Use Import-Csv cmdlet:
Import-Csv YourFileLocation -Header SourceIP, DestinationIP, Application |
where {$_.SourceIP -match "^[0-9]+"} | Export-Csv OutputFile.csv
It allows you to set custom header names, and then you can do regex search through SourceIP header, and take only stuff that starts with digit. If that's done, you can use Export-Csv to spit it out.

Related

replace thousands separators in csv with regex

I'm running into problems trying to pull the thousands separators out of some currency values in a set of files. The "bad" values are delimited with commas and double quotes. There are other values in there that are < $1000 that present no issue.
Example of existing file:
"12,345.67",12.34,"123,456.78",1.00,"123,456,789.12"
Example of desired file (thousands separators removed):
"12345.67",12.34,"123456.78",1.00,"123456789.12"
I found a regex expression for matching the numbers with separators that works great, but I'm having trouble with the -replace operator. The replacement value is confusing me. I read about $& and I'm wondering if I should use that here. I tried $_, but that pulls out ALL my commas. Do I have to use $matches somehow?
Here's my code:
$Files = Get-ChildItem *input.csv
foreach ($file in $Files)
{
$file |
Get-Content | #assume that I can't use -raw
% {$_ -replace '"[\d]{1,3}(,[\d]{3})*(\.[\d]+)?"', ("$&" -replace ',','')} | #this is my problem
out-file output.csv -append -encoding ascii
}
Tony Hinkle's comment is the answer: don't use regex for this (at least not directly on the CSV file).
Your CSV is valid, so you should parse it as such, work on the objects (change the text if you want), then write a new CSV.
Import-Csv -Path .\my.csv | ForEach-Object {
$_ | ForEach-Object {
$_ -replace ',',''
}
} | Export-Csv -Path .\my_new.csv
(this code needs work, specifically the middle as the row will have each column as a property, not an array, but a more complete version of your CSV would make that easier to demonstrate)
You can try with this regex:
,(?=(\d{3},?)+(?:\.\d{1,3})?")
See Live Demo or in powershell:
% {$_ -replace ',(?=(\d{3},?)+(?:\.\d{1,3})?")','' }
But it's more about the challenge that regex can bring. For proper work, use #briantist answer which is the clean way to do this.
I would use a simpler regex, and use capture groups instead of the entire capture.
I have tested the follow regular expression with your input and found no issues.
% {$_ -replace '([\d]),([\d])','$1$2' }
eg. Find all commas with a number before and after (so that the weird mixed splits dont matter) and replace the comma entirely.
This would have problems if your input has a scenario without that odd mixing of quotes and no quotes.

Retain carriage returns in text filtered through a regular expression

I need to search though a folder of logs and retrieve the most recent logs. Then I need to filter each log, pull out the relevant information and save to another file.
The problem is the regular expression I use to filter the log is dropping the carriage return and the line feed so the new file just contains a jumble of text.
$Reg = "(?ms)\*{6}\sBEGIN(.|\n){98}13.06.2015(.|\n){104}00000003.*(?!\*\*)+"
get-childitem "logfolder" -filter *.log |
where-object {$_.LastAccessTime -gt [datetime]$Test.StartTime} |
foreach {
$a=get-content $_;
[regex]::matches($a,$reg) | foreach {$_.groups[0].value > "MyOutFile"}
}
Log structure:
******* BEGIN MESSAGE *******
<Info line 1>
Date 18.03.2010 15:07:37 18.03.2010
<Info line 2>
File Number: 00000003
<Info line 3>
*Variable number of lines*
******* END MESSAGE *******
Basically capture everything between the BEGIN and END where the dates and file numbers are a certain value. Does anyone know how I can do this without losing the line feeds? I also tried using Out-File | Select-String -Pattern $reg, but I've never had success with using Select-String on a multiline record.
As #Matt pointed out, you need to read the entire file as a single string if you want to do multiline matches. Otherwise your (multiline) regular expression would be applied to single lines one after the other. There are several ways to get the content of a file as a single string:
(Get-Content 'C:\path\to\file.txt') -join "`r`n"
Get-Content 'C:\path\to\file.txt' | Out-String
Get-Content 'C:\path\to\file.txt' -Raw (requires PowerShell v3 or newer)
[IO.File]::ReadAllText('C:\path\to\file.txt')
Also, I'd modify the regular expression a little. Most of the time log messages may vary in length, so matching fixed lengths may fail if the log message changes. It's better to match on invariant parts of the string and leave the rest as variable length matches. And personally I find it a lot easier to do this kind of content extraction in several steps (makes for simpler regular expressions). In your case I would first separate the log entries from each other, and then filter the content:
$date = [regex]::Escape('13.06.2015')
$fnum = '00000003'
$re1 = "(?ms)\*{7} BEGIN MESSAGE \*{7}\s*([\s\S]*?)\*{7} END MESSAGE \*{7}"
$re2 = "(?ms)[\s\S]*?Date\s+$date[\s\S]*?File Number:\s+$fnum[\s\S]*"
Get-ChildItem 'C:\log\folder' -Filter '*.log' | ? {
$_.LastAccessTime -gt [DateTime]$Test.StartTime
} | % {
Get-Content $_.FullName -Raw |
Select-String -Pattern $re1 -AllMatches |
select -Expand Matches |
% {
$_.Groups[1].Value |
Select-String -Pattern $re2 |
select -Expand Matches |
select -Expand Groups |
select -Expand Value
}
} | Set-Content 'C:\path\to\output.txt'
BTW, don't use the redirection operator (>) inside a loop. It would overwrite the output file's content with each iteration. If you must write to a file inside a loop use the append redirection operator instead (>>). However, performance-wise it's usually better to put writing to output files at the end of the pipeline (see above).
Wanted to see if I could make that regex better but for now if you are using those regex modes you should be reading your text file in as a single string which helps a lot.
$a=get-content $_ -Raw
or if you don't have PowerShell 3.0
$a=(get-content $_) -join "`r`n"
I had to solve the problem of disappearing newlines in a completely different context. What you get when you do a get-content of a text file is an array of records, where each record is a line of text.
The only way I found to put the newline back in after some transformation was to use the automatic variable $OFS (output field separator). The default value is space, but if you set it to carriage return line feed, then you get separate records on separate lines.
So try this (it might work):
$OFS = "`r`n"

grep string between two other strings as delimiters

I have to do a report on how many times a certain CSS class appears in the content of our pages (over 10k pages). The trouble is, the header and footer contains that class, so a grep returns every single page.
So, how do I grep for content?
EDIT: I am looking for if a page has list-unstyled between <main> and </main>
So do I use a regular expression for that grep? or do I need to use PowerShell to have more functionality?
I have grep at my disposal and PowerShell, but I could use a portable software if that is my only option.
Ideally, I would get a report (.txt or .csv) with pages and line numbers where the class shows up, but just a list of the pages themselves would suffice.
EDIT: Progress
I now have this in PowerShell
$files = get-childitem -recurse -path w:\test\york\ -Filter *.html
foreach ($file in $files)
{
$htmlfile=[System.IO.File]::ReadAllText($file.fullName)
$regex="(?m)<main([\w\W]*)</main>"
if ($htmlfile -match $regex) {
$middle=$matches[1]
[regex]::Matches($middle,"list-unstyled")
Write-Host $file.fullName has matches in the middle:
}
}
Which I run with this command .\FindStr.ps1 | Export-csv C:\Tools\text.csv
it outputs the filename and path with string in the console, put does not add anything to the CSV. How can I get that added in?
What Ansgar Wiechers' answer says is good advice. Don't string search html files. I don't have a problem with it but it is worth noting that not all html files are the same and regex searches can produce flawed results. If tools exists that are aware of the file content structure you should use them.
I would like to take a simple approach that reports all files that have enough occurrences of the text list-unstyled in all html files in a given directory. You expect there to be 2? So if more than that show up then there is enough. I would have done a more complicated regex solution but since you want the line number as well I came up with this compromise.
$pattern = "list-unstyled"
Get-ChildItem C:\temp -Recurse -Filter *.html |
Select-String $pattern |
Group-Object Path |
Where-Object{$_.Count -gt 2} |
ForEach-Object{
$props = #{
File = $_.Group | Select-Object -First 1 -ExpandProperty Path
PatternFound = ($_.Group | Select-Object -ExpandProperty LineNumber) -join ";"
}
New-Object -TypeName PSCustomObject -Property $props
}
Select-String is a grep like tool that can search files for string. It reports the located line number in the file which I why we are using it here.
You should get output that looks like this on your PowerShell console.
File PatternFound
---- ------------
C:\temp\content.html 4;11;54
Where 4,11,54 is the lines where the text was found. The code filters out results where the count of lines is less than 3. So if you expect it once in the header and footer those results should be excluded.
You can create a regexp that will be suitable for multiline match. The regexp "(?m)<!-- main content -->([\w\W]*)<!-- end content -->" matches a multiline content delimited by your comments, with (?m) part meaning that this regexp has multiline option enabled. The group ([\w\W]*) matches everything between your comments, and also enables you to query $matches[1] which will contain your "main text" without headers and footers.
$htmlfile=[System.IO.File]::ReadAllText($fileToGrep)
$regex="(?m)<!-- main content -->([\w\W]*)<!-- end content -->"
if ($htmlfile -match $regex) {
$middle=$matches[1]
[regex]::Matches($middle,"list-unstyled")
}
This is only an example of how should you parse the file. You populate $fileToGrep with a file name which you desire to parse, then run this snippet to receive a string that contains all the list-unstyled strings in the middle of that file.
Don't use string matches for something like this. Analyze the DOM instead. That should allow you to exclude headers and footers by selecting the appropriate root element.
$ie = New-Object -COM 'InternetExplorer.Application'
$url = '...'
$classname = 'list-unstyled'
$ie.Navigate($url)
do { Start-Sleep -Milliseconds 100 } until ($ie.ReadyState -eq 4)
$root = $ie.Document.getElementsById('content-element-id')
$hits = $root.getElementsByTagName('*') | ? { $_.ClassName -eq $classname }
$hits.Count # number of occurrences of $classname below content element

Using powershell, in a csv doc, need to iterate and insert a character

So my csv file looks something like:
J|T|W
J|T|W
J|T|W
I'd like to iterate through, most likely using a regex so that after the two pipes and content \|.+{2}, and insert a tab character `t.
I'm assuming I'd use get-content to loop through, but I'm unsure of where to go from there.
Also...just thought of this, it is possible that the line will overrun to the next line, and therefore the two pipes will be on different lines, which I'm pretty sure makes a difference.
-Thanks
Ok, I'll move the comment discussion to an answer since it seems like it is a potentially valid solution:
Import-csv .\test.csv -Delimiter '|' -Header 'One', 'two', 'three' | %{$_.Three = "`t$($_.Three)"; $_} | Export-CSV .\test_result.cs
This works for a file that is known to have 3 fields. For a more generic solution, if you have the ability to determine the number of fields initially being exported to CSV, then:
Import-csv .\test.csv -Delimiter '|' -Header (1..$fieldCount) | %{$_.$fieldCount = "`t$($_.$fieldCount)"; $_} | Export-CSV .\test_result.cs
In PowerShell you can use the -replace operator with a regex e.g.:
$c = Get-Content foo.csv | Foreach {$_ -replace '<regex_here>','new_string'}
$c | Out-File foo.csv -encoding ascii
Note that in new_string you can refer to capture groups using $1 but you'll want to put that string in single quotes so PowerShell won't try to interpret $1 as a variable reference.

Open a file and filter it using a regular expression

I have a large logfile and I want to extract (write to a new file) certain rows. The problem is I need a certain row and the row before. So the regex should be applied on more than one row. Notepad++ is not able to do that and I don't want to write a script for that.
I assume I can do that with Powershell and a one-liner, but I don't know where to start ...
The regular expression is not the problem, will be something like that ^#\d+.*?\n.*?Failed.*?$
So, how can I open a file using the Powershell, passing the regex and get the rows back that fits my expression?
Look at Select-String and -context parameter:
If you only need to display the matching line and the line before, use
(for a test I use my log file and my regex - the date there)
Get-Content c:\Windows\System32\LogFiles\HTTPERR\httperr2.log |
Select-String '2011-05-13 06:16:10' -context 1,0
If you need to manipulate it further, store the result in a variable and use the properties:
$line = Get-Content c:\Windows\System32\LogFiles\HTTPERR\httperr2.log |
Select-String '2011-05-13 06:16:10' -context 1
# for all the members try this:
$line | Get-Member
#line that matches the regex:
$line.Line
$line.Context.PreContext
If there are more lines that match the regex, access them with brackets:
$line = Get-Content c:\Windows\System32\LogFiles\HTTPERR\httperr2.log |
Select-String '2011-05-13 06:16:10' -context 1
$line[0] # first match
$line[1] # second match