Exporting Hash Table Using Property Dictionary to CSV - regex

I can't seem to figure out how to simply export formatted information to a CSV unless I iterate through each item in the object and write to the CSV line by line, which takes forever. I can export values instantly to the CSV, it's just when using the properties dictionary I run into issues.
The TestCSV file is formatted with a column that has IP addresses.
Here's what I have:
$CSV = "C:\TEMP\OutputFile.csv"
$RX = "(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(?:\.|dot|\[dot\]|\[\.\])){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"
$TestCSV = "C:\TEMP\FileWithIPs.csv"
$spreadsheetDataobject = import-csv $TestCSV
$Finding = $spreadsheetDataObject | Select-String $RX
$Props = #{ #create a properties dictionary
LineNumber = $finding.LineNumber
Matches = $finding.Matches.Value
}
$OBJ = New-Object -TypeName psobject -Property $Props
$OBJ | Select-Object Matches,LineNumber | Export-Csv -Path $CSV -Append -NoTypeInformation

This isn't going to work as written. You are using Import-CSV which creates an array of objects with properties. The Select-String command expects strings as input, not objects. If you want to use Select-String you would want to simply specify the file name, or use Get-Content on the file, and pass that to Select-String. If what you want is the line number, and the IP I think this would probably work just as well if not better for you:
$CSV = "C:\TEMP\OutputFile.csv"
$RX = "(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(?:\.|dot|\[dot\]|\[\.\])){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"
$TestCSV = "C:\TEMP\FileWithIPs.csv"
$spreadsheetDataobject = import-csv $TestCSV
$spreadsheetDataobject |
Where{$_.IP -match $RX} |
Select-Object #{l='Matches';e={$_.IP}},#{l='LineNumber';e={[array]::IndexOf($spreadsheetDataobject,$_)+1}} |
Export-Csv -Path $CSV -Append -NoTypeInformation
Edit: wOxxOm is quite right, this answer has considerably more overhead than parsing the text directly like he does. Though, for somebody who's new to PowerShell it's probably easier to understand.
In regards to $_.IP, since you use Import-CSV you create an array of objects. Each object has properties associated with it based on the header of the CSV file. IP was listed in the header as one of your columns, so each object has a property of IP, and the value of that property is whatever was in the IP column for that record.
Let me explain the Select line for you, and then you'll see that it's easy to add your source path as another column.
What I'm doing is defining properties with a hashtable. For my examples I will refer to the first one shown above. Since it is a hashtable it starts with #{ and ends with }. Inside there are two key/value pairs:
l='Matches'
e={$_.IP}
Essentially 'l' is short for Label, and 'e' is short for Expression. The label determines the name of the property being defined (which equates to the column header when you export). The expression defines the value assigned to the property. In this case I am really just renaming the IP column to Matches, since the value that I assign for each row is whatever is in the IP field. If you open the CSV in Excel, copy the entire IP column, paste it in at the end, and change the header to Matches, that is basically all I'm doing. So to add the file path as a column we can add one more hashtable to the Select line with this:
#{
l='FilePath'
e={$CSV}
}
That adds a third property, where the name is FilePath, and the value is whatever is stored in $CSV. That updated Select line would look like this:
Select-Object #{l='Matches';e={$_.IP}},#{l='LineNumber';e={[array]::IndexOf($spreadsheetDataobject,$_)+1}},#{l='FilePath'e={$CSV}} |

Any code based on the built-in CSV cmdlets is extremely slow because objects are created for each field on each line, and it's noticeable on large files (for example, code from the other answer takes 900 seconds to process a 9MB file with 100k lines).
If your input CSV file is simple, you can process it as text in less than a second for a 100k lines file:
$CSV = .......
$RX = .......
$TestCSV = .......
$line = 0 # header line doesn't count
$lastMatchPos = 0
$text = [IO.File]::ReadAllText($TestCSV) -replace '"http.+?",', ','
$out = New-Object Text.StringBuilder
ForEach ($m in ([regex]"(?<=,""?)$RX(?=""?,)").Matches($text)) {
$line += $m.index - $lastMatchPos -
$text.substring($lastMatchPos, $m.index-$lastMatchPos).Replace("`n",'').length
$lastMatchPos = $m.Index + $m.length
$out.AppendLine('' + $line + ',' + $m.value) >$null
}
if (!(Test-Path $CSV)) {
'LineNumber,IP' | Out-File $CSV -Encoding ascii
}
$out.ToString() | Out-File $CSV -Encoding ascii -Append
The code zaps quoted URLs fields just in the unlikely but possible case those contain a matching IP.

Related

CSV only contains last entry from file

The CSV file only contains a partial entry from the last regex match.
I've used the ISE debugger and can verify it's finding matches.
$h = #{}
$a = #()
Get-ChildItem C:\Users\speterson\Documents\script\*.kiy | foreach {
Get-Content $_ | foreach {
if ($_ -match 'IF Ingroup\s+\(\s+\"(..+?)\"\s+\)') {
$h.Group = $matches[1]
}
if ($_ -match 'use\s+([A-Za-z]):"(\\\\..*?\\..*)\"')) {
$h.DriveLetter = $matches[1].ToUpper()
$h.Path = $matches[2]
}
}
$a += New-Object PSCustomObject -Property $h
}
$a | Export-Csv c:\temp\Whatever.csv -NoTypeInfo
The input files look like this, but have 1000+ lines in them:
IF Ingroup ( "RPC3WIA01NT" )
use v: /del
ENDIF
IF Ingroup ( "JWA03KRONOSGLOBAL" )
use v:"\\$homesrvr\$dept"
ENDIF
IF Ingroup ( "P-USERS" )
use p:'\\PServer\PDRIVE
ENDIF
CSV file only shows:
GROUP
P-USERS
I want to ignore the drive letters with the /del.
I'm trying to get a CSV file that shows
Group Drive Path
JWA03KRONOSGLOBAL V \\$homesrvr\$dept
P-USERS P \\PServer\PDRIVE
Your code has two loops, one nested in the other. The outer loop processes each file from the Get-ChildItem call. The inner loop processes the content of the current file of the outer loop. However, since you're creating your objects after the inner loop finished you're only getting the last result from each processed file. Move object creation into the inner loop to get all results from all files.
I'd also recommend not re-using a hashtable. Re-using objects always bears the risk of having data carried over somewhere undesired. Hashtable creation is so inexpensive that running that risk is never worth it.
On top of that your processing of the files' content is flawed, because the inner loop processes the content one line at a time, but both of your conditionals match on different lines and are not linked to each other. If you created a new object with every iteration that would give you incorrect results. Read the file as a whole and then use Select-String with a multiline regex to extract the desired information.
Another thing to avoid is appending to an array in a loop (that's a slow operation because it involves re-creating the array and copying elements over and over). Since you're using ForEach-Object you can pipe directly into Export-Csv.
Something like this should work:
$re = 'IF Ingroup\s+\(\s+"(.+?)"\s+\)\s+' +
"use\s+([a-z]):\s*[`"'](\\\\[^`"'\s]+)"
Get-ChildItem 'C:\Users\speterson\Documents\script\*.kiy' | ForEach-Object {
Get-Content $_.FullName |
Out-String |
Select-String $re -AllMatches |
Select-Object -Expand Matches |
ForEach-Object {
New-Object -Type PSObject -Property #{
'Group' = $_.Groups[1].Value
'DriveLetter' = $_.Groups[2].Value
'Path' = $_.Groups[3].Value
}
}
} | Export-Csv 'C:\path\to\output.csv' -NoType

Parsing Data in powershell, with the format of Label:Data

I am doing a Invoke-Webrequest in powershell to an url that does not contain any HTML, just text. I am needing to pick out a specific part of this data that is in the format of Label:Data. Each piece of data is one it's own separate line. I'm looking for some ideas on how to accomplish this. Here is a sample of the $Response.Contentdata below. I am looking to isolate the speed-over-ground:0.0
rate-of-turn:0.0
course-over-ground:293.0
speed-over-ground:0.0
heading-true:243.0
hdop:1.0
active-waypoint-name:
bearing-to-waypoint:
distance-to-waypoint:
cross-track-error:0
cross-track-error-limit:
cross-track-error-scale:0
lateral-speed-bow:0.09
lateral-speed-stern:-0.05
longitudinal-speed:-0.05
I guess it's a single string, rather than an array of lines. So, split it into lines:
$Response.Content -split "`r?`n"
Find the one which says speed-over-ground
$line = $Response.Content -split "`r?`n" | Where-Object { $_ -match 'speed-over-ground' }
Split the text from the number, using the : separator, and take the second item, converted from text to a number if appropriate:
[decimal]$speedOverGround = $line.Split(':')[1]
Although, I might try to turn all of them into an object in a bulk transform. Complexity varies with the exact possible inputs, but this tries to convert numbers to numbers and leave empty ones as nulls:
$data = New-Object -TypeName PSCustomObject
$Response.Content -split "`r?`n" -replace ':\s*$', ':$null' |
ForEach-Object {
$name, $value = $_.Split(':').Trim()
$decimalValue = 0
if ([decimal]::TryParse($value, [ref]$decimalValue))
{
$value = $decimalValue
}
$data | Add-Member -NotePropertyName $name -NotePropertyValue $value
}
# Then you can do:
$data.'speed-over-ground'

Export data to CSV different row using regex

I used a regular expression to extract a string from a file and export to CSV. I could figure out how to extract each match value to different rows. The result would end up in single cell
{ 69630e4574ec6798, 78630e4574ec6798, 68630e4574ec6798}
I need it to be in different rows in CSV as below:
69630e4574ec6798
78630e4574ec6798
68630e4574ec6798
$Regex = [regex]"\s[a-f0-9]{16}"
Select-Object #{Name="Identity";Expression={$Regex.Matches($_.Textbody)}} |
Format-Table -Wrap |
Export-Csv -Path c:\temp\Inbox.csv -NoTypeInformation -Append
Details screenshot:
Edit:
I have been trying to split the data I have in my CSV but I am having difficulty in splitting the output data "id" to next line as they all come in one cell "{56415465456489944,564544654564654,46565465}".
In the screenshot below the first couple lines are the source input and the highlighted lines in the second group is the output that I am trying to get.
Change your regular expression so that it has the hexadecimal substrings in a capturing group (to exclude the leading whitespace):
$Regex = [regex]"\s([a-f0-9]{16})"
then extract the first group from each match:
$Regex.Matches($_.Textbody) | ForEach-Object {
$_.Groups[1].Value
} | Set-Content 'C:\temp\a.txt'
Use Set-Content rather than Out-File, because the latter will create the output file in Unicode format by default whereas the former defaults to ASCII output (both cmdlets allow overriding the default via a parameter -Encoding).
Edit:
To split the data from you id column and create individual rows for each ID you could do something like this:
Import-Csv 'C:\path\to\input.csv' | ForEach-Object {
$row = $_
$row.id -replace '[{}]' -split ',' | ForEach-Object {
$row | Select-Object -Property *,#{n='id';e={$_}} -ExcludeProperty id
}
} | Export-Csv 'C:\path\to\output.csv' -NoType

grep string between two other strings as delimiters

I have to do a report on how many times a certain CSS class appears in the content of our pages (over 10k pages). The trouble is, the header and footer contains that class, so a grep returns every single page.
So, how do I grep for content?
EDIT: I am looking for if a page has list-unstyled between <main> and </main>
So do I use a regular expression for that grep? or do I need to use PowerShell to have more functionality?
I have grep at my disposal and PowerShell, but I could use a portable software if that is my only option.
Ideally, I would get a report (.txt or .csv) with pages and line numbers where the class shows up, but just a list of the pages themselves would suffice.
EDIT: Progress
I now have this in PowerShell
$files = get-childitem -recurse -path w:\test\york\ -Filter *.html
foreach ($file in $files)
{
$htmlfile=[System.IO.File]::ReadAllText($file.fullName)
$regex="(?m)<main([\w\W]*)</main>"
if ($htmlfile -match $regex) {
$middle=$matches[1]
[regex]::Matches($middle,"list-unstyled")
Write-Host $file.fullName has matches in the middle:
}
}
Which I run with this command .\FindStr.ps1 | Export-csv C:\Tools\text.csv
it outputs the filename and path with string in the console, put does not add anything to the CSV. How can I get that added in?
What Ansgar Wiechers' answer says is good advice. Don't string search html files. I don't have a problem with it but it is worth noting that not all html files are the same and regex searches can produce flawed results. If tools exists that are aware of the file content structure you should use them.
I would like to take a simple approach that reports all files that have enough occurrences of the text list-unstyled in all html files in a given directory. You expect there to be 2? So if more than that show up then there is enough. I would have done a more complicated regex solution but since you want the line number as well I came up with this compromise.
$pattern = "list-unstyled"
Get-ChildItem C:\temp -Recurse -Filter *.html |
Select-String $pattern |
Group-Object Path |
Where-Object{$_.Count -gt 2} |
ForEach-Object{
$props = #{
File = $_.Group | Select-Object -First 1 -ExpandProperty Path
PatternFound = ($_.Group | Select-Object -ExpandProperty LineNumber) -join ";"
}
New-Object -TypeName PSCustomObject -Property $props
}
Select-String is a grep like tool that can search files for string. It reports the located line number in the file which I why we are using it here.
You should get output that looks like this on your PowerShell console.
File PatternFound
---- ------------
C:\temp\content.html 4;11;54
Where 4,11,54 is the lines where the text was found. The code filters out results where the count of lines is less than 3. So if you expect it once in the header and footer those results should be excluded.
You can create a regexp that will be suitable for multiline match. The regexp "(?m)<!-- main content -->([\w\W]*)<!-- end content -->" matches a multiline content delimited by your comments, with (?m) part meaning that this regexp has multiline option enabled. The group ([\w\W]*) matches everything between your comments, and also enables you to query $matches[1] which will contain your "main text" without headers and footers.
$htmlfile=[System.IO.File]::ReadAllText($fileToGrep)
$regex="(?m)<!-- main content -->([\w\W]*)<!-- end content -->"
if ($htmlfile -match $regex) {
$middle=$matches[1]
[regex]::Matches($middle,"list-unstyled")
}
This is only an example of how should you parse the file. You populate $fileToGrep with a file name which you desire to parse, then run this snippet to receive a string that contains all the list-unstyled strings in the middle of that file.
Don't use string matches for something like this. Analyze the DOM instead. That should allow you to exclude headers and footers by selecting the appropriate root element.
$ie = New-Object -COM 'InternetExplorer.Application'
$url = '...'
$classname = 'list-unstyled'
$ie.Navigate($url)
do { Start-Sleep -Milliseconds 100 } until ($ie.ReadyState -eq 4)
$root = $ie.Document.getElementsById('content-element-id')
$hits = $root.getElementsByTagName('*') | ? { $_.ClassName -eq $classname }
$hits.Count # number of occurrences of $classname below content element

Remove Extra Lines from CSV

I have CSV file that I am trying to remove extra lines (not sure how many lines it will be) from top of CSV, and then lines in the middle of the CSV that say SourceIP, DestinationIP, etc
I tried the following:
$m = gc D:\Script\textfile.txt
Select-String D:\Script\my.csv -pattern $m -Match
And textfile.text has
*.*.*.*
But I get error,
Select-String : A parameter cannot be found that matches parameter name 'Match'.
How do I even match the strings I want (or don't want), because I'd like the resulting CSV to be
Use Import-Csv cmdlet:
Import-Csv YourFileLocation -Header SourceIP, DestinationIP, Application |
where {$_.SourceIP -match "^[0-9]+"} | Export-Csv OutputFile.csv
It allows you to set custom header names, and then you can do regex search through SourceIP header, and take only stuff that starts with digit. If that's done, you can use Export-Csv to spit it out.