Regex for searching SSNs in excel files - regex

I've been given the task of searching for SSNs (and other PII so we can remove it) in our entire file structure, fun I know. So far this script will search thru all .xlsx files in a given directory, but no matter what I try, I cannot for the life of me get the $SearchText variable to work. I have tried so many different deviations of the regex currently displayed, the only regex string that works is straight question marks; "???????????", but that returns entires I'm not looking for.
Any help would be very much appreciated.
Thanks!
$SourceLocation = "C:\Users\nick\Documents\ScriptingTest"
$SearchText2 = "^(?!(000|666|9))\d{3}-(?!00)\d{2}-(?!0000)\d{4}$"
$SearchText = "*"
$FileNames = Get-ChildItem -Path $SourceLocation -Recurse -Include *.xlsx
Function Search-Excel {
$Excel = New-Object -ComObject Excel.Application
$Workbook = $Excel.Workbooks.Open($File)
ForEach ($Worksheet in #($Workbook.Sheets)) {
$Found = $WorkSheet.Cells.Find($SearchText)
If ($Found.Text -match "SearchText2") {
$BeginAddress = $Found.Address(0,0,1,1)
[pscustomobject]#{
WorkSheet = $Worksheet.Name
Column = $Found.Column
Row =$Found.Row
Text = $Found.Text
Address = $File
}
Do {
$Found = $WorkSheet.Cells.FindNext($Found)
$Address = $Found.Address(0,0,1,1)
If ($Address -eq $BeginAddress) {
BREAK
}
[pscustomobject]#{
WorkSheet = $Worksheet.Name
Column = $Found.Column
Row =$Found.Row
Text = $Found.Text
Address = $File
}
} Until ($False)
}
}
}
$workbook.close($false)
[void][System.Runtime.InteropServices.Marshal]::ReleaseComObject([System.__ComObject]$excel)
[gc]::Collect()
[gc]::WaitForPendingFinalizers()
Remove-Variable excel -ErrorAction SilentlyContinue
foreach ($File in $FileNames)
{
Search-Excel
}
EDIT: Turns out excel has a very limited range of acceptable regex: Acceptable Excel Regex,
so I modified the first $Searchtext viarable to just be "*", and the first if statement to match regex outside of excel's search. Now I just need to come up with a crafty regex pattern to filter what I want. The next problem is filtering:
No letters.
Valid SSNs with dashes.
Valid SSNs without dashes. (this part is stumping me, how to search for something that can have dashes, but if it doesn't, it can only be 9 characters long)

It definitely doesn't appear to be regex, but this did work with the dashes. The issues I see with your code is
You define the searchtext outside of the function and don't pass it in
Same with the file names
Your workbook close, com release, gc, etc is outside of your function, so it won't do anything. (except maybe error?)
Here is what I got to work with your code. Now if you have other text that matches the pattern of 3 chars dash 2 chars dash 4 chars, you can easily filter those out afterwards with regex or whatever you like.
$SourceLocation = "C:\Users\nick\Documents\ScriptingTest"
$SearchText = "???-??-????"
$FileNames = Get-ChildItem -Path $SourceLocation -Recurse -Include *.xlsx
Function Search-Excel {
[cmdletbinding()]
Param($File,$SearchText)
$Excel = New-Object -ComObject Excel.Application
$Workbook = $Excel.Workbooks.Open($File)
ForEach ($Worksheet in #($Workbook.Sheets)) {
$Found = $WorkSheet.Cells.Find($SearchText)
If ($Found) {
$BeginAddress = $Found.Address(0,0,1,1)
[pscustomobject]#{
WorkSheet = $Worksheet.Name
Column = $Found.Column
Row =$Found.Row
Text = $Found.Text
Address = $File
}
Do {
$Found = $WorkSheet.Cells.FindNext($Found)
$Address = $Found.Address(0,0,1,1)
If ($Address -eq $BeginAddress) {
BREAK
}
[pscustomobject]#{
WorkSheet = $Worksheet.Name
Column = $Found.Column
Row =$Found.Row
Text = $Found.Text
Address = $File
}
} Until ($False)
}
}
$workbook.close($false)
[void][System.Runtime.InteropServices.Marshal]::ReleaseComObject([System.__ComObject]$excel)
[gc]::Collect()
[gc]::WaitForPendingFinalizers()
Remove-Variable excel -ErrorAction SilentlyContinue
}
foreach ($File in $FileNames)
{
Write-Host processing $File.fullname
Search-Excel -File $File.fullname -SearchText $SearchText
}
Output from test file
WorkSheet : Sheet1
Column : 2
Row : 5
Text : 123-12-5555
Address : C:\temp\excel2.xlsx
WorkSheet : Sheet1
Column : 3
Row : 21
Text : 586-99-3844
Address : C:\temp\excel2.xlsx
WorkSheet : Sheet1
Column : 7
Row : 28
Text : 987-65-4321
Address : C:\temp\excel2.xlsx

Related

Find Pattern (Not exact string) in .XLSX with Powershell

I can find exact strings but I can't seem to find the correct function or syntax to find a pattern, for example [0-9] in an .xlsx. I can find that exact string but not matches for that pattern, which is supposed to be just a digit between 0 and 9. The reason for this is because I am using the Find function and that matches exact strings. I know this is possible to find a pattern but just cant seem to figure it out. I have to call the open of Excel due to my initial Get-ChildItem script does not work with .xlsx files. Below is the code. Any help or ideas will be greatly appreciated. I have put 3 asterisks where I think the issue is but I just can't see what the solution is.
$SearchText = '[0-9]'
$path = "C:\users\username\desktop"
$output = "c:\users\username\desktop\results.txt"
$files = Get-Childitem $path -Include *.xlsx, *.xlsm, *.xlsb -Recurse
Function Search-Excel {
$Excel = New-Object -ComObject Excel.Application
ForEach($file in $files)
{
$Workbook = $Excel.Workbooks.Open($file)
ForEach ($Worksheet in #($Workbook.Sheets)) {
***$Found = $WorkSheet.Cells.Find($SearchText)***
If ($Found) {
$BeginAddress = $Found.Address(0,0,1,1)
[pscustomobject]#{
FilePath = $Workbook.Path
FileName = $Workbook.Name
WorkSheet = $Worksheet.Name
Column = $Found.Column
Row = $Found.Row
Text = $Found.Text
Address = $BeginAddress
}
Do {
$Found = $WorkSheet.Cells.FindNext($Found)
$Address = $Found.Address(0,0,1,1)
If ($Address -eq $BeginAddress) {
BREAK
}
[pscustomobject]#{
FilePath = $Workbook.Path
FileName = $Workbook.Name
WorkSheet = $Worksheet.Name
Column = $Found.Column
Row = $Found.Row
Text = $Found.Text
Address = $Address
}
} Until ($False)
$workbook.Close($false) }
Else {
Write-Warning "[$($WorkSheet.Name)] Nothing Found!"
}}
}
[void][System.Runtime.InteropServices.Marshal]::ReleaseComObject([System.__ComObject]$excel)
[gc]::Collect()
[gc]::WaitForPendingFinalizers()
Remove-Variable excel -ErrorAction SilentlyContinue
}
Search-Excel | Out-File $output -Append

Skip Header Row in a High Performance Powershell Regex Script Block

I received some amazing help from Stack Overflow ... however ... it was so amazing I need a little more help to get to closer to the finish line. I'm parsing multiple enormous 4GB files 2X per month. I need be able to be able to skip the header, count the total lines, matched lines, and the not matched lines. I'm sure this is super-simple for a PowerShell superstar, but at my newbie PS level my skills are not yet strong. Perhaps a little help from you would save the week. :)
Data Sample:
ID FIRST_NAME LAST_NAME COLUMN_NM_TOO_LON5THCOLUMN
10000000001MINNIE MOUSE COLUMN VALUE LONGSTARTS
10000000002MICKLE ROONEY MOUSE COLUMN VALUE LONGSTARTS
Code Block (based on this answer):
#$match_regex matches each fixed length field by length; the () specifies that each matched field be stored in a capture group:
[regex]$match_regex = '^(.{10})(.{50})(.{50})(.{50})(.{50})(.{3})(.{8})(.{4})(.{50})(.{2})(.{30})(.{6})(.{3})(.{4})(.{25})(.{2})(.{10})(.{3})(.{8})(.{4})(.{50})(.{2})(.{30})(.{6})(.{3})(.{2})(.{25})(.{2})(.{10})(.{3})(.{10})(.{10})(.{10})(.{2})(.{10})(.{50})(.{50})(.{50})(.{50})(.{8})(.{4})(.{50})(.{2})(.{30})(.{6})(.{3})(.{2})(.{25})(.{2})(.{10})(.{3})(.{4})(.{2})(.{4})(.{10})(.{38})(.{38})(.{15})(.{1})(.{10})(.{2})(.{10})(.{10})(.{10})(.{10})(.{38})(.{38})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})$'
Measure-Command {
& {
switch -File $infile -Regex {
$match_regex {
# Join what all the capture groups matched with a tab char.
$Matches[1..($Matches.Count-1)].Trim() -join "`t"
}
}
} | Out-File $outFile
}
You only need to keep track of two counts - matched, and unmatched lines - and then a Boolean to indicate whether you've skipped the first line
$first = $false
$matched = 0
$unmatched = 0
. {
switch -File $infile -Regex {
$match_regex {
if($first){
# Join what all the capture groups matched with a tab char.
$Matches[1..($Matches.Count-1)].Trim() -join "`t"
$matched++
}
$first = $true
}
default{
$unmatched++
# you can remove this, if the pattern always matches the header
$first = $true
}
}
} | Out-File $outFile
$total = $matched + $unmatched
Using System.IO.StreamReader reduced the processing time to about 20% of what it had been. This was absolutely needed for my requirement.
I added logic and counters without sacrificing much on performance. The field counter and row by row comparison is particularly helpful in finding bad records.
This is a copy/paste of actual code but I shortened some things, made some things slightly pseudo code, so you may have to play with it to get things working just so for yourself.
Function Get-Regx-Data-Format() {
Param ([String] $filename)
if ($filename -eq 'FILE NAME') {
[regex]$match_regex = '^(.{10})(.{10})(.{10})(.{30})(.{30})(.{30})(.{4})(.{1})'
}
return $match_regex
}
Foreach ($file in $cutoff_files) {
$starttime_for_file = (Get-Date)
$source_file = $file + '_' + $proc_yyyymm + $source_file_suffix
$source_path = $source_dir + $source_file
$parse_file = $file + '_' + $proc_yyyymm + '_load' +$parse_target_suffix
$parse_file_path = $parse_target_dir + $parse_file
$error_file = $file + '_err_' + $proc_yyyymm + $error_target_suffix
$error_file_path = $error_target_dir + $error_file
[regex]$match_data_regex = Get-Regx-Data-Format $file
Remove-Item -path "$parse_file_path" -Force -ErrorAction SilentlyContinue
Remove-Item -path "$error_file_path" -Force -ErrorAction SilentlyContinue
[long]$matched_cnt = 0
[long]$unmatched_cnt = 0
[long]$loop_counter = 0
[boolean]$has_header_row=$true
[int]$field_cnt=0
[int]$previous_field_cnt=0
[int]$array_length=0
$parse_minutes = Measure-Command {
try {
$stream_log = [System.IO.StreamReader]::new($source_path)
$stream_in = [System.IO.StreamReader]::new($source_path)
$stream_out = [System.IO.StreamWriter]::new($parse_file_path)
$stream_err = [System.IO.StreamWriter]::new($error_file_path)
while ($line = $stream_in.ReadLine()) {
if ($line -match $match_data_regex) {
#if matched and it's the header, parse and write to the beg of output file
if (($loop_counter -eq 0) -and $has_header_row) {
$stream_out.WriteLine(($Matches[1..($array_length)].Trim() -join "`t"))
} else {
$previous_field_cnt = $field_cnt
#add year month to line start, trim and join every captured field w/tabs
$stream_out.WriteLine("$proc_yyyymm`t" + `
($Matches[1..($array_length)].Trim() -join "`t"))
$matched_cnt++
$field_cnt=$Matches.Count
if (($previous_field_cnt -ne $field_cnt) -and $loop_counter -gt 1) {
write-host "`nError on line $($loop_counter + 1). `
The field count does not match the previous correctly `
formatted (non-error) row."
}
}
} else {
if (($loop_counter -eq 0) -and $has_header_row) {
#if the header, write to the beginning of the output file
$stream_out.WriteLine($line)
} else {
$stream_err.WriteLine($line)
$unmatched_cnt++
}
}
$loop_counter++
}
} finally {
$stream_in.Dispose()
$stream_out.Dispose()
$stream_err.Dispose()
$stream_log.Dispose()
}
} | Select-Object -Property TotalMinutes
write-host "`n$file_list_idx. File $file parsing results....`nMatched Count =
$matched_cnt UnMatched Count = $unmatched_cnt Parse Minutes = $parse_minutes`n"
$file_list_idx++
$endtime_for_file = (Get-Date)
write-host "`nEnded processing file at $endtime_for_file"
$TimeDiff_for_file = (New-TimeSpan $starttime_for_file $endtime_for_file)
$Hrs_for_file = $TimeDiff_for_file.Hours
$Mins_for_file = $TimeDiff_for_file.Minutes
$Secs_for_file = $TimeDiff_for_file.Seconds
write-host "`nElapsed Time for file $file processing:
$Hrs_for_file`:$Mins_for_file`:$Secs_for_file"
}
$endtime = (Get-Date -format "HH:mm:ss")
$TimeDiff = (New-TimeSpan $starttime $endtime)
$Hrs = $TimeDiff.Hours
$Mins = $TimeDiff.Minutes
$Secs = $TimeDiff.Seconds
write-host "`nTotal Elapsed Time: $Hrs`:$Mins`:$Secs"

Matching Something Against Array List Using Where Object

I've found multiple examples of what I'm trying here, but for some reason it's not working.
I have a list of regular expressions that I'm checking against a single value and I can't seem to get a match.
I'm attempting to match domains. e.g. gmail.com, yahoo.com, live.com, etc.
I am importing a csv to get the domains and have debugged this code to make sure the values are what I expect. e.g. "gmail.com"
Regular expression examples AKA $FinalWhiteListArray
(?i)gmail\.com
(?i)yahoo\.com
(?i)live\.com
Code
Function CheckDirectoryForCSVFilesToSearch {
$global:CSVFiles = Get-ChildItem $Global:Directory -recurse -Include *.csv | % {$_.FullName} #removed -recurse
}
Function ImportCSVReports {
Foreach ($CurrentChangeReport in $global:CSVFiles) {
$global:ImportedChangeReport = Import-csv $CurrentChangeReport
}
}
Function CreateWhiteListArrayNOREGEX {
$Global:FinalWhiteListArray = New-Object System.Collections.ArrayList
$WhiteListPath = $Global:ScriptRootDir + "\" + "WhiteList.txt"
$Global:FinalWhiteListArray= Get-Content $WhiteListPath
}
$Global:ScriptRootDir = Split-Path -Path $psISE.CurrentFile.FullPath
$Global:Directory = $Global:ScriptRootDir + "\" + "Reports to Search" + "\" #Where to search for CSV files
CheckDirectoryForCSVFilesToSearch
ImportCSVReports
CreateWhiteListArrayNOREGEX
Foreach ($Global:Change in $global:ImportedChangeReport){
If (-not ([string]::IsNullOrEmpty($Global:Change.Previous_Provider_Contact_Email))){
$pos = $Global:Change.Provider_Contact_Email.IndexOf("#")
$leftPart = $Global:Change.Provider_Contact_Email.Substring(0, $pos)
$Global:Domain = $Global:Change.Provider_Contact_Email.Substring($pos+1)
$results = $Global:FinalWhiteListArray | Where-Object { $_ -match $global:Domain}
}
}
Thanks in advance for any help with this.
the problem with your current code is that you put the regex on the left side of the -match operator. [grin] swap that and your code otta work.
taking into account what LotPings pointed out about case sensitivity and using a regex OR symbol to make one test per URL, here's a demo of some of that. the \b is for word boundaries, the | is the regex OR symbol. the $RegexURL_WhiteList section builds that regex pattern from the 1st array. if i haven't made something clear, please ask ...
$URL_WhiteList = #(
'gmail.com'
'yahoo.com'
'live.com'
)
$RegexURL_WhiteList = -join #('\b' ,(#($URL_WhiteList |
ForEach-Object {
[regex]::Escape($_)
}) -join '|\b'))
$NeedFiltering = #(
'example.com/this/that'
'GMail.com'
'gmailstuff.org/NothingElse'
'NotReallyYahoo.com'
'www.yahoo.com'
'SomewhereFarAway.net/maybe/not/yet'
'live.net'
'Live.com/other/another'
)
foreach ($NF_Item in $NeedFiltering)
{
if ($NF_Item -match $RegexURL_WhiteList)
{
'[ {0} ] matched one of the test URLs.' -f $NF_Item
}
}
output ...
[ GMail.com ] matched one of the test URLs.
[ www.yahoo.com ] matched one of the test URLs.
[ Live.com/other/another ] matched one of the test URLs.

Powershell to use regx to find character position

I want to add " after third comma and " before fifth comma. How can this can be done in powershell ?
My idea is to use regex function to find the location of the third and fifth comma then add " to them by
$s.Insert(4,'-') **In case reg return position 4
example data
04642583,3,HC Mobile,O213,Inc,SIS Services,KR,Non Payroll Relevant,KR50
Output
04642583,3,HC Mobile,"O213,Inc",SIS Services,KR,Non Payroll Relevant,KR50
This is code I tried, but it failed by 'An empty pipe element is not allowed' How to fix it
$source = "D:\Output\MoreComma.csv"
$FinalFile = "D:\Output\MoreComma_Corrected.csv"
$content = Get-Content $source
foreach ($line in $content)
{
$items = $line.split(',');
$items[3] = '"'+$items[3]
$items[4] = $items[4]+'"';
$items -join ','
} | Set-Content $FinalFile
If you know the format (e.g you know that it's always in this comma-separated fashion); and your're only trying to achieve this; you can simply just split the line, add the quotes and join the line again.
Example:
$data = "04642583,3,HC Mobile,O213,Inc,SIS Services,KR,Non Payroll Relevant,KR50";
$items = $data.split(',');
$items[3] = '"'+$items[3]
$items[4] = $items[4]+'"';
$items -join ','
This will produce the line:
04642583,3,HC Mobile,"O213,Inc",SIS Services,KR,Non Payroll Relevant,KR50
Given you've stored this in a CSV- file:
$file = "C:\tmp\test.csv";
$lines = (get-content $file);
$newLines=($lines|foreach-object {
$items = $_.split(',');
$items[3] = '"'+$items[3]
$items[4] = $items[4]+'"';
$items -join ','
})
You can then output the result in a new file if you want
$newLines|Set-content C:\tmp\test2.csv
This will "mess" up your CSV-format file though (as it will considered to "merge the columns"), but I'm guessing this is what you're trying to achieve?

Get index of regex in filename in powershell

I'm trying to get the starting position for a regexmatch in a folder name.
dir c:\test | where {$_.fullname.psiscontainer} | foreach {
$indexx = $_.fullname.Indexofany("[Ss]+[0-9]+[0-9]+[Ee]+[0-9]+[0-9]")
$thingsbeforeregexmatch.substring(0,$indexx)
}
Ideally, this should work but since indexofany doesn't handle regex like that I'm stuck.
You can use the Regex.Match() method to perform a regex match. It'll return a MatchInfo object that has an Index property you can use:
Get-ChildItem c:\test | Where-Object {$_.PSIsContainer} | ForEach-Object {
# Test if folder's Name matches pattern
$match = [regex]::Match($_.Name, '[Ss]+[0-9]+[0-9]+[Ee]+[0-9]+[0-9]')
if($match.Success)
{
# Grab Index from the [regex]::Match() result
$Index = $Match.Index
# Substring using the index we obtained above
$ThingsBeforeMatch = $_.Name.Substring(0, $Index)
Write-Host $ThingsBeforeMatch
}
}
Alternatively, use the -match operator and the $Matches variable to grab the matched string and use that as an argument to IndexOf() (using RedLaser's sweet regex optimization):
if($_.Name -match 's+\d{2,}e+\d{2,}')
{
$Index = $_.Name.IndexOf($Matches[0])
$ThingsBeforeMatch = $_.Name.Substring(0,$Index)
}
You can use the Index property of the Match object. Example:
# Used regEx fom #RedLaser's comment
$regEx = [regex]'(?i)[s]+\d{2}[e]+\d{2}'
$testString = 'abcS00E00b'
$match = $regEx.Match($testString)
if ($match.Success)
{
$startingIndex = $match.Index
Write-Host "Match. Start index = $startingIndex"
}
else
{
Write-Host 'No match found'
}