Find Pattern (Not exact string) in .XLSX with Powershell - regex

I can find exact strings but I can't seem to find the correct function or syntax to find a pattern, for example [0-9] in an .xlsx. I can find that exact string but not matches for that pattern, which is supposed to be just a digit between 0 and 9. The reason for this is because I am using the Find function and that matches exact strings. I know this is possible to find a pattern but just cant seem to figure it out. I have to call the open of Excel due to my initial Get-ChildItem script does not work with .xlsx files. Below is the code. Any help or ideas will be greatly appreciated. I have put 3 asterisks where I think the issue is but I just can't see what the solution is.
$SearchText = '[0-9]'
$path = "C:\users\username\desktop"
$output = "c:\users\username\desktop\results.txt"
$files = Get-Childitem $path -Include *.xlsx, *.xlsm, *.xlsb -Recurse
Function Search-Excel {
$Excel = New-Object -ComObject Excel.Application
ForEach($file in $files)
{
$Workbook = $Excel.Workbooks.Open($file)
ForEach ($Worksheet in #($Workbook.Sheets)) {
***$Found = $WorkSheet.Cells.Find($SearchText)***
If ($Found) {
$BeginAddress = $Found.Address(0,0,1,1)
[pscustomobject]#{
FilePath = $Workbook.Path
FileName = $Workbook.Name
WorkSheet = $Worksheet.Name
Column = $Found.Column
Row = $Found.Row
Text = $Found.Text
Address = $BeginAddress
}
Do {
$Found = $WorkSheet.Cells.FindNext($Found)
$Address = $Found.Address(0,0,1,1)
If ($Address -eq $BeginAddress) {
BREAK
}
[pscustomobject]#{
FilePath = $Workbook.Path
FileName = $Workbook.Name
WorkSheet = $Worksheet.Name
Column = $Found.Column
Row = $Found.Row
Text = $Found.Text
Address = $Address
}
} Until ($False)
$workbook.Close($false) }
Else {
Write-Warning "[$($WorkSheet.Name)] Nothing Found!"
}}
}
[void][System.Runtime.InteropServices.Marshal]::ReleaseComObject([System.__ComObject]$excel)
[gc]::Collect()
[gc]::WaitForPendingFinalizers()
Remove-Variable excel -ErrorAction SilentlyContinue
}
Search-Excel | Out-File $output -Append

Related

Regex for searching SSNs in excel files

I've been given the task of searching for SSNs (and other PII so we can remove it) in our entire file structure, fun I know. So far this script will search thru all .xlsx files in a given directory, but no matter what I try, I cannot for the life of me get the $SearchText variable to work. I have tried so many different deviations of the regex currently displayed, the only regex string that works is straight question marks; "???????????", but that returns entires I'm not looking for.
Any help would be very much appreciated.
Thanks!
$SourceLocation = "C:\Users\nick\Documents\ScriptingTest"
$SearchText2 = "^(?!(000|666|9))\d{3}-(?!00)\d{2}-(?!0000)\d{4}$"
$SearchText = "*"
$FileNames = Get-ChildItem -Path $SourceLocation -Recurse -Include *.xlsx
Function Search-Excel {
$Excel = New-Object -ComObject Excel.Application
$Workbook = $Excel.Workbooks.Open($File)
ForEach ($Worksheet in #($Workbook.Sheets)) {
$Found = $WorkSheet.Cells.Find($SearchText)
If ($Found.Text -match "SearchText2") {
$BeginAddress = $Found.Address(0,0,1,1)
[pscustomobject]#{
WorkSheet = $Worksheet.Name
Column = $Found.Column
Row =$Found.Row
Text = $Found.Text
Address = $File
}
Do {
$Found = $WorkSheet.Cells.FindNext($Found)
$Address = $Found.Address(0,0,1,1)
If ($Address -eq $BeginAddress) {
BREAK
}
[pscustomobject]#{
WorkSheet = $Worksheet.Name
Column = $Found.Column
Row =$Found.Row
Text = $Found.Text
Address = $File
}
} Until ($False)
}
}
}
$workbook.close($false)
[void][System.Runtime.InteropServices.Marshal]::ReleaseComObject([System.__ComObject]$excel)
[gc]::Collect()
[gc]::WaitForPendingFinalizers()
Remove-Variable excel -ErrorAction SilentlyContinue
foreach ($File in $FileNames)
{
Search-Excel
}
EDIT: Turns out excel has a very limited range of acceptable regex: Acceptable Excel Regex,
so I modified the first $Searchtext viarable to just be "*", and the first if statement to match regex outside of excel's search. Now I just need to come up with a crafty regex pattern to filter what I want. The next problem is filtering:
No letters.
Valid SSNs with dashes.
Valid SSNs without dashes. (this part is stumping me, how to search for something that can have dashes, but if it doesn't, it can only be 9 characters long)
It definitely doesn't appear to be regex, but this did work with the dashes. The issues I see with your code is
You define the searchtext outside of the function and don't pass it in
Same with the file names
Your workbook close, com release, gc, etc is outside of your function, so it won't do anything. (except maybe error?)
Here is what I got to work with your code. Now if you have other text that matches the pattern of 3 chars dash 2 chars dash 4 chars, you can easily filter those out afterwards with regex or whatever you like.
$SourceLocation = "C:\Users\nick\Documents\ScriptingTest"
$SearchText = "???-??-????"
$FileNames = Get-ChildItem -Path $SourceLocation -Recurse -Include *.xlsx
Function Search-Excel {
[cmdletbinding()]
Param($File,$SearchText)
$Excel = New-Object -ComObject Excel.Application
$Workbook = $Excel.Workbooks.Open($File)
ForEach ($Worksheet in #($Workbook.Sheets)) {
$Found = $WorkSheet.Cells.Find($SearchText)
If ($Found) {
$BeginAddress = $Found.Address(0,0,1,1)
[pscustomobject]#{
WorkSheet = $Worksheet.Name
Column = $Found.Column
Row =$Found.Row
Text = $Found.Text
Address = $File
}
Do {
$Found = $WorkSheet.Cells.FindNext($Found)
$Address = $Found.Address(0,0,1,1)
If ($Address -eq $BeginAddress) {
BREAK
}
[pscustomobject]#{
WorkSheet = $Worksheet.Name
Column = $Found.Column
Row =$Found.Row
Text = $Found.Text
Address = $File
}
} Until ($False)
}
}
$workbook.close($false)
[void][System.Runtime.InteropServices.Marshal]::ReleaseComObject([System.__ComObject]$excel)
[gc]::Collect()
[gc]::WaitForPendingFinalizers()
Remove-Variable excel -ErrorAction SilentlyContinue
}
foreach ($File in $FileNames)
{
Write-Host processing $File.fullname
Search-Excel -File $File.fullname -SearchText $SearchText
}
Output from test file
WorkSheet : Sheet1
Column : 2
Row : 5
Text : 123-12-5555
Address : C:\temp\excel2.xlsx
WorkSheet : Sheet1
Column : 3
Row : 21
Text : 586-99-3844
Address : C:\temp\excel2.xlsx
WorkSheet : Sheet1
Column : 7
Row : 28
Text : 987-65-4321
Address : C:\temp\excel2.xlsx

Skip Header Row in a High Performance Powershell Regex Script Block

I received some amazing help from Stack Overflow ... however ... it was so amazing I need a little more help to get to closer to the finish line. I'm parsing multiple enormous 4GB files 2X per month. I need be able to be able to skip the header, count the total lines, matched lines, and the not matched lines. I'm sure this is super-simple for a PowerShell superstar, but at my newbie PS level my skills are not yet strong. Perhaps a little help from you would save the week. :)
Data Sample:
ID FIRST_NAME LAST_NAME COLUMN_NM_TOO_LON5THCOLUMN
10000000001MINNIE MOUSE COLUMN VALUE LONGSTARTS
10000000002MICKLE ROONEY MOUSE COLUMN VALUE LONGSTARTS
Code Block (based on this answer):
#$match_regex matches each fixed length field by length; the () specifies that each matched field be stored in a capture group:
[regex]$match_regex = '^(.{10})(.{50})(.{50})(.{50})(.{50})(.{3})(.{8})(.{4})(.{50})(.{2})(.{30})(.{6})(.{3})(.{4})(.{25})(.{2})(.{10})(.{3})(.{8})(.{4})(.{50})(.{2})(.{30})(.{6})(.{3})(.{2})(.{25})(.{2})(.{10})(.{3})(.{10})(.{10})(.{10})(.{2})(.{10})(.{50})(.{50})(.{50})(.{50})(.{8})(.{4})(.{50})(.{2})(.{30})(.{6})(.{3})(.{2})(.{25})(.{2})(.{10})(.{3})(.{4})(.{2})(.{4})(.{10})(.{38})(.{38})(.{15})(.{1})(.{10})(.{2})(.{10})(.{10})(.{10})(.{10})(.{38})(.{38})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})$'
Measure-Command {
& {
switch -File $infile -Regex {
$match_regex {
# Join what all the capture groups matched with a tab char.
$Matches[1..($Matches.Count-1)].Trim() -join "`t"
}
}
} | Out-File $outFile
}
You only need to keep track of two counts - matched, and unmatched lines - and then a Boolean to indicate whether you've skipped the first line
$first = $false
$matched = 0
$unmatched = 0
. {
switch -File $infile -Regex {
$match_regex {
if($first){
# Join what all the capture groups matched with a tab char.
$Matches[1..($Matches.Count-1)].Trim() -join "`t"
$matched++
}
$first = $true
}
default{
$unmatched++
# you can remove this, if the pattern always matches the header
$first = $true
}
}
} | Out-File $outFile
$total = $matched + $unmatched
Using System.IO.StreamReader reduced the processing time to about 20% of what it had been. This was absolutely needed for my requirement.
I added logic and counters without sacrificing much on performance. The field counter and row by row comparison is particularly helpful in finding bad records.
This is a copy/paste of actual code but I shortened some things, made some things slightly pseudo code, so you may have to play with it to get things working just so for yourself.
Function Get-Regx-Data-Format() {
Param ([String] $filename)
if ($filename -eq 'FILE NAME') {
[regex]$match_regex = '^(.{10})(.{10})(.{10})(.{30})(.{30})(.{30})(.{4})(.{1})'
}
return $match_regex
}
Foreach ($file in $cutoff_files) {
$starttime_for_file = (Get-Date)
$source_file = $file + '_' + $proc_yyyymm + $source_file_suffix
$source_path = $source_dir + $source_file
$parse_file = $file + '_' + $proc_yyyymm + '_load' +$parse_target_suffix
$parse_file_path = $parse_target_dir + $parse_file
$error_file = $file + '_err_' + $proc_yyyymm + $error_target_suffix
$error_file_path = $error_target_dir + $error_file
[regex]$match_data_regex = Get-Regx-Data-Format $file
Remove-Item -path "$parse_file_path" -Force -ErrorAction SilentlyContinue
Remove-Item -path "$error_file_path" -Force -ErrorAction SilentlyContinue
[long]$matched_cnt = 0
[long]$unmatched_cnt = 0
[long]$loop_counter = 0
[boolean]$has_header_row=$true
[int]$field_cnt=0
[int]$previous_field_cnt=0
[int]$array_length=0
$parse_minutes = Measure-Command {
try {
$stream_log = [System.IO.StreamReader]::new($source_path)
$stream_in = [System.IO.StreamReader]::new($source_path)
$stream_out = [System.IO.StreamWriter]::new($parse_file_path)
$stream_err = [System.IO.StreamWriter]::new($error_file_path)
while ($line = $stream_in.ReadLine()) {
if ($line -match $match_data_regex) {
#if matched and it's the header, parse and write to the beg of output file
if (($loop_counter -eq 0) -and $has_header_row) {
$stream_out.WriteLine(($Matches[1..($array_length)].Trim() -join "`t"))
} else {
$previous_field_cnt = $field_cnt
#add year month to line start, trim and join every captured field w/tabs
$stream_out.WriteLine("$proc_yyyymm`t" + `
($Matches[1..($array_length)].Trim() -join "`t"))
$matched_cnt++
$field_cnt=$Matches.Count
if (($previous_field_cnt -ne $field_cnt) -and $loop_counter -gt 1) {
write-host "`nError on line $($loop_counter + 1). `
The field count does not match the previous correctly `
formatted (non-error) row."
}
}
} else {
if (($loop_counter -eq 0) -and $has_header_row) {
#if the header, write to the beginning of the output file
$stream_out.WriteLine($line)
} else {
$stream_err.WriteLine($line)
$unmatched_cnt++
}
}
$loop_counter++
}
} finally {
$stream_in.Dispose()
$stream_out.Dispose()
$stream_err.Dispose()
$stream_log.Dispose()
}
} | Select-Object -Property TotalMinutes
write-host "`n$file_list_idx. File $file parsing results....`nMatched Count =
$matched_cnt UnMatched Count = $unmatched_cnt Parse Minutes = $parse_minutes`n"
$file_list_idx++
$endtime_for_file = (Get-Date)
write-host "`nEnded processing file at $endtime_for_file"
$TimeDiff_for_file = (New-TimeSpan $starttime_for_file $endtime_for_file)
$Hrs_for_file = $TimeDiff_for_file.Hours
$Mins_for_file = $TimeDiff_for_file.Minutes
$Secs_for_file = $TimeDiff_for_file.Seconds
write-host "`nElapsed Time for file $file processing:
$Hrs_for_file`:$Mins_for_file`:$Secs_for_file"
}
$endtime = (Get-Date -format "HH:mm:ss")
$TimeDiff = (New-TimeSpan $starttime $endtime)
$Hrs = $TimeDiff.Hours
$Mins = $TimeDiff.Minutes
$Secs = $TimeDiff.Seconds
write-host "`nTotal Elapsed Time: $Hrs`:$Mins`:$Secs"

Add capture group values in a PowerShell replace loop

Needing to replace a string in multiple text files with the same string , except with capture group 2 replaced by the sum of itself and capture group 4.
String: Total amount $11.39 | Change $0.21
Desired Result: Total amount $11.60 | Change $0.21
I have attempted several methods. Here is my last attempt which seems to run without error, but without any changes to the string .
$Originalfolder = "$ENV:userprofile\Documents\folder\"
$Originalfiles = Get-ChildItem -Path "$Originalfolder\*"
$RegexPattern = '\b(Total\s\amount\s\$)(\d?\d?\d?\d?\d\.?\d?\d?)(\s\|\sChange\s\$)(\d?\d?\d\.?\d?\d?)\b'
$Substitution = {
Param($Match)
$Result = $GP1 + $Sumtotal + $GP3 + $Change
$GP1 = $Match.Groups[1].Value
$Total = $Match.Groups[2].Value
$GP3 = $Match.Groups[3].Value
$Change = $Match.Groups[4].Value
$Sumtotal = ($Total + $Change)
return [string]$Result
}
foreach ($file in $Originalfiles) {
$Lines = Get-Content $file.FullName
$Lines | ForEach-Object {
[Regex]::Replace($_, $RegexPattern, $Substitution)
} | Set-Content $file.FullName
}
For one thing, your regular expression doesn't even match what you're trying to replace, because you escaped the a in amount:
\b(Total\s\amount\s\$)(\d?\d?\d?...
# ^^
\a is an escape sequence that matches the "alarm" or "bell" character \u0007.
Also, if you want to calculate the sum of two captures you need to convert them to numeric values first, otherwise the + operator would just concatenate the two strings.
$Total = $Match.Groups[2].Value
$Change = $Match.Groups[4].Value
$Sumtotal = $Total + $Change # gives 11.390.21
$Sumtotal = [double]$Total + [double]$Change # gives 11.6
And you need to build $Result after you defined the other variables, otherwise the replacement function would just return an empty string.
Change this:
$RegexPattern = '\b(Total\s\amount\s\$)(\d?\d?\d?\d?\d\.?\d?\d?)(\s\|\sChange\s\$)(\d?\d?\d\.?\d?\d?)\b'
$Substitution = {
param ($Match)
$Result = $GP1 + $Sumtotal + $GP3 + $Change
$GP1 = $Match.Groups[1].Value
$Total = $Match.Groups[2].Value
$GP3 = $Match.Groups[3].Value
$Change = $Match.Groups[4].Value
$Sumtotal = ($Total + $Change)
return [string]$Result
}
into this:
$RegexPattern = '\b(Total\samount\s\$)(\d?\d?\d?\d?\d\.?\d?\d?)(\s\|\sChange\s\$)(\d?\d?\d\.?\d?\d?)\b'
$Substitution = {
Param($Match)
$GP1 = $Match.Groups[1].Value
$Total = [double]$Match.Groups[2].Value
$GP3 = $Match.Groups[3].Value
$Change = [double]$Match.Groups[4].Value
$Sumtotal = ($Total + $Change)
$Result = $GP1 + $Sumtotal + $GP3 + $Change
return [string]$Result
}
and the code will mostly do what you want. "Mostly", because it will not format the calculated number to double decimals. You need to do that yourself. Use the format operator (-f) and change your replacement function to something like this:
$Substitution = {
Param($Match)
$GP1 = $Match.Groups[1].Value
$Total = [double]$Match.Groups[2].Value
$GP3 = $Match.Groups[3].Value
$Change = [double]$Match.Groups[4].Value
$Sumtotal = $Total + $Change
return ('{0}{1:n2}{2}{3:n2}' -f $GP1, $Sumtotal, $GP3, $Change)
}
As a side note: the sub-expression \d?\d?\d?\d?\d\.?\d?\d? could be shortened to \d+(?:\.\d+)? (one or more digit, optionally followed by a period and one or more digits) or, more exactly, to \d{1,4}(?:\.\d{0,2})? (one to four digits, optionally followed by a period and up to 2 digits).
here's how I'd do it: this is pulled out of a larger script that regularly scans a directory for files, then does a similar manipulation, and I've changed variables quickly to obfuscate, so shout if it doesn't work and I'll take a more detailed look tomorrow.
It takes a backup of each file as well, and works on a temp copy before renaming.
Note it also sends an email alert (code at the end) to say if any processing was done - this is because it's designed to run as as scheduled task in the original
$backupDir = "$pwd\backup"
$stringToReplace = "."
$newString = "."
$files = #(Get-ChildItem $directoryOfFiles)
$intFiles = $files.count
$tmpExt = ".tmpDataCorrection"
$DataCorrectionAppend = ".DataprocessBackup"
foreach ($file in $files) {
$content = Get-Content -Path ( $directoryOfFiles + $file )
# Check whether there are any instances of the string
If (!($content -match $stringToReplace)) {
# Do nothing if we didn't match
}
Else {
#Create another blank temporary file which the corrected file contents will be written to
$tmpFileName_DataCorrection = $file.Name + $tmpExt_DataCorrection
$tmpFile_DataCorrection = $directoryOfFiles + $tmpFileName_DataCorrection
New-Item -ItemType File -Path $tmpFile_DataCorrection
foreach ( $line in $content ) {
If ( $line.Contains("#")) {
Add-Content -Path $tmpFile_DataCorrection -Value $line.Replace($stringToReplace,$newString)
#Counter to know whether any processing was done or not
$processed++
}
Else {
Add-Content -Path $tmpFile_DataCorrection -Value $line
}
}
#Backup (rename) the original file, and rename the temp file to be the same name as the original
Rename-Item -Path $file.FullName -NewName ($file.FullName + $DataCorrectionAppend) -Force -Confirm:$false
Move-Item -Path ( $file.FullName + $DataCorrectionAppend ) -Destination backupDir -Force -Confirm:$false
Rename-Item -Path $tmpFile_DataCorrection -NewName $file.FullName -Force -Confirm:$false
# Check to see if anything was done, then populate a variable to use in final email alert if there was
If (!$processed) {
#no message as did nothing
}
Else {
New-Variable -Name ( "processed" + $file.Name) -Value $strProcessed
}
} # Out of If loop
}

Powershell search matching string in word document

I have a simple requirement. I need to search a string in Word document and as result I need to get matching line / some words around in document.
So far, I could successfully search a string in folder containing Word documents but it returns True / False based on whether it could find search string or not.
#ERROR REPORTING ALL
Set-StrictMode -Version latest
$path = "c:\MORLAB"
$files = Get-Childitem $path -Include *.docx,*.doc -Recurse | Where-Object { !($_.psiscontainer) }
$output = "c:\wordfiletry.txt"
$application = New-Object -comobject word.application
$application.visible = $False
$findtext = "CRHPCD01"
Function getStringMatch
{
# Loop through all *.doc files in the $path directory
Foreach ($file In $files)
{
$document = $application.documents.open($file.FullName,$false,$true)
$range = $document.content
$wordFound = $range.find.execute($findText)
if($wordFound)
{
"$file.fullname has $wordfound" | Out-File $output -Append
}
}
$document.close()
$application.quit()
}
getStringMatch
#ERROR REPORTING ALL
Set-StrictMode -Version latest
$path = "c:\Temp"
$files = Get-Childitem $path -Include *.docx,*.doc -Recurse | Where-Object { !($_.psiscontainer) }
$output = "c:\temp\wordfiletry.csv"
$application = New-Object -comobject word.application
$application.visible = $False
$findtext = "First"
$charactersAround = 30
$results = #{}
Function getStringMatch
{
# Loop through all *.doc files in the $path directory
Foreach ($file In $files)
{
$document = $application.documents.open($file.FullName,$false,$true)
$range = $document.content
If($range.Text -match ".{$($charactersAround)}$($findtext).{$($charactersAround)}"){
$properties = #{
File = $file.FullName
Match = $findtext
TextAround = $Matches[0]
}
$results += New-Object -TypeName PsCustomObject -Property $properties
}
}
If($results){
$results | Export-Csv $output -NoTypeInformation
}
$document.close()
$application.quit()
}
getStringMatch
import-csv $output
There are a couple of ways to get what you want. A simple approach is since you have the text of the document already lets perform a regex match on it and return the results and more. This helps in trying to address getting some words around in document.
We have the variable $charactersAround which sets the number of characters to match around the $findtext. Also I though the output was a better fit for a CSV file so I used $results to capture a hashtable of properties that, in the end, are output to a csv file.
Be sure to change the variables for your own testing. Now that we are using regex to locate the matches this opens up a world of possibilities.
Sample Output
Match TextAround File
----- ---------- ----
First dley Air Services Limited dba First Air meets or exceeds all term C:\Temp\20120315132117214.docx
Thanks! You provided a great solution to use PowerShell regex expressions to look for information in a Word document. I needed to modify it to meet my needs. Maybe, it will help someone else. It reads each line of the word document, and then uses the regex expression to determine if the line is a match. The output could easily be modified or dumped to a log file.
Set-StrictMode -Version latest
$path = "c:\Temp\pii"
$files = Get-Childitem $path -Include *.docx,*.doc -Recurse | Where-Object { !($_.psiscontainer) }
$application = New-Object -comobject word.application
$application.visible = $False
$findtext = "[0-9]" #regex
Function getStringMatch
{
# Loop through all *.doc files in the $path directory
Foreach ($file In $files) {
$document = $application.documents.open($file.FullName,$false,$true)
$arrContents = $document.content.text.split()
$varCounter = 0
ForEach ($line in $arrContents) {
$varCounter++
If($line -match $findtext) {
"File: $file Found: $line Line: $varCounter"
}
}
$document.close()
}
$application.quit()
}
getStringMatch
Good answer from #Matt.
I improved it a little (new PowerShell version have problems with the given array. And to search big amount of documents it runs out of memory.
Here is my improved version:
#ERROR REPORTING ALL
Set-StrictMode -Version latest
$path = "c:\Temp"
$files = Get-Childitem $path -Include *.docx,*.doc -Recurse | Where-Object { !($_.psiscontainer) }
$output = "c:\temp\wordfiletry.csv"
$application = New-Object -comobject word.application
$application.visible = $False
$findtext = "First"
$charactersAround = 30
$results = #{}
Function getStringMatch
{
# Loop through all *.doc files in the $path directory
Foreach ($file In $files)
{
$document = $application.documents.open($file.FullName,$false,$true)
$range = $document.content
If($range.Text -match ".{$($charactersAround)}$($findtext).{$($charactersAround)}"){
$properties = #{
File = $file.FullName
Match = $findtext
TextAround = $Matches[0]
}
$results += #(New-Object -TypeName PsCustomObject -Property $properties)
}
$document.close()
}
If($results){
$results | Export-Csv $output -NoTypeInformation
}
$application.quit()
}
getStringMatch
import-csv $output
Use the function like this:
PS> WordGrep -File ./Myfile.docx -Grep one, two, three
function WordGrep{
param(
[string]$File,
[string[]]$Grep,
[switch]$WordMode,
[switch]$EscapeMode
)
$WordApp = New-Object -comobject word.application
$WordApp.visible = $False
try {
$document = $WordApp.documents.open($File, $false, $true)
$arrContents = $document.content.text.split()
$found = $false
foreach ($line in $arrContents) {
foreach ($pattern in $Grep) {
if ($EscapeMode) {
$pattern = [Regex]::Escape($pattern)
}
if ($WordMode) {
$pattern = "\b${pattern}\b"
}
if ($line -imatch $pattern) {
write-host -ForegroundColor Cyan -NoNewLine "$file`:"
write-host " $line"
break;
}
}
}
$document.close()
}
finally {
$WordApp.quit()
}
}

Use Powershell to print out line number of code matching a RegEx

I think we have a bunch of commented out code in our source, and rather than delete it immediately, we've just left it. Now I would like to do some cleanup.
So assuming that I have a good enough RegEx to find comments (the RegEx below is simple and I could expand on it based on our coding standards), how do I take the results of the file that I read up and output the following:
Filename
Line Number
The actual line of code
I think I have the basis of an answer here, but I don't know how to take the file that I've read up and parsed with RegEx and spit it out in this format.
I'm not looking for the perfect solution - I just want to find big blocks of commented out code. By looking at the result and seeing a bunch of files with the same name and sequential line numbers, I should be able to do this.
$Location = "c:\codeishere"
[regex]$Regex = "//.*;" #simple example - Will expand on this...
$Files = get-ChildItem $Location -include *cs -recurse
foreach ($File in $Files) {
$contents = get-Content $File
$Regex.Matches($contents) | WHAT GOES HERE?
}
You could do:
dir c:\codeishere -filter *.cs -recurse | select-string -Pattern '//.*;' | select Line,LineNumber,Filename
gci c:\codeishere *.cs -r | select-string "//.*;"
The select-string cmdlet already does exactly what you're asking for, though the filename displayed is a relative path.
I would go personally even further. I would like to compute number of consecutive following lines. Then print the file name, count of lines and the lines itself. You may sort the result by count of lines (candidates for delete?).
Note that my code doesn't count with empty lines between commented lines, so this part is considered as two blocks of commented code:
// int a = 10;
// int b = 20;
// DoSomething()
// SomethingAgain()
Here is my code.
$Location = "c:\codeishere"
$occurences = get-ChildItem $Location *cs -recurse | select-string '//.*;'
$grouped = $occurences | group FileName
function Compute([Microsoft.PowerShell.Commands.MatchInfo[]]$lines) {
$local:lastLineNum = $null
$local:lastLine = $null
$local:blocks = #()
$local:newBlock = $null
$lines |
% {
if (!$lastLineNum) { # first line
$lastLineNum = -2 # some number so that the following if is $true (-2 and lower)
}
if ($_.LineNumber - $lastLineNum -gt 1) { #new block of commented code
if ($newBlock) { $blocks += $newBlock }
$newBlock = $null
}
else { # two consecutive lines of commented code
if (!$newBlock) {
$newBlock = '' | select File,StartLine,CountOfLines,Lines
$newBlock.File, $newBlock.StartLine, $newBlock.CountOfLines, $newBlock.Lines = $_.Filename,($_.LineNumber-1),2, #($lastLine,$_.Line)
}
else {
$newBlock.CountOfLines += 1
$newBlock.Lines += $_.Line
}
}
$lastLineNum=$_.LineNumber
$lastLine = $_.Line
}
if ($newBlock) { $blocks += $newBlock }
$blocks
}
# foreach GroupInfo objects from group cmdlet
# get Group collection and compute
$result = $grouped | % { Compute $_.Group }
#how to print
$result | % {
write-host "`nFile $($_.File), line $($_.StartLine), count of lines: $($_.CountOfLines)" -foreground Green
$_.Lines | % { write-host $_ }
}
# you may sort it by count of lines:
$result2 = $result | sort CountOfLines -desc
$result2 | % {
write-host "`nFile $($_.File), line $($_.StartLine), count of lines: $($_.CountOfLines)" -foreground Green
$_.Lines | % { write-host $_ }
}
If you have any idea how to improve the code, post it! I have a feeling that I could do it using some standard cmdlets and the code could be shorter..
I would look at doing something like:
dir $location -inc *.cs -rec | `
%{ $file = $_; $n = 0; get-content $_ } | `
%{ $_.FileName = $file; $_.Line = ++$n; $_ } | `
?{ $_ -match $regex } | `
%{ "{0}:{1}: {2}" -f ($_.FileName, $_.Line, $_)}
I.e. add extra properties to the string to specify the filename and line number, which can be carried through the pipeline after the regex match.
(Using ForEach-Object's -begin/-end script blocks should be able to simplify this.)