Powershell search matching string in word document - regex

I have a simple requirement. I need to search a string in Word document and as result I need to get matching line / some words around in document.
So far, I could successfully search a string in folder containing Word documents but it returns True / False based on whether it could find search string or not.
#ERROR REPORTING ALL
Set-StrictMode -Version latest
$path = "c:\MORLAB"
$files = Get-Childitem $path -Include *.docx,*.doc -Recurse | Where-Object { !($_.psiscontainer) }
$output = "c:\wordfiletry.txt"
$application = New-Object -comobject word.application
$application.visible = $False
$findtext = "CRHPCD01"
Function getStringMatch
{
# Loop through all *.doc files in the $path directory
Foreach ($file In $files)
{
$document = $application.documents.open($file.FullName,$false,$true)
$range = $document.content
$wordFound = $range.find.execute($findText)
if($wordFound)
{
"$file.fullname has $wordfound" | Out-File $output -Append
}
}
$document.close()
$application.quit()
}
getStringMatch

#ERROR REPORTING ALL
Set-StrictMode -Version latest
$path = "c:\Temp"
$files = Get-Childitem $path -Include *.docx,*.doc -Recurse | Where-Object { !($_.psiscontainer) }
$output = "c:\temp\wordfiletry.csv"
$application = New-Object -comobject word.application
$application.visible = $False
$findtext = "First"
$charactersAround = 30
$results = #{}
Function getStringMatch
{
# Loop through all *.doc files in the $path directory
Foreach ($file In $files)
{
$document = $application.documents.open($file.FullName,$false,$true)
$range = $document.content
If($range.Text -match ".{$($charactersAround)}$($findtext).{$($charactersAround)}"){
$properties = #{
File = $file.FullName
Match = $findtext
TextAround = $Matches[0]
}
$results += New-Object -TypeName PsCustomObject -Property $properties
}
}
If($results){
$results | Export-Csv $output -NoTypeInformation
}
$document.close()
$application.quit()
}
getStringMatch
import-csv $output
There are a couple of ways to get what you want. A simple approach is since you have the text of the document already lets perform a regex match on it and return the results and more. This helps in trying to address getting some words around in document.
We have the variable $charactersAround which sets the number of characters to match around the $findtext. Also I though the output was a better fit for a CSV file so I used $results to capture a hashtable of properties that, in the end, are output to a csv file.
Be sure to change the variables for your own testing. Now that we are using regex to locate the matches this opens up a world of possibilities.
Sample Output
Match TextAround File
----- ---------- ----
First dley Air Services Limited dba First Air meets or exceeds all term C:\Temp\20120315132117214.docx

Thanks! You provided a great solution to use PowerShell regex expressions to look for information in a Word document. I needed to modify it to meet my needs. Maybe, it will help someone else. It reads each line of the word document, and then uses the regex expression to determine if the line is a match. The output could easily be modified or dumped to a log file.
Set-StrictMode -Version latest
$path = "c:\Temp\pii"
$files = Get-Childitem $path -Include *.docx,*.doc -Recurse | Where-Object { !($_.psiscontainer) }
$application = New-Object -comobject word.application
$application.visible = $False
$findtext = "[0-9]" #regex
Function getStringMatch
{
# Loop through all *.doc files in the $path directory
Foreach ($file In $files) {
$document = $application.documents.open($file.FullName,$false,$true)
$arrContents = $document.content.text.split()
$varCounter = 0
ForEach ($line in $arrContents) {
$varCounter++
If($line -match $findtext) {
"File: $file Found: $line Line: $varCounter"
}
}
$document.close()
}
$application.quit()
}
getStringMatch

Good answer from #Matt.
I improved it a little (new PowerShell version have problems with the given array. And to search big amount of documents it runs out of memory.
Here is my improved version:
#ERROR REPORTING ALL
Set-StrictMode -Version latest
$path = "c:\Temp"
$files = Get-Childitem $path -Include *.docx,*.doc -Recurse | Where-Object { !($_.psiscontainer) }
$output = "c:\temp\wordfiletry.csv"
$application = New-Object -comobject word.application
$application.visible = $False
$findtext = "First"
$charactersAround = 30
$results = #{}
Function getStringMatch
{
# Loop through all *.doc files in the $path directory
Foreach ($file In $files)
{
$document = $application.documents.open($file.FullName,$false,$true)
$range = $document.content
If($range.Text -match ".{$($charactersAround)}$($findtext).{$($charactersAround)}"){
$properties = #{
File = $file.FullName
Match = $findtext
TextAround = $Matches[0]
}
$results += #(New-Object -TypeName PsCustomObject -Property $properties)
}
$document.close()
}
If($results){
$results | Export-Csv $output -NoTypeInformation
}
$application.quit()
}
getStringMatch
import-csv $output

Use the function like this:
PS> WordGrep -File ./Myfile.docx -Grep one, two, three
function WordGrep{
param(
[string]$File,
[string[]]$Grep,
[switch]$WordMode,
[switch]$EscapeMode
)
$WordApp = New-Object -comobject word.application
$WordApp.visible = $False
try {
$document = $WordApp.documents.open($File, $false, $true)
$arrContents = $document.content.text.split()
$found = $false
foreach ($line in $arrContents) {
foreach ($pattern in $Grep) {
if ($EscapeMode) {
$pattern = [Regex]::Escape($pattern)
}
if ($WordMode) {
$pattern = "\b${pattern}\b"
}
if ($line -imatch $pattern) {
write-host -ForegroundColor Cyan -NoNewLine "$file`:"
write-host " $line"
break;
}
}
}
$document.close()
}
finally {
$WordApp.quit()
}
}

Related

Compare size of multiple subdirectory before and after a break in a Powershell script

I'm a beginner with Powershell, also forgive my English which isn't the best.
I have a directory with several subdirectories like this
Directory
My goal is to target directory that are updated while the script is running. So I made this script
$path = "C:\Users\s611284\Desktop\archive"
#Check that the directories have the correct name and calculate their size
Get-ChildItem -Force $path -ErrorAction SilentlyContinue -Directory | foreach {
$Size = (Get-ChildItem $_.fullname -Recurse -ErrorAction SilentlyContinue | Measure-Object -Property Length -Sum).Sum/ 1Kb
$FolderName = $_.BaseName -match '1B(\d{6})_LEAP 1A version aout2021_(\d{4})-(\d{2})-(\d{2})T(\d{2})h(\d{2})m(\d{2})s_S(\d{6})' -or $_.BaseName -match '1B(\d{6})_SML 10_LEAP 1A version aout2021_(\d{4})-(\d{2})-(\d{2})T(\d{2})h(\d{2})m(\d{2})s_S(\d{5})'
$Folder = $_.BaseName
if ($FolderName -eq "true") {
write-host(" name $Folder is correct, $Size Kb")
}
else {
write-host( "name $Folder is incorrect")
}
}
#Break
Start-Sleep -Seconds 30
write-host( "end of break")
#directory size calculation after the break
Get-ChildItem -Force $path -ErrorAction SilentlyContinue -Directory | foreach {
$Size1 = (Get-ChildItem $_.fullname -Recurse -ErrorAction SilentlyContinue | Measure-Object -Property Length -Sum).Sum/ 1Kb
$FolderName1 = $_.BaseName -match '1B(\d{6})_LEAP 1A version aout2021_(\d{4})-(\d{2})-(\d{2})T(\d{2})h(\d{2})m(\d{2})s_S(\d{6})' -or $_.BaseName -match '1B(\d{6})_SML 10_LEAP 1A version aout2021_(\d{4})-(\d{2})-(\d{2})T(\d{2})h(\d{2})m(\d{2})s_S(\d{5})'
$Folder1 = $_.BaseName
if ($FolderName1 -eq "true") {
write-host("name $Folder1 is correct, $Size1 Kb")
}
else {
write-host( "name $Folder1 is incorrect")
}
}
All of this is working great
So now I want to compare the size of the subdirectories before and after the break, to know which have been updated
I tried
if ( $FolderSize -eq $FolderSize1 )
{
Write-Output $True
}
Else
{
Write-Output $False
}
at the end of my second block but it isn't working..
I also tried Compare-object but I don't think this command will help in my case
I hope you guys will understand my post and help me
Thanks !
If I understood correctly, you're looking to filter those folders where it's Size has changed after 30 seconds, if that's the case, you could use a function so that you don't need to repeat your code. You can make your function return a hash table where the Keys are the folder's absolute path and the Values are their calculated size, once you have both results (before 30 seconds and after 30 seconds) you can run a comparison against both hash tables outputting a new object with the folder's Absolute Path, Size Before and Size After only for those folders where their calculated size has changed.
function GetFolderSize {
[cmdletbinding()]
param($path)
$map = #{}
Get-ChildItem $path -Directory -Force | ForEach-Object {
$Size = (Get-ChildItem $_.Fullname -Recurse | Measure-Object -Property Length -Sum).Sum / 1Kb
$FolderName = $_.BaseName -match '1B(\d{6})_LEAP 1A version aout2021_(\d{4})-(\d{2})-(\d{2})T(\d{2})h(\d{2})m(\d{2})s_S(\d{6})' -or $_.BaseName -match '1B(\d{6})_SML 10_LEAP 1A version aout2021_(\d{4})-(\d{2})-(\d{2})T(\d{2})h(\d{2})m(\d{2})s_S(\d{5})'
if ($FolderName) {
$map[$_.FullName] = $size
}
}
if($map) { $map }
}
$path = "C:\Users\s611284\Desktop\archive"
$before = GetFolderSize $path -ErrorAction SilentlyContinue
Start-Sleep -Seconds 30
$after = GetFolderSize $path -ErrorAction SilentlyContinue
foreach($key in $after.PSBase.Keys) {
if($before[$key] -ne $after[$key]) {
# this is a folder with a different size than before
[PSCustomObject]#{
FullName = $key
SizeBefore = $before[$key]
SizeAfter = $after[$key]
}
}
}
Not to take away from Santiago's helpful answer, but to provide an alternate solution, here's my take:
$path = "C:\Users\s611284\Desktop\archive"
$count = 0
$hashMap = #{}
While ($count -lt 2) {
Get-ChildItem -Path $path -Directory |
ForEach-Object -Begin {
$count++
$toMatch = "1B(\d{6})_LEAP 1A version aout2021_(\d{4})-(\d{2})-(\d{2})T(\d{2})h(\d{2})m(\d{2})s_S(\d{6})|1B(\d{6})_SML 10_LEAP 1A version aout2021_(\d{4})-(\d{2})-(\d{2})T(\d{2})h(\d{2})m(\d{2})s_S(\d{5})"
} -Process {
$folderMatch = $_.BaseName -match $toMatch
if ($folderMatch) {
$size = (Get-ChildItem -Path $_.FullName -Recurse -EA 0 | Measure-Object -Property "Length" -Sum).Sum / 1kb
if (-not$hashMap.ContainsKey($_.BaseName)) {
$hashMap.Add($_.BaseName,$size)
}
if ($count -ge 2) {
if ($hashMap[$_.BaseName] -ne $size) {
$_.BaseName + " " + "is a different size"
}
}
}
} -End {
if ($count -ne 2) {
Start-Sleep -Seconds 30
}
}
}
Personally, I hate the re-use of code and feel like something can always be done about it if you find yourself repeating code (copy/paste).
My question to you is:
Aren't those folder names pretty unique?
Could you not substitute it for a Wild Card Expression?
i.e.: $_.BaseName -like '*1A version aout2021*'
All in the name of "cleanliness" code. lol

Powershell - Searching for strings (in list) in a word document

I found some code for searching for strings in a Word Document. I altered it to suit my needs (I need to search from a very long list of strings). Unfortunately, I am getting a weird error.
While the script is running, it opens the word document, searches the word document and here is where it gets weird, instead of closing the document and opening the next, it presents me with a 'save as' dialog box and the script hangs until I cancel out of it. When I cancel out of it, my script continues.
Here is the script I'm using, would anyone see where I'm going south?
$results = #{}
Write-Host "Loading getStringMatch into memory" -ForegroundColor DarkMagenta
Function getStringMatch
{
# Loop through all *.doc files in the $path directory
Foreach ($file In $files)
{
Write-Host "Searching In ... $($File.FullName) " -ForegroundColor DarkYellow
$document = $application.documents.open($file.FullName,$false,$true)
$range = $document.content
If($range.Text -match ".{$($charactersAround)}$($findtext).{$($charactersAround)}"){
$properties = #{
File = $file.FullName
Match = $findtext
TextAround = $Matches[0]
}
$results += #(New-Object -TypeName PsCustomObject -Property $properties)
}
$document.close()
Write-Host "Closing Document ... $($File.FullName) " -ForegroundColor Red
}
#If($results){
# $results | Export-Csv $output -NoTypeInformation
#}
$application.quit()
}
$searchWords=Get-Content "C:\Temp\USDA_Search_For.txt"
Foreach ($sw in $searchWords)
{
Write-Host "Setting Variables ..." -ForegroundColor DarkMagenta
Set-StrictMode -Version latest
$path = "C:\Temp"
$files = Get-Childitem $path -Include *.docx,*.doc -Recurse | Where-Object { !($_.psiscontainer) }
$output = "C:\Temp\Found.csv"
$application = New-Object -comobject word.application
$application.visible = $False
$findtext = "First"
$charactersAround = 30
#$results = #{}
$findtext = $sw
Write-Host "Searching For ... $findtext" -ForegroundColor Green
getStringMatch
#clean up stuff
[System.Runtime.InteropServices.Marshal]::ReleaseComObject($application) | Out-Null
Remove-Variable -Name application
[gc]::collect()
[gc]::WaitForPendingFinalizers()
}
If($results){
$results | Export-Csv $output -NoTypeInformation
}
import-csv $output

Regular expression in power shell to convert character between double quotes to upper case

I want to write a powershell script which will convert a string which is present between double quotes in a file, and convert it into upper case.
The files are placed in different folders.
I am able to extract the string between the double quotes and convert it to upper case, but not able to replace it in the correct position.
Ex : This is the input string.
"e" //&&'i&&
The output should be
"E" //&&'i&&
This is what i have tried. Also this even i not replacing the content of the file.
$items = Get-ChildItem * -recurse
# enumerate the items array
foreach ($item in $items)
{
# if the item is a directory, then process it.
if ($item.Attributes -ne "Directory")
{
(Get-Content $item.FullName ) |
Foreach-Object {
if (($_ -match '\"'))
{
$str = $_
$ext = [regex]::Matches($str, '".*?"').Value -replace '"'
$ext = $ext.ToUpper()
Write-Host $ext
$_ = $ext
}
else { }
} |
Set-Content $item.FullName
}
}
This can do it. Really I wasn't following your code so I stripped it and modified the regex.
$items = Get-ChildItem "C:\Users\UsernameHere\Desktop\Folder123\*.txt"
# enumerate the items array
foreach ($item in $items){
# if the item is a directory, then process it.
if ($item.Attributes -ne "Directory"){
$content = (gc $item.FullName )
$content = $content.replace('"\w.*"',$matches[0].ToUpper)
$content | sc $item
}
}
If you had powershell 6 or 7:
'"hi"' -replace '".*"', { $_.value.toupper() }
"HI"
'"e" //&&''i&&' -replace '".*"', { $_.value.toupper() }
"E" //&&'i&&
I am able to print the upper case characters with the below code, but the file is not getting updated. It still has the old characters, How to update the fie with new contents.
$items = Get-ChildItem *.txt -recurse
# enumerate the items array
foreach ($item in $items)
{
# if the item is a directory, then process it.
if ($item.Attributes -ne "Directory")
{
(Get-Content $item.FullName ) |
Foreach-Object {
$str = $_
$_ = [regex]::Replace($_, '"[^"]*"', { param($m) $m.Value.ToUpper() })
Write-Host $_
} |
Set-Content $item.FullName
}
}

How to stream text with powershell and regex match on multiline

I have a text file that an application constantly errors to. I want to monitor this file with Powershell and log every error to another source.
Problem to solve: how do i pass multiline text when we are in -wait? Get-Content is passing arrays of strings.
$File = 'C:\Windows\Temp\test.txt'
$content = Get-Content -Path $file
# get stream of text
Get-Content $file -wait -Tail 0 | ForEach-Object {
if ($_ -match '(<ACVS_T>)((.|\n)*)(<\/ACVS_T>)+'){
write-host 'match found!'
}
}
Example of text junks that get drop:
<ACVS_T>
<ACVS_D>03/01/2017 17:24:03.602</ACVS_D>
<ACVS_TI>bf37ba1c9,iSTAR Server Compone</ACVS_TI>
<ACVS_C>ClusterPort</ACVS_C>
<ACVS_S>SoftwareHouse.NextGen.HardwareInterface.Nantucket.Framework.ClusterPort.HandleErrorState( )
</ACVS_S>
<ACVS_M>
ERROR MESSAGE FROM APP
</ACVS_M>
<ACVS_ST>
</ACVS_ST>
</ACVS_T>
solved it!
$File = 'D:\Program Files (x86)\Tyco\CrossFire\Logging\SystemTrace.Log'
$content = Get-Content -Path $file
# get stream of text
$text = ''
Get-Content $file -wait -Tail 0 | ForEach-Object {
$text +=$_
if ($text -match '(<ACVS_T>)((.|\n)*)(<\/ACVS_T>)+'){
[xml]$XML = "<Root>" + $text + "</Root>"
$text='' #clear it for next one
$XML.Root.ACVS_T | ForEach-Object {
$Obj = '' | Select-Object -Property ACVS_D, ACVS_TI, ACVS_C, ACVS_S, ACVS_M, ACVS_ST
$Obj.ACVS_D = $_.ACVS_D
$Obj.ACVS_ST = $_.ACVS_ST
$Obj.ACVS_C = $_.ACVS_C
$Obj.ACVS_S = $_.ACVS_S
$Obj.ACVS_M = $_.ACVS_M
$Obj.ACVS_ST = $_.ACVS_ST
write-host "`n`n$($Obj.ACVS_M)"
}
}
}

Powershell search regular expression in many Microsoft Word documents

I have a powershell script to search a string from word files as follows:
$searchStr = "ABC/2014/N/123"
$files = gci -path "c:\doc","d:\doc" -include "*.doc*","*.tp?" -recurse
$word = new-object -ComObject "word.application"
foreach ($file in $files) {
$doc = $word.documents.open($file.fullname)
if ($doc.content.find.execute($searchStr)) {
echo $file.fullname
}
$doc.close()
}
I want to enhance it to enable regex and insert it before the .execute() as this:
set $doc.content.find.text = "[A-Z]{2-5}\/[0-9]{4}\/N\/[0-9]{3}"
set $doc.content.find.matchwildcards = $true
However, the properties are read-only as it complains.
So, I try passing them as the parms in .execute()
PS C:\> $doc.content.find.execute
OverloadDefinitions
-------------------
bool Execute (Variant, Variant, Variant, Variant, Variant, Variant, Variant,
Variant, Variant, Variant, Variant, Variant, Variant, Variant, Variant)
How can I do it like this?
$doc.content.find.execute(Text:="[A-Z]{2-5}\/[0-9]{4}\/N\/[0-9]{3}", matchwildcards:=$true)
Many thanks.
It works now, thank you.
$searchStr = "CIS\/[0-9]{4}\/N\/[0-9]{3}"
$isRegexp = $true
$files = gci -path "g:\spc","h:\spc" -include "*.doc*","*.tp?" -recurse
$default = [Type]::Missing
$word = new-object -ComObject "word.application"
foreach ($file in $files) {
$doc = $word.documents.open($file.fullname, )
# expression.Execute(FindText, MatchCase, MatchWholeWord, MatchWildcards,
# MatchSoundsLike, MatchAllWordForms, Forward, Wrap, Format, ReplaceWith,
# Replace, MatchKashida, MatchDiacritics, MatchAlefHamza, MatchControl)
if ($doc.content.find.execute($searchStr,$default,$default,$isRegexp)) {
echo $file.fullname
}
$doc.close()
}