List All Matches Found in Word Doc - regex

I'm able to find the first match in each document that I'm searching, but am unable to list all matches found in each document when there are multiple matches. I've tried multiple ways of iterating through the matches hash table, but can't seem to get it right. Is there a way to do this?
$RX = "(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(?:\.|dot|\[dot\]|\[\.\])){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"
$WordFiles = Get-ChildItem $Directory -include *.doc, *.docx -recurse
$Directory = "c:\temp"
$objWord = New-Object -Com Word.Application
foreach ($fileSearched in $WordFiles) {
$objWord.Visible = $false
$objWord.DisplayAlerts = "wdAlertsNone"
$objDocument = $objWord.Documents.Open("$fileSearched")
if ($objdocument.Content.Text -match $RX){
Foreach ($found in $_.Matches) { #| ForEach-Object {$_.Value}
$file2.WriteLine("{0},{1}",$matches[$_], $filesearched.fullname)
write-host $_.matches
write-host $_.value
write-host $found
}
}
$file2.close()
}
$objWord.Quit()

Powershell's -match flavor of regex will only return the first match, and as far as I know there is no way to make it find global matches.
You can however switch to using the [regex] class matches function which matches globally by default.
([regex]::matches($objdocument.Content.Text, $RX))
UPDATE
I believe you will also need to switch $_.Matches to $_.Value per examples here.

I reviewed the link provided by cchamberlain and came up with:
$CSV = "c:\temp\output.csv"
$RX = "(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(?:\.|dot|\[dot\]|\[\.\])){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"
$WordFiles = Get-ChildItem $Directory -include *.doc, *.docx -recurse
$Directory = "c:\temp"
$objWord = New-Object -Com Word.Application
$file2 = new-object System.IO.StreamWriter($CSV,$true) #Append or Create a new file Stream.
$file2.WriteLine('Matches,File_Path') # write header
foreach ($fileSearched in $WordFiles) {
$objWord.Visible = $false
$objWord.DisplayAlerts = "wdAlertsNone"
$objDocument = $objWord.Documents.Open("$fileSearched")
$words = ([regex]::matches($objdocument.Content.Text,$RX) | %{$_.value})
foreach ($word in $words){
$file2.WriteLine("{0},{1}",$word, $filesearched.fullname)
}
$file2.close()
$objWord.Quit()

Related

PowerShell to match multiple lines with regex pattern

I write a Powershell script and regex to search two configs text files to find matches for Management Vlan. For example, each text file has two Management vlan configured as below:
Config1.txt
123 MGMT_123_VLAN
234 MGMT_VLAN_234
Config2.txt
890 MGMT_VLAN_890
125 MGMT_VLAN_USERS
Below is my script. It has several problems.
First, if I ran the script with the $Mgmt_vlan = Select-String -Path $File -Pattern $String -AllMatches then the screen output shows the expected four (4) Mgmt vlan, but in the CSV file output shows as follow
Filename Mgmt_vlan
Config1.txt System.Object[]
Config2.txt System.Object[]
I ran the script the output on the console screen shows exactly four (4) Management vlans that I expected, but in the CSV file it did not. It shows only these vlans
Second, if I ran the script with $Mgmt_vlan = Select-String -Path $File -Pattern $String | Select -First 1
Then the CSV shows as follows:
Filename Mgmt_vlan
Config1.txt 123 MGMT_123_VLAN
Config2.txt 890 MGMT_VLAN_890
The second method Select -First 1 appears to select only the first match in the file. I tried to change it to Select -First 2 and then CSV shows column Mgmt_Vlan as System.Object[].
The result output to the screen shows exactly four(4) Mgmt Vlans as expected.
$folder = "c:\config_folder"
$files = Get-childitem $folder\*.txt
Function find_management_vlan($Text)
{
$Vlan = #()
foreach($file in files) {
Mgmt_Vlan = Select-String -Path $File -Pattern $Text -AllMatches
if($Mgmt_Vlan) # if there is a match
{
$Vlan += New-Object -PSObject -Property #{'Filename' = $File; 'Mgmt_vlan' = $Mgmt_vlan}
$Vlan | Select 'Filename', 'Mgmt_vlan' | export-csv C:\documents\Mgmt_vlan.csv
$Mgmt_Vlan # test to see if it shows correct matches on screen and yes it did
}
else
{
$Vlan += New-Object -PSObject -Property #{'Filename' = $File; 'Mgmt_vlan' = "Mgmt Vlan Not Found"}
$Vlan | Select 'Filename', 'Mgmt_vlan' | Export-CSV C:\Documents\Mgmt_vlan.csv
}
}
}
find_management_vlan "^\d{1,3}\s.MGMT_"
Regex correction
First of all, there are a lot of mistakes in this code.
So this is probably not code that you actually used.
Secondly, that pattern will not match your strings, because if you use "^\d{1,3}\s.MGMT_" you will match 1-3 numbers, any whitespace character (equal to [\r\n\t\f\v ]), any character (except for line terminators) and MGMT_ chars and anything after that. So not really what you want. So in your case you can use for example this: ^\d{1,3}\sMGMT_ or with \s+ for more than one match.
Code Correction
Now back to your code... You create array $Vlan, that's ok.
After that, you tried to get all strings (in your case 2 strings from every file in your directory) and you create PSObject with two complex objects. One is FileInfo from System.IO and second one is an array of strings (String[]) from System. Inside the Export-Csv function .ToString() is called on every property of the object being processed. If you call .ToString() on an array (i.e. Mgmt_vlan) you will get "System.Object[]", as per default implementation. So you must have a collection of "flat" objects if you want to make a csv from it.
Second big mistake is creating a function with more than one responsibility. In your case your function is responsible for gathering data and after that for exporting data. That's a big no no. So repair your code and move that Export somewhere else. You can use for example something like this (i used get-content, because I like it more, but you can use whatever you want to get your string collection.
function Get-ManagementVlans($pattern, $files)
{
$Vlans = #()
foreach ($file in $files)
{
$matches = (Get-Content $file.FullName -Encoding UTF8).Where({$_ -imatch $pattern})
if ($matches)
{
$Vlans += $matches | % { New-Object -TypeName PSObject -Property #{'Filename' = $File; 'Mgmt_vlan' = $_.Trim()} }
}
else
{
$Vlans += New-Object -TypeName PSObject -Property #{'Filename' = $File; 'Mgmt_vlan' = "Mgmt Vlan Not Found"}
}
}
return $Vlans
}
function Export-ManagementVlans($path, $data)
{
#do something...
$data | Select Filename,Mgmt_vlan | Export-Csv "$path\Mgmt_vlan.csv" -Encoding UTF8 -NoTypeInformation
}
$folder = "C:\temp\soHelp"
$files = dir "$folder\*.txt"
$Vlans = Get-ManagementVlans -pattern "^\d{1,3}\sMGMT_" -files $files
$Vlans
Export-ManagementVlans -path $folder -data $Vlans```
Summary
But in my opinion in this case is overprogramming to create something like you did. You can easily do it in oneliner (but you didn't have information if the file doesn't include anything). The power of powershell is this:
$pattern = "^\d{1,3}\s+MGMT_"
$path = "C:\temp\soHelp\"
dir $path -Filter *.txt -File | Get-Content -Encoding UTF8 | ? {$_ -imatch $pattern} | select #{l="FileName";e={$_.PSChildName}},#{l="Mgmt_vlan";e={$_}} | Export-Csv -Path "$path\Report.csv" -Encoding UTF8 -NoTypeInformation
or with Select-String:
dir $path -Filter *.txt -File | Select-String -Pattern $pattern -AllMatches | select FileName,#{l="Mgmt_vlan";e={$_.Line}} | Export-Csv -Path "$path\Report.csv" -Encoding UTF8 -NoTypeInformation

Powershell: using regex to retrieve substrings from text/file

I have a bunch of log files which should be parsed and some info from them - extracted.
A sample line (line that unfortunately, after trimming sensitive data looks like xml):
<SerialNumber>xxxxxxxxx</SerialNumber><IP>X.X.X.X</IP><UserID>user#domain.com</UserID><NumOfFiles>1</NumOfFiles><LocaleID>ENU</LocaleID><Vendor>POLYCOM</Vendor><Model>VVX311</Model><Revision>Rev-A</Revision><CurrentTime>2018-03-12T02:42:59</CurrentTime><CurrentModule><FileName>cpe.nbt</FileName><FileVersion>
I want to get ip ( in ip tags), and usermail (between userid tags)
My current "solver"
$regex = "<UserID>"
$files = Get-ChildItem -path 'c:\path\*.log'
foreach ($infile in $files) {
$res = select-string -Path $infile -Pattern $regex -AllMatches {
$txt = $res[$res.count-1]
# get user
$pos1= $txt.line.IndexOf("<UserID>")
$pos2= $txt.line.IndexOf("</UserID>")
$Puser = $txt.Line.Substring($pos1+8,$pos2-$pos1-8)
....
}
it works, but I wonder if different approach will be better, want see how this could be done with
select-string -pattern ...
Tried several "GUI" regex builders, but I can't figure how to select whats needed
Thanks
PS:
Result after
$regex = '<IP>(.*)</IP>'
$res = select-string -Path $infile -Pattern $regex
$res
0312092535|cfg |4|00|DevUpdt|[LyncDeviceUpdateC::prepareAndSendRequest] '<?xml version="1.0" encoding="utf-8"?><Request><DeviceType>3PIP</DeviceType><MacAddress>11-11-11-11-11-11</MacAddress><SerialNumber>111111111111</SerialNumber><IP>10.1.1.1</IP><UserID>user#domain.com</UserID><NumOfFiles>1</NumOfFiles><LocaleID>ENU</LocaleID><Vendor>POLYCOM</Vendor><Model>VVX311</Model><Revision>Rev-A</Revision><CurrentTime>2018-03-12T09:25:35</CurrentTime><CurrentModule><FileName>cpe.nbt</FileName><FileVersion><Major>5</Major><M
Sample of log file (100Kb+)
0312104211|nisvc|2|00|Invoker's nCommands,CurrentKey:2,(106)Responder
0312104211|nisvc|2|00|Response(-1)nisvc,(-1),(-1)app,(22),(Expiry,TransactionId,Time,Type):(-1,-1,1520844131,1)IndicationCode:(400)
0312104211|app1 |5|00|[CWPADServiceEwsRsp::execute] PAC file failed with ''
0312104301|cfg |4|00|DevUpdt|[LyncDeviceUpdateC::prepareAndSendRequest] '<?xml version="1.0" encoding="utf-8"?><Request><DeviceType>3PIP</DeviceType><MacAddress>11-11-11-11-11-11</MacAddress><SerialNumber>64167F2A8451</SerialNumber><IP>10.1.1.1</IP><UserID>user#domain.com</UserID><NumOfFiles>1</NumOfFiles><LocaleID>ENU</LocaleID><Vendor>POLYCOM</Vendor><Model>VVX311</Model><Revision>Rev-A</Revision><CurrentTime>2018-03-12T10:43:00</CurrentTime><CurrentModule><FileName>cpe.nbt</FileName><FileVersion><Major>5</Major><Minor>
0312104301|nisvc|2|00|Request(-1)nisvc,(701)NIServiceHttpReqMsgKey,(-1)proxy,(1001)AuthRsp,(Expiry,TransactionId,Time,Type):(45000,1306758696,1520844181,0)IndicationLevel:(200)
This code will get all the files, read each file line by line and create objects with a user and ip and put them in an array.
[regex]$ipUserReg = '(?<=<IP>)(.*)(?:<\/IP><UserID>)(.*)(?=<\/UserID>)'
$files = Get-ChildItem $path -filter *.log
$users = #(
foreach ($fileToSearch in $files) {
$file = [System.IO.File]::OpenText($fileToSearch)
while (!$file.EndOfStream) {
$text = $file.ReadLine()
if ($ipUserReg.Matches($text).Success -or $userReg.Matches($text).Success) {
New-Object psobject -Property #{
IP = $ipUserReg.Matches($text).Groups[1].Value
User = $ipUserReg.Matches($text).Groups[2].Value
}
}
}
$file.Close()
})
To build out my regex, I often use regexr.com, but keep in mind powershell is slightly different when it comes to certain regex.
Edit: Here is an example using select-string rather than reading line by line:
[regex]$ipUserReg = '(?<=<IP>)(.*)(?:<\/IP><UserID>)(.*)(?=<\/UserID>)'
$files = Get-ChildItem $path -filter *.log
$users = #(
foreach ($fileToSearch in $files) {
Select-String -Path $fileToSearch.FullName -Pattern $ipUserReg -AllMatches | ForEach-Object {
$_.Matches | ForEach-Object{
New-Object psobject -property #{
IP = $_.Groups[1].Value
User = $_.Groups[2].Value
}
}
}
}
)

Powershell Regex match and not match in a Foreach If-then not working

Hows that for a title?
I have this script Ive been working on that does two basic things: a) Use get-ntfsaccess to pull the security for a folder and then b) use the output to look up the group members of the groups that have access.
$Outfile2 = "C:\Users\local\Documents\GroupMembers.csv"
$Header2 = "GroupName,Member"
Add-Content -Value $Header2 -Path $Outfile2
$RootPath = "p:\city\Department\building"
$Folders = get-childitem2 -directory -recurse -path $RootPath
foreach ($Folder in $Folders){
$ACLs = Get-NTFSAccess $Folder.fullname
Foreach ($ACL in $ACLs){
If ($Acl.accounttype -match 'group' -and $acl.Account.accountname -notmatch '^builtin|^NT AUTHORITY\\|^Creator|^AD\\Domain')
{
$members = Get-ADGroupMember $acl.Account.accountname.TrimStart("AD\\")
}
Foreach ($member in $members) {
$OutInfo = $ACL.Account.AccountName + "," + $member.samaccountname
Add-Content -Value $OutInfo -Path $OutFile2
}
}}
Id like to be able to filter the output of get-ntfsaccess. I want to only lookup 'groups' and groups that arent the base groups (like builtin, domain admins, etc) but my match and not match arent working in the script. If I take that exact same line and run it from the prompt - it works.
PS C:\Windows\system32> $acl.Account.accountname -notmatch '^builtin|^NT AUTHORITY\\|^Creator|^AD\\Domain'
True
When run as part of the script - doesnt work. My output includes all of the domain base groups and users. Id like to also eventually add -unique to only get unique groups but this part has got me stumped....
Thanks in advance...!
I did this with success:
((dir)[0] | get-acl).access | % { $_.IdentityReference } | ? { $_ -notmatch 'builtin|nt authority' }
I cannot test with ntfsaccess at the moment but get-acl's returned IdentityReference is most likely the same field you are attempting to parse on. You might just try removing your '^'s. I also tested with "myDomain\\Domain Admins" and that worked as expected.
So I figured it out.
Three main things -
1. The Trimstart wasn't accepting the '/' no matter how i tried to 'escape' it
2. Had to use get-adgroup to pipe to get-adgroupmember
3. the IF then was script blocked wrong to write each result out at each iteration through $ACLs
$Outfile2 = "C:\Users\local\Documents\GroupMembers.csvv"
$Header2 = "GroupName,Member"
Add-Content -Value $Header2 -Path $Outfile2
$RootPath = "p:\city\Department\building"
$Folders = get-childitem2 -directory -recurse -path $RootPath
foreach ($Folder in $Folders){
$ACLs = Get-NTFSAccess $Folder.fullname
Foreach ($ACL in $ACLs){
If ($Acl.accounttype -match 'group' -and $acl.Account.accountname -notmatch '^builtin|^NT AUTHORITY\\|^Creator|^AD\\Domain')
{$members = Get-adgroup $acl.Account.accountname.substring(3) | Get-ADGroupMember
Foreach ($member in $members) {
$OutInfo = $ACL.account.AccountName + "," + $member.samaccountname
Add-Content -Value $OutInfo -Path $OutFile2
}}}}

Find all ocurrence of wildcard in a Microsoft Word document with PowerShell

I need to get all the ocurrences of a wildcard with regexp in a Microsoft Word document with powershell.
I found this solution, but only get the first ocurrence of the wildcard.
"How do I make powershell search a Word document for wildcards and return the word it found?"
How do I make powershell search a Word document for wildcards and return the word it found?
Can you help me.
Finally I use the code of the related issue with a little changes.
In the loop that search the different wildcards ($finTest) in the Content of the Word Document, I use a temporary content ($tempContent) initialized with ($document.Content.Text) value, then for every wildcard that I find in the file, I remove them from de temporary content and repeat (while) the process while the wildcard (regexp) been founded.
$filePath = "C:\files\"
$textPath = "C:\strings.txt"
$outputPath = "C:\output.txt"
$findTexts = (Get-Content $textPath)
$docs = Get-childitem -path $filePath -Recurse -Include *.docx
$application = New-Object -comobject word.application
Foreach ($doc in $docs)
{
$document = $application.documents.open("$doc", $false, $true)
$application.visible = $False
$matchCase = $false
$matchWholeWord = $false
$matchWildCards = $true
$matchSoundsLike = $false
$matchAllWordForms = $false
$forward = $true
$wrap = 1
$range = $document.content
$null = $range.movestart()
Foreach ($findtext in $findTexts)
{
#Set tempContent with de Content of the document
$tempContent=$document.Content.Text
while ($tempContent -match "\b$($findText)\w+\b")
{
#Remove all de ocurrences of wildcard founded, for the tempContent
$tempContent=$tempContent -replace $matches[0],''
$docName = $doc.Name
"$($matches[0])`t$docName" | Out-File -append $outputPath
}
} #end foreach $findText
$document.close()
} #end foreach $doc
$application.quit()

Powershell 'where' statement -notcontains

I have a simple excerpt form a larger script, basically I'm trying to do a recursive file search, including sub-directories (and any child of the exclude).
clear
$Exclude = "T:\temp\Archive\cst"
$list = Get-ChildItem -Path T:\temp\Archive -Recurse -Directory
$list | where {$_.fullname -notlike $Exclude} | ForEach-Object {
Write-Host "--------------------------------------"
$_.fullname
Write-Host "--------------------------------------"
$files = Get-ChildItem -Path $_.fullname -File
$files.count
}
At the moment this script will exclude the T:\temp\Archive\cst directory, but not the T:\temp\Archive\cst\artwork directory. I'm struggling to overcome this simple thing.
I've tried the -notlike (which I didn't really expect to work) but also the -notcontains which I was hopeful of.
Can anyone offer any advice, I'm thinking it would require a regex match which I'm reading up on now, but not very familiar with.
In the future the $exclude variable will be an array of strings (directories) but at the moment just trying to get it to work with a straight string.
Try:
where {$_.fullname -notlike "$Exclude*"}
You could also try
where {$_.fullname -notmatch [regex]::Escape($Exclude) }
but the notlike apporach is easier.
When used without wildcards the -like operator does the same as the -eq operator. If you want to exclude a folder T:\temp\Archive\cst and everything below it, you need something like this:
$Exclude = 'T:\temp\Archive\cst'
Get-ChildItem -Path T:\temp\Archive -Recurse -Directory | ? {
$_.FullName -ne $Exclude -and
$_.FullName -notlike "$Exclude\*"
} | ...
-notlike "$Exclude\*" would only exclude subfolders of $Exclude, not the folder itself, and -notlike "$Exclude*" would also exclude folders like T:\temp\Archive\cstring, which may be undesired.
The -contains operator is used to check if a list of values contains a particular value. It doesn't check if a string contains a particular substring.
See Get-Help about_Comparison_Operators for further information.
Try changing
$Exclude = "T:\temp\Archive\cst"
To:
$Exclude = "T:\temp\Archive\cst\*"
This will still return the folder CST as it is a child item of Archive, but will exclude anything under cst.
Or:
$Exclude = "T:\temp\Archive\cst*
But that will also exclude anyfiles that start with "cst" under Archive. Same goes for Graimer's answer, jsut be aware of the trailing \ and if it's important to what you are doing
For those looking for a similar answer, what I ended up going with (to parse an array paths for a wildcard match):
# Declare variables
[string]$rootdir = "T:\temp\Archive"
[String[]]$Exclude = "T:\temp\Archive\cst", "T:\temp\archive\as"
[int]$days = 90
# Create Directory list minus excluded directories and their children
$list = Get-ChildItem -Path $rootdir -Recurse -Directory | where {$path = $_.fullname; -not #($exclude | ? {$path -like $_ -or $path -like "$_\*" }) }
Provides what I needed.
Thought I would add to this as I recently had a similar problem answered. You can use the -notcontains condition, but the thing that is counter intuitive is that the $exclude array needs to be at the start of the expression.
Here is an example.
If I perform the following no items are excluded and it returns "a","b","c","d"
$result = #()
$ItemArray = #("a","b","c","d")
$exclusionArray = #("b","c")
$ItemArray | Where-Object { $_ -notcontains $exclusionArray }
If I switch the variables around in the expression then it works and returns "a","d".
$result = #()
$ItemArray = #("a","b","c","d")
$exclusionArray = #("b","c")
$ItemArray | Where-Object { $exclusionArray -notcontains $_ }
I am not sure why the arrays have to be this way around to work. If anyone else can explain that would be great.
EDITED 12/12/20 - I now know that the other operation to use is "-in" as in
$_ -notin $exclusionArray