Regex is working on Regex101 but not inside Powershell - regex

I have this Text file:
[Tabs]
MAILBOXSEND=1
MAILBOX=8
USERS=6
DOCUMENTS_Q=9
MED_WEBSERVCALLS_LOA=3
FCLMNA=1
INCZOOMFORM=1
USERSB=1
USERSB_ONE=1
DATAPRIV=1
MED_WEBSERVCALLS=2
TINVOICES=1
PORDERS=9
PORDERSTOTAL=1
LOGPART=1
LOGCOUNTERS=1
PARTMSG=1
[External Mail]
Send=Y
Hostname=Server
Domain=Domain
Myemail=My#email.com
MyName=My Test
Port=25
SSL=0
[Search]
SUPPLIERS=5,1
StartButton=1
Ignore Case=0
PART=6,1
I'm Trying to capture all the text between [External Mail] to the Next [] Brackets Group,
I have this Regex which do the job and tested in Regex101, after all the testing's I found it's not working inside powershell:
$Text = Get-Content c:\text.txt
$Text -match '(?s)(?<=\[External Mail\]).*?(?=\[.*?\])'
or:
$Text | Select-String '(?s)(?<=\[External Mail\]).*?(?=\[.*?\])'
Nothing Return
Do you have any idea what I'm missing?
Thanks

Looks like you are parsing an .INI file. Don't try to invent the wheel again, take leverage from existing code. This solution reads the .Ini file as nested hash tables that are easy to work with.
In case of link rot, here's the function from Scripting Guys archive:
function Get-IniContent ($filePath)
{
$ini = #{}
switch -regex -file $FilePath
{
"^\[(.+)\]" # Section
{
$section = $matches[1]
$ini[$section] = #{}
$CommentCount = 0
}
"^(;.*)$" # Comment
{
$value = $matches[1]
$CommentCount = $CommentCount + 1
$name = "Comment" + $CommentCount
$ini[$section][$name] = $value
}
"(.+?)\s*=(.*)" # Key
{
$name,$value = $matches[1..2]
$ini[$section][$name] = $value
}
}
return $ini
}
# Sample usage:
$i = Get-IniContent c:\temp\test.ini
$i["external mail"]
Name Value
---- -----
Domain Domain
SSL 0
Hostname Server
Send Y
MyName My Test
Port 25
Myemail My#email.com
$i["external mail"].hostname
Server

Since you are trying to get a multiline regex match you need to be working against a single multiline string. That is the difference between your two cases of regex101 and PowerShell. Get-Content will be returning a string array. Your regex was not matching anything as it was only doing the test on single lines within the file.
PowerShell 2.0
$Text = Get-Content c:\text.txt | Out-String
PowerShell 3.0 of higher
$Text = Get-Content c:\text.txt -Raw
As I said in my comments you don't really need regex, in this way, for this type of string extraction. There are scripts that already exist to parse INI content. If you intend to be replacing content you would have to find the partner cmdlet Out-INIContent assuming it exists but I am sure someone made it. vonPryz's answer contains more information on the cmdlet

Related

Extract "Keywords" from a pdf plus the next 200 characters from the keyword in Windows Powershell

I have a powershell script to search a keyword and find from pdf documents, however what i would requires is to get the "Keyword" + next 200 characters.
The keyword in the below script is "Address" , regex is used to find the keyword. I tried several ways ,but any means I am no expert in this.
Also below script currently giving output in powershell itself , is there a way to get the output in csv format.
Below is the code:
$pdflist = Get-ChildItem -Path "C:\Users\U6013303\Desktop\Muni Refresh\DOC\old\4295479598" -Filter "*.pdf"
foreach ($pdff in $pdflist){
Add-Type -Path "C:\Users\U6013303\Desktop\Muni Refresh\Archives\itextsharp.dll"
$pdffile = $pdff.Name
$reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList "C:\Users\U6013303\Desktop\Muni Refresh\DOC\old\4295479598\$pdffile"
Write-Host "Reading file $pdffile" -BackgroundColor Black -ForegroundColor Green
for ($page = 1; $page -le $reader.NumberOfPages; $page++)
{
$strategy = new-object 'iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy'
$currentText = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader, $page, $strategy);
[string[]]$Text += [system.text.Encoding]::UTF8.GetString([System.Text.ASCIIEncoding]::Convert( [system.text.encoding]::default , [system.text.encoding]::UTF8, [system.text.Encoding]::Default.GetBytes($currentText)));
}
$Text
[regex]::matches( $text, '(Address)' ) | select *
$reader.Close()
}
Thanks,
Garry
Simple change to your regex:
$results = [regex]::matches($text, 'Address.{200}')
Export to CSV:
$results | ConvertTo-Csv
# or
$results | Export-Csv -Path "c:\your-path\results.csv"
Or if you want just the actual values:
$results | select -ExpandProperty Value
Take look at this to see the change you need to make.
$Data = "
Match Sequence using RegEx After a Specified Character ...
https://stackoverflow.com/questions/10768924/match...
You have the correct regex only the tool you're using is highlighting the entire match and not just your capture group. Hover over the match.
"
[regex]::matches( $Data, 'You').value
# Results
<#
You
#>
[regex]::matches( $Data, 'You.{50}').value
# Results
<#
You have the correct regex only the tool you're using
#>
[regex]::matches( $Data, 'You.{100}').value
# Results
<#
You have the correct regex only the tool you're using is highlighting the entire match and not just you
#>
Notice '.Value' property because "[regex]::Matches" does not bring back a single string but an object that you must pick the value property to get the results.

Powershell: Sending an email if output of command differs from the array of 3 strings that is expected

this is what i got so far
if the command places a forth line of output then the regex shouldn't be match, but it does for me.
any ideas?
$From = "cody#tech.com"
$To = "cody#tech.com"
$Subject = "hello"
$Body = "body"
$SMTPServer = "smtp.office365.com"
$SMTPPort = "587"
$Attachment = "c:\checkpointlog.txt"
$RE = [regex]"(?smi)^Checkpointing to checkpoint.*^MD5 \(checkpoint.*^Rotating D:\\Perforce\\Server\\journal to journal.*?$"
$output = p4d -jc
$output | add-content c:\checkpointlog.txt
if ($output -notmatch $RE) {
$param = #{
From = $From
To = $To
Subject = $Subject
Body = $Body
SmtpServer = $SMTPServer
port = $SMTPPort
Usessl = $True
Credential = $cred
Attachments = $Attachment
}
Send-MailMessage #param
Write-Host 'unexpected value, email sent'
exit
}
else {
Write-Host 'continuing script'
}
the output of a Perforce Helix Server command p4 -jc should always be like the following 3 lines:
Checkpointing to checkpoint.22...
MD5 (checkpoint.22) = F561234wer2B8E5123456767745645616D
Rotating D:\Perforce\Server\journal to journal.21...
I would like to use a regex in an if statement so that if the output doesn't match the 3 line string below then an email is sent with the log file for us to inspect.
Checkpointing to checkpoint.*
MD5 (checkpoint.*
Rotating D:\Perforce\Server\journal to journal.*
I am hoping to use a wild card * to account for the incremental number
Any ideas would be great!
Your regex tries to match multiple lines, and therefore needs a single multi-line string as input.
Capturing an external program's output returns an array of strings (lines).
Using an array of string as the LHS of the -match operator causes PowerShell to match the regex against each string individually.
Therefore, join the lines output by p4d with newlines to form a single multi-line string, so that your regex can match multiple lines:
$output = (p4d -jc) -join "`n"
Additionally, if you want to make sure that your regex matches the entire input, not just a substring, restructure you regex as follows:
Remove in-line option m (multi-line) so that ^ and $ truly only match the very start and end of the multi-line string
Remove in-line option s, so that . doesn't match newlines, and match newlines explicitly with \n (instead of ^).
$RE = '(?i)^Checkpointing to checkpoint.*\nMD5 \(checkpoint.*\nRotating D:\\Perforce\\Server\\journal to journal.*$'
Build (and test) your Regular Expression on a website like regex101.com
I suggest to use splatting
$RE = [regex]"(?smi)^Checkpointing to checkpoint.*^MD5 \(checkpoint.*^Rotating D:\\Perforce\\Server\\journal to journal.*?$"
$output = p4d -jc
$output | add-content c:\checkpointlog.txt
if ($output -notmatch $RE) {
$param = #{
From = $From
To = $To
Subject = $Subject
Body = $Body
SmtpServer = $SMTPServer
port = $SMTPPort
Usessl = $True
Credential = $cred
Attachments = 'c:\checkpointlog.txt'
}
Send-MailMessage #param
exit
}
I know you wanted to use RegEx but maybe this would help? It might not be the greatest but it should work.
$file='checkpointlog.txt'
$line1Expected="Checkpointing to checkpoint*"
$line2Expected="MD5 (checkpoint*"
$line3Expected="Rotating D:\Perforce\Server\journal to journal*"
$line1Actual=Get-Content($file) | Select -Index 0
$line2Actual=Get-Content($file) | Select -Index 1
$line3Actual=Get-Content($file) | Select -Index 2
if($line1Actual -like $line1Expected`
-and $line2Actual -like $line2Expected`
-and $line3Actual -like $line3Expected){
}else{
echo 'send mail'
}

Powershell to get a DLL name out of it's full path

I have a string "....\xyz\abc\0.0\abc.def.ghi.jkl.dll" am trying to get the value of a "abc.def.ghi.jkl.dll" into a variable using powershell.
I am totally new to regex and PS and kinda confused on how to get this done. I read various posts about regex and I am unable to get anything to work
Here is my code,
$str = "..\..\xyz\abc\0.0\abc.def.ghi.jkl.dll"
$regex = [regex] '(?is)(?<=\b\\b).*?(?=\b.dll\b)'
$result = $regex.Matches($str)
Write-Host $result
I would like to get "abc.def.ghi.jkl.dll" into $result. Could someone please help me out
You can use the following regex:
(?is)(?<=\\)[^\\]+\.dll\b
See regex demo
And no need to use Matches, just use a -match (or Match).
Explanation:
(?<=\\) - make sure there is a \ right before the current position in string
[^\\]+ - match 1 or more characters other than \
\.dll\b - match a . symbol followed by 3 letters dll that are followed by a trailing word boundary.
Powershell:
$str = "..\..\xyz\abc\0.0\abc.def.ghi.jkl.dll"
[regex]$regex = "(?is)(?<=\\)[^\\]+\.dll\b"
$match = $regex.match($str)
$result = ""
if ($match.Success)
{
$result = $match.Value
Write-Host $result
}

Is there a way to optimise my Powershell function for removing pattern matches from a large file?

I've got a large text file (~20K lines, ~80 characters per line).
I've also got a largish array (~1500 items) of objects containing patterns I wish to remove from the large text file. Note, if the pattern from the array appears on a line in the input file, I wish to remove the entire line, not just the pattern.
The input file is CSVish with lines similar to:
A;AAA-BBB;XXX;XX000029;WORD;WORD-WORD-1;00001;STRING;2015-07-01;;010;
The pattern in the array which I search each line in the input file for resemble the
XX000029
part of the line above.
My somewhat naïve function to achieve this goal looks like this currently:
function Remove-IdsFromFile {
param(
[Parameter(Mandatory=$true,Position=0)]
[string]$BigFile,
[Parameter(Mandatory=$true,Position=1)]
[Object[]]$IgnorePatterns
)
try{
$FileContent = Get-Content $BigFile
}catch{
Write-Error $_
}
$IgnorePatterns | ForEach-Object {
$IgnoreId = $_.IgnoreId
$FileContent = $FileContent | Where-Object { $_ -notmatch $IgnoreId }
Write-Host $FileContent.count
}
$FileContent | Set-Content "CleansedBigFile.txt"
}
This works, but is slow.
How can I make it quicker?
function Remove-IdsFromFile {
param(
[Parameter(Mandatory=$true,Position=0)]
[string]$BigFile,
[Parameter(Mandatory=$true,Position=1)]
[Object[]]$IgnorePatterns
)
# Create the pattern matches
$regex = ($IgnorePatterns | ForEach-Object{[regex]::Escape($_)}) -join "|"
If(Test-Path $BigFile){
$reader = New-Object System.IO.StreamReader($BigFile)
$line=$reader.ReadLine()
while ($line -ne $null)
{
# Check if the line should be output to file
If($line -notmatch $regex){$line | Add-Content "CleansedBigFile.txt"}
# Attempt to read the next line.
$line=$reader.ReadLine()
}
$reader.close()
} Else {
Write-Error "Cannot locate: $BigFile"
}
}
StreamReader is one of the preferred methods to read large text files. We also use regex to build pattern string to match based on. With the pattern string we use [regex]::Escape() as a precaution if regex control characters are present. Have to guess since we only see one pattern string.
If $IgnorePatterns can easily be cast as strings this should working in place just fine. A small sample of what $regex looks like would be:
XX000029|XX000028|XX000027
If $IgnorePatterns is populated from a database you might have less control over this but since we are using regex you might be able to reduce that pattern set by actually using regex (instead of just a big alternative match) like in my example above. You could reduce that to XX00002[7-9] for instance.
I don't know if the regex itself will provide an performance boost with 1500 possibles. The StreamReader is supposed to be the focus here. However I did sully the waters by using Add-Content to the output which does not get any awards for being fast either (could use a stream writer in its place).
Reader and Writer
I still have to test this to be sure it works but this just uses streamreader and streamwriter. If it does work better I am just going to replace the above code.
function Remove-IdsFromFile {
param(
[Parameter(Mandatory=$true,Position=0)]
[string]$BigFile,
[Parameter(Mandatory=$true,Position=1)]
[Object[]]$IgnorePatterns
)
# Create the pattern matches
$regex = ($IgnorePatterns | ForEach-Object{[regex]::Escape($_)}) -join "|"
If(Test-Path $BigFile){
# Prepare the StreamReader
$reader = New-Object System.IO.StreamReader($BigFile)
#Prepare the StreamWriter
$writer = New-Object System.IO.StreamWriter("CleansedBigFile.txt")
$line=$reader.ReadLine()
while ($line -ne $null)
{
# Check if the line should be output to file
If($line -notmatch $regex){$writer.WriteLine($line)}
# Attempt to read the next line.
$line=$reader.ReadLine()
}
# Don't cross the streams!
$reader.Close()
$writer.Close()
} Else {
Write-Error "Cannot locate: $BigFile"
}
}
You might need some error prevention in there for the streams but it does appear to work in place.

Regex to match URL in Powershell

I am new to programming and Powershell, I've put together the following script; it parses through all the emails in a specified folder and extract the URLs from them. The script uses a regex pattern to identify the URLs and then extracts them to a text file. The extracted text is then run through another command where I am trying to remove the http:// or https:// portion (I need help with figuring this out), these are placed into another text file, from which I go through again to remove duplicates.
The main issue I am having is that the regex doesnt appear to extract the urls correctly. What I am getting is something like an example I have created below:
URL is http://www.dropbox.com/3jksffpwe/asdj.exe
But I end up getting
dropbox.com/3jksffpwe/asdj.exe
dropbox.com
drop
dropbox
The script is
#Adjust paths to location of saved Emails
$in_files = ‘C:\temp\*.eml, *.msg’
$out_file = ‘C:\temp\Output.txt’
$Working_file = ‘C:\temp\working.txt'
$Parsed_file = ‘C:\temp\cleaned.txt'
# Removes the old output file from earlier runs.
if (Test-Path $Parsed_file) {
Remove-Item $Parsed_file
}
# regex to parse thru each email and extract the URLs to a text file
$regex = ‘([a-zA-Z]{3,})://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)*?’
select-string -Path $in_files -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } > $out_file
#Parses thru the output of urls to strip out the http or https portion
Get-Content $Out_file | ForEach-Object {$_.SubString(7)} | Out-File $Working_file
#Parses thru again to remove exact duplicates
$set = #{}
Get-Content $Working_file | %{
if (!$set.Contains($_)) {
$set.Add($_, $null)
$_
}
} | Set-Content $Parsed_file
#Removes the files no longer required
Del $out_file, $Working_file
#Confirms if the email messages should be removed
$Response = Read-Host "Do you want to remove the old messages? (Y|N)"
If ($Response -eq "Y") {del *.eml, *msg}
#Opens the output file in notepad
Notepad $Parsed_file
Exit
Thanks for any help
Try this RegEx:
(http[s]?|[s]?ftp[s]?)(:\/\/)([^\s,]+)
But remember that powershell -match is only capturing the first match. To capture all matches you could do something like this:
$txt="https://test.com, http://tes2.net, http:/test.com, http://test3.ro, text, http//:wrong.value";$hash=#{};$txt|select-string -AllMatches '(http[s]?|[s]?ftp[s]?)(:\/\/)([^\s,]+)'|%{$hash."Valid URLs"=$_.Matches.value};$hash
Best of luck! Enjoy!
RegExp for checking for URL can be like:
(?i)\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
Check for more info here.