Case Insensitive Regex Matching Using "Matches" - regex

I can't seem to find an example of what I'm trying to do here.
I have a list of regular expressions that I'm searching through for each line of a csv file and they work great if everything is in upper case. However, my search is case sensitive and I can't figure out how to make it case insensitive, without modifying each of the regular expressions with something like ?i. Is it possible to modify what I'm doing here in a simple way?
Bonus Points! I'm searching with thousands of regular expressions and it seems to take a long time on that part. If you happen to know of a faster way to search each line for all of the regex's, please share.
$file = New-Object System.IO.StreamReader ($CSVFile) # Input Stream
while (($text = $file.ReadLine()) -ne $null ){
foreach ($RX in $SearchList){
foreach ($match in ([regex]$RX).Matches($text)) {
write-host "Match found: " $match.value -ForegroundColor Red
}
}
}
$file.close();
Thanks for any help with this!

Add this line just inside of your foreach ($RX in $SearchList){:
$RX = [regex]::new($RX,([regex]$RX).Options -bor [System.Text.RegularExpressions.RegexOptions]::IgnoreCase)
This ensures that $RX is a [regex] object, as well as adds the IgnoreCase option to whatever options were present.

To speed it up, you can do two things before searching: read the entire file to memory and create all your regex-objects...
$reList = $SearchList | ForEach-Object { [regex]$_ } # adapt the regex here
$lines = [System.IO.File]::ReadAllLines($CSVFile)
You really need thousands of regexs?
The new syntax becomes:
foreach($line in $lines) {
foreach($re in $reList) {
}
}

Related

Speeding Up Regular Expression Matching

I'm searching with thousands of regular expressions and it seems to take a long time on that part. If you happen to know of a faster way to search each line for all of the regex's, I'm all ears.
I need to capture the value matched, the full line that matched, and the line number.
$file = New-Object System.IO.StreamReader ($CSVFile) # Input Stream
while (($text = $file.ReadLine()) -ne $null ){
foreach ($RX in $SearchList){
foreach ($match in ([regex]$RX).Matches($text)) {
write-host "Match found: " $match.value -ForegroundColor Red
}
}
}
$file.close();
Matching something against thousands of regular expressions in a loop will always perform poorly. Merge your individual regular expressions into one
$re = $SearchList -join '|'
and use it like this:
(Get-Content $CSVFile) -match $re
or like this (if the input file is too large to fit into memory):
Get-Content $CSVFile | Where-Object { $_ -match $re }
If you have too many individual regular expressions to fit into one large one you need to find a way to reduce their number or merge several of them into smaller, more general ones. For help with doing that you need to provide us with a representative sample of the expressions you want to match against.

How do I get around the case sensitivity of the Replace string method?

I'm moving the content source of pretty much everything in SCCM to a DFS share, and so I've got to change the source path for everything in the environment, and for the most part, I've got it coded out. There's some improvements I'd like to make, to clean up the code before I hit the big red button though.
For example, Powershell's .Replace method is case sensitive, and there are occasions where someone used uppercase in the server name in only PART of the name.
\\typNUMloc\ can be \\typNUMLOC\ or \\TYPNUMloc\ or \\TYPNUMLOC\. This makes for extra large If statements.
One of my functions is for the Drivers (not the Driver Packages, that I've tested with similar code, and I have only one mistyped path). Big Red Button commented out for safety.
$DriverArray = Get-CMDriver | Select CI_ID, ContentSourcePath | Where-Object {$_.ContentSourcePath -Like "\\oldNUMsrv\*"}
Foreach ($Driver in $DriverArray) {
$DriverID = $Driver.CI_ID
$CurrPath = $Driver.ContentSourcePath
# Checks and replaces the root path
If ($CurrPath -split '\\' -ccontains 'path_dir') {
$NewPath = $CurrPath.Replace("oldNUMsrv\path_dir","dfs\Path-Dir")
#Set-CMDriver -Id $DriverID -DriverSource $NewPath
} ElseIf ($CurrPath -split '\\' -ccontains 'Path_dir') {
$NewPath = $CurrPath.Replace("oldNUMsrv\Path_dir","dfs\Path-Dir")
#Set-CMDriver -Id $DriverID -DriverSource $NewPath
} ElseIf ($CurrPath -split '\\' -ccontains 'Path_Dir') {
$NewPath = $CurrPath.Replace("oldNUMsrv\Path_Dir","dfs\Path-Dir")
#Set-CMDriver -Id $DriverID -DriverSource $NewPath
} Else {
Write-Host "Bad Path at $DriverID -- $CurrPath" -ForegroundColor Red
}
# Checks again for ones that didn't change propery (case issues)
If ($NewPath -like "\\oldNUMsrv\*") {
Write-Host "Bad Path at $DriverID -- $CurrPath" -ForegroundColor Red
}
}
But as you can tell, that's a lot of code that I shouldn't need to do. I know, I could use the -replace or -ireplace methods, but I end up with additional backslashes (\\dfs\\Path-Dir) in my path, even when using [regex]::escape.
How can I use an array of the different paths to match against the $CurrPath and perform the replace? I know it doesn't work, but like this:
If ($Array -in $CurrPath) {
$NewPath = $CurrPath.Replace($Array, "dfs\Path-Dir"
}
I think your issue might have been assuming you had to escape the replacement string as well as the pattern string. That is not the case. Since you have control characters (the slash) you will need to escape the pattern string. In it's basic form you just need to do something like this:
PS C:\Users\Matt> "C:\temp\test\file.php" -replace [regex]::Escape("temp\test"), "another\path"
C:\another\path\file.php
However I would like to take this one step further. Your if statements are all doing relatively the same thing. Finding a series of strings and replacing them all with the same thing. -contains isn't really necessary either. Also note that all of those comparison operators are case insensitive by default. See about_comparison_operators
You can simplify all that with a little more regex by building a pattern string. So assuming that your strings are all unique (case does not matter) you could do this:
$stringstoReplace = "oldNUMsrv1\path_dir", "oldNUMsrv2\Path_dir", "oldNUMsrv1\Path_Dir"
$regexPattern = ($stringstoReplace | ForEach-Object{[regex]::Escape($_)}) -join "|"
if($CurrPath -match $regexPattern){
$NewPath = $CurrPath -replace $regexPattern,"new\path"
}
You don't even need the if. You could just use -replace on all strings regardless. I only left the if since you had a check to see if something changed. Again, if you were just creating all those statements just to account for case then my suggestion is moot.

Improving performance on PowerShell filtering statement

I have a script that goes through HTTP access log, filters out some lines based on a regex patern and copies them into another file:
param($workingdate=(get-date).ToString("yyMMdd"))
Get-Content "access-$workingdate.log" |
Select-string -pattern $pattern |
Add-Content "D:\webStatistics\log\filtered-$workingdate.log"
My logs can be quite large (up to 2GB), which takes up to 15 minutes to run. Is there anything I can to do improve the performance of the statement above?
Thank you for your thoughts!
See if this isn't faster than your current solution:
param($workingdate=(get-date).ToString("yyMMdd"))
Get-Content "access-$workingdate.log" -ReadCount 2000 |
foreach { $_ -match $pattern |
Add-Content "D:\webStatistics\log\filtered-$workingdate.log"
}
You don't show your patterns, but I suspect they are a large part of the problem.
You will want to look for a new question here (I am sure it has been asked) or elsewhere for detailed advice on building fast regular expression patterns.
But I find the best advice is to anchor your patterns and avoid runs of unknown length of all characters.
So instead of a pattern like path/.*/.*\.js use one with a $ on the end to anchor it to the end of the string. That way the regex engine can tell immediately that index.html is not a match. Otherwise it has to do some rather complicated scans with path/ and .js possibly showing up anywhere in the string. This example of course assumes the file name is at the end of the log line.
Anchors work well with start of line patterns as well. A pattern might look like ^[^"]*"GET /myfile" That has a unknown run length but at least it knows that it doesn't have to restart the search for more quotes after finding the first one. The [^"] character class allows the regex engine to stop because the pattern can't match after the first quote.
You could also try seeing if using streams would speed it up. Something like this might help, although I couldn't test it because, as mentioned above, I'm not sure what patter you are using.
param($workingdate=(get-date).ToString("yyMMdd"))
$file = New-Object System.IO.StreamReader -Arg "access-$workingdate.log"
$stream = New-Object System.IO.StreamWriter -Arg "D:\webStatistics\log\filtered-$workingdate.log"
while ($line = $file.ReadLine()) {
if($line -match $pattern){
$stream.WriteLine($line)
}
}
$file.close()
$stream.Close()

Is there a way to optimise my Powershell function for removing pattern matches from a large file?

I've got a large text file (~20K lines, ~80 characters per line).
I've also got a largish array (~1500 items) of objects containing patterns I wish to remove from the large text file. Note, if the pattern from the array appears on a line in the input file, I wish to remove the entire line, not just the pattern.
The input file is CSVish with lines similar to:
A;AAA-BBB;XXX;XX000029;WORD;WORD-WORD-1;00001;STRING;2015-07-01;;010;
The pattern in the array which I search each line in the input file for resemble the
XX000029
part of the line above.
My somewhat naïve function to achieve this goal looks like this currently:
function Remove-IdsFromFile {
param(
[Parameter(Mandatory=$true,Position=0)]
[string]$BigFile,
[Parameter(Mandatory=$true,Position=1)]
[Object[]]$IgnorePatterns
)
try{
$FileContent = Get-Content $BigFile
}catch{
Write-Error $_
}
$IgnorePatterns | ForEach-Object {
$IgnoreId = $_.IgnoreId
$FileContent = $FileContent | Where-Object { $_ -notmatch $IgnoreId }
Write-Host $FileContent.count
}
$FileContent | Set-Content "CleansedBigFile.txt"
}
This works, but is slow.
How can I make it quicker?
function Remove-IdsFromFile {
param(
[Parameter(Mandatory=$true,Position=0)]
[string]$BigFile,
[Parameter(Mandatory=$true,Position=1)]
[Object[]]$IgnorePatterns
)
# Create the pattern matches
$regex = ($IgnorePatterns | ForEach-Object{[regex]::Escape($_)}) -join "|"
If(Test-Path $BigFile){
$reader = New-Object System.IO.StreamReader($BigFile)
$line=$reader.ReadLine()
while ($line -ne $null)
{
# Check if the line should be output to file
If($line -notmatch $regex){$line | Add-Content "CleansedBigFile.txt"}
# Attempt to read the next line.
$line=$reader.ReadLine()
}
$reader.close()
} Else {
Write-Error "Cannot locate: $BigFile"
}
}
StreamReader is one of the preferred methods to read large text files. We also use regex to build pattern string to match based on. With the pattern string we use [regex]::Escape() as a precaution if regex control characters are present. Have to guess since we only see one pattern string.
If $IgnorePatterns can easily be cast as strings this should working in place just fine. A small sample of what $regex looks like would be:
XX000029|XX000028|XX000027
If $IgnorePatterns is populated from a database you might have less control over this but since we are using regex you might be able to reduce that pattern set by actually using regex (instead of just a big alternative match) like in my example above. You could reduce that to XX00002[7-9] for instance.
I don't know if the regex itself will provide an performance boost with 1500 possibles. The StreamReader is supposed to be the focus here. However I did sully the waters by using Add-Content to the output which does not get any awards for being fast either (could use a stream writer in its place).
Reader and Writer
I still have to test this to be sure it works but this just uses streamreader and streamwriter. If it does work better I am just going to replace the above code.
function Remove-IdsFromFile {
param(
[Parameter(Mandatory=$true,Position=0)]
[string]$BigFile,
[Parameter(Mandatory=$true,Position=1)]
[Object[]]$IgnorePatterns
)
# Create the pattern matches
$regex = ($IgnorePatterns | ForEach-Object{[regex]::Escape($_)}) -join "|"
If(Test-Path $BigFile){
# Prepare the StreamReader
$reader = New-Object System.IO.StreamReader($BigFile)
#Prepare the StreamWriter
$writer = New-Object System.IO.StreamWriter("CleansedBigFile.txt")
$line=$reader.ReadLine()
while ($line -ne $null)
{
# Check if the line should be output to file
If($line -notmatch $regex){$writer.WriteLine($line)}
# Attempt to read the next line.
$line=$reader.ReadLine()
}
# Don't cross the streams!
$reader.Close()
$writer.Close()
} Else {
Write-Error "Cannot locate: $BigFile"
}
}
You might need some error prevention in there for the streams but it does appear to work in place.

PowerShell multiple string replacement efficiency

I'm trying to replace 600 different strings in a very large text file 30Mb+. I'm current building a script that does this; following this Question:
Script:
$string = gc $filePath
$string | % {
$_ -replace 'something0','somethingelse0' `
-replace 'something1','somethingelse1' `
-replace 'something2','somethingelse2' `
-replace 'something3','somethingelse3' `
-replace 'something4','somethingelse4' `
-replace 'something5','somethingelse5' `
...
(600 More Lines...)
...
}
$string | ac "C:\log.txt"
But as this will check each line 600 times and there are well over 150,000+ lines in the text file this means there’s a lot of processing time.
Is there a better alternative to doing this that is more efficient?
Combining the hash technique from Adi Inbar's answer, and the match evaluator from Keith Hill's answer to another recent question, here is how you can perform the replace in PowerShell:
# Build hashtable of search and replace values.
$replacements = #{
'something0' = 'somethingelse0'
'something1' = 'somethingelse1'
'something2' = 'somethingelse2'
'something3' = 'somethingelse3'
'something4' = 'somethingelse4'
'something5' = 'somethingelse5'
'X:\Group_14\DACU' = '\\DACU$'
'.*[^xyz]' = 'oO{xyz}'
'moresomethings' = 'moresomethingelses'
}
# Join all (escaped) keys from the hashtable into one regular expression.
[regex]$r = #($replacements.Keys | foreach { [regex]::Escape( $_ ) }) -join '|'
[scriptblock]$matchEval = { param( [Text.RegularExpressions.Match]$matchInfo )
# Return replacement value for each matched value.
$matchedValue = $matchInfo.Groups[0].Value
$replacements[$matchedValue]
}
# Perform replace over every line in the file and append to log.
Get-Content $filePath |
foreach { $r.Replace( $_, $matchEval ) } |
Add-Content 'C:\log.txt'
So, what you're saying is that you want to replace any of 600 strings in each of 150,000 lines, and you want to run one replace operation per line?
Yes, there is a way to do it, but not in PowerShell, at least I can't think of one. It can be done in Perl.
The Method:
Construct a hash where the keys are the somethings and the values are the somethingelses.
Join the keys of the hash with the | symbol, and use it as a match group in the regex.
In the replacement, interpolate an expression that retrieves a value from the hash using the match variable for the capture group
The Problem:
Frustratingly, PowerShell doesn't expose the match variables outside the regex replace call. It doesn't work with the -replace operator and it doesn't work with [regex]::replace.
In Perl, you can do this, for example:
$string =~ s/(1|2|3)/#{[$1 + 5]}/g;
This will add 5 to the digits 1, 2, and 3 throughout the string, so if the string is "1224526123 [2] [6]", it turns into "6774576678 [7] [6]".
However, in PowerShell, both of these fail:
$string -replace '(1|2|3)',"$($1 + 5)"
[regex]::replace($string,'(1|2|3)',"$($1 + 5)")
In both cases, $1 evaluates to null, and the expression evaluates to plain old 5. The match variables in replacements are only meaningful in the resulting string, i.e. a single-quoted string or whatever the double-quoted string evaluates to. They're basically just backreferences that look like match variables. Sure, you can quote the $ before the number in a double-quoted string, so it will evaluate to the corresponding match group, but that defeats the purpose - it can't participate in an expression.
The Solution:
[This answer has been modified from the original. It has been formatted to fit match strings with regex metacharacters. And your TV screen, of course.]
If using another language is acceptable to you, the following Perl script works like a charm:
$filePath = $ARGV[0]; # Or hard-code it or whatever
open INPUT, "< $filePath";
open OUTPUT, '> C:\log.txt';
%replacements = (
'something0' => 'somethingelse0',
'something1' => 'somethingelse1',
'something2' => 'somethingelse2',
'something3' => 'somethingelse3',
'something4' => 'somethingelse4',
'something5' => 'somethingelse5',
'X:\Group_14\DACU' => '\\DACU$',
'.*[^xyz]' => 'oO{xyz}',
'moresomethings' => 'moresomethingelses'
);
foreach (keys %replacements) {
push #strings, qr/\Q$_\E/;
$replacements{$_} =~ s/\\/\\\\/g;
}
$pattern = join '|', #strings;
while (<INPUT>) {
s/($pattern)/$replacements{$1}/g;
print OUTPUT;
}
close INPUT;
close OUTPUT;
It searches for the keys of the hash (left of the =>), and replaces them with the corresponding values. Here's what's happening:
The foreach loop goes through all the elements of the hash and create an array called #strings that contains the keys of the %replacements hash, with metacharacters quoted using \Q and \E, and the result of that quoted for use as a regex pattern (qr = quote regex). In the same pass, it escapes all the backslashes in the replacement strings by doubling them.
Next, the elements of the array are joined with |'s to form the search pattern. You could include the grouping parentheses in $pattern if you want, but I think this way makes it clearer what's happening.
The while loop reads each line from the input file, replaces any of the strings in the search pattern with the corresponding replacement strings in the hash, and writes the line to the output file.
BTW, you might have noticed several other modifications from the original script. My Perl has collected some dust during my recent PowerShell kick, and on a second look I noticed several things that could be done better.
while (<INPUT>) reads the file one line at a time. A lot more sensible than reading the entire 150,000 lines into an array, especially when your goal is efficiency.
I simplified #{[$replacements{$1}]} to $replacements{$1}. Perl doesn't have a built-in way of interpolating expressions like PowerShell's $(), so #{[ ]} is used as a workaround - it creates a literal array of one element containing the expression. But I realized that it's not necessary if the expression is just a single scalar variable (I had it in there as a holdover from my initial testing, where I was applying calculations to the $1 match variable).
The close statements aren't strictly necessary, but it's considered good practice to explicitly close your filehandles.
I changed the for abbreviation to foreach, to make it clearer and more familiar to PowerShell programmers.
I also have no idea how to solve this in powershell, but I do know how to solve it in Bash and that is by using a tool called sed. Luckily, there is also Sed for Windows. If all you want to do is replace "something#" with "somethingelse#" everywhere then this command will do the trick for you
sed -i "s/something([0-9]+)/somethingelse\1/g" c:\log.txt
In Bash you'd actually need to escape a couple of those characters with backslashes, but I'm not sure you need to in windows. If the first command complains you can try
sed -i "s/something\([0-9]\+\)/somethingelse\1/g" c:\log.txt
I would use the powershell switch statement:
$string = gc $filePath
$string | % {
switch -regex ($_) {
'something0' { 'somethingelse0' }
'something1' { 'somethingelse1' }
'something2' { 'somethingelse2' }
'something3' { 'somethingelse3' }
'something4' { 'somethingelse4' }
'something5' { 'somethingelse5' }
'pattern(?<a>\d+)' { $matches['a'] } # sample of more complex logic
...
(600 More Lines...)
...
default { $_ }
}
} | ac "C:\log.txt"