Speeding Up Regular Expression Matching

Speeding Up Regular Expression Matching - regex

I'm searching with thousands of regular expressions and it seems to take a long time on that part. If you happen to know of a faster way to search each line for all of the regex's, I'm all ears.
I need to capture the value matched, the full line that matched, and the line number.
$file = New-Object System.IO.StreamReader ($CSVFile) # Input Stream
while (($text = $file.ReadLine()) -ne $null ){
foreach ($RX in $SearchList){
foreach ($match in ([regex]$RX).Matches($text)) {
write-host "Match found: " $match.value -ForegroundColor Red
}
}
}
$file.close();

Matching something against thousands of regular expressions in a loop will always perform poorly. Merge your individual regular expressions into one
$re = $SearchList -join '|'
and use it like this:
(Get-Content $CSVFile) -match $re
or like this (if the input file is too large to fit into memory):
Get-Content $CSVFile | Where-Object { $_ -match $re }
If you have too many individual regular expressions to fit into one large one you need to find a way to reduce their number or merge several of them into smaller, more general ones. For help with doing that you need to provide us with a representative sample of the expressions you want to match against.

Related

Case Insensitive Regex Matching Using "Matches"

I can't seem to find an example of what I'm trying to do here.
I have a list of regular expressions that I'm searching through for each line of a csv file and they work great if everything is in upper case. However, my search is case sensitive and I can't figure out how to make it case insensitive, without modifying each of the regular expressions with something like ?i. Is it possible to modify what I'm doing here in a simple way?
Bonus Points! I'm searching with thousands of regular expressions and it seems to take a long time on that part. If you happen to know of a faster way to search each line for all of the regex's, please share.
$file = New-Object System.IO.StreamReader ($CSVFile) # Input Stream
while (($text = $file.ReadLine()) -ne $null ){
foreach ($RX in $SearchList){
foreach ($match in ([regex]$RX).Matches($text)) {
write-host "Match found: " $match.value -ForegroundColor Red
}
}
}
$file.close();
Thanks for any help with this!

Add this line just inside of your foreach ($RX in $SearchList){:
$RX = [regex]::new($RX,([regex]$RX).Options -bor [System.Text.RegularExpressions.RegexOptions]::IgnoreCase)
This ensures that $RX is a [regex] object, as well as adds the IgnoreCase option to whatever options were present.

To speed it up, you can do two things before searching: read the entire file to memory and create all your regex-objects...
$reList = $SearchList | ForEach-Object { [regex]$_ } # adapt the regex here
$lines = [System.IO.File]::ReadAllLines($CSVFile)
You really need thousands of regexs?
The new syntax becomes:
foreach($line in $lines) {
foreach($re in $reList) {
}
}

Powershell regex search and append multiple values

Below I have a ps1 for finding and appending text defined by regex pattern; i.e. from [pattern] to [pattern]Foo. Is there a simpler way to do this for multiple regex patterns, other than defining each regex as pattern2, pattern3, etc. and creating a separate "ForEach" to correspond to every regex? Because that's how I did it, and it works but it looks very rudimentary.
$pattern1 = [regex]'([___)'
$pattern2 = [regex]'([___)'
Get-ChildItem 'C:\\File\\Location\\*.txt' -Recurse | ForEach {
(Get-Content $_ |
ForEach { $_ -replace $pattern1, ('$1'+'FOO')} |
ForEach { $_ -replace $pattern2, ('$1'+'FOO')}) |
Set-Content $_
}

If you are replacing with the same replacement pattern, just use alternation:
$pattern = [regex]'(pattern1|pattern2)'
NOTE: in unanchored alternations, you should watch out for the order of the alternatives: if a shorter branch can match at the same location in string, a longer one - if it is present after the shorter one - won't get tested. E.g. (on|one|ones) will only match on in ones. See more about that in the Remember That The Regex Engine Is Eager.

replace thousands separators in csv with regex

I'm running into problems trying to pull the thousands separators out of some currency values in a set of files. The "bad" values are delimited with commas and double quotes. There are other values in there that are < $1000 that present no issue.
Example of existing file:
"12,345.67",12.34,"123,456.78",1.00,"123,456,789.12"
Example of desired file (thousands separators removed):
"12345.67",12.34,"123456.78",1.00,"123456789.12"
I found a regex expression for matching the numbers with separators that works great, but I'm having trouble with the -replace operator. The replacement value is confusing me. I read about $& and I'm wondering if I should use that here. I tried $_, but that pulls out ALL my commas. Do I have to use $matches somehow?
Here's my code:
$Files = Get-ChildItem *input.csv
foreach ($file in $Files)
{
$file |
Get-Content | #assume that I can't use -raw
% {$_ -replace '"[\d]{1,3}(,[\d]{3})*(\.[\d]+)?"', ("$&" -replace ',','')} | #this is my problem
out-file output.csv -append -encoding ascii
}

Tony Hinkle's comment is the answer: don't use regex for this (at least not directly on the CSV file).
Your CSV is valid, so you should parse it as such, work on the objects (change the text if you want), then write a new CSV.
Import-Csv -Path .\my.csv | ForEach-Object {
$_ | ForEach-Object {
$_ -replace ',',''
}
} | Export-Csv -Path .\my_new.csv
(this code needs work, specifically the middle as the row will have each column as a property, not an array, but a more complete version of your CSV would make that easier to demonstrate)

You can try with this regex:
,(?=(\d{3},?)+(?:\.\d{1,3})?")
See Live Demo or in powershell:
% {$_ -replace ',(?=(\d{3},?)+(?:\.\d{1,3})?")','' }
But it's more about the challenge that regex can bring. For proper work, use #briantist answer which is the clean way to do this.

I would use a simpler regex, and use capture groups instead of the entire capture.
I have tested the follow regular expression with your input and found no issues.
% {$_ -replace '([\d]),([\d])','$1$2' }
eg. Find all commas with a number before and after (so that the weird mixed splits dont matter) and replace the comma entirely.
This would have problems if your input has a scenario without that odd mixing of quotes and no quotes.

Powershell select-string returns different values depending on number of matches

I have a report file that is generated and containes various file references.
I am using Select-String and regular expressions to match on certain types of files and perform subsequent processing on them.
the dilemma I have is trying to consistently identify the number of matches when there are zero (0), one (1), or more than one (2+) matches. Here is what I've tried:
(select-string -path $outputfilePath -pattern $regex -allmatches).matches.count
This return "null" if there are 0 matches, "1" if one match, and "null" if more than one match.
(select-string -path $outputfilePath -pattern $regex -allmatches).count
this return "null" if there are 0 or 1 match and the number of matches if more than one match.
I'm faily new to Powershell, but am trying to find a consistent way to test on the number of matches regardless of whether there are 0, 1, or more than 1 match.

Try this:
$content = Get-Content $outputfilePath
($content -match $regex).Count
Powershell has a number of Comparison Operators that will probably make your life easier. Here's a quick listing:
-eq
-ne
-gt
-ge
-lt
-le
-Like
-NotLike
-Match
-NotMatch
-Contains
-NotContains
-In
-NotIn
-Replace
In this instance, -Match will match the $content string against your regular expression $regex, and the output is grouped by parenthesis. This grouping is a collection of strings. We can then Count the objects and print out an accurate count of matches.
So why doesn't your code work as expected? When you have a single match, .Matches actually returns a System.Text.RegularExpressions.Match object that looks something like this for a string "test123":
Groups : {test123}
Success : True
Captures : {test123}
Index : 15
Length : 7
Value : test123
Why does this happen? Because a Microsoft.PowerShell.Commands.MatchInfo object is what Select-String returns. You can verify this by attempting some other properties like .Filename on your single-match output.
Okay, but why can't we get all of our matches in one go? This is because multiple matches will return multiple objects, so now you have a collection that you're trying to operate on. The collection has a different type, and doesn't understand .Matches. No collection is returned on 1 match, but instead a single object that does understand .Matches is!
Long story short: these aren't the outputs you're looking for!

You can use the array sub-expression operator #(...) to always place results in a collection with a Count:
(Select-String ...) # may return $null, one match, or a collection of matches
(Select-String ...).Count # only succeeds for two or more matches
#(Select-String ...) # always returns a collection of matches
#(Select-String ...).Count # always succeeds

PowerShell multiple string replacement efficiency

I'm trying to replace 600 different strings in a very large text file 30Mb+. I'm current building a script that does this; following this Question:
Script:
$string = gc $filePath
$string | % {
$_ -replace 'something0','somethingelse0' `
-replace 'something1','somethingelse1' `
-replace 'something2','somethingelse2' `
-replace 'something3','somethingelse3' `
-replace 'something4','somethingelse4' `
-replace 'something5','somethingelse5' `
...
(600 More Lines...)
...
}
$string | ac "C:\log.txt"
But as this will check each line 600 times and there are well over 150,000+ lines in the text file this means there’s a lot of processing time.
Is there a better alternative to doing this that is more efficient?

Combining the hash technique from Adi Inbar's answer, and the match evaluator from Keith Hill's answer to another recent question, here is how you can perform the replace in PowerShell:
# Build hashtable of search and replace values.
$replacements = #{
'something0' = 'somethingelse0'
'something1' = 'somethingelse1'
'something2' = 'somethingelse2'
'something3' = 'somethingelse3'
'something4' = 'somethingelse4'
'something5' = 'somethingelse5'
'X:\Group_14\DACU' = '\\DACU$'
'.*[^xyz]' = 'oO{xyz}'
'moresomethings' = 'moresomethingelses'
}
# Join all (escaped) keys from the hashtable into one regular expression.
[regex]$r = #($replacements.Keys | foreach { [regex]::Escape( $_ ) }) -join '|'
[scriptblock]$matchEval = { param( [Text.RegularExpressions.Match]$matchInfo )
# Return replacement value for each matched value.
$matchedValue = $matchInfo.Groups[0].Value
$replacements[$matchedValue]
}
# Perform replace over every line in the file and append to log.
Get-Content $filePath |
foreach { $r.Replace( $_, $matchEval ) } |
Add-Content 'C:\log.txt'

So, what you're saying is that you want to replace any of 600 strings in each of 150,000 lines, and you want to run one replace operation per line?
Yes, there is a way to do it, but not in PowerShell, at least I can't think of one. It can be done in Perl.
The Method:
Construct a hash where the keys are the somethings and the values are the somethingelses.
Join the keys of the hash with the | symbol, and use it as a match group in the regex.
In the replacement, interpolate an expression that retrieves a value from the hash using the match variable for the capture group
The Problem:
Frustratingly, PowerShell doesn't expose the match variables outside the regex replace call. It doesn't work with the -replace operator and it doesn't work with [regex]::replace.
In Perl, you can do this, for example:
$string =~ s/(1|2|3)/#{[$1 + 5]}/g;
This will add 5 to the digits 1, 2, and 3 throughout the string, so if the string is "1224526123 [2] [6]", it turns into "6774576678 [7] [6]".
However, in PowerShell, both of these fail:
$string -replace '(1|2|3)',"$($1 + 5)"
[regex]::replace($string,'(1|2|3)',"$($1 + 5)")
In both cases, $1 evaluates to null, and the expression evaluates to plain old 5. The match variables in replacements are only meaningful in the resulting string, i.e. a single-quoted string or whatever the double-quoted string evaluates to. They're basically just backreferences that look like match variables. Sure, you can quote the $ before the number in a double-quoted string, so it will evaluate to the corresponding match group, but that defeats the purpose - it can't participate in an expression.
The Solution:
[This answer has been modified from the original. It has been formatted to fit match strings with regex metacharacters. And your TV screen, of course.]
If using another language is acceptable to you, the following Perl script works like a charm:
$filePath = $ARGV[0]; # Or hard-code it or whatever
open INPUT, "< $filePath";
open OUTPUT, '> C:\log.txt';
%replacements = (
'something0' => 'somethingelse0',
'something1' => 'somethingelse1',
'something2' => 'somethingelse2',
'something3' => 'somethingelse3',
'something4' => 'somethingelse4',
'something5' => 'somethingelse5',
'X:\Group_14\DACU' => '\\DACU$',
'.*[^xyz]' => 'oO{xyz}',
'moresomethings' => 'moresomethingelses'
);
foreach (keys %replacements) {
push #strings, qr/\Q$_\E/;
$replacements{$_} =~ s/\\/\\\\/g;
}
$pattern = join '|', #strings;
while (<INPUT>) {
s/($pattern)/$replacements{$1}/g;
print OUTPUT;
}
close INPUT;
close OUTPUT;
It searches for the keys of the hash (left of the =>), and replaces them with the corresponding values. Here's what's happening:
The foreach loop goes through all the elements of the hash and create an array called #strings that contains the keys of the %replacements hash, with metacharacters quoted using \Q and \E, and the result of that quoted for use as a regex pattern (qr = quote regex). In the same pass, it escapes all the backslashes in the replacement strings by doubling them.
Next, the elements of the array are joined with |'s to form the search pattern. You could include the grouping parentheses in $pattern if you want, but I think this way makes it clearer what's happening.
The while loop reads each line from the input file, replaces any of the strings in the search pattern with the corresponding replacement strings in the hash, and writes the line to the output file.
BTW, you might have noticed several other modifications from the original script. My Perl has collected some dust during my recent PowerShell kick, and on a second look I noticed several things that could be done better.
while (<INPUT>) reads the file one line at a time. A lot more sensible than reading the entire 150,000 lines into an array, especially when your goal is efficiency.
I simplified #{[$replacements{$1}]} to $replacements{$1}. Perl doesn't have a built-in way of interpolating expressions like PowerShell's $(), so #{[ ]} is used as a workaround - it creates a literal array of one element containing the expression. But I realized that it's not necessary if the expression is just a single scalar variable (I had it in there as a holdover from my initial testing, where I was applying calculations to the $1 match variable).
The close statements aren't strictly necessary, but it's considered good practice to explicitly close your filehandles.
I changed the for abbreviation to foreach, to make it clearer and more familiar to PowerShell programmers.

I also have no idea how to solve this in powershell, but I do know how to solve it in Bash and that is by using a tool called sed. Luckily, there is also Sed for Windows. If all you want to do is replace "something#" with "somethingelse#" everywhere then this command will do the trick for you
sed -i "s/something([0-9]+)/somethingelse\1/g" c:\log.txt
In Bash you'd actually need to escape a couple of those characters with backslashes, but I'm not sure you need to in windows. If the first command complains you can try
sed -i "s/something\([0-9]\+\)/somethingelse\1/g" c:\log.txt

I would use the powershell switch statement:
$string = gc $filePath
$string | % {
switch -regex ($_) {
'something0' { 'somethingelse0' }
'something1' { 'somethingelse1' }
'something2' { 'somethingelse2' }
'something3' { 'somethingelse3' }
'something4' { 'somethingelse4' }
'something5' { 'somethingelse5' }
'pattern(?<a>\d+)' { $matches['a'] } # sample of more complex logic
...
(600 More Lines...)
...
default { $_ }
}
} | ac "C:\log.txt"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Speeding Up Regular Expression Matching - regex

Related

Case Insensitive Regex Matching Using "Matches"

Powershell regex search and append multiple values

replace thousands separators in csv with regex

Powershell select-string returns different values depending on number of matches

PowerShell multiple string replacement efficiency

Categories

Resources