We have a library of 3500 documents for which we're building metadata. About 1/3 of the documents have BC in the title and I want to flag these. Right now, here's what I have and it works fine:
$htmPath = "c:\ht"
$srcfiles = Get-ChildItem $htmPath -filter "*.htm*"
ForEach ($doc in $srcfiles)
{
$s = $doc.Fullname
if($s.contains("BC")) { $bcflag = 1 } else {$bcflag = 0}
Write-Host "File: " $doc.Fullname " BC Flag: " $bcflag
}
It's just come to my attention that there may be some documents with bc in the title, so I need to add an OR to my condition test. I've been unsuccessful with the word OR...errors out with bad method calls, | as a symbol for or...thinks I'm trying to pipe something, and for some reason, I can't include a regex. I can get it to work if I add a duplicate if statement with bc as a condition, but there has to be a way to provide a list of options rather than a series of statements each looking for a single value.
What is the proper syntax to
a.) Provide a list of conditions to test and/or
b.) use a regex in the above statement
The or operator in Powershell is -or so you probably want
if($s.contains("BC") -or $s.contains("bc")) { $bcflag = 1 } else {$bcflag = 0}
using regex:
if($s -match 'bc')
or just
if($doc.fullname -match 'bc')
default regex isn't case-sensitive, the will match "BC" or "bc".
Related
I need to write a regular expression to convert vol0eaec8f32c9a98654_00000001. to vol-0eaec8f32c9a98654. I am doing this in PowerShell script.
I tried using the below code to strip everything after _
$s = "vol0eaec8f32c9a98654_00000001."
$s.Substring(0, $s.IndexOf('_'))
Thank you
As a rule of thumb, it's better an idea to ask how to achieve x instead of how to achieve x using y. This is because of the XY problem.
A regular expression is suitable to your problem, but not the way you'd think. It's used for error checking. Actual string manipulation is done with .Replace(), IndexOf(), and Substring() like so,
$s = vol-0eaec8f32c9a98654
# If the string starts with vol and contains an underscore,
# pick substring from 0 to underscore and replace vol with vol-
if($s -match '^vol.+_.*') {
$t = $s.Substring(0, $s.IndexOf('_')).Replace('vol', 'vol-')
$t
} else {
Write-Host "Invalid string: $s"
}
# output
vol-0eaec8f32c9a98654
To see why error check is there, consider a string without underscore:
$s = "vol0eaec8f32c9a9865400000001."
$s.Substring(0, $s.IndexOf('_')).Replace('vol', 'vol-') Exception calling "Substring" with "2" argument(s): "Length cannot be less than zero.
Parameter name: length"
At line:1 char:1
+ $s.Substring(0, $s.IndexOf('_')).Replace('vol', 'vol-')
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (:) [], MethodInvocationException
+ FullyQualifiedErrorId : ArgumentOutOfRangeException
Boom! Since IndexOf didn't find an underscore, the Substring method throws an exception.
How about if there isn't vol? Let's see:
# NB vool instead of vol
$s = "vool0eaec8f32c9a98654_00000001."
$s.Substring(0, $s.IndexOf('_')).Replace('vol', 'vol-')
vool0eaec8f32c9a98654
Uh uh! Now the underscore part is processed nicely, but the start of the string wasn't changed. That's because Replace looks for a pattern and replaces it if it finds one. If it didn't, that's OK.
Moral of the story: always check for error conditions. Regular expressions often are very good for input validation, and for some cases excellent tools for replacing too. Don't get too hung on the idea that you need to use a regex - unless you are taking a regex class.
As a side note, even input validation doesn't need a regex. Using IndexOf works just fine. I'd usually prefer a regex, but for some scenarios a few index lookups would actually be more easy to read than a complex regex pattern.
# Look for vol and _. The underscore must be after vol.
# If both patterns exist, the substring doesn't ever get invalid argument
if( ($s.IndexOf('vol') -ge 0) -and ($s.IndexOf('_') -gt $s.IndexOf('vol')) ) {
$s.Substring(0, $s.IndexOf('_')).Replace('vol', 'vol-')
} else { Write-Host "Invalid string: $s" }
You may try this:
$s = $s -replace "vol","vol-"
My requirement changed after I posted this question, the strings were in two formats vol0eaec8f32c9a98654_00000001. and vol0eaec8f32c9a98654, the below regex worked for me, to convert them into valid volume Ids:
if ($volumeId.IndexOf('vol') -ge 0) {
$validId = $volumeID -replace "_[^ ]*$" -replace "vol", "vol-"
}
else { Write-Host "Invalid Volume ID: $volumeId" }
Hope you can help me with something. Thanks to #mklement0 I've gotten a great script matching the most basic, initial pattern for words in alphabetical order. However what's missing is a full text search and select.
An example of current script with a small sample of a few words within a Words.txt file:
App
Apple
Apply
Sword
Swords
Word
Words
Becomes:
App
Sword
Word
This is great as it really narrows down to a basic pattern per line! However the result of it going line by line there is still a pattern that can further be narrowed down which is "Word" (capitalization not important) so ideally the output should be:
App
Word
And "Sword" is removed as it falls in more basic pattern prefixed as "Word".
Would you have any suggestion on how to achieve this? Keep in mind this will be a dictionary list of about 250k words, so I would not know what I am looking for ahead of time
CODE (from a related post, handles prefix matching only):
$outFile = [IO.File]::CreateText("C:\Temp\Results.txt") # Output File Location
$prefix = '' # initialize the prefix pattern
foreach ($line in [IO.File]::ReadLines('C:\Temp\Words.txt')) # Input File name.
{
if ($line -like $prefix)
{
continue # same prefix, skip
}
$line # Visual output of new unique prefix
$prefix = "$line*" # Saves new prefix pattern
$outFile.writeline($line) # Output file write to configured location
}
You can try a two-step approach:
Step 1: Find the list of unique prefixes in the alphabetically sorted word list. This is done by reading the lines sequentially, and therefore only requires you to hold the unique prefixes as a whole in memory.
Step 2: Sort the resulting prefixes in order of length and iterate over them, checking in each iteration whether the word at hand is already represented by a substring of it in the result list.
The result list starts out empty, and whenever the word at hand has no substring in the result list, it is appended to the list.
The result list is implemented as a regular expression with alternation (|), to enable matching against all already-found unique words in a single operation.
You'll have to see if the performance is good enough; for best performance, .NET types are used directly as much as possible.
# Read the input file and build the list of unique prefixes, assuming
# alphabetical sorting.
$inFilePath = 'C:\Temp\Words.txt' # Be sure to use a full path.
$uniquePrefixWords =
foreach ($word in [IO.File]::ReadLines($inFilePath)) {
if ($word -like $prefix) { continue }
$word
$prefix = "$word*"
}
# Sort the prefixes by length in ascending order (shorter ones first).
# Note: This is a more time- and space-efficient alternative to:
# $uniquePrefixWords = $uniquePrefixWords | Sort-Object -Property Length
[Array]::Sort($uniquePrefixWords.ForEach('Length'), $uniquePrefixWords)
# Build the result lists of unique shortest words with the help of a regex.
# Skip later - and therefore longer - words, if they are already represented
# in the result list of word by a substring.
$regexUniqueWords = ''; $first = $true
foreach ($word in $uniquePrefixWords) {
if ($first) { # first word
$regexUniqueWords = $word
$first = $false
} elseif ($word -notmatch $regexUniqueWords) {
# New unique word found: add it to the regex as an alternation (|)
$regexUniqueWords += '|' + $word
}
}
# The regex now contains all unique words, separated by "|".
# Split it into an array of individual words, sort the array again...
$resultWords = $regexUniqueWords.Split('|')
[Array]::Sort($resultWords)
# ... and write it to the output file.
$outFilePath = 'C:\Temp\Results.txt' # Be sure to use a full path.
[IO.File]::WriteAllLines($outFilePath, $resultWords)
Reducing arbitrary substrings is a bit more complicated than prefix matching, as we can no longer rely on alphabetical sorting.
Instead, you could sort by length, and then keep track of patterns that can't be satisfied by a shorter one, by using a hash set:
function Reduce-Wildcard
{
param(
[string[]]$Strings,
[switch]$SkipSort
)
# Create set containing all patterns, removes all duplicates
$Patterns = [System.Collections.Generic.HashSet[string]]::new($Strings, [StringComparer]::CurrentCultureIgnoreCase)
# Now that we only have unique terms, sort them by length
$Strings = $Patterns |Sort-Object -Property Length
# Start from the shortest possible pattern
for ($i = 0; $i -lt ($Strings.Count - 1); $i++) {
$current = $Strings[$i]
if(-not $Patterns.Contains($current)){
# Check that we haven't eliminated current string already
continue
}
# There's no reason to search for this substring
# in any of the shorter strings
$j = $i + 1
do {
$next = $Strings[$j]
if($Patterns.Contains($next)){
# Do we have a substring match?
if($next -like "*$current*"){
# Eliminate the superstring
[void]$Patterns.Remove($next)
}
}
$j++
} while ($j -lt $Strings.Count)
}
# Return the substrings we have left
return $Patterns
}
Then use like:
$strings = [IO.File]::ReadLines('C:\Temp\Words.txt')
$reducedSet = Reduce-Wildcard -Strings $strings
Now, this is definitely not the most space-efficient way of reducing your patterns, but the good news is that you can easily divide-and-conquer a large set of inputs by merging and reducing the intermediate results:
Reduce-Wildcard #(
Reduce-Wildcard -Strings #('App','Apple')
Reduce-Wildcard -Strings #('Sword', 'Words')
Reduce-Wildcard -Strings #('Swords', 'Word')
)
Or, in case of multiple files, you can chain successive reductions like this:
$patterns = #()
Get-ChildItem dictionaries\*.txt |ForEach-Object {
$patterns = Reduce-Wildcard -String #(
$_ |Get-Content
$patterns
)
}
My two cents:
Using -Like or RegEx might get expensive on the long run knowing that they used in the inner loop of the selection the invocation will increase exponentially with the size of the word list. Besides, the pattern of the -Like and RegEx operation might need to be escaped (especially for Regex where e.g. a dot . has a special meaning. I Suspect that this question has something to do with checking for password complexity).
Presuming that it doesn't matter whether the output list is in lower case, I would use the String.Contains() method. Otherwise, If the case of the output does matter, you might prepare a hash table like $List[$Word.ToLower()] = $Word and use that restore the actual case at the end.
# Remove empty words, sort by word length and change everything to lowercase
# knowing that .Contains is case sensitive (and therefore presumably a little faster)
$Words = $Words | Where-Object {$_} | Sort-Object Length | ForEach-Object {$_.ToLower()}
# Start with a list of the smallest words (I guess this is a list of all the words with 3 characters)
$Result = [System.Collections.ArrayList]#($Words | Where-Object Length -Eq $Words[0].Length)
# Add the word to the list if it doesn't contain any of the all ready listed words
ForEach($Word in $Words) {
If (!$Result.Where({$Word.Contains($_)},'First')) { $Null = $Result.Add($Word) }
}
2020-04-23 updated the script with the suggestion from #Mathias:
You may want to use Where({$Word.Contains($_)},'First') to avoid comparing against all of $Result everytime
which is about twice as fast.
I can't seem to find an example of what I'm trying to do here.
I have a list of regular expressions that I'm searching through for each line of a csv file and they work great if everything is in upper case. However, my search is case sensitive and I can't figure out how to make it case insensitive, without modifying each of the regular expressions with something like ?i. Is it possible to modify what I'm doing here in a simple way?
Bonus Points! I'm searching with thousands of regular expressions and it seems to take a long time on that part. If you happen to know of a faster way to search each line for all of the regex's, please share.
$file = New-Object System.IO.StreamReader ($CSVFile) # Input Stream
while (($text = $file.ReadLine()) -ne $null ){
foreach ($RX in $SearchList){
foreach ($match in ([regex]$RX).Matches($text)) {
write-host "Match found: " $match.value -ForegroundColor Red
}
}
}
$file.close();
Thanks for any help with this!
Add this line just inside of your foreach ($RX in $SearchList){:
$RX = [regex]::new($RX,([regex]$RX).Options -bor [System.Text.RegularExpressions.RegexOptions]::IgnoreCase)
This ensures that $RX is a [regex] object, as well as adds the IgnoreCase option to whatever options were present.
To speed it up, you can do two things before searching: read the entire file to memory and create all your regex-objects...
$reList = $SearchList | ForEach-Object { [regex]$_ } # adapt the regex here
$lines = [System.IO.File]::ReadAllLines($CSVFile)
You really need thousands of regexs?
The new syntax becomes:
foreach($line in $lines) {
foreach($re in $reList) {
}
}
I'm moving the content source of pretty much everything in SCCM to a DFS share, and so I've got to change the source path for everything in the environment, and for the most part, I've got it coded out. There's some improvements I'd like to make, to clean up the code before I hit the big red button though.
For example, Powershell's .Replace method is case sensitive, and there are occasions where someone used uppercase in the server name in only PART of the name.
\\typNUMloc\ can be \\typNUMLOC\ or \\TYPNUMloc\ or \\TYPNUMLOC\. This makes for extra large If statements.
One of my functions is for the Drivers (not the Driver Packages, that I've tested with similar code, and I have only one mistyped path). Big Red Button commented out for safety.
$DriverArray = Get-CMDriver | Select CI_ID, ContentSourcePath | Where-Object {$_.ContentSourcePath -Like "\\oldNUMsrv\*"}
Foreach ($Driver in $DriverArray) {
$DriverID = $Driver.CI_ID
$CurrPath = $Driver.ContentSourcePath
# Checks and replaces the root path
If ($CurrPath -split '\\' -ccontains 'path_dir') {
$NewPath = $CurrPath.Replace("oldNUMsrv\path_dir","dfs\Path-Dir")
#Set-CMDriver -Id $DriverID -DriverSource $NewPath
} ElseIf ($CurrPath -split '\\' -ccontains 'Path_dir') {
$NewPath = $CurrPath.Replace("oldNUMsrv\Path_dir","dfs\Path-Dir")
#Set-CMDriver -Id $DriverID -DriverSource $NewPath
} ElseIf ($CurrPath -split '\\' -ccontains 'Path_Dir') {
$NewPath = $CurrPath.Replace("oldNUMsrv\Path_Dir","dfs\Path-Dir")
#Set-CMDriver -Id $DriverID -DriverSource $NewPath
} Else {
Write-Host "Bad Path at $DriverID -- $CurrPath" -ForegroundColor Red
}
# Checks again for ones that didn't change propery (case issues)
If ($NewPath -like "\\oldNUMsrv\*") {
Write-Host "Bad Path at $DriverID -- $CurrPath" -ForegroundColor Red
}
}
But as you can tell, that's a lot of code that I shouldn't need to do. I know, I could use the -replace or -ireplace methods, but I end up with additional backslashes (\\dfs\\Path-Dir) in my path, even when using [regex]::escape.
How can I use an array of the different paths to match against the $CurrPath and perform the replace? I know it doesn't work, but like this:
If ($Array -in $CurrPath) {
$NewPath = $CurrPath.Replace($Array, "dfs\Path-Dir"
}
I think your issue might have been assuming you had to escape the replacement string as well as the pattern string. That is not the case. Since you have control characters (the slash) you will need to escape the pattern string. In it's basic form you just need to do something like this:
PS C:\Users\Matt> "C:\temp\test\file.php" -replace [regex]::Escape("temp\test"), "another\path"
C:\another\path\file.php
However I would like to take this one step further. Your if statements are all doing relatively the same thing. Finding a series of strings and replacing them all with the same thing. -contains isn't really necessary either. Also note that all of those comparison operators are case insensitive by default. See about_comparison_operators
You can simplify all that with a little more regex by building a pattern string. So assuming that your strings are all unique (case does not matter) you could do this:
$stringstoReplace = "oldNUMsrv1\path_dir", "oldNUMsrv2\Path_dir", "oldNUMsrv1\Path_Dir"
$regexPattern = ($stringstoReplace | ForEach-Object{[regex]::Escape($_)}) -join "|"
if($CurrPath -match $regexPattern){
$NewPath = $CurrPath -replace $regexPattern,"new\path"
}
You don't even need the if. You could just use -replace on all strings regardless. I only left the if since you had a check to see if something changed. Again, if you were just creating all those statements just to account for case then my suggestion is moot.
I'm trying to replace 600 different strings in a very large text file 30Mb+. I'm current building a script that does this; following this Question:
Script:
$string = gc $filePath
$string | % {
$_ -replace 'something0','somethingelse0' `
-replace 'something1','somethingelse1' `
-replace 'something2','somethingelse2' `
-replace 'something3','somethingelse3' `
-replace 'something4','somethingelse4' `
-replace 'something5','somethingelse5' `
...
(600 More Lines...)
...
}
$string | ac "C:\log.txt"
But as this will check each line 600 times and there are well over 150,000+ lines in the text file this means there’s a lot of processing time.
Is there a better alternative to doing this that is more efficient?
Combining the hash technique from Adi Inbar's answer, and the match evaluator from Keith Hill's answer to another recent question, here is how you can perform the replace in PowerShell:
# Build hashtable of search and replace values.
$replacements = #{
'something0' = 'somethingelse0'
'something1' = 'somethingelse1'
'something2' = 'somethingelse2'
'something3' = 'somethingelse3'
'something4' = 'somethingelse4'
'something5' = 'somethingelse5'
'X:\Group_14\DACU' = '\\DACU$'
'.*[^xyz]' = 'oO{xyz}'
'moresomethings' = 'moresomethingelses'
}
# Join all (escaped) keys from the hashtable into one regular expression.
[regex]$r = #($replacements.Keys | foreach { [regex]::Escape( $_ ) }) -join '|'
[scriptblock]$matchEval = { param( [Text.RegularExpressions.Match]$matchInfo )
# Return replacement value for each matched value.
$matchedValue = $matchInfo.Groups[0].Value
$replacements[$matchedValue]
}
# Perform replace over every line in the file and append to log.
Get-Content $filePath |
foreach { $r.Replace( $_, $matchEval ) } |
Add-Content 'C:\log.txt'
So, what you're saying is that you want to replace any of 600 strings in each of 150,000 lines, and you want to run one replace operation per line?
Yes, there is a way to do it, but not in PowerShell, at least I can't think of one. It can be done in Perl.
The Method:
Construct a hash where the keys are the somethings and the values are the somethingelses.
Join the keys of the hash with the | symbol, and use it as a match group in the regex.
In the replacement, interpolate an expression that retrieves a value from the hash using the match variable for the capture group
The Problem:
Frustratingly, PowerShell doesn't expose the match variables outside the regex replace call. It doesn't work with the -replace operator and it doesn't work with [regex]::replace.
In Perl, you can do this, for example:
$string =~ s/(1|2|3)/#{[$1 + 5]}/g;
This will add 5 to the digits 1, 2, and 3 throughout the string, so if the string is "1224526123 [2] [6]", it turns into "6774576678 [7] [6]".
However, in PowerShell, both of these fail:
$string -replace '(1|2|3)',"$($1 + 5)"
[regex]::replace($string,'(1|2|3)',"$($1 + 5)")
In both cases, $1 evaluates to null, and the expression evaluates to plain old 5. The match variables in replacements are only meaningful in the resulting string, i.e. a single-quoted string or whatever the double-quoted string evaluates to. They're basically just backreferences that look like match variables. Sure, you can quote the $ before the number in a double-quoted string, so it will evaluate to the corresponding match group, but that defeats the purpose - it can't participate in an expression.
The Solution:
[This answer has been modified from the original. It has been formatted to fit match strings with regex metacharacters. And your TV screen, of course.]
If using another language is acceptable to you, the following Perl script works like a charm:
$filePath = $ARGV[0]; # Or hard-code it or whatever
open INPUT, "< $filePath";
open OUTPUT, '> C:\log.txt';
%replacements = (
'something0' => 'somethingelse0',
'something1' => 'somethingelse1',
'something2' => 'somethingelse2',
'something3' => 'somethingelse3',
'something4' => 'somethingelse4',
'something5' => 'somethingelse5',
'X:\Group_14\DACU' => '\\DACU$',
'.*[^xyz]' => 'oO{xyz}',
'moresomethings' => 'moresomethingelses'
);
foreach (keys %replacements) {
push #strings, qr/\Q$_\E/;
$replacements{$_} =~ s/\\/\\\\/g;
}
$pattern = join '|', #strings;
while (<INPUT>) {
s/($pattern)/$replacements{$1}/g;
print OUTPUT;
}
close INPUT;
close OUTPUT;
It searches for the keys of the hash (left of the =>), and replaces them with the corresponding values. Here's what's happening:
The foreach loop goes through all the elements of the hash and create an array called #strings that contains the keys of the %replacements hash, with metacharacters quoted using \Q and \E, and the result of that quoted for use as a regex pattern (qr = quote regex). In the same pass, it escapes all the backslashes in the replacement strings by doubling them.
Next, the elements of the array are joined with |'s to form the search pattern. You could include the grouping parentheses in $pattern if you want, but I think this way makes it clearer what's happening.
The while loop reads each line from the input file, replaces any of the strings in the search pattern with the corresponding replacement strings in the hash, and writes the line to the output file.
BTW, you might have noticed several other modifications from the original script. My Perl has collected some dust during my recent PowerShell kick, and on a second look I noticed several things that could be done better.
while (<INPUT>) reads the file one line at a time. A lot more sensible than reading the entire 150,000 lines into an array, especially when your goal is efficiency.
I simplified #{[$replacements{$1}]} to $replacements{$1}. Perl doesn't have a built-in way of interpolating expressions like PowerShell's $(), so #{[ ]} is used as a workaround - it creates a literal array of one element containing the expression. But I realized that it's not necessary if the expression is just a single scalar variable (I had it in there as a holdover from my initial testing, where I was applying calculations to the $1 match variable).
The close statements aren't strictly necessary, but it's considered good practice to explicitly close your filehandles.
I changed the for abbreviation to foreach, to make it clearer and more familiar to PowerShell programmers.
I also have no idea how to solve this in powershell, but I do know how to solve it in Bash and that is by using a tool called sed. Luckily, there is also Sed for Windows. If all you want to do is replace "something#" with "somethingelse#" everywhere then this command will do the trick for you
sed -i "s/something([0-9]+)/somethingelse\1/g" c:\log.txt
In Bash you'd actually need to escape a couple of those characters with backslashes, but I'm not sure you need to in windows. If the first command complains you can try
sed -i "s/something\([0-9]\+\)/somethingelse\1/g" c:\log.txt
I would use the powershell switch statement:
$string = gc $filePath
$string | % {
switch -regex ($_) {
'something0' { 'somethingelse0' }
'something1' { 'somethingelse1' }
'something2' { 'somethingelse2' }
'something3' { 'somethingelse3' }
'something4' { 'somethingelse4' }
'something5' { 'somethingelse5' }
'pattern(?<a>\d+)' { $matches['a'] } # sample of more complex logic
...
(600 More Lines...)
...
default { $_ }
}
} | ac "C:\log.txt"