Keep first regex match and discard others - regex

Yep another regex question... I am using PowerShell to extract a simple number from a filename when looping through a folder like so:
# sample string "ABCD - (123) Sample Text Here"
Get-ChildItem $processingFolder -filter *.xls | Where-Object {
$name = $_.Name
$pattern = '(\d{2,3})'
$metric = ([regex]$pattern).Matches($name) | { $_.Groups[1].Value }
}
All I am looking for is the number surrounded by brackets. This is successful, but it appears the $_.Name actually grabs more than just the name of the file, and the regex ends up picking up some other bits I don't want.
I understand why, as it's going through each regex match as an object and taking the value out of each and putting in $metric. I need some help editing the code so it only bothers with the first object.
I would just use -match etc if I wasn't bothered with the actual contents of the match, but it needs to be kept.

I don't see a cmdlet call before $_.Groups[1].Value which should be ForEach-Object but that is a minor thing. We need to make a small improvement on your regex pattern as well to account for the brackets but not include them in the return.
$processingFolder = "C:\temp"
$pattern = '\((\d+)\)'
Get-ChildItem $processingFolder -filter "*.xls" | ForEach-Object{
$details = ""
if($_.Name -match $pattern){$details = $matches[1]}
$_ | Add-Member -MemberType NoteProperty -Name Details -Value $details -PassThru
} | select name, details
This will loop all the files and try and match numbers in brackets. If there is more than one match it should only take the first one. We use a capture group in order to ignore the brackets in the results. Next we use Add-Member to make a new property called Details which will contain the matched value.
Currently this will return all files in the $processingFolder but a simple Where-Object{$_.Details} would return just the ones that have the property populated. If you have other properties that you need to make you can chain the Add-Members together. Just don't forget the -passthru.
You could also just make your own new object if you need to go that route with multiple custom parameters. It certainly would be more terse. That last question I answered has an example of that.

After doing some research in to the data being returned itself (System.Text.RegularExpressions.MatchCollection) I found the Item method, so called that on $metric like so:
$name = '(111) 123 456 789 Name of Report Here 123'
$pattern = '(\d{2,3})'
$metric = ([regex]$pattern).Matches($name)
Write-Host $metric.Item(1)
Whilst probably not the best approach, it returns what I'm expecting for now.

Related

Using "notin" with matching groups

Using powershell, I am trying to determine which perl scripts in a directory are not called from any other script. In my Select-String I am grouping the matches because there is some other logic I use to filter out results where the line is commented, and a bunch of other scenarios I want to exclude(for simplicity I excluded that from the code posted below). My main problem is in the "-notin" part.
I can get this to work if I remove the grouping from Select-string and only match the filename itself. So this works.
$searchlocation = "C:\Temp\"
$allresults = Select-String -Path "$searchlocation*.pl" -Pattern '\w+\.pl'
$allperlfiles = Get-Childitem -Path "$searchlocation*.pl"
$allperlfiles | foreach-object -process{
$_ | where {$_.name -notin $allresults.matches.value} | Select -expandproperty name | Write-Host
}
However I cannot get the following to work. The only difference between this and above is the value for the "-Pattern" and the value after "-notin". I'm not sure how to use "notin" along with matching groups.
$searchlocation = "C:\Temp\"
$allresults = Select-String -Path "$searchlocation*.pl" -Pattern '(.*?)(\w+\.pl)'
$allperlfiles = Get-Childitem -Path "$searchlocation*.pl"
$allperlfiles | foreach-object -process{
$_ | where {$_.name -notin $allresults.matches.groups[2].value} | Select -expandproperty name | Write-Host}
At a high level the code should search all perl scripts in a directory for any lines that execute any other perl script. With that I now have $allresults which basically gives me a list of all perl scripts called from other files. To get the inverse of that(files that are NOT called from any other file) I get a list of all perl scripts in the directory, cycle through those and list out the ones that DONT show up in $allresults.
When you select a grouping you need to do so using a Select statement, or iteratively in a loop, otherwise you are only going to select the value from the Nth match.
IE if your $Allresults object contains
File.pl, File 2.pl, File 3.pl
Then $allresults.Matches.Groups[2].value Only Returns File2.pl
Instead, you need to select those values!
$allresults | select #{N="Match";E={ $($_.Matches.Groups[2].value) } }
Which will return:
Match
-----
File1.pl
File2.pl
File3.pl
In your specific example, each match has three sub-items, the results will be completely sequential, so what you would term "match 1, group 1" is groups[0] while "match 2, group 1" is groups[3]
This means the matches you care about (those with grouping 2) are in the array values contained in the set {2,5,8,11,...,etc.} or can be described as (N*3-1) Where N is the number of the match. So For Match 1 = (1*3)-1 = [2]; while For Match 13 = (13*3)-1 = [38]
You can iterate through them using a loop to check:
for($i=0; $i -le ($allresults.Matches.groups.count-1); $i++){
"Group[$i] = ""$($allresults.Matches.Groups[$i].value)"""
}
I noticed that you took the time to avoid loops in collecting your data, but then accidentally seem to have fallen prey to using one in matching your data.
Not-In and other compares when used by the select and where clauses don't need a loop structure and are faster if not looped, so you can forego the Foreach-object loop and have a better process just by using a simple Where (?).
$SearchLocation = "C:\Temp\"
$FileGlob = "*.pl"
$allresults = Select-String -Path "$SearchLocation$FileGlob" -Pattern '(.*?)([\w\.]+\.bat)'
$allperlfiles = Get-Childitem -Path "$SearchLocation$FileGlob"
$allperlfiles | ? {
$_.name -notin $(
$allresults | select #{N="Match";E={ $($_.Matches.Groups[2].value) } }
)
} | Select -expandproperty name | Write-Host
Now, that should be faster and simpler code to maintain, but, as you may have noticed, it still has some redundancies now that you are not looping.
As you are piping it all into a Select which can do the work of the where, and what's more you only are looking to match the NAME property here so you can either for-go the last select by only piping the name of the file in the first place, or you can forgo the where and select exactly what you want.
I think the former is far simpler, and the latter is useful if you are going to actually do something with those other values inside the loop that we don't know yet.
Finally, Write-host is likely redundant as any object output will echo to the console.
Here is that version which incorporates the removal of the unnecessary loops and removes redundancies related to the output of the info you wanted, all together.
$SearchLocation = "C:\Temp\"
$FileGlob = "*.pl"
$allresults = Select-String -Path "$SearchLocation$FileGlob" -Pattern ('(.*?)([\w\.]+\'+$FileGlob+')')
$allperlfiles = Get-Childitem -Path "$SearchLocation$FileGlob"
$allperlfiles.name | ? {
$_ -notin $(
$allresults | select #{
N="Match";E={
$($_.Matches.Groups[2].value)
}
}
)
}

Replace or substring first set of numbers with regex

I am struggling to find a way to get only the first set of numbers in a file name in PowerShell. The file names can be similar to the ones below but I only want to get the first string of numbers and nothing else.
Example file names:
123456 (12).csv
123456abc.csv
123456(Copy 1).csv
123456 (Copy 1).csv
What I am currently attempting:
$test = "123456 (12).csv"
$POPieces = $test -match "^[0-9\s]+$"
Write-Host $POPieces
What I'd expect from above:
123456
The -match operator stores the matches in the automatic variable $matches. However, your regular expression includes not only digits, but also whitespace (\s), so you won't necessarily get just the number. Change the expression to ^\d+ to match only a number at the beginning of the string. Use Get-ChildItem to enumerate the files, as Martin Brandl suggested.
$POPieces = Get-ChildItem 'C:\root\folder' -Filter '*.csv' |
Where-Object { $_.Name -match '^\d+' } |
ForEach-Object { $matches[0] }

replace thousands separators in csv with regex

I'm running into problems trying to pull the thousands separators out of some currency values in a set of files. The "bad" values are delimited with commas and double quotes. There are other values in there that are < $1000 that present no issue.
Example of existing file:
"12,345.67",12.34,"123,456.78",1.00,"123,456,789.12"
Example of desired file (thousands separators removed):
"12345.67",12.34,"123456.78",1.00,"123456789.12"
I found a regex expression for matching the numbers with separators that works great, but I'm having trouble with the -replace operator. The replacement value is confusing me. I read about $& and I'm wondering if I should use that here. I tried $_, but that pulls out ALL my commas. Do I have to use $matches somehow?
Here's my code:
$Files = Get-ChildItem *input.csv
foreach ($file in $Files)
{
$file |
Get-Content | #assume that I can't use -raw
% {$_ -replace '"[\d]{1,3}(,[\d]{3})*(\.[\d]+)?"', ("$&" -replace ',','')} | #this is my problem
out-file output.csv -append -encoding ascii
}
Tony Hinkle's comment is the answer: don't use regex for this (at least not directly on the CSV file).
Your CSV is valid, so you should parse it as such, work on the objects (change the text if you want), then write a new CSV.
Import-Csv -Path .\my.csv | ForEach-Object {
$_ | ForEach-Object {
$_ -replace ',',''
}
} | Export-Csv -Path .\my_new.csv
(this code needs work, specifically the middle as the row will have each column as a property, not an array, but a more complete version of your CSV would make that easier to demonstrate)
You can try with this regex:
,(?=(\d{3},?)+(?:\.\d{1,3})?")
See Live Demo or in powershell:
% {$_ -replace ',(?=(\d{3},?)+(?:\.\d{1,3})?")','' }
But it's more about the challenge that regex can bring. For proper work, use #briantist answer which is the clean way to do this.
I would use a simpler regex, and use capture groups instead of the entire capture.
I have tested the follow regular expression with your input and found no issues.
% {$_ -replace '([\d]),([\d])','$1$2' }
eg. Find all commas with a number before and after (so that the weird mixed splits dont matter) and replace the comma entirely.
This would have problems if your input has a scenario without that odd mixing of quotes and no quotes.

grep string between two other strings as delimiters

I have to do a report on how many times a certain CSS class appears in the content of our pages (over 10k pages). The trouble is, the header and footer contains that class, so a grep returns every single page.
So, how do I grep for content?
EDIT: I am looking for if a page has list-unstyled between <main> and </main>
So do I use a regular expression for that grep? or do I need to use PowerShell to have more functionality?
I have grep at my disposal and PowerShell, but I could use a portable software if that is my only option.
Ideally, I would get a report (.txt or .csv) with pages and line numbers where the class shows up, but just a list of the pages themselves would suffice.
EDIT: Progress
I now have this in PowerShell
$files = get-childitem -recurse -path w:\test\york\ -Filter *.html
foreach ($file in $files)
{
$htmlfile=[System.IO.File]::ReadAllText($file.fullName)
$regex="(?m)<main([\w\W]*)</main>"
if ($htmlfile -match $regex) {
$middle=$matches[1]
[regex]::Matches($middle,"list-unstyled")
Write-Host $file.fullName has matches in the middle:
}
}
Which I run with this command .\FindStr.ps1 | Export-csv C:\Tools\text.csv
it outputs the filename and path with string in the console, put does not add anything to the CSV. How can I get that added in?
What Ansgar Wiechers' answer says is good advice. Don't string search html files. I don't have a problem with it but it is worth noting that not all html files are the same and regex searches can produce flawed results. If tools exists that are aware of the file content structure you should use them.
I would like to take a simple approach that reports all files that have enough occurrences of the text list-unstyled in all html files in a given directory. You expect there to be 2? So if more than that show up then there is enough. I would have done a more complicated regex solution but since you want the line number as well I came up with this compromise.
$pattern = "list-unstyled"
Get-ChildItem C:\temp -Recurse -Filter *.html |
Select-String $pattern |
Group-Object Path |
Where-Object{$_.Count -gt 2} |
ForEach-Object{
$props = #{
File = $_.Group | Select-Object -First 1 -ExpandProperty Path
PatternFound = ($_.Group | Select-Object -ExpandProperty LineNumber) -join ";"
}
New-Object -TypeName PSCustomObject -Property $props
}
Select-String is a grep like tool that can search files for string. It reports the located line number in the file which I why we are using it here.
You should get output that looks like this on your PowerShell console.
File PatternFound
---- ------------
C:\temp\content.html 4;11;54
Where 4,11,54 is the lines where the text was found. The code filters out results where the count of lines is less than 3. So if you expect it once in the header and footer those results should be excluded.
You can create a regexp that will be suitable for multiline match. The regexp "(?m)<!-- main content -->([\w\W]*)<!-- end content -->" matches a multiline content delimited by your comments, with (?m) part meaning that this regexp has multiline option enabled. The group ([\w\W]*) matches everything between your comments, and also enables you to query $matches[1] which will contain your "main text" without headers and footers.
$htmlfile=[System.IO.File]::ReadAllText($fileToGrep)
$regex="(?m)<!-- main content -->([\w\W]*)<!-- end content -->"
if ($htmlfile -match $regex) {
$middle=$matches[1]
[regex]::Matches($middle,"list-unstyled")
}
This is only an example of how should you parse the file. You populate $fileToGrep with a file name which you desire to parse, then run this snippet to receive a string that contains all the list-unstyled strings in the middle of that file.
Don't use string matches for something like this. Analyze the DOM instead. That should allow you to exclude headers and footers by selecting the appropriate root element.
$ie = New-Object -COM 'InternetExplorer.Application'
$url = '...'
$classname = 'list-unstyled'
$ie.Navigate($url)
do { Start-Sleep -Milliseconds 100 } until ($ie.ReadyState -eq 4)
$root = $ie.Document.getElementsById('content-element-id')
$hits = $root.getElementsByTagName('*') | ? { $_.ClassName -eq $classname }
$hits.Count # number of occurrences of $classname below content element

PowerShell Select Query With Regex Pattern as sub-expression

I'm trying to do a semi-one liner to replace the contents of a partial AD Distinguished Name column with the users actual name.
IE:
Pattern in $_.identity.DistinguishedName
CN=Touchdown§3939303030313134393535383932,CN=ExchangeActiveSyncDevices,CN=Guy\, Some,OU=Employees,OU=Departments and Categories,DC=something,DC=com
One Liner That doesn't work
$devices | Select #{N="Name";E={ (Get-AdUser -Identity ($_.Identity.DistinguishedName -match ".*\,CN=ExchangeActiveSyncDevices\,(.*)" | Out-Null; $Matches[1])).Name }}
This alone works....
$devices[0].Identity.DistinguishedName -match ".*\,CN=ExchangeActiveSyncDevices\,(.*)" | Out-Null; $Matches[1]
And Displays...
CN=Guy\, Some,OU=Employees,OU=Departments and Categories,DC=something,DC=com
This also works, which is similar to what i'm trying to achieve, but doesn't allow me to take the DistinguishedName and go lookup the actual name.
$devices | Select #{N="Name";E={ $_.Identity.DistinguishedName -match ".*\,CN=ExchangeActiveSyncDevices\,(.*)" | Out-Null; $Matches[1] }}
As soon as you try to do this it breaks down because i'm assuming your not allowed to use a ; to break to the next command when feeding that identity parameter in Get-ADUser.
Get-ADUser -Identity ($devices[0].Identity.DistinguishedName -match ".*\,CN=ExchangeActiveSyncDevices\,(.*)" | Out-Null; $Matches[1])
How would one accomplish this by using a select expression without having to populate a whole new separate variable and replace the original contents with the modified contents?
I don't understand why you're piping to Out-Null. If I understand your question, you're looking for a way to extract substrings from a regular expression? You can use Select-String; for example:
"Hello, world" | Select-String '^\S+, (\S+)' | ForEach-Object {
$_.Matches[0].Groups[1].Value
}
# outputs "world"
Bill