Replacing any content inbetween second and third underscore - regex

I have a PowerShell Scriptline that replaces(deletes) characters between the second and third underscore with an "_":
get-childitem *.pdf | rename-item -newname { $_.name -replace '_\p{L}+, \p{L}+_', "_"}
Examples:
12345_00001_LastName, FirstName_09_2018_Text_MoreText.pdf
12345_00002_LastName, FirstName-SecondName_09_2018_Text_MoreText.pdf
12345_00003_LastName, FirstName SecondName_09_2018_Text_MoreText.pdf
This _\p{L}+, \p{L}+_ regex only works for the first example. To replace everything inbetween I have used _(?:[^_]*)_([^_]*)_ (according to regex101 this should almost work) but the output is:
12345_09_MoreText.pdf
The desired output would be:
12345_00001_09_2018_Text_MoreText.pdf
12345_00002_09_2018_Text_MoreText.pdf
12345_00003_09_2018_Text_MoreText.pdf
How do I correctly replace the second and third underscore and everything inbetween with an "_"?

If you don't want to use regex -
$files = get-childitem *.pdf #get all pdf files
$ModifiedFiles, $New = #() #declaring two arrays
foreach($file in $files)
{
$ModifiedFiles = $file.split("_")
$ModifiedFiles = $ModifiedFiles | Where-Object { $_ -ne $ModifiedFiles[2] } #ommitting anything between second and third underscore
$New = "$ModifiedFiles" -replace (" ", "_")
Rename-Item -Path $file.FullName -NewName $New
}
Sample Data -
$files = "12345_00001_LastName, FirstName_09_2018_Text_MoreText.pdf", "12345_00002_LastName, FirstName-SecondName_09_2018_Text_MoreText.pdf", "12345_00003_LastName, FirstName SecondName_09_2018_Text_MoreText.pdf"
$ModifiedFiles, $New = #() #declaring two arrays
foreach($file in $files)
{
$ModifiedFiles = $file.split("_")
$ModifiedFiles = $ModifiedFiles | Where-Object { $_ -ne $ModifiedFiles[2] } #ommitting anything between second and third underscore
$New = "$ModifiedFiles" -replace (" ", "_")
}

You may use
-replace '^((?:[^_]*_){2})[^_]+_', '$1'
See the regex demo
Details
^ - start of the line
((?:[^_]*_){2}) - Group 1 (the value will be referenced to with $1 from the replacement pattern): two repetitions of
[^_]* - 0+ chars other than an underscore
_ - an underscore
[^_]+ - 1 or more chars other than _
_ - an underscore

To offer an alternative solution that avoids a complex regex: The following is based on the -split and -join operators and shows PowerShell's flexibility with respect to array slicing:
Get-ChildItem *.pdf | Rename-Item { ($_.Name -split '_')[0..1 + 3..6] -join '_' } -WhatIf
$_.Name -split '_' splits the filename by _ into an array of tokens (substrings).
Array slice [0..1 + 3..6] combines two range expressions (..) to essentially remove the token with index 2 from the array.
-join '_' reassembles the modified array into a _-separated string, yielding the desired result.
Note: 6, the upper array bound, is hard-coded above, which is suboptimal, but sufficient with input as predictable as in this case.
As of Windows PowerShell v5.1 / PowerShell Core 6.1.0, in order to determine the upper bound dynamically, you require the help of an auxiliary variable, which is clumsy:
Get-ChildItem *.pdf |
Rename-Item { ($arr = $_.Name -split '_')[0..1 + 3..($arr.Count-1)] -join '_' } -WhatIf
Wouldn't it be nice if we could write [0..1 + 3..] instead?
This and other improvements to PowerShell's slicing syntax are the subject of this feature suggestion on GitHub.

here's one other way ... using string methods.
'12345_00003_LastName, FirstName SecondName_09_2018_Text_MoreText.pdf'.
Split('_').
Where({
$_ -notmatch ','
}) -join '_'
result = 12345_00003_09_2018_Text_MoreText.pdf
that does the following ...
split on the underscores
toss out any item that has a comma in it
join the remaining items back into a string with underscores
i suspect that the pure regex solution will be faster, but you may want to use this simply to have something that is easier to understand when you next need to modify it. [grin]

Related

How to filter list depending on occurences within string

I have an array of items like this:
get-childitem *\bin\Release\*Tests.dll -recurse
I have paths like these:
C:\r\x\ABCTests\bin\Release\net461\ABCTests.dll
C:\r\x\ABCTests\bin\Release\net461\OtherTests.dll
C:\r\x\OtherTests\bin\Release\net461\OtherTests.dll
I only want the paths that the name of the file matches the name of the folder:
C:\r\x\ABCTests\bin\Release\net461\ABCTests.dll - Yes
C:\r\x\ABCTests\bin\Release\net461\OtherTests.dll - No
C:\r\x\OtherTests\bin\Release\net461\OtherTests.dll - Yes
What would be the best way to filter this in Powershell? I have tried with Select-String but it opens the file. I have the regex expression ready for I'm having trouble in executing in in Powershell. Should I use regex?
Here is the powershell code:
get-childitem *\bin\Release\*Tests.dll -recurse | Where-Object { $_.FullName -match {"(" + $_.Name.Substring(0, $_.Name.LastIndexOf(".")) + ").*\1\.dll"} } | %{ write-host $_ }
I suggest using
$rx = '\\([^\\]*)Tests\\bin\\Release\\(?:.*\\)?\1Tests\.dll$'
See the regex demo.
Regex details
\\ - a \ char
([^\\]*) - Group 1: any zero or more chars other than a backslash
Tests\\bin\\Release\\ - a Tests\bin\Release\ text (we may hardcode it since this value was used in the glob)
(?:.*\\)? - an optional sequence of any 0 or more chars other than a newline as many as possible, and then a backslash
\1 - the same value as captured in Group 1
Tests\.dll - Tests.dll string (we may hardcode it since this value was used in the glob)
$ - end of string.
Then use
Get-Childitem *\bin\Release\*Tests.dll -recurse |
Where { $_.FullName -match $rx } |
% { $_.FullName }
See the regex demo.

Select all backslashes between two chars

I am working on a powershell script and I've got several text files where I need to replace backslashes in lines which matches this pattern: .. >\\%name% .. < .. (.. could be anything)
Example string from one of the files where the backslashes should match:
<Tag>\\%name%\TST$\Program\1.0\000\Program.msi</Tag>
Example string from one of the files where the backslashes should not match:
<Tag>/i /L*V "%TST%\filename.log" /quiet /norestart</Tag>
So far I've managed to select every char between >\\%name% and < with this expression (Regex101):
(?<=>\\\\%name%)(.*)(?=<)
but I failed to select only the backslashes.
Is there a solution which I could not yet find?
I'd recommend selecting the relevant tags with an XPath expression and then do the replacement on the text body of the selected nodes.
$xml.SelectNodes('//Tag[substring(., 1, 8) = "\\%name%"]' | ForEach-Object {
$_.'#text' = $_.'#text' -replace '\\', '\\'
}
So here's my solution:
$original_file = $Filepath
$destination_file = $Filepath + ".new"
Get-Content -Path $original_file | ForEach-Object {
$line = $_
if ($line -match '(?<=>\\\\%name%)(.*)(?=<)'){
$line = $line -replace '\\','/'
}
$line
} | Set-Content -Path $destination_file
Remove-Item $original_file
Rename-Item $destination_file.ToString() $original_file.ToString()
So this will replace every \ with an / in the given pattern but not in the way which my question was about.

regex in powershell - not change three characters before text

Is there any easy way to do this?
input: 123215-85_01_test
expected output: 01_test
Another example
input: 12154_02_test
expected output: 02_test
There will be always string "test", but different numbering before
for example this code..
$path = "c:\tmp\*.sql"
get-childitem $path | forEach-object {
$name = $_.Name
$result = $name -replace "","" # I don't know how write this regex..
$extension = $_.Extension
$newName = $prefix+"_"+ $result -f, $extension
Rename-Item -Path $_.FullName -NewName $newName
}
There are two ways you go go at this. Simple split and join or you can use one of many regexes....
Split on underscore and rejoin last 2 elements
$split = "123215-85_01_test" -split "_"
$split[-2..-1] -join "_" # $split[-2,-1] would also work.
Regex to locate the data between the last underscores
"123215-85_01_test" -replace "^.*_(\d+)_(.*)$", '$1_$2'
Note this fails if there is more than 2 underscores.

Exclude array items based on dynamic criteria

I have an array of strings like:
File1
File2
File1_s1
File2_s1
Print$
PSDrive
PSParentPath
I have a need to select all strings that do not conform to a dynamic set of rules. I don't really need fancy regex, I just want to match a dynamic amount of very simple regex rules. Basically:
$Arr | Where {($_.Name -notlike '_s1') -and ($_.Name -notlike 'Print$')}
But I need a dynamic amount of -ands specified by an input to the function. Is there any easy way to do this?
Ok so you can do this
$omits = "_s1","Print$"
$regex = '({0})' -f (($omits | ForEach-Object{[regex]::Escape($_)}) -join "|")
$arr | Where-Object{$_ -notmatch $regex}
$omits would contain the list of strings you want to -match/-notmatch. Then we take each member and run a regex escape on it ($ is a special regex character. The end of line anchor) The take each scrubbed string and build a matching group. So in the above example $regex would be
(_s1|Print\$)
Add more entries to $omit as you see fit. Which would give the filtered results as
File1
File2
PSDrive
PSParentPath
If you can be trusted to escape your own regex your options open up more.
$omits = "_s1","Print\$","^PS"
$regex = '({0})' -f ($omits -join "|")
That way the PS has to be at the beginning of the string.
As the #Matt suggests, without wildcards in your strings -NotLike is basically equivalent to -ne (and then you could just use -NotIn). I'm assuming your examples are missing wildcards but your actual patterns are not.
$patternArray = 'File1','File2','File1_s1','File2_s1','Print$','PSDrive','PSParentPath';
foreach ($pattern in $patternArray) {
$Arr = $Arr | Where-Object {$_.Name -notlike $pattern};
}
Or you could do some basic matching with:
$patternArray = 'File1','File2','File1_s1','File2_s1','Print$','PSDrive','PSParentPath';
foreach ($pattern in $patternArray) {
$Arr = $Arr | Where-Object {$_.Name -notlike "*$pattern*"};
}

Expansion of a variable in a regex pattern doesn't work

As a novice in powershell coding, I have some difficulties with expansion of a variable in PowerShell regex patterns.
What I wanted to do is:
Scan for logfiles that have been changed between two timeframes
For each of the logfiles, I get part of the name which indicates the date it is referencing to.
That date is stored in the variable $filedate.
Then go trough each line logfiles
Whenever I find a line that looks like:
14:00:15 blablabla
In a file named blabla20130620.log
I want that the data line becomes
2013-06-20 14:00:15 blablabla
It should write the output in append mode to a text file (to concatenate different log files)
Here is what I got until now (I'm testing in a sandbox now, so no comments etc...)
$Logpath = "o:\Log"
$prevcheck="2013-06-24 19:27:14"
$currenttd="{0:yyyy-MM-dd HH:mm:ss}" -f (get-date)
$batch = 1000
[regex]$match_regex = '^([01]\d|2[0-3]):([0-5]\d):([0-5]\d)'
If (Test-Path "$Logpath\test.txt"){
Remove-Item "$Logpath\test.txt"
}
$files=Get-ChildItem $LogPath\*.log | Where-Object { $_.LastWriteTime -ge "$prevcheck" - and $_.LastWriteTime -le "$currenttd" -and !$_.PSIsContainer }
foreach ($file in $files)
{
$filedate=$file.Name.Substring(6,4) + "-" + $file.Name.Substring(10,2) + "-" + $file.Name.Substring(12,2)
## This doesn't seem to work fine
## results look like:
## "$filedate" 14:00:15 blablabla
$replace_regex = '"$filedate" $_'
## I tried this too, but without success
## The time seems to dissappear now
## results look like:
## 2013-06-20 blablabla
#$replace_regex = iex('$filedate' + $_)
(Get-Content $file.PSPath -ReadCount $batch) |
foreach-object {if ($_ -match $match_regex) { $_ -replace $match_regex, $replace_regex} else { $_ }}|
out-file -Append "o:\log\test.txt"
You're over-complicating things.
You're comparing dates in your Where-Object filter, so you don't need to transform your reference dates to strings. Just use dates:
$prevcheck = Get-Date "2013-06-24 19:27:14"
$currenttd = Get-Date
You can use a regular expression to extract the date from the file name and transform it into the desired format:
$filedate = $file.BaseName -replace '^.*(\d{4})(\d{2})(\d{2})$', '$1-$2-$3'
Your regular expression for matching the time is overly correct. Use ^(\d{2}:\d{2}:\d{2}) instead. It's a little sloppier, but it will most likely suffice and is a lot easier on the eye.
To prepend the time-match with the date, use "$filedate `$1". The double quotes will cause $filedate to be expanded to the date from the file name, and the escaped $ (``$1`) will keep the grouped match (see Richard's explanation).
While you can assign the results from each step to variables, it'd be simpler to just use a single pipeline.
Try this:
$Logpath = "o:\Log"
$Logfile = "$Logpath\test.txt"
$prevcheck = Get-Date "2013-06-24 19:27:14"
$currenttd = Get-Date
If (Test-Path -LiteralPath $Logfile) { Remove-Item $Logfile }
Get-ChildItem "$LogPath\*.log" | ? {
-not $_.PSIsContainer -and
$_.LastWriteTime -ge $prevcheck -and
$_.LastWriteTime -le $currenttd
} | % {
$filedate = $_.BaseName -replace '^.*(\d{4})(\d{2})(\d{2})$', '$1-$2-$3'
Get-Content $_ | % {
$_ -replace '^(\d{2}:\d{2}:\d{2})', "$filedate `$1"
} | Out-File -Append $Logfile
}
In PowerShell strings have to be in double quotes (") for variable substitution. Single quoted (') strings do not perform variable substitution.
In your script (in which I suggest you indent the content of code blocks to make the structure easier to follow):
$replace_regex = '"$filedate" $_'
where the string is single quoted, so no variable substitution. This can be fixed by remembering the back-quote (`) character can be used to escape double quotes embedded in a double quoted string:
$replace_regex = "`"$filedate`" $_"
But remember:
$ is a regex meta-character, so if you want to include a $ in a regex in double quotes it will need to be escaped to avoid PSH treating it as the start of the variable name.
Any regex meta-characters in the variable will have their regex meaning. Consider escaping the content of the variable before substitution ([regex]::Escape(string)).