Multiline Regex in PowerShell - regex

I have this PowerShell script that's main purpose is to search through HTML files within a folder, find specific HTML markup, and replace with what I tell it to.
I have been able to do 3/4 of my find and replaces perfectly. The one I am having trouble with involves a Regular Expression.
This is the markup that I am trying to make my regex find and replace:
<a href="programsactivities_skating.html"><br />
</a>
Here is the regex I have so far, along with the function I am using it in:
automate -school "C:\Users\$env:username\Desktop\schools\$question" -query '(?mis)(?!exclude1|exclude2|exclude3)(<a[^>]*?>(\s| |<br\s?/?>)*</a>)' -replace ''
And here is the automate function:
function automate($school, $query, $replace) {
$processFiles = Get-ChildItem -Exclude *.bak -Include "*.html", "*.HTML", "*.htm", "*.HTM" -Recurse -Path $school
foreach ($file in $processFiles) {
$text = Get-Content $file
$text = $text -replace $query, $replace
$text | Out-File $file -Force -Encoding utf8
}
}
I have been trying to figure out the solution to this for about 2 days now, and just can't seem to get it to work. I have determined that problem is that I need to tell my regex to account for Multiline, and that's what I'm having trouble with.
Any help anyone can provide is greatly appreciate.
Thanks in Advance.

Get-Content produces an array of strings, where each string contains a single line from your input file, so you won't be able to match text passages spanning more than one line. You need to merge the array into a single string if you want to be able to match more than one line:
$text = Get-Content $file | Out-String
or
[String]$text = Get-Content $file
or
$text = [IO.File]::ReadAllText($file)
Note that the 1st and 2nd method don't preserve line breaks from the input file. Method 2 simply mangles all line breaks, as Keith pointed out in the comments, and method 1 puts <CR><LF> at the end of each line when joining the array. The latter may be an issue when dealing with Linux/Unix or Mac files.

I don't get what it is you're trying to do with those Exclude elements, but I find multi-line regex is usually easier to construct in a here-string:
$text = #'
<a href="programsactivities_skating.html"><br />
</a>
'#
$regex = #'
(?mis)<a href="programsactivities_skating.html"><br />
\s+?</a>
'#
$text -match $regex
True

Get-Content will return an array of strings, you want to concatenate the strings in question to create one:
function automate($school, $query, $replace) {
$processFiles = Get-ChildItem -Exclude *.bak -Include "*.html", "*.HTML", "*.htm", "*.HTM" -Recurse -Path $school
foreach ($file in $processFiles) {
$text = ""
$text = Get-Content $file | % { $text += $_ +"`r`n" }
$text = $text -replace $query, $replace
$text | Out-File $file -Force -Encoding utf8
}
}

Related

Powershell - Editing a text file using changing each instance of a date "YYYY.MM.DD" to "YYYYMMDD" including the quotes

I need to be able to edit a text file, find strings such as "2022.09.20" (including the "") and replace them with "20220920". I think I am missing something this is what I tried
$sf = '.\TestCSV.csv'
$regex = '("\d{4}).(\d{1,2}).(\d{1,2})"'
$regex2 = '("\d{4})(\d{1,2})(\d{1,2})"'
(Get-Content $sf) |
Foreach-Object {$_ -replace $regex , $regex2 } | Set-Content '.\TestCSVNew.csv'
Any help much appreciated

Select all backslashes between two chars

I am working on a powershell script and I've got several text files where I need to replace backslashes in lines which matches this pattern: .. >\\%name% .. < .. (.. could be anything)
Example string from one of the files where the backslashes should match:
<Tag>\\%name%\TST$\Program\1.0\000\Program.msi</Tag>
Example string from one of the files where the backslashes should not match:
<Tag>/i /L*V "%TST%\filename.log" /quiet /norestart</Tag>
So far I've managed to select every char between >\\%name% and < with this expression (Regex101):
(?<=>\\\\%name%)(.*)(?=<)
but I failed to select only the backslashes.
Is there a solution which I could not yet find?
I'd recommend selecting the relevant tags with an XPath expression and then do the replacement on the text body of the selected nodes.
$xml.SelectNodes('//Tag[substring(., 1, 8) = "\\%name%"]' | ForEach-Object {
$_.'#text' = $_.'#text' -replace '\\', '\\'
}
So here's my solution:
$original_file = $Filepath
$destination_file = $Filepath + ".new"
Get-Content -Path $original_file | ForEach-Object {
$line = $_
if ($line -match '(?<=>\\\\%name%)(.*)(?=<)'){
$line = $line -replace '\\','/'
}
$line
} | Set-Content -Path $destination_file
Remove-Item $original_file
Rename-Item $destination_file.ToString() $original_file.ToString()
So this will replace every \ with an / in the given pattern but not in the way which my question was about.

regex in powershell - not change three characters before text

Is there any easy way to do this?
input: 123215-85_01_test
expected output: 01_test
Another example
input: 12154_02_test
expected output: 02_test
There will be always string "test", but different numbering before
for example this code..
$path = "c:\tmp\*.sql"
get-childitem $path | forEach-object {
$name = $_.Name
$result = $name -replace "","" # I don't know how write this regex..
$extension = $_.Extension
$newName = $prefix+"_"+ $result -f, $extension
Rename-Item -Path $_.FullName -NewName $newName
}
There are two ways you go go at this. Simple split and join or you can use one of many regexes....
Split on underscore and rejoin last 2 elements
$split = "123215-85_01_test" -split "_"
$split[-2..-1] -join "_" # $split[-2,-1] would also work.
Regex to locate the data between the last underscores
"123215-85_01_test" -replace "^.*_(\d+)_(.*)$", '$1_$2'
Note this fails if there is more than 2 underscores.

Search mutiple words using regular expression in powershell

I am new to powershell. I highly appreciate any help you can provide for the below. I have a powershell script but not being able to complete to get all the data fields from the text file.
I have a file 1.txt as below.
I am trying to extract output for "pid" and "ctl00_lblOurPrice" from the file in table format below so that I can get open this in excel. Column headings are not important. :
pid ctl00_lblOurPrice
0070362408 $6.70
008854787666 $50.70
Currently I am only able to get pid as below. Would like to also get the price for each pid. -->
0070362408
008854787666
c:\scan\1.txt:
This is sentence 1.. This is sentence 1.1... This is sentence A1...
fghfdkgjdfhgfkjghfdkghfdgh gifdgjkfdghdfjghfdg
gkjfdhgfdhgfdgh
ghfghfjgh
...
href='http://example.com/viewdetails.aspx?pid=0070362408'>
This is sentence B1.. This is sentence B2... This is sentence B3...
GFGFGHHGH
HHGHGFHG
<p class="price" style="display:inline;">
ctl00_lblOurPrice=$6.70
This is sentence 1.. This is sentence 1.1... This is sentence A1...
fghfdkgjdfhgfkjghfdkghfdgh gifdgjkfdghdfjghfdg
gkjfdhgfdhgfdgh
ghfghfjgh
...
href='http://example.com/viewdetails.aspx?pid=008854787666'>
This is sentence B1.. This is sentence B2... This is sentence B3...
6GBNGH;L
887656HGFHG
<p class="price" style="display:inline;">
ctl00_lblOurPrice=$50.70
...
...
Current powershell script:
$files=Get-ChildItem c:\scan -recurse
$output_file = ‘c:\output\outdata.txt’
foreach ($file in $files) {
$input_path = $file
$regex = ‘num=\d{1,13}’
select-string -Path $input_path -Pattern $regex -AllMatches | % { $_.Matches } | % {
($_.Value) -replace "num=","" } | Out-File $output_file -Append }
Thanks in advance for your help
I'm going to assume that you either mean pid=\d{1,13} in your code, or that your sample text should have read num= instead of pid=. We will go with the assumption that it is in fact supposed to be pid.
In that case we will turn the entire file into one long string with -Join "", and then split it on "href" to create records for each site to parse against. Then we match for pid= and ending when it comes across a non-numeric character, and then we look for a dollar amount (a $ followed by numbers, followed by a period, and then two more numbers).
When we have a pair of PID/Price matches we can create an object with two properties, PID and Price, and output that. For this I will assign it to an array, to be used later. If you do not have PSv3 or higher you will have to change [PSCustomObject][ordered] into New-Object PSObject -Property but that loses the order of properties, so I like the former better and use it in my example here.
$files=Get-ChildItem C:\scan -recurse
$output_file = 'c:\output\outdata.csv'
$Results = #()
foreach ($file in $files) {
$Results += ((gc $File) -join "") -split "href" |?{$_ -match "pid=(\d+?)[^\d].*?(\$\d*?\.\d{2})"}|%{[PSCustomObject][ordered]#{"PID"=$Matches[1];"Price"=$Matches[2]}}
}
$Results | Select PID,Price | Export-Csv $output_file -NoTypeInformation

RegEx Match whole line with first occurrence from the bottom of the file, upwards

I'm trying to parse a file with error codes.
I would only like the first occurrence from the bottom of the file to be returned.
So far, I've got this regex searching for the error code numbers, and it returns the whole line with the Multiline option, but it returns all lines in the file, not just the last one.
^.*?\b(639|640|460|458|664|148)\b.*$
I'm using powershell, so if you have an example using powershell - that would be great.
Thank you.
Assuming your regex is correct for matching on a line then you should be able to do something like this:
$pattern = '^.*?\b(639|640|460|458|664|148)\b.*$'
$content = Get-Content c:\somefile.txt
for ($i = $content.Length - 1; $i -ge 0; $i--) {
if ($content[$i] -match $pattern) {
$matches[1]
break
}
}
I'd use Select-String for this:
$filename = 'C:\path\to\input.txt'
$pattern = '\b(639|640|460|458|664|148)\b'
Get-Content $filename | Select-String $pattern | select -Last 1