Multiline Regex Lookbehind Failing in Powershell

Multiline Regex Lookbehind Failing in Powershell - regex

I'm trying to parse a particular text file. One portion of the file is:
Installed HotFix
n/a Internet Explorer - 0
Applications:
In powershell, this is currently in a file C:\temp\software.txt. I'm trying to get it to return all lines in between "HotFix" and "Applications:" (As there may be more in the future.)
My current command looks like this:
Get-Content -Raw -Path 'C:\temp\software.txt' | Where-Object { $_ -match '(?<=HotFix\n)((.*?\n)+)(?=Applications)' }
Other regex I've tried:
'(?<=HotFix`n)((.*?`n)+)(?=Applications)'
'(?<=HotFix`n)((.*?\n)+)(?=Applications)'
'(?<=HotFix\n)((.*?`n)+)(?=Applications)'
'(?<=HotFix$)((.*?\n)+)(?=Applications)'
'(?<=HotFix)((.*?\n)+)(?=Applications)'
'(?<=HotFix)((.*?`n)+)(?=Applications)'

I think Select-String will provide better results here:
((Get-Content -Path 'C:\temp\software.txt' textfile -Raw |
Select-String -Pattern '(?sm)(?<=HotFix\s*$).*?(?=^Applications:)' -AllMatches).Matches.Value).Trim()
Regex modifier s is used because you are expecting the . character to potentially match newline characters. Regex modifier m is used so that end of string $ and start of string ^ characters can be matched on each line. Together that syntax is (?sm) in PowerShell.
Where {$_ -match ...} will return anything that makes the condition true. Since you are passing a Get-Content -Raw output, the entire contents of the file will be one string and therefore the entire string will output on a true condition.
Since you used -match here against a single string, any successful matches will be stored in the $matches automatic variable. Your matched string would be available in $matches[0]. If you were expecting multiple matches, -match will not work as constructed here.
Alternatively, the .NET Matches() method of the Regex class, can also do the job:
[regex]::Matches((Get-Content 'c:\temp\software.txt' -Raw),'(?sm)(?<=HotFix\s*$).*?(?=^Applications:)').Value.Trim()
Without Trim(), you'd need to understand your newline character situation:
[regex]::Matches((Get-Content software.txt -Raw),'(?m)(?<=HotFix\r?\n?)[^\r\n]+(?=\r?\n?^Applications:)').Value
A non-regex, alternative could use a switch statement.
switch -File Software.txt -Regex {
'HotFix\s*$' { $Hotfix,$Applications = $true,$false }
'^Applications:' { $Applications = $true }
default {
if ($Hotfix -and !$Applications) {
$_
}
}
}

If you read the file into a string the following regular expression will read the lines of interest:
/(?<=HotFix\n).*?(?=\nApplications:)/s
demo
The regex reads:
Match zero or more characters, lazily (?), preceded by the string "HotFix\n" and followed by the string "\nApplications:".
(?<=HotFix\n) is a positive lookbehind; (?=\nApplications:) is a positive lookahead.
The flag s (/s) causes .*? to continue past the ends of lines. (Some languages have a different flag that has the same effect.)
.*? (lazy match) is used in place of .* (greedy match) in the event that there is more than one line following the "Hot Fix" line that begins "Applications:". The lazy version will match the first; the greedy version, the last.
I would not be inclined to use a regex for this task. For one, the entire file must be read into a string, which could be problematic (memory-wise) if the file is sufficiently large. Instead, I would simply read the file line-by-line, keeping only the current line in memory. Once the "Bad Fix" line has been read, save the following lines until the "Applications:" line is read. Then, after closing the file, you're done.

Instead of using lookarounds, you could make use of a capturing group
First match the line that ends with HotFix. Then capture in group 1 all the following lines that do not start with Applications and then match Applications
^.*\bHotFix\r?\n((?:(?!Applications:).*\r?\n)+)Applications:
Explanation
^.*\bHotFix\r?\n Match the line that ends with HotFix
( Capture group 1
(?: Non capture group
(?!Applications:).*\r?\n Match the whole line if it does not start with Applications:
)+ Close non capturing group and repeat 1+ times to match all lines
) Close group 1
Applications: Match literally
Regex demo

Related

Match only a file's base name (filename without extension) with a regular expression

In PowerShell, I want to compare a file's name. The file has the name as someFileName.docx. I am using the following cmdlet:
$tmpTarget["Name"] -match $name
sometimes the $tmpTarget["Name"] has a file extension and sometimes it doesn't.
so I want a regular expression that will only -match the file name, but ignore the extension.
Thank you

If you want to match the base name only (the file name without extension), it's simpler to use [IO.Path]::GetFileNameWithoutExtension() first and match the result:
[IO.Path]::GetFileNameWithoutExtension($tmpTarget["Name"]) -match '^someFileName$'
Note that sample file name someFileName is anchored with ^ (start of string) and $ (end of string), because -match by default performs substring matching.
Of course, to just match someFileName in full, literally, -eq 'someFileName' will do.
In PowerShell Core (but not in Windows PowerShell), you can alternatively use
Split-Path -LeafBase:
# PowerShell *Core* only.
PS> Split-Path -LeafBase 'someFileName.docx'
someFileName
If you do want to use a single regular expression, you can use the following, but it's significantly more complex:
$tmpTarget["Name"] -match '^someFileName(?:\.|$)'
(?:...) is a non-capturing subexpression
\.|$ either matches a literal . (\.) - the start of an extension - or (|) the end of the string ($) - in case there is no extension.
The above works fine if your file names only ever have at most 1 extension, such as someFileName or someFileName.docx.
If your file names may have multiple extensions, such as someFileName.foo.docx, and you only want to ignore the last extension, a little more work is needed:
PS> 'someFileName.bar.docx' -match '^someFileName\.foo(?:\.[^.]*$|$)'
True # 'someFileName.foo' matches in full, because only .docx is ignored.
Subexpression \.[^.]*$ only matches the last extension: a literal . (\.) followed by something other than . ([^.]) zero or more times (*) through the end of the string ($).

As file names could have multiple dots contained it's difficult to decide if the last dot separated part is meant to be an extension.
But if the BaseName is known simply make the extension optional
$name = [RegEx] 'someFileName(\.docx)?'
$tmpTarget["Name"] -match $name

You can use the -replace operator for this. It will output the file base name without the extension.
$tmpTarget["Name"] -replace "\.[^.]*$"
If you already know all of the possible extensions, you can do something like the following, which will only remove the known extension if it is present.
$tmpTarget["Name"] -replace "(\.docx|\.xlsx|\.txt)$"
Explanation:
-replace uses a regex match to select a string and replaces it with a string. \.[^.]*$ matches the final . character and then greedily matches all characters except newlines and dots until the end of string ($).
Issues with the first solution will be if there are file names with . characters when the extension is not present. Knowing the extensions up front can make the matching more reliable.
The way to avoid all of the text manipulation is to use System.IO.FileInfo objects. Those objects have all of the file parts you are looking for separated into properties.

Remove everything up to and including triple newline

I am very new to Powershell, so I am no doubt doing something really stupid that causes my attempts to get this to work to not actually work... but after an hour of struggling, I'd love a hand.
I have a file for which a triple newline (two empty lines) marks a boundary. I want only everything that comes after the boundary.
My latest fruitless attempt looks like this:
$content = Get-Content -Raw $Path
$content = $content -Replace '^.+`r`n`r`n`r`n', ''
All my attempts to even match a single new line have failed. The -Raw parameter is because I came to understand this would change the way newlines were processed, but it didn't change anything.
I am also aware the regex isn't ideal; I'd want to make it non-greedy but I want to get a super-basic test case working first given my unfamiliarity with whatever flavor of regular expressions Powershell supports. (I assume I can just stick a ? after the + to fix that, but first things first.)
The goal is to go from
useless metadata I don't care about
more useless metadata
actual content
to this:
actual content
What am I doing wrong?

The '`r`n' is a literal 4 char string, while "`r`n" is linebreak 2-char string. Your pattern would not match any line breaks. It is safer to use \r to match CR and \n to match LF in Powershell regex patterns.
Also note that there are several lines between the start of the string and your delimiter, but . does not match a newline by default, you need a (?s) inline modifier to make . match newlines, too.
Use
$content -replace '(?s)^.*?(?:\r?\n){3}'
Details
(?s) - a Singleline option that makes . match newlines, too
^ - start of the string
.*? - any 0+ chars, as few as possible
(?:\r?\n){3} - triple CRLF/LF line break.
See the .NET regex demo.

Regex for matching between 2nd quotation and file name?

I have a powershell script that opens up CSV files and replaces 2nd column full file path with just file names. I am able to use -replace function in powershell, but I don't have a way to explicitly match certain string because the file path vary in lengths and how many sub directories there are.
I need help in using regex to match the string like this:
String: "1003,"\\ST12345\share$\SYSTEM\V1\1\2\1234.htm"
I want to match: \\ST12345\share$\SYSTEM\V1\1\2\
so I could replace the above with empty (thus delete it). Another issue is the shares could have vary number of directories, so there could be 2 back-slashes or there could be 4 backslashes, but there will always be a file name and the string will always start with \.
Thank you for your help!

You may use the following pattern:
(?<=,").*?(?=\d+\.htm)
You can try it here.
Powershell demo:
$matches = '"1003,"\\ST12345\share$\SYSTEM\V1\1\2\1234.htm"' | Select-String -Pattern '(?<=,").*?(?=\d+\.htm)'
$matches.Matches.Value
Prints:
\\ST12345\share$\SYSTEM\V1\1\2\

To answer your question exactly as asked (even though your input string has an imbalanced "):
PS> '"1003,"\\ST12345\share$\SYSTEM\V1\1\2\1234.htm"' -replace '(?<=")\\.+\\'
"1003,"1234.htm"
(?<=") is a look-behind assertion that matches the " immediately before the file path without including it in the match.
\\.+\\ matches an (escaped) \ followed by any nonempty sequence of characters (.+) followed by a \. .NET regex matching is greedy by default, so everything through the last \ is matched, effectively removing the file's directory path.

Regex to remove string after file extension

I'm using PowerShell to query for a service path from which results should resemble C:\directory\sub-directory\service.exe
Some results however also include characters after the .exe file extension, for example output may resemble one of the following:
C:\directory\sub-directory\service.exe ThisTextNeedsRemoving
C:\directory\sub-directory\service.exe -ThisTextNeedsRemoving
C:\directory\sub-directory\service.exe /ThisTextNeedsRemoving
i.e. ThisTextNeedsRemoving may be proceeded by a space, hyphen or forward slash.
I can use the regex -replace '($*.exe).*' to remove everything after, but including the .exe file extension, but how do I keep the .exe in the results?

You can use a look-around:
$txt = 'C:\directory\sub-directory\service.exe /ThisTextNeedsRemoving'
$txt -replace '(?<=\.exe).+', ''
This uses a look-behind which is a zero-width match so it doesn't get replaced.
Debuggex Demo

Using lookbehind is possible, but note that lookbehinds are only necessary when you need to specify some rather complex condition or to obtain overlapping matches. In most cases, when you can do without a lookbehind, you should consider using a non-lookbehind solution because it is rather a costly operation. It is easier to check once if the current character is not a whitespace than to also check if each of these symbols is preceded with something else. Or a whole substring, or a more complext pattern.
Thus, I'd suggest using a solution based on capturing mechanism, with a backreference in the replacement part to restore the captured substring in the result:
$s -replace '^(\S+\.exe) .*','$1'
or - for paths containing spaces and not inside double quotes:
$s -replace '^(.*?\.exe) .*','$1'
Explanation:
^ - start of string
(\S+\.exe) - one or more character other than whitespace (\S+) (or any characters other than a newline, any amount, as few as possible, with .*?) followed with a literal . and exe
.* - a space and then any number of characters other than a newline.

Powershell regex - match until character

So what I need to do is match text until I hit a certain character, then stop. Right now I'm having a heck of a time getting that to work right and at this point I think I'm just confusing myself even more. The text I'm searching will look like this:
ServerA_logfile.log
ServerB_logfile.log
ServerC_logfile.log
What I need to do is just return the server name, and exclude everything after the underscore character.
Here's my code:
Get-ChildItem \\fileshare\logs\ -Name -Filter *.log | foreach { [regex]::match($_ -replace "^_", "")}
What it returns is.... well, not helpful, but that's as good as I can get.
What am I missing?

What you need is positive lookahead (it's tailored to the match before something case)
[Regex]::Match($_, "^.+(?=_)").Value
Match() does not return a string, but a Match object. Hence the Value property should be accessed to extract the string from the object.
In case it wasn't clear, expression used specifies to find:
at the beginning of line (^)
string of any length (longer or equal to one character) (.+)
followed by underscore ((?=_)), that's positive lookahead

There is another very simple solution:
[Regex]::Match($_, "^[^_]*").Value
[^_] matches any character except underscores. Therefore ^[^_]* starts the match at the start of the string and stops before the first underscore.

I know regex was requested, but it would be just as easy (maybe easier) to use the built in split command.
Here is the code:
Get-ChildItem \\fileshare\logs\ -Name -Filter *.log | foreach { $_.Split("_")[0] }

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Multiline Regex Lookbehind Failing in Powershell - regex

Related

Match only a file's base name (filename without extension) with a regular expression

Remove everything up to and including triple newline

Regex for matching between 2nd quotation and file name?

Regex to remove string after file extension

Powershell regex - match until character

Categories

Resources