Regex for matching between 2nd quotation and file name?

Regex for matching between 2nd quotation and file name? - regex

I have a powershell script that opens up CSV files and replaces 2nd column full file path with just file names. I am able to use -replace function in powershell, but I don't have a way to explicitly match certain string because the file path vary in lengths and how many sub directories there are.
I need help in using regex to match the string like this:
String: "1003,"\\ST12345\share$\SYSTEM\V1\1\2\1234.htm"
I want to match: \\ST12345\share$\SYSTEM\V1\1\2\
so I could replace the above with empty (thus delete it). Another issue is the shares could have vary number of directories, so there could be 2 back-slashes or there could be 4 backslashes, but there will always be a file name and the string will always start with \.
Thank you for your help!

You may use the following pattern:
(?<=,").*?(?=\d+\.htm)
You can try it here.
Powershell demo:
$matches = '"1003,"\\ST12345\share$\SYSTEM\V1\1\2\1234.htm"' | Select-String -Pattern '(?<=,").*?(?=\d+\.htm)'
$matches.Matches.Value
Prints:
\\ST12345\share$\SYSTEM\V1\1\2\

To answer your question exactly as asked (even though your input string has an imbalanced "):
PS> '"1003,"\\ST12345\share$\SYSTEM\V1\1\2\1234.htm"' -replace '(?<=")\\.+\\'
"1003,"1234.htm"
(?<=") is a look-behind assertion that matches the " immediately before the file path without including it in the match.
\\.+\\ matches an (escaped) \ followed by any nonempty sequence of characters (.+) followed by a \. .NET regex matching is greedy by default, so everything through the last \ is matched, effectively removing the file's directory path.

Related

Multiline Regex Lookbehind Failing in Powershell

I'm trying to parse a particular text file. One portion of the file is:
Installed HotFix
n/a Internet Explorer - 0
Applications:
In powershell, this is currently in a file C:\temp\software.txt. I'm trying to get it to return all lines in between "HotFix" and "Applications:" (As there may be more in the future.)
My current command looks like this:
Get-Content -Raw -Path 'C:\temp\software.txt' | Where-Object { $_ -match '(?<=HotFix\n)((.*?\n)+)(?=Applications)' }
Other regex I've tried:
'(?<=HotFix`n)((.*?`n)+)(?=Applications)'
'(?<=HotFix`n)((.*?\n)+)(?=Applications)'
'(?<=HotFix\n)((.*?`n)+)(?=Applications)'
'(?<=HotFix$)((.*?\n)+)(?=Applications)'
'(?<=HotFix)((.*?\n)+)(?=Applications)'
'(?<=HotFix)((.*?`n)+)(?=Applications)'

I think Select-String will provide better results here:
((Get-Content -Path 'C:\temp\software.txt' textfile -Raw |
Select-String -Pattern '(?sm)(?<=HotFix\s*$).*?(?=^Applications:)' -AllMatches).Matches.Value).Trim()
Regex modifier s is used because you are expecting the . character to potentially match newline characters. Regex modifier m is used so that end of string $ and start of string ^ characters can be matched on each line. Together that syntax is (?sm) in PowerShell.
Where {$_ -match ...} will return anything that makes the condition true. Since you are passing a Get-Content -Raw output, the entire contents of the file will be one string and therefore the entire string will output on a true condition.
Since you used -match here against a single string, any successful matches will be stored in the $matches automatic variable. Your matched string would be available in $matches[0]. If you were expecting multiple matches, -match will not work as constructed here.
Alternatively, the .NET Matches() method of the Regex class, can also do the job:
[regex]::Matches((Get-Content 'c:\temp\software.txt' -Raw),'(?sm)(?<=HotFix\s*$).*?(?=^Applications:)').Value.Trim()
Without Trim(), you'd need to understand your newline character situation:
[regex]::Matches((Get-Content software.txt -Raw),'(?m)(?<=HotFix\r?\n?)[^\r\n]+(?=\r?\n?^Applications:)').Value
A non-regex, alternative could use a switch statement.
switch -File Software.txt -Regex {
'HotFix\s*$' { $Hotfix,$Applications = $true,$false }
'^Applications:' { $Applications = $true }
default {
if ($Hotfix -and !$Applications) {
$_
}
}
}

If you read the file into a string the following regular expression will read the lines of interest:
/(?<=HotFix\n).*?(?=\nApplications:)/s
demo
The regex reads:
Match zero or more characters, lazily (?), preceded by the string "HotFix\n" and followed by the string "\nApplications:".
(?<=HotFix\n) is a positive lookbehind; (?=\nApplications:) is a positive lookahead.
The flag s (/s) causes .*? to continue past the ends of lines. (Some languages have a different flag that has the same effect.)
.*? (lazy match) is used in place of .* (greedy match) in the event that there is more than one line following the "Hot Fix" line that begins "Applications:". The lazy version will match the first; the greedy version, the last.
I would not be inclined to use a regex for this task. For one, the entire file must be read into a string, which could be problematic (memory-wise) if the file is sufficiently large. Instead, I would simply read the file line-by-line, keeping only the current line in memory. Once the "Bad Fix" line has been read, save the following lines until the "Applications:" line is read. Then, after closing the file, you're done.

Instead of using lookarounds, you could make use of a capturing group
First match the line that ends with HotFix. Then capture in group 1 all the following lines that do not start with Applications and then match Applications
^.*\bHotFix\r?\n((?:(?!Applications:).*\r?\n)+)Applications:
Explanation
^.*\bHotFix\r?\n Match the line that ends with HotFix
( Capture group 1
(?: Non capture group
(?!Applications:).*\r?\n Match the whole line if it does not start with Applications:
)+ Close non capturing group and repeat 1+ times to match all lines
) Close group 1
Applications: Match literally
Regex demo

Match only a file's base name (filename without extension) with a regular expression

In PowerShell, I want to compare a file's name. The file has the name as someFileName.docx. I am using the following cmdlet:
$tmpTarget["Name"] -match $name
sometimes the $tmpTarget["Name"] has a file extension and sometimes it doesn't.
so I want a regular expression that will only -match the file name, but ignore the extension.
Thank you

If you want to match the base name only (the file name without extension), it's simpler to use [IO.Path]::GetFileNameWithoutExtension() first and match the result:
[IO.Path]::GetFileNameWithoutExtension($tmpTarget["Name"]) -match '^someFileName$'
Note that sample file name someFileName is anchored with ^ (start of string) and $ (end of string), because -match by default performs substring matching.
Of course, to just match someFileName in full, literally, -eq 'someFileName' will do.
In PowerShell Core (but not in Windows PowerShell), you can alternatively use
Split-Path -LeafBase:
# PowerShell *Core* only.
PS> Split-Path -LeafBase 'someFileName.docx'
someFileName
If you do want to use a single regular expression, you can use the following, but it's significantly more complex:
$tmpTarget["Name"] -match '^someFileName(?:\.|$)'
(?:...) is a non-capturing subexpression
\.|$ either matches a literal . (\.) - the start of an extension - or (|) the end of the string ($) - in case there is no extension.
The above works fine if your file names only ever have at most 1 extension, such as someFileName or someFileName.docx.
If your file names may have multiple extensions, such as someFileName.foo.docx, and you only want to ignore the last extension, a little more work is needed:
PS> 'someFileName.bar.docx' -match '^someFileName\.foo(?:\.[^.]*$|$)'
True # 'someFileName.foo' matches in full, because only .docx is ignored.
Subexpression \.[^.]*$ only matches the last extension: a literal . (\.) followed by something other than . ([^.]) zero or more times (*) through the end of the string ($).

As file names could have multiple dots contained it's difficult to decide if the last dot separated part is meant to be an extension.
But if the BaseName is known simply make the extension optional
$name = [RegEx] 'someFileName(\.docx)?'
$tmpTarget["Name"] -match $name

You can use the -replace operator for this. It will output the file base name without the extension.
$tmpTarget["Name"] -replace "\.[^.]*$"
If you already know all of the possible extensions, you can do something like the following, which will only remove the known extension if it is present.
$tmpTarget["Name"] -replace "(\.docx|\.xlsx|\.txt)$"
Explanation:
-replace uses a regex match to select a string and replaces it with a string. \.[^.]*$ matches the final . character and then greedily matches all characters except newlines and dots until the end of string ($).
Issues with the first solution will be if there are file names with . characters when the extension is not present. Knowing the extensions up front can make the matching more reliable.
The way to avoid all of the text manipulation is to use System.IO.FileInfo objects. Those objects have all of the file parts you are looking for separated into properties.

Regex to exclude certain file extensions

similar questions have been asked, but they miss one thing I need to do and I can't figure it out.
I need to find all files that do NOT have either a tif, or tiff extension, but I DO need to find all others including those that have no extension. I got the first part working with the regex below, but this doesn't match files with no extension.
^(.+)\.(?!tif$|tiff$).+$
That works great, but I need the following to work.
filename.ext MATCH
filename.abc MATCH
filename.tif FAIL
filename MATCH
Thanks :)

If you're not working with JS/ECMAscript regex, you can use:
^.*(?<!\.tif)(?<!\.tiff)$

Rather than writing a negative regex, consider using the simpler, positive regex, but taking action when something does not match. This is often a superior approach.
It can't be used in every situation (e.g. if you are using a command line tool that requires you to specify what does match), but I would do this where possible.

This works for me:
^(?:(.+\.)((?!tif$|tiff$)[^.]*)|[^.]+)$
That regex is split in two different parts:
Part 1: (.+)\.((?!tif$|tiff$)[^.]*)
(.+) (1st capturing group) Match a filename (potentially containing dots)
\. Match the last dot of the string (preceding the extension).
((?!tif$|tiff$)[^.]*) (2nd capturing group) Then check if the dot is not followed by exactly "tif" or "tiff" and if so match the extension.
Part 2: [^.]+ If part 1 didn't match, check if you have just a filename containing no dot.

If you have some strings in a text file ( that has newline ):
perl -lne '/(?:tiff?)/ || print' file
If you have some files in a directory:
ls | perl -lne '/(?:tiff?)/ || print'
Screen-shot:

Here's what I came up with:
^[^\.\s]+(\.|\s)(?!tiff?)
Explanation:
Beginning of line to dot or whitespace, put your matching group around this, ie:
^(?<result>[^\.\s]+)
It will then look for a dot or a whitespace, with a negative lookahead on the tiff (tiff? will match to both tif and tiff).
This makes the assumption that there will always be a dot or a whitespace after the filename. You can change this to be an end of line if that is what you need:
^[^\.\s]+(\.(?!tiff?)|\n) linux
^[^\.\s]+(\.(?!tiff?)|\r\n) windows

Regex to remove string after file extension

I'm using PowerShell to query for a service path from which results should resemble C:\directory\sub-directory\service.exe
Some results however also include characters after the .exe file extension, for example output may resemble one of the following:
C:\directory\sub-directory\service.exe ThisTextNeedsRemoving
C:\directory\sub-directory\service.exe -ThisTextNeedsRemoving
C:\directory\sub-directory\service.exe /ThisTextNeedsRemoving
i.e. ThisTextNeedsRemoving may be proceeded by a space, hyphen or forward slash.
I can use the regex -replace '($*.exe).*' to remove everything after, but including the .exe file extension, but how do I keep the .exe in the results?

You can use a look-around:
$txt = 'C:\directory\sub-directory\service.exe /ThisTextNeedsRemoving'
$txt -replace '(?<=\.exe).+', ''
This uses a look-behind which is a zero-width match so it doesn't get replaced.
Debuggex Demo

Using lookbehind is possible, but note that lookbehinds are only necessary when you need to specify some rather complex condition or to obtain overlapping matches. In most cases, when you can do without a lookbehind, you should consider using a non-lookbehind solution because it is rather a costly operation. It is easier to check once if the current character is not a whitespace than to also check if each of these symbols is preceded with something else. Or a whole substring, or a more complext pattern.
Thus, I'd suggest using a solution based on capturing mechanism, with a backreference in the replacement part to restore the captured substring in the result:
$s -replace '^(\S+\.exe) .*','$1'
or - for paths containing spaces and not inside double quotes:
$s -replace '^(.*?\.exe) .*','$1'
Explanation:
^ - start of string
(\S+\.exe) - one or more character other than whitespace (\S+) (or any characters other than a newline, any amount, as few as possible, with .*?) followed with a literal . and exe
.* - a space and then any number of characters other than a newline.

Powershell regex - match until character

So what I need to do is match text until I hit a certain character, then stop. Right now I'm having a heck of a time getting that to work right and at this point I think I'm just confusing myself even more. The text I'm searching will look like this:
ServerA_logfile.log
ServerB_logfile.log
ServerC_logfile.log
What I need to do is just return the server name, and exclude everything after the underscore character.
Here's my code:
Get-ChildItem \\fileshare\logs\ -Name -Filter *.log | foreach { [regex]::match($_ -replace "^_", "")}
What it returns is.... well, not helpful, but that's as good as I can get.
What am I missing?

What you need is positive lookahead (it's tailored to the match before something case)
[Regex]::Match($_, "^.+(?=_)").Value
Match() does not return a string, but a Match object. Hence the Value property should be accessed to extract the string from the object.
In case it wasn't clear, expression used specifies to find:
at the beginning of line (^)
string of any length (longer or equal to one character) (.+)
followed by underscore ((?=_)), that's positive lookahead

There is another very simple solution:
[Regex]::Match($_, "^[^_]*").Value
[^_] matches any character except underscores. Therefore ^[^_]* starts the match at the start of the string and stops before the first underscore.

I know regex was requested, but it would be just as easy (maybe easier) to use the built in split command.
Here is the code:
Get-ChildItem \\fileshare\logs\ -Name -Filter *.log | foreach { $_.Split("_")[0] }

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex for matching between 2nd quotation and file name? - regex

You may use the following pattern: (?<=,").?(?=\d+\.htm) You can try it here. Powershell demo: $matches = '"1003,"\\ST12345\share$\SYSTEM\V1\1\2\1234.htm"' | Select-String -Pattern '(?<=,").?(?=\d+\.htm)' $matches.Matches.Value Prints: \\ST12345\share$\SYSTEM\V1\1\2\

Related

Multiline Regex Lookbehind Failing in Powershell

Match only a file's base name (filename without extension) with a regular expression

Regex to exclude certain file extensions

Regex to remove string after file extension

Powershell regex - match until character

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex for matching between 2nd quotation and file name? - regex

You may use the following pattern: (?<=,").*?(?=\d+\.htm) You can try it here. Powershell demo: $matches = '"1003,"\\ST12345\share$\SYSTEM\V1\1\2\1234.htm"' | Select-String -Pattern '(?<=,").*?(?=\d+\.htm)' $matches.Matches.Value Prints: \\ST12345\share$\SYSTEM\V1\1\2\

Related

Multiline Regex Lookbehind Failing in Powershell

Match only a file's base name (filename without extension) with a regular expression

Regex to exclude certain file extensions

Regex to remove string after file extension

Powershell regex - match until character

Categories

Resources

You may use the following pattern: (?<=,").?(?=\d+\.htm) You can try it here. Powershell demo: $matches = '"1003,"\\ST12345\share$\SYSTEM\V1\1\2\1234.htm"' | Select-String -Pattern '(?<=,").?(?=\d+\.htm)' $matches.Matches.Value Prints: \\ST12345\share$\SYSTEM\V1\1\2\