Powershell regex - match until character - regex

So what I need to do is match text until I hit a certain character, then stop. Right now I'm having a heck of a time getting that to work right and at this point I think I'm just confusing myself even more. The text I'm searching will look like this:
ServerA_logfile.log
ServerB_logfile.log
ServerC_logfile.log
What I need to do is just return the server name, and exclude everything after the underscore character.
Here's my code:
Get-ChildItem \\fileshare\logs\ -Name -Filter *.log | foreach { [regex]::match($_ -replace "^_", "")}
What it returns is.... well, not helpful, but that's as good as I can get.
What am I missing?

What you need is positive lookahead (it's tailored to the match before something case)
[Regex]::Match($_, "^.+(?=_)").Value
Match() does not return a string, but a Match object. Hence the Value property should be accessed to extract the string from the object.
In case it wasn't clear, expression used specifies to find:
at the beginning of line (^)
string of any length (longer or equal to one character) (.+)
followed by underscore ((?=_)), that's positive lookahead

There is another very simple solution:
[Regex]::Match($_, "^[^_]*").Value
[^_] matches any character except underscores. Therefore ^[^_]* starts the match at the start of the string and stops before the first underscore.

I know regex was requested, but it would be just as easy (maybe easier) to use the built in split command.
Here is the code:
Get-ChildItem \\fileshare\logs\ -Name -Filter *.log | foreach { $_.Split("_")[0] }

Related

Multiline Regex Lookbehind Failing in Powershell

I'm trying to parse a particular text file. One portion of the file is:
Installed HotFix
n/a Internet Explorer - 0
Applications:
In powershell, this is currently in a file C:\temp\software.txt. I'm trying to get it to return all lines in between "HotFix" and "Applications:" (As there may be more in the future.)
My current command looks like this:
Get-Content -Raw -Path 'C:\temp\software.txt' | Where-Object { $_ -match '(?<=HotFix\n)((.*?\n)+)(?=Applications)' }
Other regex I've tried:
'(?<=HotFix`n)((.*?`n)+)(?=Applications)'
'(?<=HotFix`n)((.*?\n)+)(?=Applications)'
'(?<=HotFix\n)((.*?`n)+)(?=Applications)'
'(?<=HotFix$)((.*?\n)+)(?=Applications)'
'(?<=HotFix)((.*?\n)+)(?=Applications)'
'(?<=HotFix)((.*?`n)+)(?=Applications)'
I think Select-String will provide better results here:
((Get-Content -Path 'C:\temp\software.txt' textfile -Raw |
Select-String -Pattern '(?sm)(?<=HotFix\s*$).*?(?=^Applications:)' -AllMatches).Matches.Value).Trim()
Regex modifier s is used because you are expecting the . character to potentially match newline characters. Regex modifier m is used so that end of string $ and start of string ^ characters can be matched on each line. Together that syntax is (?sm) in PowerShell.
Where {$_ -match ...} will return anything that makes the condition true. Since you are passing a Get-Content -Raw output, the entire contents of the file will be one string and therefore the entire string will output on a true condition.
Since you used -match here against a single string, any successful matches will be stored in the $matches automatic variable. Your matched string would be available in $matches[0]. If you were expecting multiple matches, -match will not work as constructed here.
Alternatively, the .NET Matches() method of the Regex class, can also do the job:
[regex]::Matches((Get-Content 'c:\temp\software.txt' -Raw),'(?sm)(?<=HotFix\s*$).*?(?=^Applications:)').Value.Trim()
Without Trim(), you'd need to understand your newline character situation:
[regex]::Matches((Get-Content software.txt -Raw),'(?m)(?<=HotFix\r?\n?)[^\r\n]+(?=\r?\n?^Applications:)').Value
A non-regex, alternative could use a switch statement.
switch -File Software.txt -Regex {
'HotFix\s*$' { $Hotfix,$Applications = $true,$false }
'^Applications:' { $Applications = $true }
default {
if ($Hotfix -and !$Applications) {
$_
}
}
}
If you read the file into a string the following regular expression will read the lines of interest:
/(?<=HotFix\n).*?(?=\nApplications:)/s
demo
The regex reads:
Match zero or more characters, lazily (?), preceded by the string "HotFix\n" and followed by the string "\nApplications:".
(?<=HotFix\n) is a positive lookbehind; (?=\nApplications:) is a positive lookahead.
The flag s (/s) causes .*? to continue past the ends of lines. (Some languages have a different flag that has the same effect.)
.*? (lazy match) is used in place of .* (greedy match) in the event that there is more than one line following the "Hot Fix" line that begins "Applications:". The lazy version will match the first; the greedy version, the last.
I would not be inclined to use a regex for this task. For one, the entire file must be read into a string, which could be problematic (memory-wise) if the file is sufficiently large. Instead, I would simply read the file line-by-line, keeping only the current line in memory. Once the "Bad Fix" line has been read, save the following lines until the "Applications:" line is read. Then, after closing the file, you're done.
Instead of using lookarounds, you could make use of a capturing group
First match the line that ends with HotFix. Then capture in group 1 all the following lines that do not start with Applications and then match Applications
^.*\bHotFix\r?\n((?:(?!Applications:).*\r?\n)+)Applications:
Explanation
^.*\bHotFix\r?\n Match the line that ends with HotFix
( Capture group 1
(?: Non capture group
(?!Applications:).*\r?\n Match the whole line if it does not start with Applications:
)+ Close non capturing group and repeat 1+ times to match all lines
) Close group 1
Applications: Match literally
Regex demo

Regex for matching between 2nd quotation and file name?

I have a powershell script that opens up CSV files and replaces 2nd column full file path with just file names. I am able to use -replace function in powershell, but I don't have a way to explicitly match certain string because the file path vary in lengths and how many sub directories there are.
I need help in using regex to match the string like this:
String: "1003,"\\ST12345\share$\SYSTEM\V1\1\2\1234.htm"
I want to match: \\ST12345\share$\SYSTEM\V1\1\2\
so I could replace the above with empty (thus delete it). Another issue is the shares could have vary number of directories, so there could be 2 back-slashes or there could be 4 backslashes, but there will always be a file name and the string will always start with \.
Thank you for your help!
You may use the following pattern:
(?<=,").*?(?=\d+\.htm)
You can try it here.
Powershell demo:
$matches = '"1003,"\\ST12345\share$\SYSTEM\V1\1\2\1234.htm"' | Select-String -Pattern '(?<=,").*?(?=\d+\.htm)'
$matches.Matches.Value
Prints:
\\ST12345\share$\SYSTEM\V1\1\2\
To answer your question exactly as asked (even though your input string has an imbalanced "):
PS> '"1003,"\\ST12345\share$\SYSTEM\V1\1\2\1234.htm"' -replace '(?<=")\\.+\\'
"1003,"1234.htm"
(?<=") is a look-behind assertion that matches the " immediately before the file path without including it in the match.
\\.+\\ matches an (escaped) \ followed by any nonempty sequence of characters (.+) followed by a \. .NET regex matching is greedy by default, so everything through the last \ is matched, effectively removing the file's directory path.

Unable to figure out regex bash or sed or awk

I wanted to split the following jdk-1.6.0_30-fcs.x86_64 to just jdk-1.6.0_30. I tried the following sed 's/\([a-z][^fcs]*\).*/\1/'but I end up with jdk-1.6.0_30-. I think am approaching it the wrong way, is there a way to start from the end of the word and traverse backwards till I encounter -?
Not exactly, but you can anchor the pattern to the end of the string with $. Then you just need to make sure that the characters you repeat may not include hyphens:
echo jdk-1.6.0_30-fcs.x86_64 | sed 's/-[^-]*$//'
This will match from a - to the end of the string, but all characters in between must be different from - (so that it does not match for the first hyphen already).
A slightly more detailed explanation. The engine tries to match the literal - first. That will first work at the first - in the string (obviously). Then [^-]* matches as many non-- characters as possible, so it will consume 1.6.0_30 (because the next character is in fact a hyphen). Now the engine will try to match $, but that does not work because we are not at the end of the string. Some backtracking occurs, but we can ignore that here. In the end the engine will abandon matching the first - and continue through the string. Then the engine will match the literal - with the second -. Now [^-]* will consume fcs.x86_64. Now we are actually at the end of the string and $ will match, so the full match (which will be removed is) -fcs.x86_64.
Use cut >>
echo 'jdk-1.6.0_30-fcs.x86_64' | cut -d- -f-2
Try doing this :
echo 'jdk-1.6.0_30-fcs.x86_64' | sed 's/-fcs.*//'
If using bash, sh or ash, you can do :
var=jdk-1.6.0_30-fcs.x86_64
echo ${var%%-fcs*}
jdk-1.6.0_30
Later solution use parameter expansion, tested on Linux and Minix3

Extract strings between two separators using regex in perl

I have a file which looks like:
uniprotkb:Q9VNB0|intact:EBI-102551 uniprotkb:A1ZBG6|intact:EBI-195768
uniprotkb:P91682|intact:EBI-142245 uniprotkb:Q24117|intact:EBI-156442
uniprotkb:P92177-3|intact:EBI-204491 uniprotkb:Q9VDK2|intact:EBI-87444
and I wish to extract strings between : and | separators, the output should be:
Q9VNB0 A1ZBG6
P91682 Q24117
P92177-3 Q9VDK2
tab delimited between the two columns.
I wrote in unix a perl command:
perl -l -ne '/:([^|]*)?[^:]*:([^|]*)/ and print($1,"\t",$2)' <file>
the output that I got is:
Q9VNB0 EBI-102551 uniprotkb:A1ZBG6
P91682 EBI-142245 uniprotkb:Q24117
P92177-3 EBI-204491 uniprotkb:Q9VDK2
I wish to know what am I doing wrong and how can I fix the problem.
I don't wish to use split function.
Thanks,
Tom.
The expression you give is too greedy and thus consumes more characters than you wanted. The following expression works on your sample data set:
perl -l -ne '/:([^|]*)\|.*:([^|]*)\|/ and print($1,"\t",$2)'
It anchors the search with explicit matches for something between a ":" and "|" pair. If your data doesn't match exactly, it should ignore the input line, but I have not tested this. I.e., this regex assumes exactly two entries between ":" and "|" will exist per line.
Try m/: ( [^:|]+ ) \| .+ : ( [^:|]+ ) \| /x instead.
A fix could be to use a greeding expression between the first string and the second one. With .* it goes until the end and begins to backtrack searching for the last colon followed by a pipe.
perl -l -ne '/:([^|]*).*:([^|]*)\|/ and print($1,"\t",$2)' <file>
Output:
Q9VNB0 A1ZBG6
P91682 Q24117
P92177-3 Q9VDK2
See it in action:
:([\w\-]*?)\|
Another method:
:(\S*?)\|
The way you've specified it, it has to match that way. You want a single colon
followed by any number of non-pipe, followed by any number of non-colon.
single colon -> :
non-pipe -> Q9VNB0
non-colon -> |intact
colon -> :
non-pipe -> EBI-102551 uniprotkb:A1ZBG6
Instead I make a space the end-of-contract, and require all my patterns to begin
with a colon, end with a pipe and consist of non-space/non-pipe characters.
perl -M5.010 -lne 'say join( "\t", m/[:]([^\s|]+)[|]/g )';
perl -nle'print "$1\t$2" if /:([^|]*)\S*\s[^:]*:([^|]*)/'
Or with 5.10+:
perl -nE'say "$1\t$2" if /:([^|]*)\S*\s[^:]*:([^|]*)/'
Explanation:
: Matches the start of the first "word".
([^|]*) Matches the desired part of the first "word".
\S* Matches the end of the first "word".
\s+ Matches the "word" separator.
[^:]*: Matches the start of the second "word".
([^|]*) Matches the desired part of the second "word".
This isn't the shortest answer (although it's close) because each part is quite independent of the others. This makes it more robust, less error-prone, and easier to maintain.
Why do you not want to use the split function. On the face of it this would be easily solved by writing
my #fields = map /:([^|]+)/, split
I am not sure how your regex is supposed to work. Using the /x modifier to allow non-significant whitespace it looks like this
/ : ([^|]*)? [^:]* : ([^|]*) /x
which finds a colon and optionally captures as many non-pipe characters as possible. Then skips over as many non-colon characters as possible to the next colon. Then captures zero asm many non-pipe characters as possible. Because all of your matches are greedy, any one of them is allowed to consume all of the rest of the string as long as the characters match the character class. Note that a ? that indicates an optional sequence will first of all match all that it can, and the option to skip the sequence will be taken only if the rest of the pattern cannot then be made to match
It is hard to judge from your examples the precise criteria for a field, but this code should do the trick. It finds sequences of characters that are neither a colon nor a pipe that are preceded by a colon and terminated by a pipe
use strict;
use warnings;
while (<DATA>) {
my #fields = /:([^:|]+)\|/g;
print join("\t", #fields), "\n";
}
__DATA__
uniprotkb:Q9VNB0|intact:EBI-102551 uniprotkb:A1ZBG6|intact:EBI-195768
uniprotkb:P91682|intact:EBI-142245 uniprotkb:Q24117|intact:EBI-156442
uniprotkb:P92177-3|intact:EBI-204491 uniprotkb:Q9VDK2|intact:EBI-87444
output
Q9VNB0 A1ZBG6
P91682 Q24117
P92177-3 Q9VDK2

Perl regex subsitute last occurrence

I have this input:
AB2.HYNN.KABCDSEG.L000.G0001V00
AB2.HYNN.GABCDSEG.L000.G0005V00
I would like to remove all which finish by GXXXXVXX in the string.
When i use this code:
$result =~ s/\.G.*V.*$//g;
print "$result \n";
The result is :
AB2.HYNN.KABCDSEG.L000
AB2.HYNN
It seems each time the regex find ".G" it removes with blank .
I don't understand.
I would like to have this:
AB2.HYNN.KABCDSEG.L000
AB2.HYNN.GABCDSEG.L000
How i can do this in regex ?
Update:
After talking in the comments, the final solution was:
s/\.G\w+V\w+$//;
In your regex:
s/\.G.*V.*$//g;
those .* are greedy and will match as much as possible. The only requirement you have is that there must be a V after the .G somewhere, so it will truncate the string from the first .G it finds, as long as it is followed by a V. There is no need for the /g modifier here, because any match that occurs will delete the rest of the string. Unless you have newlines, because . does not match newlines without the /s modifier.
$result =~ s/\.G\d+V\d+//g;
Works on given input.