Regex to remove string after file extension - regex

I'm using PowerShell to query for a service path from which results should resemble C:\directory\sub-directory\service.exe
Some results however also include characters after the .exe file extension, for example output may resemble one of the following:
C:\directory\sub-directory\service.exe ThisTextNeedsRemoving
C:\directory\sub-directory\service.exe -ThisTextNeedsRemoving
C:\directory\sub-directory\service.exe /ThisTextNeedsRemoving
i.e. ThisTextNeedsRemoving may be proceeded by a space, hyphen or forward slash.
I can use the regex -replace '($*.exe).*' to remove everything after, but including the .exe file extension, but how do I keep the .exe in the results?

You can use a look-around:
$txt = 'C:\directory\sub-directory\service.exe /ThisTextNeedsRemoving'
$txt -replace '(?<=\.exe).+', ''
This uses a look-behind which is a zero-width match so it doesn't get replaced.
Debuggex Demo

Using lookbehind is possible, but note that lookbehinds are only necessary when you need to specify some rather complex condition or to obtain overlapping matches. In most cases, when you can do without a lookbehind, you should consider using a non-lookbehind solution because it is rather a costly operation. It is easier to check once if the current character is not a whitespace than to also check if each of these symbols is preceded with something else. Or a whole substring, or a more complext pattern.
Thus, I'd suggest using a solution based on capturing mechanism, with a backreference in the replacement part to restore the captured substring in the result:
$s -replace '^(\S+\.exe) .*','$1'
or - for paths containing spaces and not inside double quotes:
$s -replace '^(.*?\.exe) .*','$1'
Explanation:
^ - start of string
(\S+\.exe) - one or more character other than whitespace (\S+) (or any characters other than a newline, any amount, as few as possible, with .*?) followed with a literal . and exe
.* - a space and then any number of characters other than a newline.

Related

Eliminate whitespace around single letters

I frequently receive PDFs that contain (when converted with pdftotext) whitespaces between the letters of some arbitrary words:
This i s a n example t e x t that c o n t a i n s strange spaces.
For further automated processing (looking for specific words) I would like to remove all whitespace between "standalone" letters (single-letter words), so the result would look like this:
This isan example text that contains strange spaces.
I tried to achieve this with a simple perl regex:
s/ (\w) (\w) / $1$2 /g
Which of course does not work, as after the first and second standalone letters have been moved together, the second one no longer is a standalone, so the space to the third will not match:
This is a n example te x t that co n ta i ns strange spaces.
So I tried lockahead assertions, but failed to achieve anything (also because I did not find any example that uses them in a substitution).
As usual with PRE, my feeling is, that there must be a very simple and elegant solution for this...
Just match a continuous series of single letters separated by spaces, then delete all spaces from that using a nested substitution (the /e eval modifier).
s{\b ((\w\s)+\w) \b}{ my $s = $1; $s =~ s/ //g; $s }xge;
Excess whitespace can be removed with a regex, but Perl by itself cannot know what is correct English. With that caveat, this seems to work:
$ perl -pe's/(?<!\S)(\S) (?=\S )/$1/g' spaces.txt
This isan example text that contains strange spaces.
Note that i s a n cannot be distinguished from a normal 4 letter word, that requires human correction, or some language module.
Explanation:
(?<!\S) negative look-behind assertion checks that the character behind is not a non-whitespace.
(\S) next must follow a non-whitespace, which we capture with parens, followed by a whitespace, which we will remove (or not put back, as it were).
(?=\S ) next we check with a look-ahead assertion that what follows is a non-whitespace followed by a whitespace. We do not change the string there.
Then put back the character we captured with $1
It might be more correct to use [^ ] instead of \S. Since you only seem to have a problem with spaces being inserted, there is no need to match tabs, newlines or other whitespace. Feel free to do that change if you feel it is appropriate.

Remove everything up to and including triple newline

I am very new to Powershell, so I am no doubt doing something really stupid that causes my attempts to get this to work to not actually work... but after an hour of struggling, I'd love a hand.
I have a file for which a triple newline (two empty lines) marks a boundary. I want only everything that comes after the boundary.
My latest fruitless attempt looks like this:
$content = Get-Content -Raw $Path
$content = $content -Replace '^.+`r`n`r`n`r`n', ''
All my attempts to even match a single new line have failed. The -Raw parameter is because I came to understand this would change the way newlines were processed, but it didn't change anything.
I am also aware the regex isn't ideal; I'd want to make it non-greedy but I want to get a super-basic test case working first given my unfamiliarity with whatever flavor of regular expressions Powershell supports. (I assume I can just stick a ? after the + to fix that, but first things first.)
The goal is to go from
useless metadata I don't care about
more useless metadata
actual content
to this:
actual content
What am I doing wrong?
The '`r`n' is a literal 4 char string, while "`r`n" is linebreak 2-char string. Your pattern would not match any line breaks. It is safer to use \r to match CR and \n to match LF in Powershell regex patterns.
Also note that there are several lines between the start of the string and your delimiter, but . does not match a newline by default, you need a (?s) inline modifier to make . match newlines, too.
Use
$content -replace '(?s)^.*?(?:\r?\n){3}'
Details
(?s) - a Singleline option that makes . match newlines, too
^ - start of the string
.*? - any 0+ chars, as few as possible
(?:\r?\n){3} - triple CRLF/LF line break.
See the .NET regex demo.

Match first executable path in list in PowerShell

I was looking for a way to put quotes in Windows Services paths that don't have them because they have spaces. See https://regex101.com/r/N6cbk8/2 for the regex I found which highlights the strings I want to put in quotes, so for example:
D:\SerPatHL7Server\RunAsService.exe runasservice d:\SerPatHL7Server\SerPatHL7server.exe
needs to become
"D:\SerPatHL7Server\RunAsService.exe" runasservice d:\SerPatHL7Server\SerPatHL7server.exe
And the other services need also needs quotes like so:
E:\Program Files (x86)\Endobase\ebserver.exe
to become
"E:\Program Files (x86)\Endobase\ebserver.exe"
But PowerShell won't accept the \K in | Select-String -Pattern "(^.*?)\K\.exe" and throws the error: "The string (^.\*?)\K\.exe is not a valid regular expression: parsing "(^.*?)\K\.exe" - Unrecognized escape sequence \K."
I couldn't find an alternative with my limited knowledge of Regex expressions.
See the above link for a full list of examples. Is there a way to achieve my goal?
The \K is a match reset operator used in PCRE, Boost, Python PiPy regex and Onigmo libraries. You do not need this operator in .NET because it supports an infinite width lookbehind (and \K is actually a kind of this lookbehind work around).
You just need to match any 0+ chars as few as possible up to the first .exe that is followed with whitespace or end of string and replace with " + match + ".
Use
-replace "^.*?\.exe(?!\S)", '"$&"'
Details
^ - start of the line
.*? - any 0+ chars other than newline as few as possible
\. - a literal .
exe - a literal exe substring
(?!\S) - there should be whitespace or end of string immediately to the right of the current location.
"$&" - the replacement pattern where $& stands for the whole match.

Extract certain part of a string in Perl

I have the following Perl strings. The lengths and the patterns are different. The file is always named *log.999
my $file1 = '/user/mike/desktop/sys/syslog.1';
my $file2 = '/user/mike/desktop/movie/dnslog.2';
my $file3 = '/haselog.3';
my $file4 = '/user/mike/desktop/movie/dns-sys.log'
I need to extract the words before log. In this case, sys, dns, hase and dns-sys.
How can I write a regular expression to extract them?
\w+(?=log\b)
matches one or more alphanumeric characters that are followed by log (but not logging etc.)
If the filename format is fixed, you can make the regex more reliable by using
\w+(?=log\.\d+\/$)
The main property of shown strings is that the *log* phrase is last.
Then anchor the pattern, so we wouldn't match a log somewhere in the middle
my ($name) = $string =~ /(\w+)log\.[0-9]+$/;
while if .N extension is optional
my ($name) = $string =~ /(\w+)log(?:\.[0-9]+)?$/;
The above uses the \w+ pattern to capture the text preceding log. But that text may also contain non-word characters (-, ., etc), in which case we would use [^/]+ to capture everything after the last /, as pointed out in Abigail's answer. With .N optional, per question in the comments
my ($name) = $string =~ m{ ([^/]+) log (?: \.[0-9]+ )? $}x;
where I added the }x modifier, with which spaces inside are ignored, what can aid readibility.
I use a set of delimiters other than / to be able to use / inside without escaping it, and then the m is compulsory. The [^...] is a negated character class, matching any character not listed inside. So [^/]+log matches all successive characters which are not /, coming before log.
The non capturing group (?: ... ) groups patterns inside, so that ? applies to the whole group, but doesn't needlessly capture them.
The (?:\.[0-9]+)? pattern was written specifically so to disallow things like log. (nothing after dot) and log5. But if these are acceptable, change it to the simpler \.?[0-9]*
Update Corrected a typo in code: for optional .N there is +, not *

What REGEX pattern should I use to look for a specific string pattern and remove anything else that doesnt match?

I'm parsing through code using a Perl-REGEX parsing engine in my IDE and I want to grab any variables that look like
$hash->{ hash_key04}
and nuke the rest of the code..
So far my very basic REGEX doesnt do what I expected
(.*)(\$hash\-\>\{[\w\s]+\})(.*)
(
\$
hash
\-\>
\{
[\w\s]+
\}
)
I know to use replace for this ($1,$2,etc), but match (.*) before and after the target string doesnt seem to capture all the rest of the code!
UPADTED:
tried matching null but of course thats too greedy.
([^\0]*)
What expression in regex should i use to look only for the string pattern and remove the rest?
The problem is I want to be left with the list of $hash->{} strings after the replace runs in the IDE.
This is better approached from the other direction. Instead of trying to delete everything you don't want, what about extracting everything you do want?
my #vars = $src_text =~ /(\$hash->\{[\w\s]+\})/g;
Breaking down the regex:
/( # start of capture group
\$hash-> # prefix string with $ escaped
\{ # opening escaped delimiter
[\w\s]+ # any word characters or space
\} # closing escaped delimiter
)/g; # match repeatedly returning a list of captures
Here is another way that might fit within your IDE better:
s/(\$hash->\{[\w\s]+\})|./$1/gs;
This regex tries to match one of your hash variables at each location, and if it fails, it deletes the next character and then tries again, which after running over the whole file will have deleted everything you don't want.
Depends on your coding language. What you want is group 2 (The second set of characters in parenthesis). In perl that would be $2, in VIM it would be \2, etc ...
It depends on the platform, but generally, replace the pattern with an empty string.
In javascript,
// prints "the la in ing"
console.log('the latest in testing'.replace(/test/g, ''));
In bash
$ echo 'the latest in testing' | sed 's/test//g'
the la in ing
In C#
Console.WriteLine(Regex.Replace("the latest in testing", "test", ""));
etc
By default the wildcard . won't match newlines. You can enable newlines in its matching set using a flag depending on what regex standard you're using and under what language/api. Or you can add them explicitly yourself by defining a character set:
[.\n\r]* <- Matches any character including newline, carriage return.
Combine this with capture groups to grab desired variables from your code and skip over lines which contain no capture group.
If you want help constructing the proper regex for your context you'll need to paste some input text and specify what the output should be.
I think you want to add a ^ to the beginning of the regex s/^.(PATTERN)(.)$/$1/ so that it starts at the beginning of the line and goes to the end, removing anything except that pattern.