Extract certain part of a string in Perl - regex

I have the following Perl strings. The lengths and the patterns are different. The file is always named *log.999
my $file1 = '/user/mike/desktop/sys/syslog.1';
my $file2 = '/user/mike/desktop/movie/dnslog.2';
my $file3 = '/haselog.3';
my $file4 = '/user/mike/desktop/movie/dns-sys.log'
I need to extract the words before log. In this case, sys, dns, hase and dns-sys.
How can I write a regular expression to extract them?

\w+(?=log\b)
matches one or more alphanumeric characters that are followed by log (but not logging etc.)
If the filename format is fixed, you can make the regex more reliable by using
\w+(?=log\.\d+\/$)

The main property of shown strings is that the *log* phrase is last.
Then anchor the pattern, so we wouldn't match a log somewhere in the middle
my ($name) = $string =~ /(\w+)log\.[0-9]+$/;
while if .N extension is optional
my ($name) = $string =~ /(\w+)log(?:\.[0-9]+)?$/;
The above uses the \w+ pattern to capture the text preceding log. But that text may also contain non-word characters (-, ., etc), in which case we would use [^/]+ to capture everything after the last /, as pointed out in Abigail's answer. With .N optional, per question in the comments
my ($name) = $string =~ m{ ([^/]+) log (?: \.[0-9]+ )? $}x;
where I added the }x modifier, with which spaces inside are ignored, what can aid readibility.
I use a set of delimiters other than / to be able to use / inside without escaping it, and then the m is compulsory. The [^...] is a negated character class, matching any character not listed inside. So [^/]+log matches all successive characters which are not /, coming before log.
The non capturing group (?: ... ) groups patterns inside, so that ? applies to the whole group, but doesn't needlessly capture them.
The (?:\.[0-9]+)? pattern was written specifically so to disallow things like log. (nothing after dot) and log5. But if these are acceptable, change it to the simpler \.?[0-9]*
Update Corrected a typo in code: for optional .N there is +, not *

Related

How to correctly build RegEx for multiline values in reg file

I would like to get values from a .reg file (REG EXPORT file) so I can compare them to another .reg file. I'm having problems to create the RegEx for this.
facts which make it harder for me:
I don't know what kind of registry key types are being used in the file (that's why I want to build a regex for all the different types like string, dword, qword, multistring,...)
I don't know if the last character in the file is a newline or not
I would like to only return the actual value, e.g. fa,ad,df,fa,ad,df,fa,ad if the regkey is "qword"=hex(b):fa,ad,df,fa,ad,df,fa,ad
$Text = #'
[HKEY_LOCAL_MACHINE\SOFTWARE\Test]
"String"="asfasdfasasfasdfasasfasdfasasfas"
"Binary"=hex:d3,45,34,53,45,34,53,45,34,53,45,34,53,45,34,53,45,34,5b,09,89,08,\
34,09,8a,ef,02,30,40,9a,ad,fa,d0
"DWORD"=dword:fefefefe
"multistring"=hex(7):61,00,62,00,6c,00,61,00,73,00,66,00,62,00,00,00,62,00,61,\
00,6c,00,73,00,66,00,62,00,61,00,73,00,64,00,66,00,00,00,62,00,61,00,6c,00,\
73,00,64,00,66,00,61,00,64,00,6c,00,66,00,00,00,61,00,73,00,64,00,66,00,61,\
00,73,00,64,00,66,00,00,00,61,00,73,00,64,00,66,00,00,00,61,00,73,00,64,00,\
00,00,66,00,61,00,73,00,64,00,00,00,66,00,61,00,73,00,64,00,66,00,61,00,73,\
00,66,00,61,00,73,00,64,00,66,00,00,00,61,00,73,00,64,00,66,00,61,00,73,00,\
64,00,66,00,61,00,73,00,64,00,00,00,61,00,73,00,64,00,66,00,61,00,73,00,64,\
00,66,00,00,00,00,00
"qword"=hex(b):fa,ad,df,fa,ad,df,fa,ad
'#
# this one works
$key = "multistring"
$regex = ('(?ms)\"{0}\"=hex\(7\):(.+)\n' -f [RegEx]::Escape($key))
[regex]::Matches($Text, $regex) | foreach { $_.Groups[1].Value }
# this one does not work because there is no newline after the last line...
$key2 = "qword"
$regex2 = ('(?ms)\"{0}\"=hex\(b\):(.+)\n' -f [RegEx]::Escape($key2))
[regex]::Matches($Text, $regex2) | foreach { $_.Groups[1].Value }
In your regex you use (?s) which is a modifier that will make the dot match any character including new lines. So .+ will match until the end of all lines.
You could use a capturing group to capture the part after the colon.
First match the part uptil a colon using \"{0}\"=hex\(7\):
Then match what follows until the end of the line and use a negative lookahead to check if what follows is not a line that starts with a word between double quotes followed by an equals sign like "qword"=. As long as that is the case, match the whole string.
Your code could look like:
$regex = \"{0}\"=hex\(7\):(.*(?:(?!\n"[^\n"]+"=)\n.*)*)
Explanation of the second part:
( Capturing group which will hold your value
.* Match any character except a newline 0+ times
(?: Non capturing group
(?! Negative lookahead to assert what follows is not
\n"[^\n"]+"= Match \n", negated character class to match not any of \n or "
)\n.* Close negative lookahead and match \n followed by any character except a newline 0+ times
)* Close non capturing group and repeat 0+ times
) Close capturing group
Example Pattern
\"multistring\"=hex\(7\):(.*(?:(?!\n"[^\n"]+"=)\n.*)*)
Regex demo
.+ is a greedy expression, and the modifier (?s) makes the . match all characters (including newlines), so (.+)\n will match everything up to the last newline.
Try something like this:
$regex = '"{0}"=hex\(b\):(.+(?:\n .+)*)'
You need neither (?m) nor (?s) here, because you don't want . to include newlines, and you don't want to match beginnings or ends of lines inside the multiline string. .+(?:\n .+)* matches the rest of the line after the prefix hex(b): and all subsequent lines beginning with two consecutive spaces. The (?:...) is just a non-capturing group, since there's no need to capture each line in a separate group.

Regex to remove string after file extension

I'm using PowerShell to query for a service path from which results should resemble C:\directory\sub-directory\service.exe
Some results however also include characters after the .exe file extension, for example output may resemble one of the following:
C:\directory\sub-directory\service.exe ThisTextNeedsRemoving
C:\directory\sub-directory\service.exe -ThisTextNeedsRemoving
C:\directory\sub-directory\service.exe /ThisTextNeedsRemoving
i.e. ThisTextNeedsRemoving may be proceeded by a space, hyphen or forward slash.
I can use the regex -replace '($*.exe).*' to remove everything after, but including the .exe file extension, but how do I keep the .exe in the results?
You can use a look-around:
$txt = 'C:\directory\sub-directory\service.exe /ThisTextNeedsRemoving'
$txt -replace '(?<=\.exe).+', ''
This uses a look-behind which is a zero-width match so it doesn't get replaced.
Debuggex Demo
Using lookbehind is possible, but note that lookbehinds are only necessary when you need to specify some rather complex condition or to obtain overlapping matches. In most cases, when you can do without a lookbehind, you should consider using a non-lookbehind solution because it is rather a costly operation. It is easier to check once if the current character is not a whitespace than to also check if each of these symbols is preceded with something else. Or a whole substring, or a more complext pattern.
Thus, I'd suggest using a solution based on capturing mechanism, with a backreference in the replacement part to restore the captured substring in the result:
$s -replace '^(\S+\.exe) .*','$1'
or - for paths containing spaces and not inside double quotes:
$s -replace '^(.*?\.exe) .*','$1'
Explanation:
^ - start of string
(\S+\.exe) - one or more character other than whitespace (\S+) (or any characters other than a newline, any amount, as few as possible, with .*?) followed with a literal . and exe
.* - a space and then any number of characters other than a newline.

Replace specific capture group instead of entire regex in Perl

I've got a regular expression with capture groups that matches what I want in a broader context. I then take capture group $1 and use it for my needs. That's easy.
But how to use capture groups with s/// when I just want to replace the content of $1, not the entire regex, with my replacement?
For instance, if I do:
$str =~ s/prefix (something) suffix/42/
prefix and suffix are removed. Instead, I would like something to be replaced by 42, while keeping prefix and suffix intact.
As I understand, you can use look-ahead or look-behind that don't consume characters. Or save data in groups and only remove what you are looking for. Examples:
With look-ahead:
s/your_text(?=ahead_text)//;
Grouping data:
s/(your_text)(ahead_text)/$2/;
If you only need to replace one capture then using #LAST_MATCH_START and #LAST_MATCH_END (with use English; see perldoc perlvar) together with substr might be a viable choice:
use English qw(-no_match_vars);
$your_string =~ m/aaa (bbb) ccc/;
substr $your_string, $LAST_MATCH_START[1], $LAST_MATCH_END[1] - $LAST_MATCH_START[1], "new content";
# replaces "bbb" with "new content"
This is an old question but I found the below easier for replacing lines that start with >something to >something_else. Good for changing the headers for fasta sequences
while ($filelines=~ />(.*)\s/g){
unless ($1 =~ /else/i){
$filelines =~ s/($1)/$1\_else/;
}
}
I use something like this:
s/(?<=prefix)(group)(?=suffix)/$1 =~ s|text|rep|gr/e;
Example:
In the following text I want to normalize the whitespace but only after ::=:
some text := a b c d e ;
Which can be achieved with:
s/(?<=::=)(.*)/$1 =~ s|\s+| |gr/e
Results with:
some text := a b c d e ;
Explanation:
(?<=::=): Look-behind assertion to match ::=
(.*): Everything after ::=
$1 =~ s|\s+| |gr: With the captured group normalize whitespace. Note the r modifier which makes sure not to attempt to modify $1 which is read-only. Use a different sub delimiter (|) to not terminate the replacement expression.
/e: Treat the replacement text as a perl expression.
Use lookaround assertions. Quoting the documentation:
Lookaround assertions are zero-width patterns which match a specific pattern without including it in $&. Positive assertions match when their subpattern matches, negative assertions match when their subpattern fails. Lookbehind matches text up to the current match position, lookahead matches text following the current match position.
If the beginning of the string has a fixed length, you can thus do:
s/(?<=prefix)(your capture)(?=suffix)/$1/
However, ?<= does not work for variable length patterns (starting from Perl 5.30, it accepts variable length patterns whose length is smaller than 255 characters, which enables the use of |, but still prevents the use of *). The work-around is to use \K instead of (?<=):
s/.*prefix\K(your capture)(?=suffix)/$1/

Finding absence of words in a regular expression

I've seen examples of finding the absence of characters in a regular expression, I'm trying to find the absence of words in a regular expression (likely using a negative lookbehind).
I have lines of code like this:
Example One:
protected static readonly string BACKGROUND_MUSIC_NAME = "Music_Mission_Complete_Loop_audio";
And here's another one:
mainWindow.Id = "MainWindow";
Final one:
mainStoLabel.Text = "#stb_entry_clah";
I want to capture only the middle one by finding all strings like these that a.) aren't preceded by a "#" in the actual string between the quotes, and b.) aren't preceded at all by the word "readonly".
My current Regular Expression is this:
.*\W\=\W"[^#].*"
It captures the top two examples. Now I just want to narrow down the top example. How do I capture the absence of (not characters) whole words.
Thanks.
The bug in your negation lookahead assertion is that you didn’t put it together right to suit the general case. You need to make its assertion apply to every character position as you crawl ahead. It only applies to one possible dot the way you’ve written it, whereas you need it to apply to all of them. See below for how you must do this to do it correctly.
Here is a working demo that shows two different approaches:
The first uses a negative lookahead to ensure that the left-hand portion not contain readonly and the right-hand portion not start with a number sign.
The second does a simpler parser, then separately inspects the left- and right-hand sides for the individual constraints that apply to each.
The demo language is Perl, but the same patterns and logic should work virtually everywhere.
#!/usr/bin/perl
while (<DATA>) {
chomp;
#
# First demo: use a complicated regex to get desired part only
#
my($label) = m{
^ # start at the beginning
(?: # noncapture group:
(?! \b readonly \b ) # no "readonly" here
. # now advance one character
) + # repeated 1 or more times
\s* = \s* # skip an equals sign w/optional spaces
" ( [^#"] [^"]* ) " # capture #1: quote-delimited text
# BUT whose first char isn't a "#"
}x;
if (defined $label) {
print "Demo One: found label <$label> at line $.\n";
}
#
# Second demo: This time use simpler patterns, several
#
my($lhs, $rhs) = m{
^ # from the start of line
( [^=]+ ) # capture #1: 1 or more non-equals chars
\s* = \s* # skip an equals sign w/optional spaces
" ( [^"]+ ) " # capture #2: all quote-delimited text
}x;
unless ($lhs =~ /\b readonly \b/x || $rhs =~ /^#/) {
print "Demo Two: found label <$rhs> at line $.\n";
}
}
__END__
protected static readonly string BACKGROUND_MUSIC_NAME = "Music_Mission_Complete_Loop_audio";
mainWindow.Id = "MainWindow";
mainStoLabel.Text = "#stb_entry_clah";
I have two bits of advice. The first is to make very sure you ALWAYS use /x mode so you can produce documented and maintainable regexes. The second is that it is much cleaner doing things a bit at a time as in the second solution rather than all at once as in the first.
I don 't understand your question completely, a negative lookahead would look like this:
(?!.*readonly)(?:.*\s\=\s"[^#].*")
The first part will match if there is not the word "readonly" in the string.
Which language are you using?
What do you want to match, only the second example, did I understand this correct?
^[^"=]*(?<!(^|\s)readonly\s.*)\s*=\s*"[^#].*" seems to fit your needs:
everything before the first equal sign should not contain readonly or quotes
readonly is recognized not with word boundaries but with whitespace (except at beginning of line)
the equal sign can be surrounded by arbitrary whitespace
the equal sign must be followed by a quoted string
the quoted string should not start with #
You can work with lookarounds or capture groups if you only want the strings or quoted strings.
Note: as per your own regex, this discards anything after the last quote (not matching the semi-colon in your examples)
You absolutely need to specify the language. The negative lookahead/lookbehind is the thing you need.
Look at this site for an inventory of how to do that in Delphi, GNU (Linux), Groovy, Java, JavaScript, .NET, PCRE (C/C++), Perl, PHP, POSIX, PowerShell, Python, R, REALbasic, Ruby, Tcl, VBScript, Visual Basic 6, wxWidgets, XML Schema, XQuery & XPath

extract word with regular expression

I have a string 1/temperatoA,2/CelcieusB!23/33/44,55/66/77 and I would like to extract the words temperatoA and CelcieusB.
I have this regular expression (\d+/(\w+),?)*! but I only get the match 1/temperatoA,2/CelcieusB!
Why?
Your whole match evaluates to '1/temperatoA,2/CelcieusB' because that matches the following expression:
qr{ ( # begin group
\d+ # at least one digit
/ # followed by a slash
(\w+) # followed by at least one word characters
,? # maybe a comma
)* # ANY number of repetitions of this pattern.
}x;
'1/temperatoA,' fulfills capture #1 first, but since you are asking the engine to capture as many of those as it can it goes back and finds that the pattern is repeated in '2/CelcieusB' (the comma not being necessary). So the whole match is what you said it is, but what you probably weren't expecting is that '2/CelcieusB' replaces '1/temperatoA,' as $1, so $1 reads '2/CelcieusB'.
Anytime you want to capture anything that fits a certain pattern in a certain string it is always best to use the global flag and assign the captures into an array. Since an array is not a single scalar like $1, it can hold all the values that were captured for capture #1.
When I do this:
my $str = '1/temperatoA,2/CelcieusB!23/33/44,55/66/77';
my $regex = qr{(\d+/(\w+))};
if ( my #matches = $str =~ /$regex/g ) {
print Dumper( \#matches );
}
I get this:
$VAR1 = [
'1/temperatoA',
'temperatoA',
'2/CelcieusB',
'CelcieusB',
'23/33',
'33',
'55/66',
'66'
];
Now, I figure that's probably not what you expected. But '3' and '6' are word characters, and so--coming after a slash--they comply with the expression.
So, if this is an issue, you can change your regex to the equivalent: qr{(\d+/(\p{Alpha}\w*))}, specifying that the first character must be an alpha followed by any number of word characters. Then the dump looks like this:
$VAR1 = [
'1/temperatoA',
'temperatoA',
'2/CelcieusB',
'CelcieusB'
];
And if you only want 'temperatoA' or 'CelcieusB', then you're capturing more than you need to and you'll want your regex to be qr{\d+/(\p{Alpha}\w*)}.
However, the secret to capturing more than one chunk in a capture expression is to assign the match to an array, you can then sort through the array to see if it contains the data you want.
The question here is: why are you using a regular expression that’s so obviously wrong? How did you get it?
The expression you want is simply as follows:
(\w+)
With a Perl-compatible regex engine you can search for
(?<=\d/)\w+(?=.*!)
(?<=\d/) asserts that there is a digit and a slash before the start of the match
\w+ matches the identifier. This allows for letters, digits and underscore. If you only want to allow letters, use [A-Za-z]+ instead.
(?=.*!) asserts that there is a ! ahead in the string - i. e. the regex will fail once we have passed the !.
Depending on the language you're using, you might need to escape some of the characters in the regex.
E. g., for use in C (with the PCRE library), you need to escape the backslashes:
myregexp = pcre_compile("(?<=\\d/)\\w+(?=.*!)", 0, &error, &erroroffset, NULL);
Will this work?
/([[:alpha:]]\w+)\b(?=.*!)
I made the following assumptions...
A word begins with an alphabetic character.
A word always immediately follows a slash. No intervening spaces, no words in the middle.
Words after the exclamation point are ignored.
You have some sort of loop to capture more than one word. I'm not familiar enough with the C library to give an example.
[[:alpha:]] matches any alphabetic character.
The \b matches a word boundary.
And the (?=.*!) came from Tim Pietzcker's post.