Perl : Decoding Regex - regex

I would highly appreciate if somebody could help me understand the following.
=~/(?<![\w.])($val)(?![\w.])/gi)
This what i picked up but i dont understand this.
Lookaround: (?=a) for a lookahead, ?! for negative lookahead, or ?<= and ?<! for lookbehinds (positive and negative, respectively).

The regex seems to search for $val (i.e. string that matches the contents of the variable $val) not surrounded by word characters or dots.
Putting $val into parentheses remembers the corresponding matched part in $1.
See perlre for details.
Note that =~ is not part of the regex, it's the "binding operator".
Similarly, gi) is part of something bigger. g means the matching happens globally, which has different effects based on the context the matching occurs in, and i makes the match case insensitive (which could only influence $val here). The whole expression was in parentheses, probably, but we can't see the opening one.

Read (?<!PAT) as "not immediately preceded by text matching PAT".
Read (?!PAT) as "not immediately followed by text matching PAT".

I use these sites to help with testing and learning and decoding regex:
https://regex101.com/: This one dissects and explains the expression the best IMO.
http://www.regexr.com/

define $val then watch the regex engine work with rxrx - command-line REPL and wrapper for Regexp::Debugger
it shows output like this but in color
Matched
|
VVV
/(?<![\w.])(dog)(?![\w.])/
|
V
'The quick brown fox jumps over the lazy dog'
^^^
[Visual of regex at 'rxrx' line 0] [step: 189]
It also gives descriptions like this
(?<! # Match negative lookbehind
[\w.] # Match any of the listed characters
) # The end of negative lookbehind
( # The start of a capturing block ($1)
dog # Match a literal sequence ("dog")
) # The end of $1
(?! # Match negative lookahead
[\w.] # Match any of the listed characters
) # The end of negative lookahead

Related

regex negative lookbehind matching when expected not to

Can someone help me understand why the following regex is matching when i would expect it not to match.
String to check against
/opt/lnpsite/ni00/flat/tmp/Med_Local_Bak/ROI_Med_Transfer/CBD99_PINPUK_14934_09_02_2017_12_07_36.txt
regex
(?<!Transfer\/)\w*PINPUK.*(?:csv|txt)$
I was expecting this to not match as the string Transfer/ appears before 0 or more word chars followed by the string PINPUK. If I change the pattern from \w* to \w{6} to explicitly match 6 word chars this correctly returns no match.
Can someone help me understand why with the 0 or more quantifier on my "word" character results in the regex giving a match?
Your regex pattern (?<!Transfer/)\w*PINPUK.*(?:csv|txt)$ is looking for \w*PINPUK not immediately preceded by Transfer/
Given the string
/opt/lnpsite/ni00/flat/tmp/Med_Local_Bak/ROI_Med_Transfer/CBD99_PINPUK_14934_09_02_2017_12_07_36.txt
the regex engine will start by matching \w*PINPUK with CBD99_PINPUK
But that is preceded by Transfer/ so the engine backtracks and finds BD99_PINPUK
That is preceded by C, which isn't Transfer/, so the match is successful
As for a fix, just put the slash outside the look-behind
(?<!Transfer)/\w*PINPUK.*(?:csv|txt)$
That forces the \w* to begin right after the slash, and the pattern now correctly fails
Borodin has given an excellent explanation of why this doesn't work and a solution for this case (move a /). Sometimes something simple like that isn't possible though so here I'll explain an alternate work around that might be useful
Things will match as you expect if you move the \w* inside the negative look-behind. Like so:
(?<!Transfer\/\w*)PINPUK.*(?:csv|txt)$
Unfortunately Perl doesn't allow this, negative look-behinds must be fixed width. But still, there is a way to perform one match: match in reverse
^(?:vsc|txt).*KUPNIP(?!\w*\/refsnarT)
This uses a variable length negative look-ahead, something Perl does allow. Putting all this together in a script we get
use strict;
use warnings;
use feature 'say';
my $string_matches = '/opt/lnpsite/ni00/flat/tmp/Med_Local_Bak/ROI_Med_Transfer/CBD99_PINPUK_14934_09_02_2017_12_07_36.txt';
say "Trying $string_matches";
if ( reverse($string_matches) =~ /^(?:vsc|txt).*KUPNIP(?!\w*\/refsnarT)/ ) {
say 'It matched';
} else {
say 'No match';
}
say '';
my $string_doesnt_match = '/opt/lnpsite/ni00/flat/tmp/Med_Local_Bak/ROI_Med/CBD99_PINPUK_14934_09_02_2017_12_07_36.txt';
say "Trying $string_doesnt_match";
if ( reverse($string_doesnt_match) =~ /^(?:vsc|txt).*KUPNIP(?!\w*\/refsnarT)/ ) {
say 'It matched';
} else {
say 'No match';
}
Which outputs
Trying /opt/lnpsite/ni00/flat/tmp/Med_Local_Bak/ROI_Med_Transfer/CBD99_PINPUK_14934_09_02_2017_12_07_36.txt
No match
Trying /opt/lnpsite/ni00/flat/tmp/Med_Local_Bak/ROI_Med/CBD99_PINPUK_14934_09_02_2017_12_07_36.txt
It matched

Replace specific capture group instead of entire regex in Perl

I've got a regular expression with capture groups that matches what I want in a broader context. I then take capture group $1 and use it for my needs. That's easy.
But how to use capture groups with s/// when I just want to replace the content of $1, not the entire regex, with my replacement?
For instance, if I do:
$str =~ s/prefix (something) suffix/42/
prefix and suffix are removed. Instead, I would like something to be replaced by 42, while keeping prefix and suffix intact.
As I understand, you can use look-ahead or look-behind that don't consume characters. Or save data in groups and only remove what you are looking for. Examples:
With look-ahead:
s/your_text(?=ahead_text)//;
Grouping data:
s/(your_text)(ahead_text)/$2/;
If you only need to replace one capture then using #LAST_MATCH_START and #LAST_MATCH_END (with use English; see perldoc perlvar) together with substr might be a viable choice:
use English qw(-no_match_vars);
$your_string =~ m/aaa (bbb) ccc/;
substr $your_string, $LAST_MATCH_START[1], $LAST_MATCH_END[1] - $LAST_MATCH_START[1], "new content";
# replaces "bbb" with "new content"
This is an old question but I found the below easier for replacing lines that start with >something to >something_else. Good for changing the headers for fasta sequences
while ($filelines=~ />(.*)\s/g){
unless ($1 =~ /else/i){
$filelines =~ s/($1)/$1\_else/;
}
}
I use something like this:
s/(?<=prefix)(group)(?=suffix)/$1 =~ s|text|rep|gr/e;
Example:
In the following text I want to normalize the whitespace but only after ::=:
some text := a b c d e ;
Which can be achieved with:
s/(?<=::=)(.*)/$1 =~ s|\s+| |gr/e
Results with:
some text := a b c d e ;
Explanation:
(?<=::=): Look-behind assertion to match ::=
(.*): Everything after ::=
$1 =~ s|\s+| |gr: With the captured group normalize whitespace. Note the r modifier which makes sure not to attempt to modify $1 which is read-only. Use a different sub delimiter (|) to not terminate the replacement expression.
/e: Treat the replacement text as a perl expression.
Use lookaround assertions. Quoting the documentation:
Lookaround assertions are zero-width patterns which match a specific pattern without including it in $&. Positive assertions match when their subpattern matches, negative assertions match when their subpattern fails. Lookbehind matches text up to the current match position, lookahead matches text following the current match position.
If the beginning of the string has a fixed length, you can thus do:
s/(?<=prefix)(your capture)(?=suffix)/$1/
However, ?<= does not work for variable length patterns (starting from Perl 5.30, it accepts variable length patterns whose length is smaller than 255 characters, which enables the use of |, but still prevents the use of *). The work-around is to use \K instead of (?<=):
s/.*prefix\K(your capture)(?=suffix)/$1/

Finding absence of words in a regular expression

I've seen examples of finding the absence of characters in a regular expression, I'm trying to find the absence of words in a regular expression (likely using a negative lookbehind).
I have lines of code like this:
Example One:
protected static readonly string BACKGROUND_MUSIC_NAME = "Music_Mission_Complete_Loop_audio";
And here's another one:
mainWindow.Id = "MainWindow";
Final one:
mainStoLabel.Text = "#stb_entry_clah";
I want to capture only the middle one by finding all strings like these that a.) aren't preceded by a "#" in the actual string between the quotes, and b.) aren't preceded at all by the word "readonly".
My current Regular Expression is this:
.*\W\=\W"[^#].*"
It captures the top two examples. Now I just want to narrow down the top example. How do I capture the absence of (not characters) whole words.
Thanks.
The bug in your negation lookahead assertion is that you didn’t put it together right to suit the general case. You need to make its assertion apply to every character position as you crawl ahead. It only applies to one possible dot the way you’ve written it, whereas you need it to apply to all of them. See below for how you must do this to do it correctly.
Here is a working demo that shows two different approaches:
The first uses a negative lookahead to ensure that the left-hand portion not contain readonly and the right-hand portion not start with a number sign.
The second does a simpler parser, then separately inspects the left- and right-hand sides for the individual constraints that apply to each.
The demo language is Perl, but the same patterns and logic should work virtually everywhere.
#!/usr/bin/perl
while (<DATA>) {
chomp;
#
# First demo: use a complicated regex to get desired part only
#
my($label) = m{
^ # start at the beginning
(?: # noncapture group:
(?! \b readonly \b ) # no "readonly" here
. # now advance one character
) + # repeated 1 or more times
\s* = \s* # skip an equals sign w/optional spaces
" ( [^#"] [^"]* ) " # capture #1: quote-delimited text
# BUT whose first char isn't a "#"
}x;
if (defined $label) {
print "Demo One: found label <$label> at line $.\n";
}
#
# Second demo: This time use simpler patterns, several
#
my($lhs, $rhs) = m{
^ # from the start of line
( [^=]+ ) # capture #1: 1 or more non-equals chars
\s* = \s* # skip an equals sign w/optional spaces
" ( [^"]+ ) " # capture #2: all quote-delimited text
}x;
unless ($lhs =~ /\b readonly \b/x || $rhs =~ /^#/) {
print "Demo Two: found label <$rhs> at line $.\n";
}
}
__END__
protected static readonly string BACKGROUND_MUSIC_NAME = "Music_Mission_Complete_Loop_audio";
mainWindow.Id = "MainWindow";
mainStoLabel.Text = "#stb_entry_clah";
I have two bits of advice. The first is to make very sure you ALWAYS use /x mode so you can produce documented and maintainable regexes. The second is that it is much cleaner doing things a bit at a time as in the second solution rather than all at once as in the first.
I don 't understand your question completely, a negative lookahead would look like this:
(?!.*readonly)(?:.*\s\=\s"[^#].*")
The first part will match if there is not the word "readonly" in the string.
Which language are you using?
What do you want to match, only the second example, did I understand this correct?
^[^"=]*(?<!(^|\s)readonly\s.*)\s*=\s*"[^#].*" seems to fit your needs:
everything before the first equal sign should not contain readonly or quotes
readonly is recognized not with word boundaries but with whitespace (except at beginning of line)
the equal sign can be surrounded by arbitrary whitespace
the equal sign must be followed by a quoted string
the quoted string should not start with #
You can work with lookarounds or capture groups if you only want the strings or quoted strings.
Note: as per your own regex, this discards anything after the last quote (not matching the semi-colon in your examples)
You absolutely need to specify the language. The negative lookahead/lookbehind is the thing you need.
Look at this site for an inventory of how to do that in Delphi, GNU (Linux), Groovy, Java, JavaScript, .NET, PCRE (C/C++), Perl, PHP, POSIX, PowerShell, Python, R, REALbasic, Ruby, Tcl, VBScript, Visual Basic 6, wxWidgets, XML Schema, XQuery & XPath

Regex to match text between specified delimiters? (I just can't get it myself)

I've been googling & trying to get this myself but can't quite get it...
QUESTION: What regular expression could be used to select text BETWEEN (but not including) the delimiter text. So as an example:
Start Marker=ABC
Stop Marker=XYZ
---input---
This is the first line
And ABCfirst matched hereXYZ
and then
again ABCsecond matchXYZ
asdf
------------
---expected matches-----
[1] first matched here
[2] second match
------------------------
Thanks
Standard or extended regex syntax can't do that, but what it can do is create match groups which you can then select. For instance:
ABC(.*)XYZ
will store anything between ABC and XYZ as \1 (otherwise known as group 1).
If you're using PCREs (Perl-Compatible Regular Expressions), lookahead and lookbehind assertions are also available -- but groups are the more portable and better-performing solution. Also, if you're using PCREs, you should use *? to ensure that the match is non-greedy and will terminate at the first opportunity.
You can test this yourself in a Python interpreter (the Python regex syntax is PCRE-derived):
>>> import re
>>> input_str = '''
... This is the first line
... And ABC first matched hereXYZ
... and then
... again ABCsecond matchXYZ
... asdf
... '''
>>> re.findall('ABC(.*?)XYZ', input_str)
[' first matched here', 'second match']
/ABC(.*?)XYZ/
By default, regular expression matches are greedy. The '?' after the . wildcard character, denotes a minimal match, so that the first match is this:
first matched here
...instead of this:
first matched hereXYZ
and then
again ABCsecond match
You want the non-greedy match, .*?
while( $string =~ /ABC(.*?)XYZ/gm ) {
$match = $1;
}

What is the right regular expression that match when a defined string isn't present

I'm desperatly searching for a regular expression that match anything between a to z, A to Z, 0 to 9 and "-". The same regular expression must not match if the string being checked is equal to "admin".
I tried this :
/^(?!admin)|([-a-z0-9]{1,})$/i
but it doesn't work, it match for the string admin even if I added the negative lookahead stuff.
I'm using this regex with PHP (PCRE).
Doing "A and not B" in a single regex is tricky. It's usually better (clearer, easier) to do it in two parts instead. If that isn't an option, it can be done. You're on the right track with the negative look-ahead but the alternation is killing you.
/^(?!admin$)[-a-zA-Z0-9]+$/
In Perl's extended syntax this is:
/
^ # beginning of line anchor
(?! # start negative lookahead assertion
admin # the literal string 'admin'
$ # end of line anchor
) # end negative lookahead assertion
[-a-zA-Z0-9]+ # one or more of a-z, A-Z, 0-9, or '-'
$ # end of string anchor
/x
Is the regular expression the only tool available to you? It might be easier to do something like (Perl syntax):
if ($string ne "admin" && $string =~ /^[-a-z0-9]+$/i) { ...
I think it would work if you removed the "or" (|) operator. Right now you are basically saying, match if it is letters, etc. OR not equal to admin, so it always returns true because when it is equal to admin, it still matches the other expression.
^(?![\.]*[aA][dD][mM][iI][nN][\.]*$)([A-Za-z0-9]{1,}$)
Try experimenting with a regex testing tool like the regex-freetool at code.google.com