cryptic perl expression - regex

I find the following statement in a perl (actually PDL) program:
/\/([\w]+)$/i;
Can someone decode this for me, an apprentice in perl programming?

Sure, I'll explain it from the inside out:
\w - matches a single character that can be used in a word (alphanumeric, plus '_')
[...] - matches a single character from within the brackets
[\w] - matches a single character that can be used in a word (kinda redundant here)
+ - matches the previous character, repeating as many times as possible, but must appear at least once.
[\w]+ - matches a group of word characters, many times over. This will find a word.
(...) - grouping. remember this set of characters for later.
([\w]+) - match a word, and remember it for later
$ - end-of-line. match something at the end of a line
([\w]+)$ - match the last word on a line, and remember it for later
\/ - a single slash character '/'. it must be escaped by backslash, because slash is special.
\/([\w]+)$ - match the last word on a line, after a slash '/', and remember the word for later. This is probably grabbing the directory/file name from a path.
/.../ - match syntax
/.../i - i means case-insensitive.
All together now:
/\/([\w]+)$/i; - match the last word on a line and remember it for later; the word must come after a slash. Basically, grab the filename from an absolute path. The case insensitive part is irrelevant, \w will already match both cases.
More details about Perl regex here: http://www.troubleshooters.com/codecorn/littperl/perlreg.htm
And as JRFerguson pointed out, YAPE::Regex::Explain is useful for tokenizing regex, and explaining the pieces.

You will find the Yape::Regex::Explain module worth installing.
#!/usr/bin/env perl
use YAPE::Regex::Explain;
#...may need to single quote $ARGV[0] for the shell...
print YAPE::Regex::Explain->new( $ARGV[0] )->explain;
Assuming this script is named 'rexplain' do:
$ ./rexplain '/\/([\w]+)$/i'
...to obtain:
The regular expression:
(?-imsx:/\/([\w]+)$/i)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
/ '/'
----------------------------------------------------------------------
\/ '/'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[\w]+ any character of: word characters (a-z,
A-Z, 0-9, _) (1 or more times (matching
the most amount possible))
----------------------------------------------------------------------
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
/ '/'
----------------------------------------------------------------------
\/ '/'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[\w]+ any character of: word characters (a-z,
A-Z, 0-9, _) (1 or more times (matching
the most amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
/i '/i'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
UPDATE:
See also: https://stackoverflow.com/a/12359682/1015385 . As noted there and in the module's documentation:
There is no support for regular expression syntax added after Perl version 5.6, particularly any
constructs added in 5.10.

/\/([\w]+)$/i;
It is a regex, and if it is a complete statement, it is applied to the $_ variable, like so:
$_ =~ /\/([\w]+)$/i;
It looks for a slash \/, followed by an alphanumeric string \w+, followed by end of line $. It also captures () the alphanumeric string, which ends up in the variable $1. The /i on the end makes it case-insensitive, which has no effect in this case.

While it doesn't help "explain" a regex, once you have a test case, Damian's new Regexp::Debugger is a cool utility to watch what actually occurs during the matching. Install it and then do rxrx at the command line to start the debugger, then type in /\/([\w]+)$/ and '/r' (for example), and finally m to start the matching. You can then step through the debugger by hitting enter repeatedly. Really cool!

This is comparing $_ to a slash followed by one or more character (case insensitive) and storing it in $1
$_ value then $1 value
------------------------------
"/abcdes" | "abcdes"
"foo/bar2" | "bar2"
"foobar" | undef # no slash so doesn't match

The Online Regex Analyzer deserves a mention. Here's a link to explain what your regex means, and pasted here for the record.
Sequence: match all of the followings in order
/ (slash)
--+
Repeat | (in GroupNumber:1)
AnyCharIn[ WordCharacter] one or more times |
--+
EndOfLine

Related

Extract the last path-segments of a URI or path using RegEx

I am trying to extract the last section of the following string :
"/subscriptions/5522233222-d762-666e-555a-e6666666666/resourcegroups/rg-sql-Belguim-01/providers/Microsoft.Compute/snapshots/vm-sql-image-v3.3-pre-sysprep-Oct-2021-BG"
I want to capture:
"snapshots/vm-sql-image-v3.3-pre-sysprep-Oct-2021-BG"
I tried below with no luck:
(\w*?\/\w*?)$
How to pull this off using regex?
Use
[^\/]+\/[^\/]+$
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
[^\/]+ any character except: '\/' (1 or more
times (matching the most amount possible))
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
[^\/]+ any character except: '\/' (1 or more
times (matching the most amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string
Your issues
(\w*?/\w*?)$ is for simple or empty last 2 segments (tested), e.g.
matched hello/world/subscriptions123/snap_shots capturing subscriptions123/snap_shots
matched /1/2// capturing the last 2 empty segments
OK was:
capture-group
/ to match the last path-separator before end ($)
\w*? intended to match the path-segment of any length
What to improve:
*? is a bit too unrestricted, choose quantifier as + for at least one (instead * for any or ? for zero or one)
\w is for word-meta-character, does not match hyphens or dots (OK for snapshot, not for given last segment)
Quick-fixed
(\w+/[\w\.-]+)$ (tested)
added dot \. and hyphen - to character-set containing \w
Simple but solid
(snapshots/[^\/]+)$ (tested)
fore-last path-segment assumed as fix constant snapshots
[^\/] any character except (^) slash in last segment
Note: the slash doesn't need to be escaped \/ like Ryszard answered

How to Grep Search two occurrences of a character in a lookbetween

I seem to have to perpetually relearn Regex & Grep syntax every time I need something advanced. This time, even with BBEDIT's pattern playground, I can't work this one out.
I need to do a multi-line search for the occurrence of two literal asterisks anywhere in the text between a pair of tags in a plist/XML file.
I can successfully construct a lookbetween so:
(?s)(?<=<array>).*?(?=</array>)
I try to limit that to only match occurrences in which two asterisks appear between tags:
(?s)(?<=<array>).*?[*]{2}.*?(?=</array>)
(?s)(?<=<array>).+[*]{2}.+(?=</array>)
(?s)(?<=<array>).+?[*]{2}.+?(?=</array>)
But they find nought. And when I remove the {2} I realize I'm not even constructing it right to find occurrences of one asterisk. I tried escaping the character /* and [/*] but to no avail.
How can i match any occurrence of blah blah * blah blah * blah blah ?
[*]{2} means the two asterisks must be consecutive.
(.*[*]){2} is what you're looking for - it contains two asterisks, with anything in between them.
But we also need to make sure the regex is only testing one tag closure at the same time, so instead of .*, we need to use ((?!<\/array>).)* to make sure it won't consume the end tag </array> while matching .*
The regex can be written as:
(?s)(?<=<array>)(?:((?!<\/array>).)*?[*]){2}(?1)*
See the test result here
Use
(?s)(?<=<array>)(?:(?:(?!<\/?array>)[^*])*[*]){2}.*?(?=</array>)
See proof.
Explanation
NODE
EXPLANATION
(?s)
set flags for this block (with . matching \n) (case-sensitive) (with ^ and $ matching normally) (matching whitespace and # normally)
(?<=
look behind to see if there is:
  <array>
'<array>'
)
end of look-behind
(?:
group, but do not capture (2 times):
(?:
group, but do not capture (0 or more times (matching the most amount possible)):
(?!
look ahead to see if there is not:
</?array>
</array> or <array>
)
end of look-ahead
[^*]
any character except: '*'
)*
end of grouping
[*]
any character of: '*'
){2}
end of grouping
.*?
any character (0 or more times (matching the least amount possible))
(?=
look ahead to see if there is:
</array>
'</array>'
)
end of look-ahead

How to express this in regular expression?

I need to capture all Upper and Lower case character strings other than "FOO" and "BAR". How to do this?
I tried [^(^FOO$)(^BAR$)] but it doesn't work.
Update:
Actually I'm using this in a context, I am concatenating it with another regex
["(\w)+": _this_regex_ ]
For example ["abc":FOO] shouldn't be matched
All other types say ["abc":BAZ] should match
You want a negative look ahead:
\["(\w+)"\s*:\s*(?!FOO\b|BAR\b)(\w+)]
The (\w+) are capturing group, they store the key/value pairs inside variables (I guess that's what you want to do?)
(?!...) is a negative lookahead: it will cause the regex to fail if what's inside matches.
\b is a word-boundary: here it will make the loohahead match (and so fail the regex) only if FOO is followed by a non alphanum character (so ["foo": FOOLISH] will be accepted by the regex)
\s is a short for all type of whitespaces (spaces, tabs, newlines etc)
Demo: http://regex101.com/r/fM3uZ7
What you tried [^...] was a negative character range: it matches any character (and only one character) that's not inside the character range. And keep in mind that inside character ranges only ], ^ and - are special character (so $ means \$ and so on)
Have a try with:
(?i)^(?!.*foo)(?!.*bar)
In action within perl script:
my $re = qr~(?i)^(?!.*foo)(?!.*bar)~;
while(<DATA>) {
chomp;
say (/$re/ ? "OK : $_" : "KO : $_");
}
__DATA__
["abc":FOO]
["abc":BAZ]
output:
KO : ["abc":FOO]
OK : ["abc":BAZ]
Explanation:
The regular expression:
(?i)^(?!.*foo)(?!.*bar)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?i) set flags for this block (case-
insensitive) (with ^ and $ matching
normally) (with . not matching \n)
(matching whitespace and # normally)
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
foo 'foo'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
bar 'bar'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------

What is this regex doing?

/^([a-z]:)?\//i
I don't quite understand what the ? in this regex if I had to explain it from what I understood:
Match begin "Group1 is a to z and :" outside ? (which I don't get what its doing) \/ which makes it match / and option /i "case insensitive".
I realize that this will return 0 or 1 not quiet sure why because of the ?
Is this to match directory path or something ?
If I test it:
$var = 'test' would get 0 while $var ='/test'; would get 1 but $var = 'test/' gets 0
So anything that begins with / will get 1 and anything else 0.
Can someone explain this regex in elementary terms to me?
See YAPE::Regex::Explain:
#!/usr/bin/perl
use strict; use warnings;
use YAPE::Regex::Explain;
print YAPE::Regex::Explain->new(qr/^([a-z]:)?\//i)->explain;
The regular expression:
(?i-msx:^([a-z]:)?/)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?i-msx: group, but do not capture (case-insensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
( group and capture to \1 (optional
(matching the most amount possible)):
----------------------------------------------------------------------
[a-z] any character of: 'a' to 'z'
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
)? end of \1 (NOTE: because you're using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
----------------------------------------------------------------------
/ '/'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
It matches a lower- or upper case letter ([a-z] with the i modifier) positioned at the start of the input string (^) followed by a colon (:) all optionally (?), followed by a forward slash \/.
In short:
^ # match the beginning of the input
( # start capture group 1
[a-z] # match any character from the set {'A'..'Z', 'a'..'z'} (with the i-modifier!)
: # match the character ':'
)? # end capture group 1 and match it once or none at all
\/ # match the character '/'
? will match one or none of the preceding pattern.
? Match 1 or 0 times
See also: perldoc perlre
Explanation:
/.../i # case insensitive
^(...) # match at the beginning of the string
[a-z]: # one character between 'a' and 'z' followed by a colon
(...)? # zero or one time of the group, enclosed in ()
So in english: Match anything which begins with a / (slash) or some letter followed by a colon followed by a /. This looks like it matches pathnames across unix and windows, e.g.
it would match:
/home/user
and
C:/Applications
etc.
It looks like it is looking for a "rooted" path. It will successfully match any string that either starts with a forward slash (/test), or a drive letter followed by a colon, followed by a forward slash (c:/test).
Specifically, the question mark makes something optional. It applies to the part in parentheses, which is a letter followed by a colon.
Things that will match:
C:/
a:/
/
(That last item above is why the ? is there)
Things that will not match:
C:
a:
ab:/
a/

Trying to match what is before /../ but after / with regular expressions

I am trying to match what is before /../ but after / with a regular expressions, but I want it to look back and stop at the first /
I feel like I am close but it just looks at the first slash and then takes everything after it like... input is this:
this/is/a/./path/that/../includes/face/./stuff/../hat
and my regular expression is:
#\/(.*)\.\.\/#
matching /is/a/./path/that/../includes/face/./stuff/../ instead of just that/../ and stuff/../
How should I change my regex to make it work?
.* means "match any number of any character at all[1]". This is not what you want. You want to match any number of non-/ characters, which is written [^/]*.
Any time you are tempted to use .* or .+ in a regex, be very suspicious. Stop and ask yourself whether you really mean "any character at all[1]" or not - most of the time you don't. (And, yes, non-greedy quantifiers can help with this, but character classes are both more efficient for the regex engine to match against and more clear in their communication of your intent to human readers.)
[1] OK, OK... . isn't exactly "any character at all" - it doesn't match newline (\n) by default in most regex flavors - but close enough.
Change your pattern that only characters other than / ([^/]) get matched:
#([^/]*)/\.\./#
Alternatively, you can use a lookahead.
#(\w+)(?=/\.\./)#
Explanation
NODE EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
/ '/'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
/ '/'
--------------------------------------------------------------------------------
) end of look-ahead
I think you're essentially right, you just need to make the match non-greedy, or change the (.*) to not allow slashes: #/([^/]*)/\.\./#
In your favourite language, do a few splits and string manipulation eg Python
>>> s="this/is/a/./path/that/../includes/face/./stuff/../hat"
>>> a=s.split("/../")[:-1] # the last item is not required.
>>> for item in a:
... print item.split("/")[-1]
...
that
stuff
In python:
>>> test = 'this/is/a/./path/that/../includes/face/./stuff/../hat'
>>> regex = re.compile(r'/\w+?/\.\./')
>>> regex.findall(me)
['/that/..', '/stuff/..']
Or if you just want the text without the slashes:
>>> regex = re.compile(r'/(\w+?)/\.\./')
>>> regex.findall(me)
['that', 'stuff']
([^/]+) will capture all the text between slashes.
([^/]+)*/\.\. matches that\.. and stuff\.. in you string of this/is/a/./path/that/../includes/face/./stuff/../hat It captures that or stuff and you can change that, obviously, by changing the placement of the capturing parens and your program logic.
You didn't state if you want to capture or just match. The regex here will only capture that last occurrence of the match (stuff) but is easily changed to return that then stuff if used global in a global match.
NODE EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1 (0 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
[^/]+ any character except: '/' (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
)* end of \1 (NOTE: because you're using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
--------------------------------------------------------------------------------
/ '/'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
\. '.'