~m, s and () in perl regexp - regex

I am trying to get hold of regular expressions in Perl. Can anyone please provide any examples of what matches and what doesn't for the below regular expression?
$sentence =~m/.+\/(.+)/s

=~ is the binding operator; it makes the regex match be performed on $sentence instead of the default $_. m is the match operator; it is optional (e.g. $foo =~ /bar/) when the regex is delimited by / characters but required if you want to use a different delimiter.
s is a regex flag that makes . in the regex match any characters; by default . does not match newlines.
The actual regex is .+\/(.+); this will match one or more characters, then a literal / character, then one or more other characters. Because the initial .+ consumes as much as possible while still allowing the regex to succeed, it will match up to the last / in the string that has at least one character after it; then the (.+) will capture the characters that follow that / and make them available as $1.
So it is essentially capturing the final component of a filepath. Of foo/bar it will capture the bar, of foo/bar/ it will capture the bar/. Strings with only one component, like /foo or bar/ or baz will not match.

Any string, including multi-line strings, that contain a slash character somewhere in the middle of the string.
Matches:
foo/bar
asdf\nwrqwer/wrqwerqw # /s modifier allows '.' to match newlines
Doesn't match:
asdfasfdasf # no slash character
/asdfasdf # no characters before the slash
asdfasf/ # no characters after the slash
In addition, the entire substring that follows the last slash in the string will be captured and assigned to the variable $1.

Breakdown:
$sentence =~ — match $sentence with
m/ — the pattern consisting of
. — any character
+ — one or more times
\/ — then a forward-slash
( — and, saving in the $1 capture group,
.+ — any character one or more times
)
/s — allowing . to match newlines
See perldoc perlop for information about operators such as =~ and quote-like operators such as m//, and perldoc perlre about regular expressions and their options such as /s.

Related

Removing multiple consecutive words separated with whitespace

In the code below the pattern / man / matches twice consecutively. So when I substitute that pattern only the first occurence is matched but the second occurence is not matched.
As I understand the problem the first pattern itself matches until the start of second pattern(i.e, the space after man is the end of first pattern and also the start of first pattern). So second pattern is not matched. How to match this pattern globally when it occurs consecutively.
use strict;
use warnings;
#my $name =" man sky man "; #this works
my $name =" man man sky"; #this does'nt
$name =~s/ man / nam /g; #expected= 'nam nam sky'
print $name,"\n";
The regex is eating up characters which it matches. So, to avoid this, you should use lookahead and lookbehind to match it in this case. Check perlre
$name =~ s/(?<=\s)man(?=\s)/nam/g;
Quoting from perlre
Look Ahead:
(?=pattern)
A zero-width positive lookahead assertion. For example, /\w+(?=\t)/ matches
a word followed by a tab, without including the tab in $&.
Look Behind:
(?<=pattern) \K A zero-width positive lookbehind assertion. For
example, /(?<=\t)\w+/ matches a word that follows a tab, without
including the tab in $& . Works only for fixed-width lookbehind.
I understand you want to replace man in between whitespace characters or start/end of string.
In this case, you may use two approaches, with positive lookarounds containing alternation operator checking for string boundaries and/or whitespaces, or negative lookarounds checking for the non-whitespace chars on both ends of the search word.
Use either of the two:
$name =~ s/(?<=^|\s)man(?=\z|\s)/nam/g;
$name =~ s/(?<!\S)man(?!\S)/nam/g;
From the point of view of efficiency, the second option is better since alternation is a bit "expensive".
The (?<=^|\s) positive lookbehind matches a location in string that is preceded with start of string (^) or (|) a whitespace (\s) and the (?=$|\s) positive lookahead makes sure there is a whitespace or end of string ($) immediately after man.
The (?<!\S) negative lookbehind matches a location in string that is not immediately preceded with a non-whitespace char, i.e. if there is a non-whitespace char there will be no match), and (?!\S) negative lookahead asserts there is no non-whitespace right after man.
See more details about Lookaround Assertions at perlre.

Regular expressions in Sublime Text 3

I am trying to make a regular expression that replaces the content of the texts in parentheses.
I have used the following regular expression:
"([A-Za-z ]*)"
But as you can see in the following image does not work:
Thank you and greetings.
Remove the double quotes from your expression and escape the parentheses:
\([A-Za-z ]*\)
Details:
\( - a literal (
[A-Za-z ]* - zero or more ASCII letters or spaces
\) - a literal ).
The unescaped (...) form a capturing group that stores a submatch in the memory buffer that can be used later during matching or replacement via backreferences.

Optional regular expression operator in PowerShell

In $string, I'm trying to phase out the first "-1" so the output of the string will be "test test test-Long.xml".
$string = 'test test test-1-Long.xml'
$string -replace '^(.*)-?\d?(-?.*)\.xml$', '$1$2'
My issue is that I need to make that same first "-1" pattern optional, as both the hyphen and number could not be there as well.
Why is the "?" operator not working? I've also tried {0,1} after each as well with no luck.
Regexes are greedy, so the engine can't decide what to match, and it is ambiguous.
I am not sure it's the best solution, but I could make it work this way:
$string -replace '^([^\-]*)-?\d?(-?.*)\.xml$', '$1$2'
Sole change: the first group must not contain the dash: that kind of "balances" the regex, avoiding the greedyness and that yields:
test test test-Long
Note: the output is not test test test-Long.xml as required in your answer. To do that, simply remove the xml suffix:
$string -replace '^([^\-]*)-?\d?(-?.*)', '$1$2'
The $string -replace '^(.*?)(?:-\d+)?(-.*?)\.xml$', '$1$2' should work if the hyphen is obligatory in the input. Or $string -replace '^((?:(?!-\d+).)*)(?:-\d+)?(.*)\.xml$', '$1$2' in case the input may have no hyphen.
See the regex demo 1 and regex demo 2.
Pattern details:
^ - start of string
(.*?) - Group 1 capturing any 0+ characters other than a newline as few as possible (as the *? quantifier is lazy) up to the first (NOTE: to increase regex performance, you may use a tempered greedy token based pattern instead of (.*?) - ((?:(?!-\d+).)*) that matches any text, but -+1 or more digits, thus, acting similarly to negated character class, but for a sequence of symbols)
(?:-\d+)? - non-capturing group with a greedy ? quantifier (so, this group has more priority for the regex engine, the previous capture will end before this pattern) capturing a hyphen followed with one or more digits
(-.*?) - Group 3 capturing an obligatory - and any 0+ chars other than LF, as few as possible up to
\.xml - literal text .xml
$ - end of string.
Why is the "?" operator not working?
It is not true. The quantifier ? works well as it matches one or zero occurrences of the quantified subpattern. However, the issue arises in combination with the first .* greedy dot matching subpattern. See your regex in action: the first capture group grabs the whole substring up to the last .xml, and the second group is empty. Why?
Because of backtracking and how greedy quantifier works. The .* matches any characters, but a newline, as many as possible. Thus, it grabs the whole string up to the end. Then, backtracking starts: one character at a time is given back and tested against the subsequent subpatterns.
What are they? -?\d?(-?.*) - all of them can match an empty string. The -? matches an empty string before .xml, ok, \d? matches there as well, -? and .* also matches there.
However, the .* grabs the whole string again, but there is the \.xml pattern to accommodate. So, the second capture group is just empty. In fact, there are more steps the regex engine performs (see the regex debugger page), but the main idea is like that.

Eclipse regex find and replace

I want to replace the below statement
ImageIcon("images/calender.gif");
with
ImageIcon(res.getResource("images/calender.gif"));
Can anyone suggest a regex to do this in eclipse.Instead of "calender.gif" any filename can come.
You can find this pattern (in regex mode):
ImageIcon\(("[^"]+")\)
and replace with:
ImageIcon(res.getResource($1))
The \( and \) in the pattern escapes the braces since they are to match literally. The unescaped braces (…) sets up capturing group 1 which matches the doublequoted string literal, which should not have escaped doublequotes (which I believe is illegal for filenames anyway).
The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.
The + is one-or-more repetition, so [^"]+ matches non-empty sequence of everything except double quotes. We simply surround this pattern with " to match the double-quoted string literal.
So the pattern breaks down like this:
literal( literal)
| |
ImageIcon\(("[^"]+")\)
\_______/
group 1
In replacement strings, $1 substitutes what group 1 matched.
References
regular-expressions.info
Character Class, Repetition, Brackets
Examples/Programming Constructs - Strings - has patterns for strings that may contain escaped doublequotes
Ctrl-F
Find: ImageIcon\("([^\"]*)"\);
Replace with: ImageIcon(res.getResource("\1"));
Check Regular Expressions checkbox.

What do these Perl regexes mean?

What does the following syntax mean in Perl?
$line =~ /([^:]+):/;
and
$line =~ s/([^:]+):/$replace/;
See perldoc perlreref
[^:]
is a character class that matches any character other than ':'.
[^:]+
means match one or more of such characters.
I am not sure the capturing parentheses are needed. In any case,
([^:]+):
captures a sequence of one or more non-colon characters followed by a colon.
$line =~ /([^:]+):/;
The =~ operator is called the binding operator, it runs a regex or substitution against a scalar value (in this case $line). As for the regex itself, () specify a capture. Captures place the text that matches them in special global variables. These variables are numbered starting from one and correspond to the order the parentheses show up in, so given
"abc" =~ /(.)(.)(.)/;
the $1 variable will contain "a", the $2 variable will contain "b", and the $3 variable will contain "c" (if you haven't guessed yet . matches one character*). [] specifies a character class. Character classes will match one character in them, so /[abc]/ will match one character if it is "a", "b", or "c". Character classes can be negated by starting them with ^. A negated character class matches one character that is not listed in it, so [^abc] will match one character that is not "a", "b", or "c" (for instance, "d" will match). The + is called a quantifier. Quantifiers tell you how many times the preceding pattern must match. + requires the pattern to match one or more times. (the * quantifier requires the pattern to match zero or more times). The : has no special meaning to the regex engine, so it just means a literal :.
So, putting that information together we can see that the regex will match one or more non-colon characters (saving this part to $1) followed by a colon.
$line =~ s/([^:]+):/$replace/;
This is a substitution. Substitutions have two parts, the regex, and the replacement string. The regex part follows all of the same rules as normal regexes. The replacement part is treated like a double quoted string. The substitution replaces whatever matches the regex with the replacement, so given the following code
my $line = "key: value";
my $replace = "option";
$line =~ s/([^:]+):/$replace/;
The $line variable will hold the string "option value".
You may find it useful to read perldoc perlretut.
* except newline, unless the /m option is used, in which case it matches any character
The first one captures the part in front of a colon from a line, such as "abc" in the string "abc:foo". More precisely it matches at least one non-colon character (though as many as possible) directly before a colon and puts them into a capture group.
The second one substitutes said part, although this time including the colon by the contents of the variable $replace.
I may be misunderstanding some of the previous answers, but I think that there's a confusion about the second example. It will not replace only the captured item (i.e., one or more non-colons up until a colon) by $replaced. It will replace all of ([^:]+): with $replace - the colon as well. (The substitution operates on the match, not just the capture.)
This means if you don't include a colon in $replace (and you want one), you will get bit:
my $line = 'http://www.example.com/';
my $replace = 'ftp';
$line =~ s/([^:]+):/$replace/;
print "Here's \$line now: $line\n";
Output:
Here's $line now: ftp//www.example.com/ # Damn, no colon!
I'm not sure if you are just looking at example code, but you unless you plan to use the capture I'm not sure you really want it in these examples.
If you are very unfamiliar with regular expressions (or Perl), you should look at perldoc perlrequick before trying perldoc perlre or perldoc perlretut.
You want to return something matching one or more characters that are anything but : followed by a : and the second one you want to do the same thing but replace it with $replace.
perl -MYAPE::Regex::Explain -e "print YAPE::Regex::Explain->new('([^:]+):')->explain"
The regular expression:
(?-imsx:([^:]+):)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[^:]+ any character except: ':' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
$line =~ /([^:]+):/;
Matches anything that does not contain : before :/
If $line = "http://www.google.com", it will match http (the variable $1 will contain http)
$line =~ s/([^:]+):/$replace/;
This time, replace the value matched by the content of the variable $replace