Which characters can be used as regular expression delimiters? - regex

Which characters can be used as delimiters for a Perl regular expression? m/re/, m(re) and måreå all seem to work, but I'd like to know all possibilities.

From perlop:
With the m you can use any pair of non-whitespace characters as delimiters.
So anything goes, except whitespace. The full paragraph for this is:
If "/" is the delimiter then the initial m is optional. With the m you can use any pair of non-whitespace characters as delimiters. This is particularly useful for matching path names that contain "/", to avoid LTS (leaning toothpick syndrome). If "?" is the delimiter, then the match-only-once rule of ?PATTERN? applies. If "'" is the delimiter, no interpolation is performed on the PATTERN. When using a character valid in an identifier, whitespace is required after the m.

As is often the case, I wonder "can I write a Perl program to answer that question?".
Here is a pretty good first approximation of trying all of the printable ASCII chars:
#!/usr/bin/perl
use warnings;
use strict;
$_ = 'foo bar'; # something to match against
foreach my $ascii (32 .. 126) {
my $delim = chr $ascii;
next if $delim eq '?'; # avoid fatal error
foreach my $m ('m', 'm ') { # with and without space after "m"
my $code = $m . $delim . '(\w+)' . $delim . ';';
# print "$code\n";
my $match;
{
no warnings 'syntax';
($match) = eval $code;
}
print "[$delim] didn't compile with $m$delim$delim\n" if $#;
if (defined $match and $match ne 'foo') {
print "[$delim] didn't match correctly ($match)\n";
}
}
}

Just about any non-whitespace character can be used, though identifier characters have to be separated from the initial m by whitespace. Though when you use a single quote as the delimiter, it disables interpolation and most backslash escaping.

There is currently a bug in the lexer that sometimes prevents UTF-8 characters from being used as a delimiter, even though you can sneak Latin1 by it if you aren't in full Unicode mode.

Related

perl - strings comparison and regex

What is the difference between the two lines?
if ($data =~ m/$str/) {
#### ^--- HERE
print "OK";
}
and
if ($data =~ /$str/) {
print "OK";
}
The whole difference is just an 'm'.
m is indicator that you're about to use matching regexp, as opposed to replacing, using transliteration or other operators that can be used with /. If you use / as separator, then m is optional. Standalone / assumes m. m is mandatory if you want to use other symbols as quotes around regexp like $str =~ m|$regexp|. This is useful for writing more readable code if you regexp contains lots of / inside so you don't have to quote them.
Additionally, some other separators that can be specified with m will process quoted string differently.
http://perldoc.perl.org/perlop.html#Regexp-Quote-Like-Operators
With the m you can use any pair of non-whitespace (ASCII) characters
as delimiters. This is particularly useful for matching path names
that contain "/", to avoid LTS (leaning toothpick syndrome). If "?" is
the delimiter, then a match-only-once rule applies, described in
m?PATTERN? below. If "'" (single quote) is the delimiter, no
interpolation is performed on the PATTERN. When using a character
valid in an identifier, whitespace is required after the m.

regular expressions in perl for extracting information

How would I match any number of any characters between two specific words... I have a document with a block of text enclosed between 'begin parameters' and 'end parameters'. These two phrases are separated by a number of lines of text. So my text looks like this:
begin parameters
<lines of text here \n.
end parameters
My current regular expression looks like this:
my $regex = "begin parameters[.*\n*]end parameters";
However this is not matching. Does anybody have any suggestions?
Use the /s switch so that the any character . will match new lines.
I also suggest that you use non greedy matching by adding ? to your quantifier.
use strict;
use warnings;
my $data = do {local $/; <DATA>};
if ($data =~ /begin parameters(.*?)end parameters/s) {
print "'$1'";
}
__DATA__
begin parameters
<lines of text here.
end parameters
Outputs:
'
<lines of text here.
'
Your current regular expression does not do what you may think, by placing those characters inside of a character class; it matches any character of: ( ., *, \n, * ) instead of actually matching what you want.
You can use the s modifier forcing the dot . to match newline sequences. By placing a capturing group around what you want to extract, you can access that by using $1
my $regex = qr/begin parameters(.*?)end parameters/s;
my $string = do {local $/; <DATA>};
print $1 if $string =~ /$regex/;
See Demo
Please try this :
Begin Parameters([\S\s]+?)EndParameters
Translation : This will look for any char who is a separator, or any char who is everything but a separator (so actually, it will look for any char) until it find "EndParameters".
I hope it is what you expect.
The meta-character . loses its special properties inside of a character class.
So [.*\n*] actually matches 0 or more literal periods or zero or more newlines.
What you actual want is to match 0 or more of any character and 0 or more of a newline. Which you can represent in a non-capturing group:
begin parameters(?:.|\n)*?end parameters

Understanding Perl regular expression modifers /m and /s [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 8 years ago.
I have been reading perl regular expression with modifier s m and g. I understand that //g is a global matching where it will be a greedy search.
But I am confused with the modifier s and m. Can anyone explain the difference between s and m with code example to show how it can be different? I have tried to search online and it only gives explanation as in the link http://perldoc.perl.org/perlre.html#Modifiers. In stackoverflow I have even seen people using s and m together. Isn't s is the opposite of m?
//s
//m
//g
I am not able to match multiple line using using m.
use warnings;
use strict;
use 5.012;
my $file;
{
local $/ = undef;
$file = <DATA>;
};
my #strings = $file =~ /".*"/mg; #returns all except the last string across multiple lines
#/"String"/mg; tried with this as well and returns nothing except String
say for #strings;
__DATA__
"This is string"
"1!=2"
"This is \"string\""
"string1"."string2"
"String"
"S
t
r
i
n
g"
The documentation that you link to yourself seems very clear to me. It would help if you would explain what problem you had with understanding it, and how you came to think that /s and /m were opposites.
Very briefly, /s changes the behaviour of the dot metacharacter . so that it matches any character at all. Normally it matches anything except a newline "\n", and so treats the string as a single line even if it contains newlines.
/m modifies the caret ^ and dollar $ metacharacters so that they match at newlines within the string, treating it as a multi-line string. Normally they will match only at the beginning and end of the string.
You shouldn't get confused with the /g modifier being "greedy". It is for global matches which will find all occurrences of the pattern within the string. The term greedy is usually user for the behaviour of quantifiers within the pattern. For instance .* is said to be greedy because it will match as many characters as possible, as opposed to .*? which will match as few characters as possible.
Update
In your modified question you are using /".*"/mg, in which the /m is irrelevant because, as noted above, that modifier alters only the behaviour of the $ and ^ metacharacters, and there are none in your pattern.
Changing it to /".*"/sg improves things a little in that the . can now match the newline at the end of each line and so the pattern can match multi-line strings. (Note that it is the object string that is considered to be "single line" here - i.e. the match behaves just as if there were no newlines in it as far as . is concerned.) Hower here is the conventional meaning of greedy, because the pattern now matches everything from the first double-quote in the first line to the last double-quote at the end of the last line. I assume that isn't what you want.
There are a few ways to fix this. I recommend changing your pattern so that the string you want is a double-quote, followed by any sequence of characters except double-quotes, followed by another double quote. This is written /"[^"]*"/g (note that the /s modifier is no longer necessary as there are now no dots in the pattern) and very nearly does what you want except that the escaped double-quotes are seen as ending the pattern.
Take a look at this program and its output, noting that I have put a chevron >> at the start of each match so that they can be distinguished
use strict;
use warnings;
my $file = do {
local $/;
<DATA>;
};
my #strings = $file =~ /"[^"]*"/g;
print ">> $_\n\n", for #strings;
__DATA__
"This is string"
"1!=2"
"This is \"string\""
"string1"."string2"
"String"
"S
t
r
i
n
g"
output
>> "This is string"
>> "1!=2"
>> "This is \"
>> ""
>> "string1"
>> "string2"
>> "String"
>> "S
t
r
i
n
g"
As you can see everything is now in order except that in "This is \"string\"" it has found two matches, "This is \", and "". Fixing that may be more complicated than you want to go but it's perfectly possible. Please say so if you need that fixed too.
Update
I may as well finish this off. To ignore escaped double-quotes and treat them as just part of the string, we need to accept either \" or any character except double-quote. That is done using the regex alternation operator | and must be grouped inside non-capturing parentheses (?: ... ). The end result is /"(?:\\"|[^"])*"/g (the backslash itself must be escaped so it is doubled up) which, when put into the above program, produces this output, which I assume is what you wanted.
>> "This is string"
>> "1!=2"
>> "This is \"string\""
>> "string1"
>> "string2"
>> "String"
>> "S
t
r
i
n
g"
/m and /s both affect how the match operator treats multi-line strings.
With the /m modifier, ^ and $ match the beginning and end of any line within the string. Without the /m modifier, ^ and $ just match the beginning and end of the string.
Example:
$_ = "foo\nbar\n";
/foo$/, /^bar/ do not match
/foo$/m, /^bar/m match
With the /s modifier, the special character . matches all characters including newlines. Without the /s modifier, . matches all characters except newlines.
$_ = "cat\ndog\ngoldfish";
/cat.*fish/ does not match
/cat.*fish/s matches
It is possible to use /sm modifiers together.
$_ = "100\n101\n102\n103\n104\n105\n";
/^102.*104$/ does not match
/^102.*104$/s does not match
/^102.*104$/m does not match
/^102.*104$/sm matches
With /".*"/mg your match
starts with "
and then .*" matches every character (except \n) as much as possible till "
since you use /g and match stopped at second ", regex will try to repeat first two steps
/m doesn't make difference here as you're not using ^ or $ anchors
Since you have escaped quotes in your example, regex is not the best tool to do what you want.
If that wasn't the case and you wanted everything between two quotes, /".*?"/gs would do the job.
Borodin's regex will work for the examples from this lab assignment.
However, it's also possible for a backslash to escape itself. This comes up when one includes windows paths in a string, so the following regex would catch that case:
use warnings;
use strict;
use 5.012;
my $file = do { local $/; <DATA>};
my #strings = $file =~ /"(?:(?>[^"\\]+)|\\.)*"/g;
say "<$_>" for #strings;
__DATA__
"This is string"
"1!=2"
"This is \"string\""
"string1"."string2"
"String"
"S
t
r
i
n
g"
"C:\\windows\\style\\path\\"
"another string"
Outputs:
<"This is string">
<"1!=2">
<"This is \"string\"">
<"string1">
<"string2">
<"String">
<"S
t
r
i
n
g">
<"C:\\windows\\style\\path\\">
<"another string">
For a quick explanation of the pattern:
my #strings = $file =~ m{
"
(?:
(?> # Independent subexpression (reduces backtracking)
[^"\\]+ # Gobble all non double quotes and backslashes
)
|
\\. # Backslash followed by any character
)*
"
}xg; # /x modifier allows whitespace and comments.

perl: how to check alphanumeric values and limit the string size to 30 by using regex

I am trying to write a regex for perl that would check for alphanumeric values (having spaces) but not including underscore "_" and limit the number of character to 30 I am trying this but this is not working could anyone please tell me what I am doing wrong! This code is even taking special characters as alphanumeric values. $currLine = 'Kapil# 123' this should not be a valid value.
** apologies by $currLine = "regex" i meant $currLine =~ "regex"
if ($currLine = /^[a-zA-Z0-9]{1,30}$/){
say "Line3 Good: ", $currLine;
} else {
say "Error in Line 3: Name not alphamumeric ";
}
$currLine = /^[a-zA-Z0-9]{1,30}$/
means
$currLine = $_ =~ /^[a-zA-Z0-9]{1,30}$/
You want to use
$currLine =~ /^[a-zA-Z0-9]{1,30}$/
Now on to the other problems.
You didn't allow spaces. (What follows allows whitespace. If you mean SPACE specifically, use that instead of \s).
You allow a trailing newline.
You allow 31 characters if the 31st is a newline.
You forbid many alphanumeric characters.
You forbid zero characters.
$currLine =~ /^[\p{Alnum}\s]{0,30}\z/
You are using = (assignment) where you should have =~ (bind).
Enabling warnings may have alerted you to this. The code you have is matching $_ and then assigning the results of the match to $currLine.
For your regular expression to match all alphanumeric values including spaces, you need to include for space inside your character class. You should also be using the bind operator =~ instead of = here.
if ( $currLine =~ /^[a-z0-9\s]{1,30}$/i ) { ...
Note: I included the i modifier for case-insensitive matching.
You are using assignment operator(=) instead of match operator(=~). You should change the if statement to:
if ($currLine =~ /^[a-zA-Z0-9]{1,30}$/)
This can also be shortened to:
if ($currLine =~ /^[^\W_]{1,30}$/)
[^\W] already matches anything apart from what is represented by \w. To discard _, we add it to negated character class, thus using - [^\W_]. Note however that, this matches much more than mere [a-zA-Z0-9]. It includes other unicode characters that come under word character. To just allow that regex to consider ASCII text, add /a character set modifier:
/^[^\W_]{1,30}$/a

Difference between Perl regular expression delimiters /.../ and #...#

Today I came across two different syntaxes for a Perl regular expression match.
#I have a date string
my $time = '2012-10-29';
#Already familiar "m//":
$t =~ m/^(\d{4}-\d\d-\d\d)$/
#Completely new to me m##.
$t =~ m#^(\d{4}-\d\d-\d\d)#/
Now what is the difference between /expression/ and #expression#?
As everone else said, you can use any delimiter after the m.
/ has one special feature: you can use it by itself, e.g.
$string =~ /regexp/;
is equivalent to:
$string =~ m/regexp/;
Perl allows you to use pretty much any characters to delimit strings, including regexes. This is especially useful if you need to match a pattern that contains a lot of slash characters:
$slashy =~ m/\/\//; #Bad
$slashy =~ m|//|; #Good
According to the documentation, the first of those is an example of "leaning toothpick syndrome".
Most but not all characters behave in the same way when escaping. There is an important exception: m?...? is a special case that only matches a single time between calls to reset().
Another exception: if single quotes are used for the delimiter, no variable interpolation is done. You still have to escape $, though, as it is a special character matching the end of the line.
Nothing except what you have to escape in the regex. You can use any pair of matched characters you like.
$string = "http://example.com/";
$string =~ m!http://!;
$string =~ m#http://!#;
$string =~ m{http://};
$string =~ m/http:\/\//;
After the match or search/replace operator (the m and s, respectively) you can use any character as the delimiter, e.g. the # in your case. This also works with pairs of parenthesis: s{ abc (.*) def }{ DEF $1 ABC }x.
Advantages are that you don't have to escape the / (but the actual delimiter characters, of course). It's often used for clarity, especially when dealing with things like paths or protocols.
There is no difference; the "/" and "#" characters are used as delimiters for the expression. They simply mark the "boundary" of the expression, but are not part of the expression. In theory you can use most non-alphanumeric characters as a delimiter. Here is a link to the PHP manual (It doesn't matter that it is the PHP manual, the Regex syntax is the same, I just like it because it explains well) on Perl compatible regular expression syntax; read the part about delimiters