Perl regex with a negative lookahead behaves unexpectedly - regex

I'm attempting to match /ezmlm-(any word except 'weed' or 'return')\s+/ with a regex. The following demonstrates a foreach loop which does the right thing, and an attempted regex which almost does:
#!/usr/bin/perl
use strict;
use warnings;
my #tests = (
{ msg => "want 'yes', string has ezmlm, but not weed or return",
str => q[|/usr/local/bin/ezmlm-reject '<snip>'],
},
{ msg => "want 'yes', array has ezmlm, but not weed or return",
str => [ <DATA> ],
},
{ msg => "want 'no' , has ezmlm-weed",
str => q[|/usr/local/bin/ezmlm-weed '<snip>'],
},
{ msg => "want 'no' , doesn't have ezmlm-anything",
str => q[|/usr/local/bin/else '<snip>'],
},
{ msg => "want 'no' , ezmlm email pattern",
str => q[crazy/but/legal/ezmlm-wacky#example.org],
},
);
print "foreach regex\n";
foreach ( #tests ) {
print doit_fe( ref $_->{str} ? #{$_->{str}} : $_->{str} ) ? "yes" : "no";
print "\t";
print doit_re( ref $_->{str} ? #{$_->{str}} : $_->{str} ) ? "yes" : "no";
print "\t<--- $_->{msg}\n";
};
# for both of the following subs:
# #_ will contain one or more lines of data
# match the pattern /ezmlm-(any word except 'weed' or 'return')\s+/
sub doit_fe {
my $has_ezmlm = 0;
foreach ( #_ ) {
next if $_ !~ m/ezmlm-(.*?)\s/;
return 0 if $1 eq 'weed' or $1 eq 'return';
$has_ezmlm++;
};
return $has_ezmlm;
};
sub doit_re { return grep /ezmlm-(?!weed|return)/, #_; };
__DATA__
|/usr/local/bin/ezmlm-reject '<snip>'
|/usr/local/bin/ezmlm-issubn '<snip>'
|/usr/local/bin/ezmlm-send '<snip>'
|/usr/local/bin/ezmlm-archive '<snip>'
|/usr/local/bin/ezmlm-warn '<snip>'
The output of the sample program is as follows:
foreach regex
yes yes <--- want 'yes', string has ezmlm, but not weed or return
yes yes <--- want 'yes', array has ezmlm, but not weed or return
no no <--- want 'no' , has ezmlm-weed
no no <--- want 'no' , doesn't have ezmlm-anything
no yes <--- want 'no' , ezmlm email pattern
In the last instance, the regex fails, matching a goofy but legal email address. If I amend the regex placing a \s after the negative lookahead pattern like so:
grep /ezmlm-(?!weed|return)\s+/
The regex fails to match at all. I'm supposing it has to do with the how the negative pattern works. I've tried making the negation non-greedy, but it seems there's some lesson buried in 'perldoc perlre' that is escaping me. Is it possible to do this with a single regex?

The negative look-ahead is zero-width which means that the regex
/ezmlm-(?!weed|return)\s+/
will only match if one or more space characters immediately follow "ezmlm-".
The pattern
/ezmlm-(?!weed|return)/
will match
"crazy/but/legal/ezmlm-wacky#example.org"
because it contains "ezmlm-" not followed by "weedy" or "return".
Try
/ezmlm-(?!weed|return)\S+\s+/
where \S+ is one or more non-space characters (or instead use [^#\s]+ if you want to deny email addresses even if followed by a space).

Related

How to verify if a variable value contains a character and ends with a number using Perl

I am trying to check if a variable contains a character "C" and ends with a number, in minor version. I have :
my $str1 = "1.0.99.10C9";
my $str2 = "1.0.99.10C10";
my $str3 = "1.0.999.101C9";
my $str4 = "1.0.995.511";
my $str5 = "1.0.995.AC";
I would like to put a regex to print some message if the variable has C in 4th place and ends with number. so, for str1,str2,str3 -> it should print "matches". I am trying below regexes, but none of them working, can you help correcting it.
my $str1 = "1.0.99.10C9";
if ( $str1 =~ /\D+\d+$/ ) {
print "Candy match1\n";
}
if ( $str1 =~ /\D+C\d+$/ ) {
print "Candy match2\n";
}
if ($str1 =~ /\D+"C"+\d+$/) {
print "candy match3";
}
if ($str1 =~ /\D+[Cc]+\d+$/) {
print "candy match4";
}
if ($str1 =~ /\D+\\C\d+$/) {
print "candy match5";
}
if ($str1 =~ /C[^.]*\d$/)
C matches the letter C.
[^.]* matches any number of characters that aren't .. This ensures that the match won't go across multiple fields of the version number, it will only match the last field.
\d matches a digit.
$ matches the end of the string. So the digit has to be at the end.
I found it really helpful to use https://www.regextester.com/109925 to test and analyse my regex strings.
Let me know if this regex works for you:
((.*\.){3}(.*C\d{1}))
Following your format, this regex assums 3 . with characters between, and then after the third . it checks if the rest of the string contains a C.
EDIT:
If you want to make sure the string ends in a digit, and don't want to use it to check longer strings containing the formula, use:
^((.*\.){3}(.*C\d{1}))$
Lets look what regex should look like:
start{digit}.{digit}.{2-3 digits}.{2-3 digits}C{1-2 digits}end
very very strict qr/^1\.0\.9{2,3}\.101?C\d+\z/ - must start with 1.0.99[9]?.
very strict qr/^1\.\0.\d{2,3}\.\d{2,3}C\d{1,2}\z/ - must start with 1.0.
strict qr/^\d\.\d\.\d{2,3}\.\d{2,3}C\d{1,2}\z/
relaxed qr/^\d\.\d\.\d+\.\d+C\d+\z/
very relaxed qr/\.\d+C\d+\z/
use strict;
use warnings;
use feature 'say';
my #data = qw/1.0.99.10C9 1.0.99.10C10 1.0.999.101C9 1.0.995.511 1.0.995.AC/;
#my $re = qr/^\d\.\d\.\d+\.\d+C\d+\z/;
my $re = qr/^\d\.\d\.\d{2,3}\.\d{2,3}C\d+\z/;
say '--- Input Data ---';
say for #data;
say '--- Matching -----';
for( #data ) {
say 'match ' . $_ if /$re/;
}
Output
--- Input Data ---
1.0.99.10C9
1.0.99.10C10
1.0.999.101C9
1.0.995.511
1.0.995.AC
--- Matching -----
match 1.0.99.10C9
match 1.0.99.10C10
match 1.0.999.101C9

Select characters that appear only once in a string

Is it possible to select characters who appear only once?
I am familiar with negative look-behind, and tried the following
/(.)(?<!\1.*)/
but could not get it to work.
examples:
given AXXDBD it should output ADBD
^^ - this is unacceptable
given 123558 it should output 1238
^^ - this is unacceptable
thanks in advance for the help
There are probably a lot of approaches to this, but I think you're looking for something like
(.)\1{1,}
That is, any character followed by the same character at least once.
Your question is tagged with both PHP and JS, so:
PHP:
$str = preg_replace('/(.)\1{1,}/', '', $str);
JS:
str = str.replace(/(.)\1{1,}/g, '');
Without using a regular expression:
function not_twice ($str) {
$str = (string)$str;
$new_str = '';
$prev = false;
for ($i=0; $i < strlen($str); $i++) {
if ($str[$i] !== $prev) {
$new_str .= $str[$i];
}
$prev = $str[$i];
}
return $new_str;
}
Removes consecutives characters (1+) and casts numbers to string in case you need that too.
Testing:
$string = [
'AXXDBD',
'123558',
12333
];
$string = array_map('not_twice', $string);
echo '<pre>' . print_r($string, true) . '</pre>';
Outputs:
Array
(
[0] => AXDBD
[1] => 12358
[2] => 123
)

Regex on a string

I'm trying to formulate a regular expression to use on text. Using in-memory variables is not giving the same result.
The below regular expression provides $1 and $2 that return what I expect. rw results vary. These positions can vary: I am looking to extract the data irrespective of the position in the string.
\/vol\/(\w+)\?(\w+|\s+).*rw=(.*\w+)
My data:
_DATA_
/vol/vol1 -sec=sys,rw=h1:h2,anon=0
/vol/vol1/q1 -sec=sys,rw=h3:h4,anon=0,ro=h1:h2
/vol/vol2/q1 -sec=sys,root=host5,ro=h3:h5,rw=h1:h2,anon=0
I'm trying to capture the second and third groups (if it is a space it should return a space), and a list of entries in rw, ro and root.
The expression (.*\w+) will match up to the last word character in the line. What you are looking for is most likely this ([0-9a-z:]+)
Guessing from your comment in reply to ikegami, maybe the following will give results you want.
#!/usr/bin/perl
use strict;
use warnings;
my #keys = qw/ rw ro root /;
my $wanted = join "|", #keys;
my %data;
while (<DATA>) {
my ($path, $param) = split;
my ($vol, $q) = (split '/', $path)[2,3];
my %tmp = map {split /=/} grep /^(?:$wanted)/, split /,/, $param;
$data{$vol}{$q // ' '} = \%tmp;
}
use Data::Dumper; print Dumper \%data;
__DATA__
/vol/vol1 -sec=sys,rw=h1:h2,anon=0
/vol/vol1/q1 -sec=sys,rw=h3:h4,anon=0,ro=h1:h2
/vol/vol2/q1 -sec=sys,root=host5,ro=h3:h5,rw=h1:h2,anon=0
The output from Data::Dumper is:
$VAR1 = {
'vol2' => {
'q1' => {
'ro' => 'h3:h5',
'root' => 'host5',
'rw' => 'h1:h2'
}
},
'vol1' => {
' ' => {
'rw' => 'h1:h2'
},
'q1' => {
'ro' => 'h1:h2',
'rw' => 'h3:h4'
}
}
};
Update: can you tell me what does (?:) mean in the grep?
(?: . . .) is a non-capturing group. It is used in this case because the beginning of the regex has ^. Without grouping, the regex would attempt to match ro positioned at the beginning of the string or rw or root anywhere in the string (not just the beginning).
/^ro|rw|root/ rather than /^(?:ro|rw|root)/
The second expression helps the search along because it knows to only attempt a match at the beginning of the string for all 3 patterns and not to try to match anywhere in the string (speeds things up although in your case, there are only 3 alternating matches to attempt - so, wouldn't make a huge difference here). But, still a good practice to follow.
what does (// ' ') stand for?
That is the defined or operator. The expression $q // ' ' says to use $q for the key in the hash if it is defined or a space instead.
You said in your original post I'm trying to capture the second and third groups (if it is a space it should return a space).
$q can be undefined when the split, my ($vol, $q) = (split '/', $path)[2,3]; has only a vol and not a q such as in this data line (/vol/vol1 -sec=sys,rw=h1:h2,anon=0).
No idea what you want, but a regex would not make a good parser here.
while (<DATA>) {
my ($path, $opts) = split;
my %opts =
map { my ($k,$v) = split(/=/, $_, 2); $k=>$v }
split(/,/, $opts);
...
}
(my %opts = split(/[,=]/, $opts); might suffice.)

Regular expression to match strings with embedded spaces

I am trying to write a regular expression but I can't pass the words space
I have a data file like this (generated by another utility)
* field : 100
blahbla : <Set>
scree : <what>
.Cont.asasd :
Othreaol : Value, Other value
Point->IP : 0.0.0.0 Port 5060
The pattern has to match and capture data like this
"field" "100"
"blahbla" "<Set>"
"scree" "<what>"
".Cont.asasd" ""
"Othreaol" "Value, Other value"
My early solution is
/^([\s\*]+)([\w]+[\s\.\-\>]{0,2}[\w]+)(\s*\:\s)(.*)/
but I have problem with some strings like
Z.15 example : No
the space stops the pattern from matching
H.25 miss here : No
same thing here
There are some complicated answers here. I think I'd use a simple split:
while( <DATA> ) {
chomp;
my( $field, $value ) = split /\s*:\s*/, $_, 2;
print "Field [$field] value [$value]\n";
}
__DATA__
* field : 100
blahbla : <Set>
scree : <what>
.Cont.asasd :
Othreaol : Value, Other value
Point->IP : 0.0.0.0 Port 5060
This gives:
Field [* field] value [100]
Field [blahbla] value [<Set>]
Field [scree] value [<what>]
Field [.Cont.asasd] value []
Field [Othreaol] value [Value, Other value]
Field [Point->IP] value [0.0.0.0 Port 5060]
From there, I'd filter the names and values as needed instead of trying to do it all in a single regex:
my #pairs =
grep { $_->[0] !~ /->/ } # filter keys
map { $_->[0] =~ s/\A\*\s+//; $_ } # transform keys
map { chomp; [ split /\s*:\s*/, $_, 2 ] } # parse line
<DATA>;
use Data::Printer;
p #pairs;
__DATA__
* field : 100
blahbla : <Set>
scree : <what>
.Cont.asasd :
Othreaol : Value, Other value
Point->IP : 0.0.0.0 Port 5060
Since you want to separate the values by colon, use the complement of that character in your regex for all those characters before the split.
my $regex
= qr{
( # v- no worry, this matches the first non-space, non-colon
[^\s:]
(?> [^:\n]* # this matches all non-colon chars on the line
[^\s:] # match the last non-space, non-colon, if there
)? # but possibly not there
) # end group
\s* # match any number of whitespace
: # match the colon
\s* # followed by any number of whitespace
( \S # Start second capture with any non space
(?> .* # anything on the same line
\S # ending in a non-space
)? # But, possibly not there at all
| # OR
) # nothing - this gives the second capture as an
# empty string instead of an undef
}x;
while ( <$in> ) {
$hash{ $1 } = $2 if m/$regex/;
}
%hash then looks like this:
{ '* field' => '100'
, '.Cont.asasd' => ''
, 'H.25 miss here' => 'No'
, Othreaol => 'Value, Other value'
, 'Point->IP' => '0.0.0.0 Port 5060'
, 'Z.15 example' => 'No'
, blahbla => '<Set>'
, scree => '<what>'
}
Of course, as I begin to think on it, if you could be assured of a /\s+:\s+/ pattern or at least a /\s{2,}:\s{2,}/ pattern, it might be simpler to just split the line like so:
while ( <$in> ) {
if ( my ( $k, #v )
= grep {; length } split /\A\s+|\s+\z|(\s+:\s+)/
) {
shift #v; # the first one will be the separator
$hash{ $k } = join( '', #v );
}
}
It does the same thing, at does not have to do nearly as much backtracking to trim results. And it ignores escaped colons without a whole lot more syntax, because it has to be a bare colon surrounded by spaces. You could just simply add the following to the if block:
$k =~ s/(?<!\\)(\\\\)*\\:/$1:/g;
I don't understand why the Point->IP line is omitted from your example output, but something like the code below should suit you.
use strict;
use warnings;
while (<DATA>) {
next unless /([^\s*].+?)\s*:\s*(.*?)\s*$/;
printf qq("%s" "%s"\n), $1, $2;
}
__DATA__
* field : 100
blahbla : <Set>
scree : <what>
.Cont.asasd :
Othreaol : Value, Other value
Point->IP : 0.0.0.0 Port 5060
Z.15 example : No
H.25 miss here : No
output
"field" "100"
"blahbla" "<Set>"
"scree" "<what>"
".Cont.asasd" ""
"Othreaol" "Value, Other value"
"Point->IP" "0.0.0.0 Port 5060"
"Z.15 example" "No"
"H.25 miss here" "No"

How can I escape a literal string I want to interpolate into a regular expression?

Is there a built-in way to escape a string that will be used within/as a regular expression? E.g.
www.abc.com
The escaped version would be:
www\.abc\.com
I was going to use:
$string =~ s/[.*+?|()\[\]{}\\]/\\$&/g; # Escapes special regex chars
But I just wanted to make sure that there's not a cleaner built-in operation that I'm missing?
Use quotemeta or \Q...\E.
Consider the following test program that matches against $str as-is, with quotemeta, and with \Q...\E:
#! /usr/bin/perl
use warnings;
use strict;
my $str = "www.abc.com";
my #test = (
"www.abc.com",
"www/abc!com",
);
sub ismatch($) { $_[0] ? "MATCH" : "NO MATCH" }
my #match = (
[ as_is => sub { ismatch /$str/ } ],
[ qmeta => sub { my $qm = quotemeta $str; ismatch /$qm/ } ],
[ qe => sub { ismatch /\Q$str\E/ } ],
);
for (#test) {
print "\$_ = '$_':\n";
foreach my $method (#match) {
my($name,$match) = #$method;
print " - $name: ", $match->(), "\n";
}
}
Notice in the output that using the string as-is could produce spurious matches:
$ ./try
$_ = 'www.abc.com':
- as_is: MATCH
- qmeta: MATCH
- qe: MATCH
$_ = 'www/abc!com':
- as_is: MATCH
- qmeta: NO MATCH
- qe: NO MATCH
For programs that accept untrustworthy inputs, be extremely careful about using such potentially nasty bits as regular expressions: doing so could create unexpected runtime errors, denial-of-service vulnerabilities, and security holes.
The best way to do this is to use \Q to begin a quoted string and \E to end it.
my $foo = 'www.abc.com';
$bar =~ /blah\Q$foo\Eblah/;
You can also use quotemeta on the variable first. E.g.
my $quoted_foo = quotemeta($foo);
The \Q trick is documented in perlre under "Escape Sequences."