Regex on a string - regex

I'm trying to formulate a regular expression to use on text. Using in-memory variables is not giving the same result.
The below regular expression provides $1 and $2 that return what I expect. rw results vary. These positions can vary: I am looking to extract the data irrespective of the position in the string.
\/vol\/(\w+)\?(\w+|\s+).*rw=(.*\w+)
My data:
_DATA_
/vol/vol1 -sec=sys,rw=h1:h2,anon=0
/vol/vol1/q1 -sec=sys,rw=h3:h4,anon=0,ro=h1:h2
/vol/vol2/q1 -sec=sys,root=host5,ro=h3:h5,rw=h1:h2,anon=0
I'm trying to capture the second and third groups (if it is a space it should return a space), and a list of entries in rw, ro and root.

The expression (.*\w+) will match up to the last word character in the line. What you are looking for is most likely this ([0-9a-z:]+)

Guessing from your comment in reply to ikegami, maybe the following will give results you want.
#!/usr/bin/perl
use strict;
use warnings;
my #keys = qw/ rw ro root /;
my $wanted = join "|", #keys;
my %data;
while (<DATA>) {
my ($path, $param) = split;
my ($vol, $q) = (split '/', $path)[2,3];
my %tmp = map {split /=/} grep /^(?:$wanted)/, split /,/, $param;
$data{$vol}{$q // ' '} = \%tmp;
}
use Data::Dumper; print Dumper \%data;
__DATA__
/vol/vol1 -sec=sys,rw=h1:h2,anon=0
/vol/vol1/q1 -sec=sys,rw=h3:h4,anon=0,ro=h1:h2
/vol/vol2/q1 -sec=sys,root=host5,ro=h3:h5,rw=h1:h2,anon=0
The output from Data::Dumper is:
$VAR1 = {
'vol2' => {
'q1' => {
'ro' => 'h3:h5',
'root' => 'host5',
'rw' => 'h1:h2'
}
},
'vol1' => {
' ' => {
'rw' => 'h1:h2'
},
'q1' => {
'ro' => 'h1:h2',
'rw' => 'h3:h4'
}
}
};
Update: can you tell me what does (?:) mean in the grep?
(?: . . .) is a non-capturing group. It is used in this case because the beginning of the regex has ^. Without grouping, the regex would attempt to match ro positioned at the beginning of the string or rw or root anywhere in the string (not just the beginning).
/^ro|rw|root/ rather than /^(?:ro|rw|root)/
The second expression helps the search along because it knows to only attempt a match at the beginning of the string for all 3 patterns and not to try to match anywhere in the string (speeds things up although in your case, there are only 3 alternating matches to attempt - so, wouldn't make a huge difference here). But, still a good practice to follow.
what does (// ' ') stand for?
That is the defined or operator. The expression $q // ' ' says to use $q for the key in the hash if it is defined or a space instead.
You said in your original post I'm trying to capture the second and third groups (if it is a space it should return a space).
$q can be undefined when the split, my ($vol, $q) = (split '/', $path)[2,3]; has only a vol and not a q such as in this data line (/vol/vol1 -sec=sys,rw=h1:h2,anon=0).

No idea what you want, but a regex would not make a good parser here.
while (<DATA>) {
my ($path, $opts) = split;
my %opts =
map { my ($k,$v) = split(/=/, $_, 2); $k=>$v }
split(/,/, $opts);
...
}
(my %opts = split(/[,=]/, $opts); might suffice.)

Related

How can I match these function calls, and extract the nmae of the function and the first argument?

I am trying to parse an array of elements. Those who match a pattern like the following:
Jim("jjanson", Customer.SALES);
I want to create a hash table like Jim => "jjanson"
How can I do this?
I can not match the lines using:
if($line =~ /\s*[A-Za-z]*"(.*),Customer.*\s*/)
You're not matching either the '(' after the name, nor the ' ' after the comma, before "Customer.".
I can get 'jjanson"' using this expression:
/\s*[A-Za-z]\(*"(.*), Customer.*\s*/
But I assume you don't want jjanson", so we need to modify it like so. (I tend to include the negative character class when I'm looking for simply-delimited stuff. So, in this case I'll make the expression "[^"]*"
/\s*[A-Za-z]\(*"([^"]+)", Customer.*\s*/
Also, I try not to depend upon whitespace, presence or number, I'm going to replace the space with \s*. That you didn't notice that you skipped the whitespace is a good illustration of the need to say "ignore a bunch of whitespace".
/\s*[A-Za-z]\(*"([^"]+)",\s*Customer.*\s*/
Now it's only looking for the sequence ',' + 'Customer' in the significant characters. Functionally, the same, if more flexible.
But since you only do one capture, I can't see what you'd map to what. So I'll do my own mapping:
my %records;
while ( my $line = $source->()) { # simply feed for a source of lines.
my ( $first, $user, $tag )
= $line = m/\s*(\p{Alpha}+)\s*\(\s*"([^"]+)",\s*Customer\.(\S+?)\)\/
;
$records{ $user }
= { first => $first
, username => $user
, tag => $tag
};
}
This is much more than you would tend to need in a one-off, quick solution. But I like to store as much of my input as seems significant.
Note that Jim("jjanson", Customer.SALES); matches the syntax of a function call with two arguments. You can thus abuse string eval:
#!/usr/bin/env perl
use strict;
use warnings;
use YAML::XS;
my $info = extract_first_arg(q{ Jim("jjanson", Customer.SALES);} );
print Dump $info;
sub extract_first_arg {
my $call = shift;
my ($name) = ($call =~ m{ \A \s* (\w+) }x);
unless ($name) {
warn "Failed to find function name in '$call'";
return;
}
my $username = eval sprintf(q{
package My::DangerZone;
no strict;
local *{ %s } = sub { $_[0] };
%s
}, $name, $call);
return { $name => $username };
}
Output:
---
Jim: jjanson
Or, you can abuse autoloading:
our $AUTOLOAD;
print Dump eval 'no strict;' . q{ Jim("jjanson", Customer.SALES); };
sub AUTOLOAD {
my ($fn) = ($AUTOLOAD =~ /::(\w+)\z/);
return { $fn => $_[0] };
}
I would not necessarily recommend using these methods, especially on input that is not in your control, and in a situation where this script has access to sensitive facilities.
On the other hand, I have, in the right circumstances, utilized this kind of thing in transforming one given set of information into something that can be used elsewhere.
Try this:
$line = 'Jim("jjanson", Customer.SALES)';
my %hashStore = (); #Jim("jjanson"
if($line=~m/^\s*([^\(\)]*)\(\"([^\"]*)\"/g) { $hashStore{$1} = $2; }
use Data::Dumper;
print Dumper \%hashStore;
Output:
$VAR1 = {
'Jim' => 'jjanson'
};

Perl all matches of a regexp in a given string

Perl's regexp matching is left-greedy, so that the regexp
/\A (a+) (.+) \z/x
matching the string 'aaab', will set $1='aaa' and $2='b'.
(The \A and \z are just to force start and end of the string.)
You can also give non-greedy qualifiers, as
/\A (a+?) (.+?) \z/x
This will still match, but give $1='a' and $2='aab'.
But I would like to check all possible ways to generate the string, which are
$1='aaa' $2='b'
$1='aa' $2='ab'
$1='a' $2='aab'
The first way corresponds to the default left-greedy behaviour, and the third way corresponds to making the first match non-greedy, but there may be ways in between those extremes. Is there a regexp engine (whether Perl's, or some other such as PCRE or RE2) which can be made to try all possible ways that the regexp specified generates the given string?
Among other things, this would let you implement 'POSIX-compatible' regexp matching where the longest total match is picked. In my case I really would like to see every possibility.
(One way would be to munge the regexp itself, replacing the + modifier with {1,1} on the first attempt, then {1,2}, {1,3} and so on - for each combination of + and * modifiers in the regexp. That is very laborious and slow, and it's not obvious when to stop. I hope for something smarter.)
Background
To answer Jim G.'s question on what problem this might solve, consider a rule-based translation system between two languages, given by the rules
translate(any string of one or more 'a' . y) = 'M' . translate(y)
translate('ab') = 'U'
Then there is a possible result of translate('aaab'), namely 'MU'.
You might try to put these rules into Perl code based on regexps, as
our #m;
my #rules = (
[ qr/\A (a+) (.*) \z/x => sub { 'M' . translate($m[1]) } ],
[ qr/\A ab \z/x => sub { 'U' } ],
);
where translate runs over each of #rules and tries to apply them in turn:
sub translate {
my $in = shift;
foreach (#rules) {
my ($lhs, $rhs) = #$_;
$in =~ $lhs or next;
local #m = ($1, $2);
my $r = &$rhs;
next if index($r, 'fail') != -1;
return $r;
}
return 'fail';
}
However, calling translate('aaab') returns 'fail'. This is because
it tries to apply the first rule matching (a+)(.*) and the regexp
engine finds the match with the longest possible string of 'a'.
Using the answer suggested by ikegami, we can try all ways in which
the regular expression generates the string:
use re 'eval';
sub translate {
my $in = shift;
foreach (#rules) {
my ($lhs, $rhs) = #$_;
local our #matches;
$in =~ /$lhs (?{ push #matches, [ $1, $2 ] }) (*FAIL)/x;
foreach (#matches) {
local #m = #$_;
my $r = &$rhs;
next if index($r, 'fail') != -1;
return $r;
}
}
return 'fail';
}
Now translate('aaab') returns 'MU'.
local our #matches;
'aaab' =~ /^ (a+) (.+) \z (?{ push #matches, [ $1, $2 ] }) (*FAIL)/x;

Specify number of matching regex groups using Perl

Let's say I have the following string
my $val = "3.4 -22.352 4.0"
The goal is to extract each decimal number by itself. There can be any number of spaces on each side or in between. It is also important to make sure that there is exactly 3 numbers present, and no other junk. I have something like this, but it doesn't work:
my #parts = ($val =~ /((\s*[-+]?\d{1,3}\.\d{1,3}\s*)){3}/)
if (scalar(#parts) == 3) {
print "Validated!\n";
for my $i (#parts) {
print "$i\n";
}
}
For some reason I get the last one twice.
Each capturing group gets you only one value, even if you apply a quantifier on it. If you want 3 values you have to repeat the capturing group 3 times. For example:
my $num = qr/[-+]?\d{1,3}\.\d{1,3}/;
my #nums = $val =~ /^\s*($num)\s+($num)\s+($num)\s*$/;
if(#nums){
print "Valid, and no need to check the number of elements.\n";
}
Instead of fighting regular expressions, use split and looks_like_number:
use warnings;
use strict;
use Scalar::Util qw(looks_like_number);
my $val = "3.4 -22.352 4.0";
my #parts = split /\s+/, $val;
if (scalar(#parts) == 3) {
my $ok = 0;
for (#parts) {
$ok++ if looks_like_number($_);
}
if ($ok == 3) {
print "Validated!\n";
for my $i (#parts) {
print "$i\n";
}
}
}
There are several issues here:
1) If you want three and only three numbers, you should anchor the start (^) and end ($) of the line in the regex.
2) Why are there two sets of parentheses? As written the second pair are redundant.
3) When you have a regex, the number of values returned are usually counted by the left parentheses (unless you use ?: or some other modifier). In this example, you have two, so it only returns two values. Because of the redundant parentheses, you get the same values twice each.
You have two sets of parens, so two values are returned. Both sets surround the same part of the regex, so both values will be the same.
Validating and extracting at not necessarily possible to do at the same time.
Doing it in two steps, extracting first, is quite simple:
my #nums = split ' ', $val;
die "Invalid\n" if #parts != 3;
for (#nums) {
die "Invalid\n" if !/^[-+]?[0-9]{1,3}\.[0-9]{1,3}\z/;
}
You can do it in one step, but there's some redundancy involved:
my $num_pat = qr/[-+]?[0-9]{1,3}\.[0-9]{1,3}/;
my #nums = $val =~ /^($num_pat)\s+($num_pat)\s+($num_pat)\z/
or die "Invalid\n";
my $val = "3.4 -22.352 4.0";
my $length = $val =~ s/((^|\s)\S)/$1/g;
#determines the number of tokens
if ($length == 3)
{
while($val=~/([-+]?[0-9]{1,3}\.[0-9]{1,3})/g)
{
print "$1\n";
}
}
The /g allows you to loop through the string and extract values conforming to your restrictions (one at a time). It will do this until all of the "tokens" matching your pattern are iterated through. I like this solution because it's concise and doesn't require you to create an auxiliary array. It's also a more general answer than using three extractions in one's regex.
With Regex Only
This will require 3 chunks of numbers delimited by space each number will be popluated into it's respective group.
(?:(?:^)([-]?[0-9]*?[.]?[0-9]*?)(?=\s))(?:(?:\s)([-]?[0-9]*?[.]?[0-9]*?)(?=\s))(?:(?:\s)([-]?[0-9]*?[.]?[0-9]*?)(?=$))
Example
PHP Code Example:
<?php
$sourcestring="3.4 -22.352 4.0";
preg_match_all('/(?:(?:^)([-]?[0-9]*?[.]?[0-9]*?)(?=\s))(?:(?:\s)([-]?[0-9]*?[.]?[0-9]*?)(?=\s))(?:(?:\s)([-]?[0-9]*?[.]?[0-9]*?)(?=$))/i',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
$matches Array:
(
[0] => Array
(
[0] => 3.4 -22.352 4.0
)
[1] => Array
(
[0] => 3.4
)
[2] => Array
(
[0] => -22.352
)
[3] => Array
(
[0] => 4.0
)
)

When there is a similar pattern in an expression, how to extract the occurence of the last instance in perl?

The value of $s is dynamic. I need to extract the values that occur after the last | in between each [].
my $s = "[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|Coffee]";
my #parts = split(/\]/, $s);
foreach my $part (#parts)
{
# Need to extract the values that occur after the last '|'
# (for example: !, .1iit, 10:48AM, Calculator, Coffee)
# and store each of the values separately in a hash
}
Could someone help me out in this?
Thanks,
Best to transform the string into a more useful data structure, then take the needed elements. Why is this best? Because right now you need the last element, but perhaps next time you will need some other part. Since its not harder to do it right, why not?
#!/usr/bin/perl
use strict;
use warnings;
# Only needed for Dumper
use Data::Dumper;
my $s = "[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|Coffee]";
# Extract each group between []
# Then transform each group into an array reference by splitting on |
my #groups = map { [ split /\|/ ] } ($s =~ /\[([^\]]*)\]/g);
# Inspect the data structure
print Dumper \#groups;
# Print only the last element of each sub-array
print "$_\n" for map {$_->[-1]} #groups;
If needed the third elements of the sub-arrays could be transformed into hashrefs quite easily too. ,however since that wasn't needed, I leave that as an exercise for the reader (I always love saying that when I get the chance!).
Edit: since I found it interesting I ended up creating these hashrefs, here is the code that would replace the my #groups line:
my #groups = map { [ map { /\{([^\}]*)\}/ ? { split /(?:=|,)/, $1 } : $_ } (split /\|/) ] } ($s =~ /\[([^\]]*)\]/g);
or more properly commented (map commands are read from the back, so the comments start at the bottom and follow by number, comments like #/N pair with those like #N)
my #groups = map { #/1
[ #/2
map { #/3
/\{([^\}]*)\}/ #4 ... and if any element (separated by pipes in #3)
# is surrounded by curly braces
? { #5 ... then return a hash ref
split /(?:=|,)/, $1 #6 ... whose elements are given
# pairwise between '=' or ',' signs
} #/5
: $_ #7 ... otherwise (from 'if' in #4 ) return the element as is
} (split /\|/) #3 ... where each element is separated by pipes (i.e. |)
] #2 ... return an array ref
} ($s =~ /\[([^\]]*)\]/g); #1 For each element between sqr braces (i.e. [])
The generic way:
#subparts = split /\|/, $part;
$tail = $subparts[$#subparts];
If you only ever need the last part separately:
$part =~ /([^\|]*)$/ and $tail = $1;
my ($value) = $part =~ m/[^|]\|(.+)$/;
print "$part => $value\n";
and another way:
my $s =
"[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|Coffee]";
my #parts = $s =~ m/\|([^|]+)]/g;
print join( "\n", #parts );
Since you insist on a regex:
#matches = $s =~ /\|([^|]+?)]/g
Using /g will dump all matches into the array #matches
You really don't need a regex... just use split(). The results are stored in %results
my $s = "[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|Coffee]";
foreach my $part (split(/\]/, $s))
{
#pieces = split(/\|/, $part);
$results{$pieces[-1]} = $pieces[-1];
}
With regexes, when you think “I want the last of,” you should immediately think of the pattern .* because regex greed does just what you want.
For example, matching /^(.*)a(.*)$/ chops up "abababab" into
ababab in $1
a matched by the literal in the pattern
b in $2
Let's think through the process of the match. Imagine .* as Augustus Gloop.
Augustus: Ausgezeichnet! The ^ anchor means I get to start at the beginning. From there, I shall eat all the candies!
Willie Wonka: But, my dear Augustus, you must share with the other children.
Augustus: Fine, I get "abababa" and they get "b". Happy?
Willie Wonka: But the next child in line doesn't like b candies.
Augustus: Then I shall keep "ababab" for myself and leave "ab" for the others.
At this point, Augustus has his big pile, humble little Charlie Bucket gets his single a, and Veruca Salt—although scowling about the meager quantity—gets at least something now.
In other words, $2 contains everything after the last a. To be persnickety, the ^ and $ anchors are redundant, but I like keeping them for added emphasis.
Putting this into action, you could write
#! /usr/bin/env perl
use strict;
use warnings;
sub last_fields {
local($_) = #_;
my #last;
push #last, $1 =~ /^.*\|(.+)$/ ? $1 : undef
while /\[(.*?)\]/g;
wantarray ? #last : \#last;
}
The outer while breaks up the string into [...] chunks and assumes that right square-bracket cannot occur inside a chunk. Within each chunk, we use /^.*\|(.+)$/ to capture in $1 everything after the last pipe.
Testing it with your example looks like
my $s = "[0|0|{A=145,B=2,C=12,D=18}|!]" .
"[0|0|{A=167,B=2,C=67,D=17}|.1iit]" .
"[196|0|{A=244,B=6,C=67,D=12}|10:48AM]" .
"[204|0|{A=9,B=201,C=61,D=11}|Calculator]" .
"[66|0|{A=145,B=450,C=49,D=14}|Coffee]";
use Test::More tests => 6;
my #lasts = last_fields $s;
# yes, is_deeply could do this in a single call,
# but it's laid out explicitly here for expository benefit
is $lasts[0], "!";
is $lasts[1], ".1iit";
is $lasts[2], "10:48AM";
is $lasts[3], "Calculator";
is $lasts[4], "Coffee";
is scalar #lasts, 5;
All the tests pass:
$ ./match-last-of
1..6
ok 1
ok 2
ok 3
ok 4
ok 5
ok 6
The output of prove is nicer. Run it yourself to see the color coding.
$ prove ./match-last-of
./match-last-of .. ok
All tests successful.
Files=1, Tests=6, 0 wallclock secs ( 0.02 usr 0.01 sys + 0.02 cusr 0.00 csys = 0.05 CPU)
Result: PASS

How can I escape a literal string I want to interpolate into a regular expression?

Is there a built-in way to escape a string that will be used within/as a regular expression? E.g.
www.abc.com
The escaped version would be:
www\.abc\.com
I was going to use:
$string =~ s/[.*+?|()\[\]{}\\]/\\$&/g; # Escapes special regex chars
But I just wanted to make sure that there's not a cleaner built-in operation that I'm missing?
Use quotemeta or \Q...\E.
Consider the following test program that matches against $str as-is, with quotemeta, and with \Q...\E:
#! /usr/bin/perl
use warnings;
use strict;
my $str = "www.abc.com";
my #test = (
"www.abc.com",
"www/abc!com",
);
sub ismatch($) { $_[0] ? "MATCH" : "NO MATCH" }
my #match = (
[ as_is => sub { ismatch /$str/ } ],
[ qmeta => sub { my $qm = quotemeta $str; ismatch /$qm/ } ],
[ qe => sub { ismatch /\Q$str\E/ } ],
);
for (#test) {
print "\$_ = '$_':\n";
foreach my $method (#match) {
my($name,$match) = #$method;
print " - $name: ", $match->(), "\n";
}
}
Notice in the output that using the string as-is could produce spurious matches:
$ ./try
$_ = 'www.abc.com':
- as_is: MATCH
- qmeta: MATCH
- qe: MATCH
$_ = 'www/abc!com':
- as_is: MATCH
- qmeta: NO MATCH
- qe: NO MATCH
For programs that accept untrustworthy inputs, be extremely careful about using such potentially nasty bits as regular expressions: doing so could create unexpected runtime errors, denial-of-service vulnerabilities, and security holes.
The best way to do this is to use \Q to begin a quoted string and \E to end it.
my $foo = 'www.abc.com';
$bar =~ /blah\Q$foo\Eblah/;
You can also use quotemeta on the variable first. E.g.
my $quoted_foo = quotemeta($foo);
The \Q trick is documented in perlre under "Escape Sequences."