I've got some data that I'm parsing in Perl, and will be adding more and more differently formatted data in the near future. What I would like to do is write an easy-to-use function, that I could pass a string and a regex to, and it would return anything in parentheses. It would work something like this (pseudocode):
sub parse {
$data = shift;
$regex = shift;
$data =~ eval ("m/$regex/")
foreach $x ($1...$n)
{
push (#ra, $x);
}
return \#ra;
}
Then, I could call it like this:
#subs = parse ($data, '^"([0-9]+)",([^:]*):(\W+):([A-Z]{3}[0-9]{5}),ID=([0-9]+)');
As you can see, there's a couple of issues with this code. I don't know if the eval would work, the 'foreach' definitely wouldn't work, and without knowing how many parentheses there are, I don't know how many times to loop.
This is too complicated for split, so if there's another function or possibility that I'm overlooking, let me know.
Thanks for your help!
In list context, a regular expression will return a list of all the parenthesized matches.
So all you have to do is:
my #matches = $string =~ /regex (with) (parens)/;
And assuming that it matched, #matches will be an array of the two capturing groups.
So using your regex:
my #subs = $data =~ /^"([0-9]+)",([^:]*):(\W+):([A-Z]{3}[0-9]{5}),ID=([0-9]+)/;
Also, when you have long regexes, Perl has the x modifier, which goes after the closing regex delimiter. The x modifier allows you to put white-space and newlines inside the regex for increased readability.
If you are worried about the capturing groups that might be zero length, you can pass the matches through #subs = grep {length} #subs to filter them out.
Then, I could call it like this:
#subs = parse($data,
'^"([0-9]+)",([^:]*):(\W+):([A-Z]{3}[0-9]{5}),ID=([0-9]+)');
Instead, call it like:
parse($data,
qr/^"([0-9]+)",([^:]*):(\W+):([A-Z]{3}[0-9]{5}),ID=([0-9]+)/);
Further, your task would be made simpler if you can use named captures (i.e. Perl 5.10 and later). Here is an example:
#!/usr/bin/perl
use strict; use warnings;
my %re = (
id => '(?<id> [0-9]+ )',
name => '(?<name> \w+ )',
value => '(?<value> [0-9]+ )',
);
my #this = (
'123,one:12',
'456,two:21',
);
my #that = (
'one:[12],123',
'two:[21],456',
);
my $this_re = qr/$re{id} , $re{name} : $re{value}/x;
my $that_re = qr/$re{name} : \[$re{value}\] , $re{id} /x;
use YAML;
for my $d ( #this ) {
print Dump [ parse($d, $this_re) ];
}
for my $d ( #that ) {
print Dump [ parse($d, $that_re) ];
}
sub parse {
my ($d, $re) = #_;
return unless $d =~ $re;
return my #result = #+{qw(id name value)};
}
Output:
---
- 123
- one
- 12
---
- 456
- two
- 21
---
- 123
- one
- 12
---
- 456
- two
- 21
You are trying to parse a complex expression with a regex - which is an insufficient tool for the job. Recall that regular expressions cannot parse higher grammars. For intuition, any expression which might be nested cannot be parsed with regex.
When you want to find text inside of pairs of parenthesis, you want to use Text::Balanced.
But, that is not what you want to do, so it will not help you.
Related
How do you create a $scalar from the result of a regex match?
Is there any way that once the script has matched the regex that it can be assigned to a variable so it can be used later on, outside of the block.
IE. If $regex_result = blah blah then do something.
I understand that I should make the regex as non-greedy as possible.
#!/usr/bin/perl
use strict;
use warnings;
# use diagnostics;
use Win32::OLE;
use Win32::OLE::Const 'Microsoft Outlook';
my #Qmail;
my $regex = "^\\s\*owner \#";
my $sentence = $regex =~ "/^\\s\*owner \#/";
my $outlook = Win32::OLE->new('Outlook.Application')
or warn "Failed Opening Outlook.";
my $namespace = $outlook->GetNamespace("MAPI");
my $folder = $namespace->Folders("test")->Folders("Inbox");
my $items = $folder->Items;
foreach my $msg ( $items->in ) {
if ( $msg->{Subject} =~ m/^(.*test alert) / ) {
my $name = $1;
print " processing Email for $name \n";
push #Qmail, $msg->{Body};
}
}
for(#Qmail) {
next unless /$regex|^\s*description/i;
print; # prints what i want ie lines that start with owner and description
}
print $sentence; # prints ^\\s\*offense \ # not lines that start with owner.
One way is to verify a match occurred.
use strict;
use warnings;
my $str = "hello what world";
my $match = 'no match found';
my $what = 'no what found';
if ( $str =~ /hello (what) world/ )
{
$match = $&;
$what = $1;
}
print '$match = ', $match, "\n";
print '$what = ', $what, "\n";
Use Below Perl variables to meet your requirements -
$` = The string preceding whatever was matched by the last pattern match, not counting patterns matched in nested blocks that have been exited already.
$& = Contains the string matched by the last pattern match
$' = The string following whatever was matched by the last pattern match, not counting patterns matched in nested blockes that have been exited already. For example:
$_ = 'abcdefghi';
/def/;
print "$`:$&:$'\n"; # prints abc:def:ghi
The match of a regex is stored in special variables (as well as some more readable variables if you specify the regex to do so and use the /p flag).
For the whole last match you're looking at the $MATCH (or $& for short) variable. This is covered in the manual page perlvar.
So say you wanted to store your last for loop's matches in an array called #matches, you could write the loop (and for some reason I think you meant it to be a foreach loop) as:
my #matches = ();
foreach (#Qmail) {
next unless /$regex|^\s*description/i;
push #matches_in_qmail $MATCH
print;
}
I think you have a problem in your code. I'm not sure of the original intention but looking at these lines:
my $regex = "^\\s\*owner \#";
my $sentence = $regex =~ "/^\s*owner #/";
I'll step through that as:
Assign $regexto the string ^\s*owner #.
Assign $sentence to value of running a match within $regex with the regular expression /^s*owner $/ (which won't match, if it did $sentence will be 1 but since it didn't it's false).
I think. I'm actually not exactly certain what that line will do or was meant to do.
I'm not quite sure what part of the match you want: the captures, or something else. I've written Regexp::Result which you can use to grab all the captures etc. on a successful match, and Regexp::Flow to grab multiple results (including success statuses). If you just want numbered captures, you can also use Data::Munge
You can do the following:
my $str ="hello world";
my ($hello, $world) = $str =~ /(hello)|(what)/;
say "[$_]" for($hello,$world);
As you see $hello contains "hello".
If you have older perl on your system like me, perl 5.18 or earlier, and you use $ $& $' like codequestor's answer above, it will slow down your program.
Instead, you can use your regex pattern with the modifier /p, and then check these 3 variables: ${^PREMATCH}, ${^MATCH}, and ${^POSTMATCH} for your matching results.
I'm having to replace fqdn's inside a SQL dump for website migration purposes. I've written a perl filter that's supposed to take STDIN, replace the serialized strings containing the domain name that's supposed to be replaced, replace it with whatever argument is passed into the script, and output to STDOUT.
This is what I have so far:
my $search = $ARGV[0];
my $replace = $ARGV[1];
my $offset_s = length($search);
my $offset_r = length($replace);
my $regex = eval { "s\:([0-9]+)\:\\\"(https?\://.*)($search.*)\\\"" };
while (<STDIN>) {
my #fs = split(';', $_);
foreach (#fs) {
chomp;
if (m#$regex#g) {
my ( $len, $extra, $str ) = ( $1, $2, $3 );
my $new_len = $len - $offset_s + $offset_r;
$str =~ eval { s/$search/$replace/ };
print 's:' . $new_len . ':' . $extra . $str . '\"'."\n";
}
}
}
The filter gets passed data that may look like this (this is taken from a wordpress dump, but we're also supposed to accommodate drupal dumps:
INSERT INTO `wp_2_options` VALUES (1,'siteurl','http://to.be.replaced.com/wordpress/','yes'),(125,'dashboard_widget_options','
a:2:{
s:25:\"dashboard_recent_comments\";a:1:{
s:5:\"items\";i:5;
}
s:24:\"dashboard_incoming_links\";a:2:{
s:4:\"home\";s:31:\"http://to.be.replaced.com/wordpress\";
s:4:\"link\";s:107:\"http://blogsearch.google.com/blogsearch?scoring=d&partner=wordpress&q=link:http://to.be.replaced.com/wordpress/\";
}
}
','yes'),(148,'theme_175','
a:1:{
s:13:\"courses_image\";s:37:\"http://to.be.replaced.com/files/image.png\";
}
','yes')
The regex works if I don't have any periods in my $search. I've tried escaping the periods, i.e. domain\.to\.be\.replaced, but that didn't work. I'm probably doing this either in a very roundabout way or missing something obvious. Any help would be greatly appreciated.
There is no need to evaluate (eval) your regular expression because of including variables in them. Also, to avoid the special meaning of metacharacters of those variables like $search, escape them using quotemeta() function or including the variable between \Q and \E inside the regexp. So instead of:
my $regex = eval { "s\:([0-9]+)\:\\\"(https?\://.*)($search.*)\\\"" };
Use:
my $regex = qr{s\:([0-9]+)\:\\\"(https?\://.*)(\Q$search\E.*)\\\"};
or
my $quoted_search = quotemeta $search;
my $regex = qr{s\:([0-9]+)\:\\\"(https?\://.*)($quoted_search.*)\\\"};
And the same advice for this line:
$str =~ eval { s/$search/$replace/ };
you have to double the escape char \ in your $search variable for the interpolated string to contain the escaped periods.
i.e. domain\.to\.be\.replaced -> domain.to.be.replaced (not wanted)
while domain\\.to\\.be\\.replaced -> domain\.to\.be\.replaced (correct).
I'm not sure your perl regex would replace the DNS in string matching several times the old DNS (in the same serialized string).
I made a gist with a script using bash, sed and one big perl regex for this same problem. You may give it a try.
The regex I use is something like that (exploded for lisibility, and having -7 as the known difference between domain names lengths):
perl -n -p -i -e '1 while s#
([;|{]s:)
([0-9]+)
:\\"
(((?!\\";).)*?)
(domain\.to\.be\.replaced)
(.*?)
\\";#"$1".($2-7).":\\\"$3new.domain.tld$6\\\";"#ge;' file
Which is maybe not the best one but at least it seems to de the job. The g option manages lines containing several serialized strings to cleanup and the while loop redo the whole job until no replacement occurs in serilized strings (for strings containing several occurences of the DNS). I'm not fan enough of regex to try a recursive one.
I have a list of regular expressions (about 10 - 15) that I needed to match against some text. Matching them one by one in a loop is too slow. But instead of writing up my own state machine to match all the regexes at once, I am trying to | the individual regexes and let perl do the work. The problem is that how do I know which of the alternatives matched?
This question addresses the case where there are no capturing groups inside each individual regex. (which portion is matched by regex?) What if there are capturing groups inside each regexes?
So with the following,
/^(A(\d+))|(B(\d+))|(C(\d+))$/
and the string "A123", how can I both know that A123 matched and extract "123"?
You don't need to code up your own state machine to combine regexes. Look into Regexp:Assemble. It has methods that'll track which of your initial patterns matched.
Edit:
use strict;
use warnings;
use 5.012;
use Regexp::Assemble;
my $string = 'A123';
my $re = Regexp::Assemble->new(track => 1);
for my $pattern (qw/ A(\d+) B(\d+) C(\d+) /) {
$re->add($pattern);
}
say $re->re; ### (?-xism:(?:A(\d+)(?{0})|B(\d+)(?{2})|C(\d+)(?{1})))
say for $re->match($string); ### A(\d+)
say for $re->capture; ### 123
Why not use /^ (?<prefix> A|B|C) (?<digits> \d+) $/x. Note, named capture groups used for clarity, and not essential.
A123 will be in capture group $1 and 123 will be in group $2
So you could say:
if ( /^(A(\d+))|(B(\d+))|(C(\d+))$/ && $1 eq 'A123' && $2 eq '123' ) {
...
}
This is redundant, but you get the idea...
EDIT: No, you don't have to enumerate each sub match, you asked how to know whether A123 matched and how to extract 123:
You won't enter the if block unless A123 matched
and you can extract 123 using the $2 backreference.
So maybe this example would have been more clear:
if ( /^(A(\d+))|(B(\d+))|(C(\d+))$/ ) {
# do something with $2, which will be '123' assuming $_ matches /^A123/
}
EDIT 2:
To capture matches in an AoA (which is a different question, but this should do it):
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my #matches = map { [$1,$2] if /^(?:(A|B|C)(\d+))$/ } <DATA>;
print Dumper \#matches;
__DATA__
A123
B456
C769
Result:
$VAR1 = [
[
'A',
'123'
],
[
'B',
'456'
],
[
'C',
'769'
]
];
Note that I modified your regex, but it looks like that's what you're going for judging by your comment...
Another thing you can do in Perl is to embed Perl code directly in your regex using "(?{...})". So, you can set a variable that tells you which part of the regex matched. WARNING: your regex should not contain any variables (outside of the embedded Perl code), that will be interpolated into the regex or you will get errors. Here is a sample parser that uses this feature:
my $kind;
my $REGEX = qr/
[A-Za-z][\w]* (?{$kind = 'IDENT';})
| (?: ==? | != | <=? | >=? ) (?{$kind = 'OP';})
| -?\d+ (?{$kind = 'INT';})
| \x27 ( (?:[^\x27] | \x27{2})* ) \x27 (?{$kind = 'STRING';})
| \S (?{$kind = 'OTHER';})
/xs;
my $line = "if (x == 'that') then x = -23 and y = 'say ''hi'' for me';";
my #tokens;
while ($line =~ /( $REGEX )/xsg) {
my($match, $str) = ($1,$2);
if ($kind eq 'STRING') {
$str =~ s/\x27\x27/\x27/g;
push(#tokens, ['STRING', $str]);
}
else {
push(#tokens, [$kind, $match]);
}
}
foreach my $lItems (#tokens) {
print("$lItems->[0]: $lItems->[1]\n");
}
which prints out the following:
IDENT: if
OTHER: (
IDENT: x
OP: ==
STRING: that
OTHER: )
IDENT: then
IDENT: x
OP: =
INT: -23
IDENT: and
IDENT: y
OP: =
STRING: say 'hi' for me
OTHER: ;
It's kind of contrived, but you'll notice that the quotes (actually, apostrophes) around strings are stripped off (also, consecutive quotes are collapsed to single quotes), so in general, only the $kind variable will tell you whether the parser saw an identifier or a quoted string.
With your example data, it is easy to write
'A123' =~ /^([ABC])(\d+)$/;
after which $1 will contain the prefix and $2 the suffix.
I cannot tell whether this is relevant to your real data, but to use an additional module seems like overkill.
My program read other programs source code and colect information about used SQL queries. I have problem with getting substring.
...
$line = <FILE_IN>;
until( ($line =~m/$values_string/i && $line !~m/$rem_string/i) || eof )
{
if($line =~m/ \S{2}DT\S{3}/i)
{
# here I wish to get (only) substring that match to pattern \S{2}DT\S{3}
# (7 letter table name) and display it.
$line =~/\S{2}DT\S{3}/i;
print $line."\n";
...
In result print prints whole line and not a substring I expect. I tried different approach, but I use Perl seldom and probably make basic concept error. ( position of tablename in line is not fixed. Another problem is multiple occurrence i.e.[... SELECT * FROM AADTTAB, BBDTTAB, ...] ). How can I obtain that substring?
Use grouping with parenthesis and store the first group.
if( $line =~ /(\S{2}DT\S{3})/i )
{
my $substring = $1;
}
The code above fixes the immediate problem of pulling out the first table name. However, the question also asked how to pull out all the table names. So:
# FROM\s+ match FROM followed by one or more spaces
# (.+?) match (non-greedy) and capture any character until...
# (?:x|y) match x OR y - next 2 matches
# [^,]\s+[^,] match non-comma, 1 or more spaces, and non-comma
# \s*; match 0 or more spaces followed by a semi colon
if( $line =~ /FROM\s+(.+?)(?:[^,]\s+[^,]|\s*;)/i )
{
# $1 will be table1, table2, table3
my #tables = split(/\s*,\s*/, $1);
# delim is a space/comma
foreach(#tables)
{
# $_ = table name
print $_ . "\n";
}
}
Result:
If $line = "SELECT * FROM AADTTAB, BBDTTAB;"
Output:
AADTTAB
BBDTTAB
If $line = "SELECT * FROM AADTTAB;"
Output:
AADTTAB
Perl Version: v5.10.0 built for MSWin32-x86-multi-thread
I prefer this:
my ( $table_name ) = $line =~ m/(\S{2}DT\S{3})/i;
This
scans $line and captures the text corresponding to the pattern
returns "all" the captures (1) to the "list" on the other side.
This psuedo-list context is how we catch the first item in a list. It's done the same way as parameters passed to a subroutine.
my ( $first, $second, #rest ) = #_;
my ( $first_capture, $second_capture, #others ) = $feldman =~ /$some_pattern/;
NOTE:: That said, your regex assumes too much about the text to be useful in more than a handful of situations. Not capturing any table name that doesn't have dt as in positions 3 and 4 out of 7? It's good enough for 1) quick-and-dirty, 2) if you're okay with limited applicability.
It would be better to match the pattern if it follows FROM. I assume table names consist solely of ASCII letters. In that case, it is best to say what you want. With those two remarks out of the way, note that a successful capturing regex match in list context returns the matched substring(s).
#!/usr/bin/perl
use strict;
use warnings;
my $s = 'select * from aadttab, bbdttab';
if ( my ($table) = $s =~ /FROM ([A-Z]{2}DT[A-Z]{3})/i ) {
print $table, "\n";
}
__END__
Output:
C:\Temp> s
aadttab
Depending on the version of perl on your system, you may be able to use a named capturing group which might make the whole thing easier to read:
if ( $s =~ /FROM (?<table>[A-Z]{2}DT[A-Z]{3})/i ) {
print $+{table}, "\n";
}
See perldoc perlre.
Parens will let you grab part of the regex into special variables: $1, $2, $3...
So:
$line = ' abc andtabl 1234';
if($line =~m/ (\S{2}DT\S{3})/i) {
# here I wish to get (only) substring that match to pattern \S{2}DT\S{3}
# (7 letter table name) and display it.
print $1."\n";
}
Use a capturing group:
$line =~ /(\S{2}DT\S{3})/i;
my $substr = $1;
$& contains the string matched by the last pattern match.
Example:
$str = "abcdefghijkl";
$str =~ m/cdefg/;
print $&;
# Output: "cdefg"
So you could do something like
if($line =~m/ \S{2}DT\S{3}/i) {
print $&."\n";
}
WARNING:
If you use $& in your code it will slow down all pattern matches.
I thought this would have done it...
$rowfetch = $DBS->{Row}->GetCharValue("meetdays");
$rowfetch = /[-]/gi;
printline($rowfetch);
But it seems that I'm missing a small yet critical piece of the regex syntax.
$rowfetch is always something along the lines of:
------S
-M-W---
--T-TF-
etc... to represent the days of the week a meeting happens
$rowfetch =~ s/-//gi
That's what you need for your second line there. You're just finding stuff, not actually changing it without the "s" prefix.
You also need to use the regex operator "=~" for this.
Here is what your code presently does:
# Assign 'rowfetch' to the value fetched from:
# The function 'GetCharValue' which is a method of:
# An Value in A Hash Identified by the key "Row" in:
# Either a Hash-Ref or a Blessed Hash-Ref
# Where 'GetCharValue' is given the parameter "meetdays"
$rowfetch = $DBS->{Row}->GetCharValue("meetdays");
# Assign $rowfetch to the number of times
# the default variable ( $_ ) matched the expression /[-]/
$rowfetch = /[-]/gi;
# Print the number of times.
printline($rowfetch);
Which is equivalent to having written the following code:
$rowfetch = ( $_ =~ /[-]/ )
printline( $rowfetch );
The magic you are looking for is the
=~
Token instead of
=
The former is a Regex operator, and the latter is an assignment operator.
There are many different regex operators too:
if( $subject =~ m/expression/ ){
}
Will make the given codeblock execute only if $subject matches the given expression, and
$subject =~ s/foo/bar/gi
Replaces ( s/) all instances of "foo" with "bar", case-insentitively (/i), and repeating the replacement more than once(/g), on the variable $subject.
Using the tr operator is faster than using a s/// regex substitution.
$rowfetch =~ tr/-//d;
Benchmark:
use Benchmark qw(cmpthese);
my $s = 'foo-bar-baz-blee-goo-glab-blech';
cmpthese(-5, {
trd => sub { (my $a = $s) =~ tr/-//d },
sub => sub { (my $a = $s) =~ s/-//g },
});
Results on my system:
Rate sub trd
sub 300754/s -- -79%
trd 1429005/s 375% --
Off-topic, but without the hyphens, how will you know whether a "T" is Tuesday or Thursday?