Parsing data with perl- capturing a range of text - regex

I'm writing code to parse all the interfaces on my network, looking for certain configurations.. etc.
the data looks like this:
Interface fa1
mode access
port-security
mac-address sticky
!
interface fa2
mode trunk
!
Basically starting with "^interface " and ending "!".
my current algorithm is to "record" the data I need
foreach $line (#input) {
if ( $line =~ m/^interface.+\d/ && $line !~ m/interface Embedded-Service-Engine|BRI|TenGigabitEthernet|vlan|Port-channel|ATM|loopback/i) {
$record = 1;
}
#$int ne '' is to handle the rest of the file not in this format
if( $line =~ m/!/ && $int ne '') {
#save data in format 'interface fa2,mode trunk'
#if the interface doesn't have port-security
push(#intlist, join(','split("\r\n",$int))."\n") unless $int =~ m/port-security/;
$record=0;
$int='';
}
if ($record) {
$int.=$line;
}
}
while this works in my case, I'd like a simply way to do it. I've searched and found that you can use the range operator '..' on regex
which turns my code into :
#input # contains the file
#interfaces = grep (/^interface.+\d/ .. /!/, #input);
which gives me all the interface data, the problem is now every line is a single element in the #interfaces array. how can I then split this data up so everything from /^interface.+\d/ .. /!/ is one element in this array without creating more for loops?
The goal is to get it down to one element so I can then scan it for interfaces I don't want to look at interface Embedded-Service-Engine|BRI|TenGigabit as well as interfaces that have the correct configurations.

Have a look at $/ because I think that'll help. It's the record separator - which defaults to \n.
Then you can apply regular expressions to the current 'chunk' to pull out the data you require - by default a regular expression/capture group applies to $_ the implicit variable.
E.g.
#!/usr/bin/perl
use strict;
use warnings;
local $/ = '!';
while ( <DATA> ) {
my ( $interface ) = m/Interface (\w+)/i;
next if $interface =~ m/Embedded-Service-Engine/;
my ( $mode ) = m/mode (\w+)/;
print "$interface $mode\n";
print "---next record---\n";
}
__DATA__
Interface fa1
mode access
port-security
mac-address sticky
!
interface fa2
mode trunk
!
If you need to keep the data for other uses (e.g. 'process as you go' isn't suitable) then the tool for the job is a hash.
You can either use something like the above - and populate the hash with particular keys you're interested in - or use the magic of map to do it for you.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
local $/ = '!';
my %interfaces;
while ( <DATA> ) {
my ( $interface ) = m/Interface (\w+)/i;
next if $interface =~ m/Embedded-Service-Engine/;
my %interface_values = map { my ( $key, $value ) = split; $key, $value || 1 } grep { /\w/ } split ( "\n" );
$interfaces{$interface} = \%interface_values;
}
print Dumper \%interfaces
__DATA__
Interface fa1
mode access
port-security
mac-address sticky
!
interface fa2
mode trunk
!
That map line basically:
splits the current record on \n to get each line.
filters 'not word' values (so blank lines and !)
splits each line on whitepace, to get a key and value pair.
If no value is defined, sets it to 1. (so in the example, port-security )
Populates a hash with these key-value pairs.
and then updates %interfaces with the hash for each interface ID.
Giving something like:
$VAR1 = {
'fa1' => {
'port-security' => 1,
'mode' => 'access',
'Interface' => 'fa1',
'mac-address' => 'sticky'
},
'fa2' => {
'mode' => 'trunk',
'interface' => 'fa2'
}
};

A hash or hashref would be a result where you can work with. Furthermore, reading records based on a fixed structure can be read using a matching regex. Like so:
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
our %MATCH;
*MATCH = \%+;
# read file into variable
my ( $file, $data ) = ( 'interfaces.txt', undef );
open( my $fh, '<', $file ) or die "cannot open file $file";
{
local $/;
$data = <$fh>;
}
close($fh);
print Dumper $data;
my $regex = qr{
(?sm)
interface [^\w\n]+
(?<interface> (\w[^\n]+))
[^\w]+
mode [^\w]+
(?<mode> (\w[^\n]+))
[^\w]+
((?<portsecurity> port-security)
[^\w]+)? # port-security is optional
(mac-address [^\w]+
(?<macaddress> (\w[^\n]+))
)? # mac-address is optional
[^!]*
!
}x;
my $results = {};
while ( $data =~ m/$regex/g ) {
my $interface = $MATCH{interface};
$results->{$interface} = { mode => $MATCH{mode} ? $MATCH{mode} : '' };
$results->{$interface}->{'port-security'} = 1
if defined $MATCH{portsecurity};
$results->{$interface}->{macaddress} = $MATCH{macaddress}
if defined $MATCH{macaddress};
}
print Dumper $results;
The result from your input is:
$VAR1 = {
'fa1' => {
'macaddress' => 'sticky',
'mode' => 'access',
'port-security' => 1
},
'fa2' => {
'mode' => 'trunk'
}
};
Having a hash with the interface names as key values, gives you the opportunity to use a 'grep' for the interfaces you want.
If your structure is not fixed - there is no ordering in your fields mode, port-security, mac-address - then you would need to read a interface record in one go, and split up the fields using separate regexes for each field.

This is my Final solution. In this particular case I'm searching for all switchports that have a maximum port-security not equal to 1. This is just an example and can be switched for any configuration. I'm also omitting certain interfaces from being caught if that configuration is actually applied to them.
#!/usr/bin/perl
$MDIR='/currentConfig';
#list of interfaces you don't want to see to filter output
#omit =(
'MANAGEMENT.PORT',
'sup.mgmt',
'Internal.EtherSwitch',
'Router',
'ip address \d',
'STRA'
);
#join with '|' to form the regex
$dontwant = join('|',#omit);
#search criteria
$search='switchport port-security maximum [^1]';
opendir(DIR,$MDIR) or die $!;
#dirContents=readdir DIR;close DIR;
foreach $file (#dirContents) {
open(IN,$MDIR.'/'.$file) or die $!;
#record seperator to !
$/='!';
my #inFile=<IN>; close IN;
#since the record seperator has been changed, '^' won't match beginning of line
my #ints = grep (/\ninterface/i,#inFile);
#set record seperator back to normal
$/="\n";
foreach $int (#ints) {
if ( $int =~ m/$search/i && $int !~ m/$dontwant/) {
push(#finalint,$int);
}
}
}
#just list the interfaces found, i'll use this to make it comma seperated
foreach $elem (#finalint) {
print $elem;
}

Related

Searching Perl array with regex and return single capturing group only

I have a Perl script in which I perform web service calls in a loop. The server returns a multivalued HTTP header that I need to parse after each call with information that I will need to make the next call (if it doesn't return the header, I want to exit the loop).
I only care about one of the values in the header, and I need to get the information out of it with a regular expression. Let's say the header is like this, and I only care about the "foo" value:
X-Header: test-abc12345; blah=foo
X-Header: test-fgasjhgakg; blah=bar
I can get the header values like this: #values = $response->header( 'X-Header' );. But how do I quickly check if
There is a foo value, and
Parse and save the foo value for the next iteration?
Ideally, I'd like to do something like this:
my $value = 'default';
do {
# (do HTTP request; use $value)
#values = $response->header( 'X-Header' );
} while( $value = first { /(?:test-)([^;]+)(?:; blah=foo)/ } #values );
But grep, first (from List::Util), etc. all return the entire match and not just the single capturing group I want. I want to avoid cluttering up my code by looping over the array and matching/parsing inside the loop body.
Is what I want possible? What would be the most compact way to write it? So far, all I can come up with is using lookarounds and \K to discard the stuff I don't care about, but this isn't super readable and makes the regex engine perform a lot of unnecessary steps.
So it seems that you want to catch the first element with a certain pattern, but acquire only the pattern. And you want it done nicely. Indeed, first and grep only pass the element itself.
However, List::MoreUtils::first_result does support processing of its match
use List::MoreUtils 0.406 qw(first_result);
my #w = qw(a bit c dIT); # get first "it" case-insensitive
my $res = first_result { ( /(it)/i )[0] } #w;
say $res // 'undef'; #--> it
That ( ... )[0] is needed to put the regex in the list context so that it returns the actual capture. Another way would be firstres { my ($r) = /(it)/i; $r }. Pick your choice
For the data in the question
use warnings;
use strict;
use feature 'say';
use List::MoreUtils 0.406 qw(firstres);
my #data = (
'X-Header: test-abc12345; blah=foo',
'X-Header: test-fgasjhgakg; blah=bar'
);
if (my $r = firstres { ( /test-([^;]+);\s+blah=foo/ )[0] } #data) {
say $r
}
Prints abc12345, clarified in a comment to be the sought result.
Module versions prior to 0.406 (of 2015-03-03) didn't have firstres (alias first_result)
first { ... } #values returns one the values (or undef).
You could use either of these:
my ($value) = map { /...(...).../ } #values;
my $value = ( map { /...(...).../ } #values ) ? $1 : undef;
my $value = ( map { /...(...).../ } #values )[0];
Using first, it would look like the following, which is rather silly:
my $value = first { 1 } map { /...(...).../ } #values;
However, assuming the capture can't be an empty string or the string 0, List::MoreUtils's first_result could be used to avoid the unnecessary matches:
my $value = first_result { /...(...).../ ? $1 : undef } #values;
my $value = first_result { ( /...(...).../ )[0] } #values;
If the returned value can be false (e.g. an empty string or a 0) you can use something like
my $value = first_result { /...(...).../ ? \$1 : undef } #values;
$value = $$value if $value;
The first_result approach isn't necessarily faster in practice.
Following code snippet is looking for foo stored in a variable $find, the found values is stored in variable $found.
my $find = 'foo';
my $found;
while( $response->header( 'X-Header' ) ) {
if( /X-Header: .*?blah=($find)/ ) {
$found = $1;
last;
}
}
say $found if $found;
Sample demo code
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my $find = 'foo';
my $found;
my #header = <DATA>;
chomp(#header);
for ( #header ) {
$found = $1 if /X-Header: .*?blah=($find)/;
last if $found;
}
say Dumper(\#header);
say "Found: $found" if $found;
__DATA__
X-Header: test-abc12345; blah=foo
X-Header: test-fgasjhgakg; blah=bar
Output
$VAR1 = [
'X-Header: test-abc12345; blah=foo',
'X-Header: test-fgasjhgakg; blah=bar'
];
Found: foo

In Perl, how can I store arrays as values in a hash?

I don't know if you're allowed to ask questions about a school assignment. I just want to understand what I am supposed to do, not do this for me. Maybe I am missing something so simple it is right in front of me but anyways it was based off of an older assignment, but I had missed this class and I am running into a wall now, and well the problem is that I'm trying to push an array into a hash table. using something like this;
push #{$hash_table{$hash_key}}, $port
and to count the ports while storing them and then to print the contents of the hash
while ( ($key , $value) = each ( %hash ) ) {
print “$key scanned #{$value}”
}
and if I wanted to sort the results I would use
< foreach $key ( keys ( %hash ) ) {
}
The current code I have is this, searching for the string of /iNext-DROP/
using a log file provided. I cant for the life of me find the correct place to add the code above
use warnings;
my $LogRecord;
my $LogRecordCount;
open LOGFILE, "sample.log.txt" or die "couldn't open sample.log.txt";
while ($LogRecord = <LOGFILE>) {
if ($LogRecord =~ /INext-DROP/) {
$LogRecordCount ++;
$LogRecord =~ /(SRC=[0-9\.]* ).*(SPT=[0-9\.]* )/;
$source=$1;
$sport=$2;
print "$source$sport";
print substr( $LogRecord , 0 , $ARGV[1] ) , "\n" if $ARGV[1];
}
}
print "The file contained $LogRecordCount records" if $ARGV[1];
close LOGFILE;
Here's a picture of the old code with comments ;
Old Code -- not much has changed since I keep going back after it doesn't work
The thing you're seemingly having problems with, is that you're not sure where the port is being captured and the hash update.
What's happening is your while loop is iterating the file one line at a time, and capturing values - that $LogRecord =~ line is capturing a pattern - into $1 and $2.
And then that $2 is the thing you can add to your hash, with push.
However, there's a few things I have changed style wise like using lexical file handles because it's better style.
#!/usr/bin/env perl
use warnings;
use strict;
#because it makes debugging easier.
use Data::Dumper;
my $LogRecordCount;
#declare some hashes;
my %ports_from;
my %ips_that_used;
open my $logfile, "sample.log.txt" or die "couldn't open sample.log.txt";
while (my $line = <$logfile>) {
#matches 'current line' - skips stuff that doesn't match.
next unless $line =~ /INext-DROP/;
#increment count.
$LogRecordCount++;
my ( $source, $src_port ) = $line =~ m/SRC=([0-9\.]+).*SPT=([0-9]+)/;
print "$source$sport";
#not sure what this is doing, so I have left it in.
print substr( $line , 0 , $ARGV[1] ) , "\n" if $ARGV[1];
push #{$ports_from{$source}}, $src_port;
push #{$ips_that_used{$src_port}}, $source;
}
print "The file contained $LogRecordCount records" if $ARGV[1];
close $logfile;
print Dumper \%ports_from;
print Dumper \%ips_that_used;
That's built up your hashes.
But when it comes to outputing:
foreach my $ip ( keys %ports_from ) {
print "$ip: ", join ( " ", #{$ports_from{$ip}}) ,"\n"
}
If you wanted to sort them, you would have to do this using sort.
Now sort is a pretty clever function, but by default is sorts alphanumerically. That's ... actually not all that useful when it comes to IP addresses or port numbers, because you probably want to sort those more numerically. The easy answer is Sort::Naturally and use nsort.
However - sort takes a function (it defaults to cmp) that returns -1, 0, 1 depending on relative position.
So sorting 'by IP' might look like this:
sub by_ip {
my #a = split /\./, $a;
my #b = split /\./, $b;
foreach my $octet ( #a ) {
my $comparison = $octet <=> shift ( #b );
return $comparison if $comparison;
}
return 0;
}
And then you could:
foreach my $ip ( sort by_ip keys %ports_from ) {
print "$ip: ", join ( " ", sort { $a <=> $b } #{$ports_from{$ip}}),"\n";
}
Giving you:
24.64.208.134 : 24128 24128 24128
71.228.199.109 : 37091
72.197.8.56 : 9258
75.117.31.43 : 3122
99.248.20.48 : 48725
207.68.178.56 : 80
Given an IP-to-port mapping where duplicates exist though, it might be better still to merely count port-frequency using a hash-of-hashes instead of a hash-of-arrays.
$count_ports_from{$source}{$src_port}++;
And then:
foreach my $ip ( sort by_ip keys %count_ports_from ) {
print "$ip: ";
foreach my $port_num ( sort { $count_ports_from{$a} <=> $count_ports_from{$b} }
keys %{ $count_ports_from{$ip} } )
{
print "\t $port_num : $count_ports_from{$ip}{$port_num}\n";
}
}
Giving you something like:
24.64.208.134 : 24128 : 3
71.228.199.109 : 37091 : 1
72.197.8.56 : 9258 : 1
75.117.31.43 : 3122 : 1
99.248.20.48 : 48725 : 1
207.68.178.56 : 80 : 1

Dynamically capture regular expression match in Perl

I'm trying to dynamically catch regex matching in Perl. I've known that eval will help me do this but I may be doing something wrong.
Code:
use strict;
use warnings;
my %testHash = (
'(\d+)\/(\d+)\/(\d+)' => '$1$2$3'
);
my $str = '1/12/2016';
foreach my $pattern (keys (%testHash)) {
my $value = $testHash{$pattern};
my $result;
eval {
local $_ = $str;
/$pattern/;
print "\$1 - $1\n";
print "\$2 - $2\n";
print "\$3 - $3\n";
eval { print "$value\n"; }
}
}
Is it also possible to store captured regex patterns in an array?
I believe what you really want is a dynamic version of the following:
say $str =~ s/(\d+)\/(\d+)\/(\d+)/$1$2$3/gr;
String::Substitution provides what we need to achieve that.
use String::Substitution qw( gsub_copy );
for my $pattern (keys(%testHash)) {
my $replacement = $testHash{$pattern};
say gsub_copy($str, $pattern, $replacement);
}
Note that $replacement can also be a callback. This permits far more complicated substitutions. For example, if you wanted to convert 1/12/2016 into 2016-01-12, you could use the following:
'(\d+)/(\d+)/(\d+)' => sub { sprintf "%d-%02d-%02d", #_[3,1,2] },
To answer your actual question:
use String::Substitution qw( interpolate_match_vars last_match_vars );
for my $pattern (keys(%testHash)) {
my $template = $testHash{$pattern};
$str =~ $pattern # Or /$pattern/ if you prefer
or die("No match!\n");
say interpolate_match_vars($template, last_match_vars());
}
I am not completely sure what you want to do here, but I don't think your program does what you think it does.
You are useing eval with a BLOCK of code. That's like a try block. If it dies inside of that eval block, it will catch that error. It will not run your string like it was code. You need a string eval for that.
Instead of explaining that, here's an alternative.
This program uses sprintf and numbers the parameters. The %1$s syntax in the pattern says _take the first argument (1$) and format it as a string (%s). You don't need to localize or assign to $_ to do a match. The =~ operator does that on other variables for you. I also use qr{} to create a quoted regular expression (essentially a variable containing a precompiled pattern) that I can use directly. Because of the {} as delimiter, I don't need to escape the slashes.
use strict;
use warnings;
use feature 'say'; # like print ..., "\n"
my %testHash = (
qr{(\d+)/(\d+)/(\d+)} => '%1$s.%2$s.%3$s',
qr{(\d+)/(\d+)/(\d+) nomatch} => '%1$s.%2$s.%3$s',
qr{(\d+)/(\d+)/(\d\d\d\d)} => '%3$4d-%2$02d-%1$02d',
qr{\d} => '%s', # no capture group
);
my $str = '1/12/2016';
foreach my $pattern ( keys %testHash ) {
my #captures = ( $str =~ $pattern );
say "pattern: $pattern";
if ($#+ == 0) {
say " no capture groups";
next;
}
unless (#captures) {
say " no match";
next;
}
# debug-output
for my $i ( 1 .. $#- ) {
say sprintf " \$%d - %s", $i, $captures[ $i - 1 ];
}
say sprintf $testHash{$pattern}, #captures;
}
I included four examples:
The first pattern is the one you had. It uses %1$s and so on as explained above.
The second one does not match. We check the number of elements in #captured by looking at it in scalar context.
The third one shows that you can also reorder the result, or even use the sprintf formatting.
The last one has no capture group. We check by looking at the index of the last element ($# as the sigil for arrays that usually have an # sigil) in #+, which holds the offsets of the ends of the last successful submatches in the currently active dynamic scope. The first element is the end of the overall match, so if this only has one element, we don't have capture groups.
The output for me is this:
pattern: (?^:(\d+)/(\d+)/(\d\d\d\d))
$1 - 1
$2 - 12
$3 - 2016
2016-12-01
pattern: (?^:(\d+)/(\d+)/(\d+) nomatch)
no match
pattern: (?^:\d)
no capture groups
pattern: (?^:(\d+)/(\d+)/(\d+))
$1 - 1
$2 - 12
$3 - 2016
1.12.2016
Note that the order in the output is mixed up. That's because hashes are not ordered in Perl, and if you iterate over the keys in a hash without sort the order is random.
Apologies! I realized both my question and sample code were both vague. But after reading your suggestions I came of with the following code.
I haven't optimized this code yet and there is a limit to the replacement.
foreach my $key (keys %testHash) {
if ( $str =~ $key ) {
my #matchArr = ($str =~ $key); # Capture all matches
# Search and replace (limited from $1 to $9)
for ( my $i = 0; $i < #matchArr; $i++ ) {
my $num = $i+1;
$testHash{$key} =~ s/\$$num/$matchArr[$i]/;
}
$result = $testHash{$key};
last;
}
}
print "$result\n";
Evaluing the regexp in list context returns the matches. so in your example:
use Data::Dumper; # so we can see the result
foreach my $pattern (keys (%testHash)) {
my #a = ($str =~/$pattern/);
print Dumper(\#a);
}
would do the job.
HTH
Georg
Is it also possible to store captured regex patterns in an array?
Of course it is possible to store captured substrings in an array:
#!/usr/bin/env perl
use strict;
use warnings;
my #patterns = map qr{$_}, qw{
(\d+)/(\d+)/(\d+)
};
my $str = '1/12/2016';
foreach my $pattern ( #patterns ) {
my #captured = ($str =~ $pattern)
or next;
print "'$_'\n" for #captured;
}
Output:
'1'
'12'
'2016'
I do not quite understand what you are trying to do with combinations of local, eval EXPR and eval BLOCK in your code and the purpose of the following hash:
my %testHash = (
'(\d+)\/(\d+)\/(\d+)' => '$1$2$3'
);
If you are trying to codify that this pattern should result in three captures, you can do that like this:
my #tests = (
{
pattern => qr{(\d+)/(\d+)/(\d+)},
ncaptures => 3,
}
);
my $str = '1/12/2016';
foreach my $test ( #tests ) {
my #captured = ($str =~ $test->{pattern})
or next;
unless (#captured == $test->{ncaptures}) {
# handle failure
}
}
See this answer to find out how you can automate counting the number of capture groups in a pattern. Using the technique in that answer:
#!/usr/bin/env perl
use strict;
use warnings;
use Test::More;
my #tests = map +{ pattern => qr{$_}, ncaptures => number_of_capturing_groups($_) }, qw(
(\d+)/(\d+)/(\d+)
);
my $str = '1/12/2016';
foreach my $test ( #tests ) {
my #captured = ($str =~ $test->{pattern});
ok #captured == $test->{ncaptures};
}
done_testing;
sub number_of_capturing_groups {
"" =~ /|$_[0]/;
return $#+;
}
Output:
ok 1
1..1

How can I match these function calls, and extract the nmae of the function and the first argument?

I am trying to parse an array of elements. Those who match a pattern like the following:
Jim("jjanson", Customer.SALES);
I want to create a hash table like Jim => "jjanson"
How can I do this?
I can not match the lines using:
if($line =~ /\s*[A-Za-z]*"(.*),Customer.*\s*/)
You're not matching either the '(' after the name, nor the ' ' after the comma, before "Customer.".
I can get 'jjanson"' using this expression:
/\s*[A-Za-z]\(*"(.*), Customer.*\s*/
But I assume you don't want jjanson", so we need to modify it like so. (I tend to include the negative character class when I'm looking for simply-delimited stuff. So, in this case I'll make the expression "[^"]*"
/\s*[A-Za-z]\(*"([^"]+)", Customer.*\s*/
Also, I try not to depend upon whitespace, presence or number, I'm going to replace the space with \s*. That you didn't notice that you skipped the whitespace is a good illustration of the need to say "ignore a bunch of whitespace".
/\s*[A-Za-z]\(*"([^"]+)",\s*Customer.*\s*/
Now it's only looking for the sequence ',' + 'Customer' in the significant characters. Functionally, the same, if more flexible.
But since you only do one capture, I can't see what you'd map to what. So I'll do my own mapping:
my %records;
while ( my $line = $source->()) { # simply feed for a source of lines.
my ( $first, $user, $tag )
= $line = m/\s*(\p{Alpha}+)\s*\(\s*"([^"]+)",\s*Customer\.(\S+?)\)\/
;
$records{ $user }
= { first => $first
, username => $user
, tag => $tag
};
}
This is much more than you would tend to need in a one-off, quick solution. But I like to store as much of my input as seems significant.
Note that Jim("jjanson", Customer.SALES); matches the syntax of a function call with two arguments. You can thus abuse string eval:
#!/usr/bin/env perl
use strict;
use warnings;
use YAML::XS;
my $info = extract_first_arg(q{ Jim("jjanson", Customer.SALES);} );
print Dump $info;
sub extract_first_arg {
my $call = shift;
my ($name) = ($call =~ m{ \A \s* (\w+) }x);
unless ($name) {
warn "Failed to find function name in '$call'";
return;
}
my $username = eval sprintf(q{
package My::DangerZone;
no strict;
local *{ %s } = sub { $_[0] };
%s
}, $name, $call);
return { $name => $username };
}
Output:
---
Jim: jjanson
Or, you can abuse autoloading:
our $AUTOLOAD;
print Dump eval 'no strict;' . q{ Jim("jjanson", Customer.SALES); };
sub AUTOLOAD {
my ($fn) = ($AUTOLOAD =~ /::(\w+)\z/);
return { $fn => $_[0] };
}
I would not necessarily recommend using these methods, especially on input that is not in your control, and in a situation where this script has access to sensitive facilities.
On the other hand, I have, in the right circumstances, utilized this kind of thing in transforming one given set of information into something that can be used elsewhere.
Try this:
$line = 'Jim("jjanson", Customer.SALES)';
my %hashStore = (); #Jim("jjanson"
if($line=~m/^\s*([^\(\)]*)\(\"([^\"]*)\"/g) { $hashStore{$1} = $2; }
use Data::Dumper;
print Dumper \%hashStore;
Output:
$VAR1 = {
'Jim' => 'jjanson'
};

A Better Regex Solution in Perl?

Here's my problem:
I have text files with five columns. The last always has a single digit. Backslashes are illegal in the first three. Spaces may show up in the first column. I remove everything after the last # in the first column. The columns are separated by spaces. I can set the column width to pretty much any value I want, giving me some control as to the spacing between columns.
So, I might have something like this:
D Smith Application Database Read 2
I have code that transforms it into this:
grant read on database 'Application'.'Database' to 'D Smith';
Here is the Regex code I have created to delimit each field and avoid confusing any spaces in the first field from the delimiting spacing.
while (<>) {
s/^ //m;
if (/^([^\\]+?)( {80,})/) {
my $atindex = rindex($1,"#",);
my $username = substr($1,0,$atindex);
if ($atindex != -1) {
s/^([^\\]+?)( {80,})/$username $2/m;
s/ {2,}/ \\ \\ /g;
s/\\ \d$//gm;
s/ \\ $//gm;
}
}
What this does is make \\ \\ the delimiter between fields. Then I use this code for the transformation:
if (/([^\\]+) \\ \\ ([^\\]+) \\ \\ ([^\\]+) \\ \\ ([^\\]+)\n/) {
if ($4 eq "any") {
my $execany = "execute any";
print "grant $execany on database '$2'.'$3' to user '$1';\n";
} else {
print "grant $4 on database '$2'.'$3' to user '$1';\n";
}
I'm doing this because I couldn't figure out a way to discern the spaces between the fields from the spaces that might occur in the first field. Is there a better way? This works sufficiently quickly, but it's not elegant.
Are the columns constant width? If so, skip the regular expression and simply use substr:
Data Format
D Smith Application Database Read 2
012345678901234567890123456789012345678901234567890
Program
use strict;
use warnings;
use feature qw(say);
while ( my $line = <> ) {
chomp $line;
( my $user = substr( $line, 0, 10 )) =~ s/\s*$//;
( my $file = substr( $line, 12, 15 )) =~ s/\s*$//;
( my $db = substr( $line, 28, 12 )) =~ s/\s*$//;
( my $op = substr( $line, 41, 9 )) =~ s/\s*$//;
( my $num = substr ( $line, 50 )) =~ s/\s*$//;
say qq(User = "$user", File = "$file", DB = "$db", OP = "$op", NUM = "$num");
}
The s/\s*$//; trims the right side of the string removing white space.
If you don't want to use all of those substrings, and only your first field might have a space in it, then you can use substr to split out that first field, and split on the rest of the fields:
while ( my $line = <> ) {
chomp $line;
( my $user = substr( $line, 0, 10 ) ) =~ s/\s*$//;
my ( $file, $db, $op, $num ) = split /\s+/, substr( $line, 12 );
....
}
Another Solution
Are the columns constant width? ... Nice solution. unpack could also be used with constant widths. – Kenosis
Let's use unpack!
while ( my $line = <> ) {
chomp $line;
my ( $user, $file, $db, $op, $num ) = unpack ("A12A16A13A9A*", $line);
say qq(User = "$user", File = "$file", DB = "$db", OP = "$op", NUM = "$num");
}
Yes, that's easy to understand. At least I don't have to right trim my strings like I did with substr. See the pack/unpack tutorial.
As I describe in the comments to your question, as long as you can ensure that two simple assumptions are valid, you have no need for a lot of complicated hairy regexing. Those assumptions are:
that, for every pair of columns, at least two spaces separate the end of the value in the first column, and the beginning of the value in the second;
that no column's value contains a string of two or more spaces.
(If you can't guarantee those assumptions for a separator consisting of two or more spaces, perhaps you can for three or more, or four or more, &c. You're better off delimiting your columns with something that you can be certain will never appear in any value, but absent that, rules like these are the best you can hope to do.)
Given those assumptions, you can just split() the string on substrings of two or more spaces, something like this:
while (<>) {
$_ =~ s#^\s+##;
my #fields = split(/\s{2,}/, $_);
# print your commands, interpolating values from #fields
}
Or, more simply and readably still, you can do something like this:
while (my $line = <STDIN>) {
# the same leading-space cleanup and split...
$line =~ s#^\s+##;
my #fields = split(/\s{2,}/, $line);
# ...and then we assign values to a hash with meaningful keys...
my %values = ('user' => $fields[0],
'application' => $fields[1],
'database' => $fields[2],
'permission' => (lc($fields[3]) eq 'any'
? 'execany'
: $fields[3]));
# ...so that our interpolation and printing becomes much more
# readable.
print "grant $values{'permission'}"
. " on database '$values{'application'}'.'$values{'database'}"
. " to user '$values{'user'}';"
. "\n";
};
You'd do well also to add some validity checking, i.e. make sure all the values you expect in a given row are present and correctly formatted and emit some useful notice, or just die() outright, if they're not.
To match lines like this:
D Smith Application Database Read 2
F J Perl Foobar Database2 Write 4
Something Whatever Database3 Any 1
into the relevant columns 1 to 5, where column 1 can contain spaces, anchor on end-of-line ($):
while (<>) {
next unless /^\s*(.+?)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\d+)$/;
my $grant_type = $4;
$grant_type = 'execute any' if lc $grant_type eq 'any';
print "grant $grant_type on '$2'.'$3' to '$1'\n";
}
result:
grant Read on 'Application'.'Database' to 'D Smith'
grant Write on 'Foobar'.'Database2' to 'F J Perl'
grant execute any on 'Whatever'.'Database3' to 'Something'
Given you have two+ spaces between fields, perhaps the following will be helpful:
use strict;
use warnings;
while (<>) {
my ( $user, $app, $db, $perm ) = grep $_, split /\s{2,}/;
$perm = 'execute any' if lc $perm eq 'any';
print "grant $perm on database '$app'.'$db' to user '$user';\n";
}
You can omit the initial-space substitution by grepping the result of split. $perm is changed only if it's any after the split.
As you say only the first column contains spaces we can use split to break up the columns,
and splice to remove the last four... Then just use string interpolation to re-constitute
the first column - no complex repular expressions required, no assumptions about fixed
column spacing and no assumptions about double spacing.. Probably want to add some more
validity checks (make sure values are valid)
use strict;
use Const::Fast qw(const);
const my $N => 4;
while(<>){
## Split the string on spaces...
chomp;
my #Q = split;
next if #Q <= $N;
## And remove the last four columns...
my ($app,$db,$perm,$flag) = splice #Q,-$N,$N;
## Sort out name and perm...
( my $user = "#Q" ) =~ s{#[^#]+}{}mxs;
$perm = 'execute any' if 'any' eq lc $perm;
## Print out statement... using named variables makes life easier!
print "grant $perm on database '$app'.'$db' to user '$user';\n";
}