Perl regex to extract multiple matches from string - regex

I have a string for example
id:123,createdby:'testuser1',"lastmodifiedby":'testuser2'.....
I want to extract the 2 user names (testuser1, testuser2) and save it to an array.

You don't need to do everything in one pattern. Do something simple in multiple matches:
my $string = qq(id:123,createdby:'testuser1',"lastmodifiedby":'testuser2');
my( $created_by ) = $string =~ /,createdby:'(.*?)'/;
my( $last_modified_by ) = $string =~ /,"lastmodifiedby":'(.*?)'/;
print <<"HERE";
Created: $created_by
Last modified by: $last_modified_by
HERE
But, this looks like comma-separated data, and the data that you show are inconsistently quoted. I don't know if that's from you typing it out or it's your actual data.
But, it also looks like it might have come from JSON. It that's true, there are much better ways to extract data.

Try this
use strict;
use warnings;
my $string = q[id:123,createdby:'testuser1',"lastmodifiedby":'testuser2'....];
my #matches = ($string =~ /,createdby:'(.+?)',"lastmodifiedby":'(.+?)'/) ;
print " #matches\n";
Outputs
testuser1 testuser2
User requirements changed to allow coping with missing files. To deal with that, try this
use strict;
use warnings;
my $string1 = q[id:123,createdby:'testuser1',"lastmodifiedby":'testuser2'....];
my $string2 = q[id:123,createdby:'testuser1'....] ;
for my $s ($string1, $string2)
{
my #matches = ( $s =~ /(?:createdby|"lastmodifiedby"):'(.+?)'/g ) ;
print "#matches\n";
}
Outputs
testuser1 testuser2
testuser1

Problem description does not give enough details, inside the string quoting is not consistent.
As already stated the string can be part of JSON block and in such case should be handled by other means. Perhaps this assumption is correct but it not clearly stated in the question.
Please read How do I ask a good question?, How to create a Minimal, Reproducible Example.
Otherwise assumed that quoting is just a typing error. A bigger data sample and better problem description would be a significant improvement of the question.
Following code sample demonstrates one of possible approaches to get desired result and assumes that data fields does not includes , and : (otherwise other approach to process data must be in place).
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my($str,%data,#arr);
$str = "id:123,createdby:'testuser1','lastmodifiedby':'testuser2'";
$str =~ s/'//g;
%data = split(/[:,]/,$str);
say Dumper(\%data);
#arr = ($data{createdby},$data{lastmodifiedby});
say Dumper(\#arr);
Output
$VAR1 = {
'id' => '123',
'createdby' => 'testuser1',
'lastmodifiedby' => 'testuser2'
};
$VAR1 = [
'testuser1',
'testuser2'
];
Other approach could be as following
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my($str,$re,#data,#arr);
$str = "id:123,createdby:'testuser1',\"lastmodifiedby\":'testuser2'";
#data = split(',',$str);
$re = qr/(createdby|lastmodifiedby)/;
for ( #data ) {
next unless /$re/;
s/['"]//g;
my($k,$v) = split(':',$_);
push #arr, $v;
}
say Dumper(\#arr);
Output
$VAR1 = [
'testuser1',
'testuser2'
];

Related

Split a comma separated list where commas in text aren't escaped

I'm working with legacy data which is usually in the format:
QID RESPONSE
However on some occasions the response contains multiple values of different types:
01320 2,35,6,"warm"
I have tried using
my #dataRowAsList = split('\t', $_);
my $questionID = $dataRowAsList[0];
my $response = substr($dataRowAsList[1],0,-2);
my #thisResponse = split(',', $response);
on relevant cases to split the output into question and response and then each response into component parts
However I've just discovered this type of case:
01320 2,35,6,"warm,windy"
The comma in quotes is not escaped
Is there a neat way to parse this into its components?
2
35
6
"warm,windy"
Quick example of Text::CSV usage with reading from a string:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;
use Text::CSV;
my $str = q/01320 2,35,6,"warm,windy"/;
my $csv = Text::CSV->new({auto_diag => 2});
my #fields = split " ", $str, 2;
say '$fields[0] is ', $fields[0];
say '$fields[1] is ', $fields[1];
say 'Parsed out $fields[1] is:';
$csv->parse($fields[1]);
say for $csv->fields;
Running this will produce:
$fields[0] is 01320
$fields[1] is 2,35,6,"warm,windy"
Parsed out $fields[1] is:
2
35
6
warm,windy
This is a non-core module, so you'll have to install it with your favorite CPAN client or your OS's package manager. If doing so doesn't automatically also install Text::CSV_XS, you'll probably want to do so as well to get an optimized implementation that Text::CSV with automatically use if present.
In your case I will use regexp and check the group that I need, this is an example I hope it will help you
use warnings;
use strict;
my $string = '01320 2,35,6,"warm,windy"';
if ($string =~ /^(\d+)\t(\d+),{1}(\d+),{1}(\d+),{1}(\S+)$/gu) {
print "$1\n$2\n$3\n$4\n$5\n\n";
}

Pattern match in perl

my $line = "Name:Amanda_Marry_Rose,Region:US,host:USE,cardType:DebitCard,product:Satin,Name:Raghav.S.Thomas,Region:UAE,";
my $name = "";
#name = ( $line =~ m/Name:([\w\s\_\,/g );
foreach (#name) {
print $name."\n";
}
I want to capture the word between Name: and ,Region whereever it occurs in the whole line. The main loophole is that the name can be of any format
Amanda_Marry_Rose
Amanda.Marry.Rose
Amanda Marry Rose
Amanda/Marry/Rose
I need a help in capturing such a pattern every time it occurs in the line. So for the line I provided, the output should be
Amanda_Marry_Rose
Raghav.S.Thomas
Does anyone has any idea how to do this? I tried keeping the below line, but it's giving me the wrong output as.
#name=($line=~m/Name:([\w\s\!\"\#\$\%\&\'\(\)\*\+\,\-\.\/\:\;\<\=\>\?\#\[\\\]\^\_\`\{\|\}\~\ยด]+)\,/g);
Output
Amanda_Marry_Rose,Region:US,host:USE,cardType:DebitCard,product:Satin,Name:Raghav.S.Thomas,Region:UAE
To capture between Name: and the first comma, use a negated character class:
/Name:([^,]+)/g
This says to match one or more characters following Name: which isn't a comma:
while (/Name:([^,]+)/g) {
print $1, "\n";
}
This is more efficient than a non-greedy quantifier, e.g:
/Name:(.+?),/g
As it doesn't require backtracking.
Reg-ex corrected:
my $line = "Name:Amanda_Marry_Rose,Region:US,host:USE,cardType:DebitCard,product:Satin,Name:Raghav.S.Thomas,Region:UAE,";
my #name = ($line =~ /Name\:([\w\s_.\/]+)\,/g);
foreach my $name (#name) {
print $name."\n";
}
What you have there is comma separated data. How you should parse this depends a lot on your data. If it is full-fledged csv data, the most safe approach is to use a proper csv parser, such as Text::CSV. If it is less strict data, you can get away with using the light-weight parser Text::ParseWords, which also has the benefit of being a core module in Perl 5. If what you have here is rather basic, user entered fields, then I would recommend split -- simply because when you know the delimiter, it is easier and safer to define it, than everything else inside it.
use strict;
use warnings;
use Data::Dumper;
my $line = "Name:Amanda_Marry_Rose,Region:US,host:USE,cardType:DebitCard,product:Satin,Name:Raghav.S.Thomas,Region:UAE,";
# Simple split
my #fields = split /,/, $line;
print Dumper for map /^Name:(.*)/, #fields;
use Text::ParseWords;
print Dumper map /^Name:(.*)/, quotewords(',', 0, $line);
use Text::CSV;
my $csv = Text::CSV->new({
binary => 1,
});
$csv->parse($line);
print Dumper map /^Name:(.*)/, $csv->fields;
Each of these options give the same output, save for the one that uses Text::CSV, which also issues an undefined warning, quite correctly, because your data has a trailing comma (meaning an empty field at the end).
Each of these has different strengths and weaknesses. Text::CSV can choke on data that does not conform with the CSV format, and split cannot handle embedded commas, such as Name:"Doe, John",....
The regex we use to extract the names very simply just captures the entire rest of the lines that begin with Name:. This also allows you to perform sanity checks on the field names, for example issue a warning if you suddenly find a field called Doe;Name:
The simple way is to look for all sequences of non-comma characters after every instance of Name: in the string.
use strict;
use warnings;
my $line = 'Name:Amanda_Marry_Rose,Region:US,host:USE,cardType:DebitCard,product:Satin,Name:Raghav.S.Thomas,Region:UAE,';
my #names = $line =~ /Name:([^,]+)/g;
print "$_\n" for #names;
output
Amanda_Marry_Rose
Raghav.S.Thomas
However, it may well be useful to parse the data into an array of hashes so that related fields are gathered together.
use strict;
use warnings;
my $line = 'Name:Amanda_Marry_Rose,Region:US,host:USE,cardType:DebitCard,product:Satin,Name:Raghav.S.Thomas,Region:UAE,';
my %info;
my #persons;
while ( $line =~ / ([a-z]+) : ([^:,]+) /gix ) {
my ($key, $val) = (lc $1, $2);
if ($info{$key}) {
push #persons, { %info };
%info = ();
}
$info{$key} = $val;
}
push #persons, { %info };
use Data::Dump;
dd \#persons;
print "\nNames:\n";
print "$_\n" for map $_->{name}, #persons;
output
[
{
cardtype => "DebitCard",
host => "USE",
name => "Amanda_Marry_Rose",
product => "Satin",
region => "US",
},
{
name => "Raghav.S.Thomas",
region => "UAE",
},
]
Names:
Amanda_Marry_Rose
Raghav.S.Thomas

Perl special variables for regex matches

I'd like to use one of perl's special variable to make this snippet a bit less large and ugly:
my $mysqlpass = "mysqlpass=verysecret";
$mysqlpass = first { /mysqlpass=/ } #vars;
$mysqlpass =~ s/mysqlpass=//;
I have looked this info up and tried several special variables ($',$1,$`, etc) to no avail
A s/// will return true if it replaces something.
Therefore, it is possible to simply combine those two statements instead of having a redundant m//:
use strict;
use warnings;
use List::Util qw(first);
chomp(my #vars = <DATA>);
my $mysqlpass = first { s/mysqlpass=// } #vars;
print "$mysqlpass\n";
__DATA__
mysqluser=notsosecret
mysqlpass=verysecret
mysqldb=notsecret
Outputs:
verysecret
One Caveat
Because $_ is an alias to the original data structure, the substitution will effect the #vars value as well.
Alternative using split
To avoid that, I would inquire if the #vars contains nothing but key value pairs separated by equal signs. If that's the case, then I would suggest simply translating that array into a hash instead.
This would enable much easier pulling of all keys:
use strict;
use warnings;
chomp(my #vars = <DATA>);
my %vars = map {split '=', $_, 2} #vars;
print "$vars{mysqlpass}\n";
__DATA__
mysqluser=notsosecret
mysqlpass=verysecret
mysqldb=notsecret
Outputs:
verysecret
Yeah, regular expression it, if you really want to visit the path of obfuscation.
See following code:
my $string = "mysqlpass=verysecret";
if ($string =~ /^(\w+)\=(\w+)$/) {
print $1; # This stores 'mysqlpass'
print $2; # This stores 'verysecret'
}
My recommendation against this though, is that you want your code to be readable.
The one you're looking for is $_.

How do I split a string into an array by comma but ignore commas inside double quotes?

I have a line:
$string = 'Paul,12,"soccer,baseball,hockey",white';
I am try to split this into #array that has 4 values so
print $array[2];
Gives
soccer,baseball,hockey
How do I this? Help!
Just use Text::CSV. As you can see from the source, getting CSV parsing right is quite complicated:
sub _make_regexp_split_column {
my ($esc, $quot, $sep) = #_;
if ( $quot eq '' ) {
return qr/([^\Q$sep\E]*)\Q$sep\E/s;
}
qr/(
\Q$quot\E
[^\Q$quot$esc\E]*(?:\Q$esc\E[\Q$quot$esc\E0][^\Q$quot$esc\E]*)*
\Q$quot\E
| # or
[^\Q$sep\E]*
)
\Q$sep\E
/xs;
}
The standard module Text::ParseWords will do this as well.
my #array = parse_line(q{,}, 0, $string);
In response to how to do it with Text::CSV(_PP). Here is a quick one.
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV_PP;
my $parser = Text::CSV_PP->new();
my $string = "Paul,12,\"soccer,baseball,hockey\",white";
$parser->parse($string);
my #fields = $parser->fields();
print "$_\n" for #fields;
Normally one would install Text::CSV or Text::CSV_PP through the cpan utility.
To work around your not being able to install modules, I suggest you use the 'pure Perl' implementation so that you can 'install' it. The above example would work assuming you copied the text of Text::CSV_PP source into a file named CSV_PP.pm in a folder called Text created in the same directory as your script. You could also put it in some other location and use the use lib 'directory' method as discussed previously. See here and here to see other ways to get around install restriction using CPAN modules.
Use this regex: m/("[^"]+"|[^,]+)(?:,\s*)?/g;
The above regular expression globally matches any word that starts with a comma or a quote and then matches the remaining word/words based on the starting character (comma or quote).
Here is a sample code and the corresponding output.
my $string = "Word1, Word2, \"Commas, inbetween\", Word3, \"Word4Quoted\", \"Again, commas, inbetween\"";
my #arglist = $string =~ m/("[^"]+"|[^,]+)(?:,\s*)?/g;
map { print $_ , "\n"} #arglist;
Here is the output:
Word1
Word2
"Commas, inbetween"
Word3
"Word4Quoted"
"Again, commas, inbetween"
try this
#array=($string =~ /^([^,]*)[,]([^,]*)[,]["]([^"]*)["][,]([^']*)$/);
the array will contains the output which expected by you.
use strict;
use warning;
#use Data::Dumper;
my $string = qq/Paul,12,"soccer,baseball,hockey",white/;
#split string into three parts
my ($st1, $st2, $st3) = split(/,"|",/, $string);
#output: st1:Paul,12 st2:soccer,baseball,hockey st3:white
#split $st1 into two parts
my ($st4, $st5) = split(/,/,$st1);
#push records into array
push (my #test,$st4, $st5,$st2, $st3 ) ;
#print Dumper \#test;
print "$test[2]\n";
output:
soccer,baseball,hockey
#$VAR1 = [
# 'Paul',
# '12',
# 'soccer,baseball,hockey',
# 'white'
# ];
$string = "Paul,12,\"soccer,baseball,hockey\",white";
1 while($string =~ s#"(.?),(.?)"#\"$1aaa$2\"#g);
#array = map {$_ =~ s/aaa/ /g; $_ =~ s/\"//g; $_} split(/,/, $string);
$" = "\n";
print "$array[2]";

How can I find all matches to a regular expression in Perl?

I have text in the form:
Name=Value1
Name=Value2
Name=Value3
Using Perl, I would like to match /Name=(.+?)/ every time it appears and extract the (.+?) and push it onto an array. I know I can use $1 to get the text I need and I can use =~ to perform the regex matching, but I don't know how to get all matches.
A m//g in list context returns all the captured matches.
#!/usr/bin/perl
use strict; use warnings;
my $str = <<EO_STR;
Name=Value1
Name=Value2
Name=Value3
EO_STR
my #matches = $str =~ /=(\w+)/g;
# or my #matches = $str =~ /=([^\n]+)/g;
# or my #matches = $str =~ /=(.+)$/mg;
# depending on what you want to capture
print "#matches\n";
However, it looks like you are parsing an INI style configuration file. In that case, I will recommend Config::Std.
my #values;
while(<DATA>){
chomp;
push #values, /Name=(.+?)$/;
}
print join " " => #values,"\n";
__DATA__
Name=Value1
Name=Value2
Name=Value3
The following will give all the matches to the regex in an array.
push (#matches,$&) while($string =~ /=(.+)$/g );
Use a Config:: module to read configuration data. For something simple like that, I might reach for ConfigReader::Simple. It's nice to stay out of the weeds whenever you can.
Instead of using a regular expression you might prefer trying a grammar engine like:
Parse::RecDescent
Regexp::Grammars
I've given a snippet of a Parse::ResDescent answer before on SO. However Regexp::Grammars looks very interesting and is influenced by Perl6 rules & grammars.
So I thought I'd have a crack at Regexp::Grammars ;-)
use strict;
use warnings;
use 5.010;
my $text = q{
Name=Value1
Name = Value2
Name=Value3
};
my $grammar = do {
use Regexp::Grammars;
qr{
<[VariableDeclare]>*
<rule: VariableDeclare>
<Var> \= <Value>
<token: Var> Name
<rule: Value> <MATCH= ([\w]+) >
}xms;
};
if ( $text =~ $grammar ) {
my #Name_values = map { $_->{Value} } #{ $/{VariableDeclare} };
say "#Name_values";
}
The above code outputs Value1 Value2 Value3.
Very nice! The only caveat is that it requires Perl 5.10 and that it may be overkill for the example you provided ;-)
/I3az/