Pattern match in perl - regex

my $line = "Name:Amanda_Marry_Rose,Region:US,host:USE,cardType:DebitCard,product:Satin,Name:Raghav.S.Thomas,Region:UAE,";
my $name = "";
#name = ( $line =~ m/Name:([\w\s\_\,/g );
foreach (#name) {
print $name."\n";
}
I want to capture the word between Name: and ,Region whereever it occurs in the whole line. The main loophole is that the name can be of any format
Amanda_Marry_Rose
Amanda.Marry.Rose
Amanda Marry Rose
Amanda/Marry/Rose
I need a help in capturing such a pattern every time it occurs in the line. So for the line I provided, the output should be
Amanda_Marry_Rose
Raghav.S.Thomas
Does anyone has any idea how to do this? I tried keeping the below line, but it's giving me the wrong output as.
#name=($line=~m/Name:([\w\s\!\"\#\$\%\&\'\(\)\*\+\,\-\.\/\:\;\<\=\>\?\#\[\\\]\^\_\`\{\|\}\~\ยด]+)\,/g);
Output
Amanda_Marry_Rose,Region:US,host:USE,cardType:DebitCard,product:Satin,Name:Raghav.S.Thomas,Region:UAE

To capture between Name: and the first comma, use a negated character class:
/Name:([^,]+)/g
This says to match one or more characters following Name: which isn't a comma:
while (/Name:([^,]+)/g) {
print $1, "\n";
}
This is more efficient than a non-greedy quantifier, e.g:
/Name:(.+?),/g
As it doesn't require backtracking.

Reg-ex corrected:
my $line = "Name:Amanda_Marry_Rose,Region:US,host:USE,cardType:DebitCard,product:Satin,Name:Raghav.S.Thomas,Region:UAE,";
my #name = ($line =~ /Name\:([\w\s_.\/]+)\,/g);
foreach my $name (#name) {
print $name."\n";
}

What you have there is comma separated data. How you should parse this depends a lot on your data. If it is full-fledged csv data, the most safe approach is to use a proper csv parser, such as Text::CSV. If it is less strict data, you can get away with using the light-weight parser Text::ParseWords, which also has the benefit of being a core module in Perl 5. If what you have here is rather basic, user entered fields, then I would recommend split -- simply because when you know the delimiter, it is easier and safer to define it, than everything else inside it.
use strict;
use warnings;
use Data::Dumper;
my $line = "Name:Amanda_Marry_Rose,Region:US,host:USE,cardType:DebitCard,product:Satin,Name:Raghav.S.Thomas,Region:UAE,";
# Simple split
my #fields = split /,/, $line;
print Dumper for map /^Name:(.*)/, #fields;
use Text::ParseWords;
print Dumper map /^Name:(.*)/, quotewords(',', 0, $line);
use Text::CSV;
my $csv = Text::CSV->new({
binary => 1,
});
$csv->parse($line);
print Dumper map /^Name:(.*)/, $csv->fields;
Each of these options give the same output, save for the one that uses Text::CSV, which also issues an undefined warning, quite correctly, because your data has a trailing comma (meaning an empty field at the end).
Each of these has different strengths and weaknesses. Text::CSV can choke on data that does not conform with the CSV format, and split cannot handle embedded commas, such as Name:"Doe, John",....
The regex we use to extract the names very simply just captures the entire rest of the lines that begin with Name:. This also allows you to perform sanity checks on the field names, for example issue a warning if you suddenly find a field called Doe;Name:

The simple way is to look for all sequences of non-comma characters after every instance of Name: in the string.
use strict;
use warnings;
my $line = 'Name:Amanda_Marry_Rose,Region:US,host:USE,cardType:DebitCard,product:Satin,Name:Raghav.S.Thomas,Region:UAE,';
my #names = $line =~ /Name:([^,]+)/g;
print "$_\n" for #names;
output
Amanda_Marry_Rose
Raghav.S.Thomas
However, it may well be useful to parse the data into an array of hashes so that related fields are gathered together.
use strict;
use warnings;
my $line = 'Name:Amanda_Marry_Rose,Region:US,host:USE,cardType:DebitCard,product:Satin,Name:Raghav.S.Thomas,Region:UAE,';
my %info;
my #persons;
while ( $line =~ / ([a-z]+) : ([^:,]+) /gix ) {
my ($key, $val) = (lc $1, $2);
if ($info{$key}) {
push #persons, { %info };
%info = ();
}
$info{$key} = $val;
}
push #persons, { %info };
use Data::Dump;
dd \#persons;
print "\nNames:\n";
print "$_\n" for map $_->{name}, #persons;
output
[
{
cardtype => "DebitCard",
host => "USE",
name => "Amanda_Marry_Rose",
product => "Satin",
region => "US",
},
{
name => "Raghav.S.Thomas",
region => "UAE",
},
]
Names:
Amanda_Marry_Rose
Raghav.S.Thomas

Related

Perl regex to extract multiple matches from string

I have a string for example
id:123,createdby:'testuser1',"lastmodifiedby":'testuser2'.....
I want to extract the 2 user names (testuser1, testuser2) and save it to an array.
You don't need to do everything in one pattern. Do something simple in multiple matches:
my $string = qq(id:123,createdby:'testuser1',"lastmodifiedby":'testuser2');
my( $created_by ) = $string =~ /,createdby:'(.*?)'/;
my( $last_modified_by ) = $string =~ /,"lastmodifiedby":'(.*?)'/;
print <<"HERE";
Created: $created_by
Last modified by: $last_modified_by
HERE
But, this looks like comma-separated data, and the data that you show are inconsistently quoted. I don't know if that's from you typing it out or it's your actual data.
But, it also looks like it might have come from JSON. It that's true, there are much better ways to extract data.
Try this
use strict;
use warnings;
my $string = q[id:123,createdby:'testuser1',"lastmodifiedby":'testuser2'....];
my #matches = ($string =~ /,createdby:'(.+?)',"lastmodifiedby":'(.+?)'/) ;
print " #matches\n";
Outputs
testuser1 testuser2
User requirements changed to allow coping with missing files. To deal with that, try this
use strict;
use warnings;
my $string1 = q[id:123,createdby:'testuser1',"lastmodifiedby":'testuser2'....];
my $string2 = q[id:123,createdby:'testuser1'....] ;
for my $s ($string1, $string2)
{
my #matches = ( $s =~ /(?:createdby|"lastmodifiedby"):'(.+?)'/g ) ;
print "#matches\n";
}
Outputs
testuser1 testuser2
testuser1
Problem description does not give enough details, inside the string quoting is not consistent.
As already stated the string can be part of JSON block and in such case should be handled by other means. Perhaps this assumption is correct but it not clearly stated in the question.
Please read How do I ask a good question?, How to create a Minimal, Reproducible Example.
Otherwise assumed that quoting is just a typing error. A bigger data sample and better problem description would be a significant improvement of the question.
Following code sample demonstrates one of possible approaches to get desired result and assumes that data fields does not includes , and : (otherwise other approach to process data must be in place).
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my($str,%data,#arr);
$str = "id:123,createdby:'testuser1','lastmodifiedby':'testuser2'";
$str =~ s/'//g;
%data = split(/[:,]/,$str);
say Dumper(\%data);
#arr = ($data{createdby},$data{lastmodifiedby});
say Dumper(\#arr);
Output
$VAR1 = {
'id' => '123',
'createdby' => 'testuser1',
'lastmodifiedby' => 'testuser2'
};
$VAR1 = [
'testuser1',
'testuser2'
];
Other approach could be as following
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my($str,$re,#data,#arr);
$str = "id:123,createdby:'testuser1',\"lastmodifiedby\":'testuser2'";
#data = split(',',$str);
$re = qr/(createdby|lastmodifiedby)/;
for ( #data ) {
next unless /$re/;
s/['"]//g;
my($k,$v) = split(':',$_);
push #arr, $v;
}
say Dumper(\#arr);
Output
$VAR1 = [
'testuser1',
'testuser2'
];

Perl Regex: How to parse string from " to" without \"?

I have to parse current line "abc\",","\"," by regex in Perl,
and get this result "abc\"," and "\","
I do this
while (/(\s*)/gc) {
if (m{\G(["])([^\1]+)\1,}gc){
say $2;
}
}
but it is wrong, because this regexp go to the last ",
My question is, How can I jump over this \" and stop on first ", ?
The following program performs matches according to your specification:
while (<>) {
#arr = ();
while (/("(?:\\"|[^"])*")/) {
push #arr, $1;
$_ = $';
}
print join(' ', #arr), "\n";
}
Input file input.txt:
"abc", "def"
"abc\",","\","
Output:
$ ./test.pl < input.txt
"abc" "def"
"abc\"," "\","
It can be improved to match more strictly because in this form a lot of input is possible that is maybe not desirable, but it serves as a first pointer. Additionally, it is better matching a CSV file with the corresponding module and not with regular expressions, but you have not stated if your input is really a CSV file.
Don't reinvent the wheel. If you have CSV, use a CSV parser.
use Text::CSV_XS qw( );
my $string = '"abc\",","\","';
my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 });
$csv->parse($_)
my #fields = $csv->fields();
Regexes aren't the best tool for this task. The standard Text::ParseWords module does this easily.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use Text::ParseWords;
my $line = '"abc\",","\","';
my #fields = parse_line(',', 1, $line);
for (0 .. $#fields) {
say "$_: $fields[$_]"
}
The output is:
0: "abc\","
1: "\","
split /(?<!\\)",(?<!\\)"/, $_
(preceded by cleaning the boundary of $_ with s/^"// && s/"$//; because enclosing external quotes didn't need to be in the definition of the input string, but you have them)
returns directly the array you want (without the need of external loop as the loop is inside the core perl function split, you might add \s* surrounding the comma according to how the string might be provided).
..but (actually just a note as you didn't mention) there could be a deeper case
If you have \" meaning " you possibly have also \\ meaning \, so you might have \\\" and \\", the last one (more generally an even number of \ preceding ") is complicate with one line regexp because look-behind is implemented for fixed size, and the unsupported regexp form (?<!\\(?:\\\\)*)" which would potentially get well also a string delimiter after backslash not intending as escape quote \" from the sequence \\", is inapplicable and a less efficient code that mine would be required, but again this marginal consideration is about the case that \\ has to be hypothetically interpreted too.

Multiline match with irregular new line

I have text file with many entries like this:
[...]
Wind: 83,476,224
Solution: (category,runs)~
0.235,6.52312667,~
0.98962,14.33858333,~
sdasd,cccc,~
0.996052905,sdsd
EnterValues: 656,136,1
Speed: 48,32
State: 2,102,83,476,224
[...]
From above part I would like to extract:
Solution: (category,runs)~
0.235,6.52312667,~
0.98962,14.33858333,~
sdasd,cccc,~
0.996052905,sdsd
It would be simple if EnterValues: exists after every Solution:, unfortunately it doesn't. Sometime it is Speed, sometime something different. I don't know how to construct the end of regex (I assume it should be sth like this:Solution:.*?(?<!~)\n).
My file has \n as a delimiter of new line.
What you need is to apply a "record separator" that has the functionality of a regex. Unfortunately, you cannot use $/, because it cannot be a regex. You can however read the entire file into one line, and split that line using a regex:
use strict;
use warnings;
use Data::Dumper;
my $str = do {
local $/; # disable input record separator
<DATA>; # slurp the file
};
my #lines = split /^(?=\pL+:)/m, $str; # lines begin with letters + colon
print Dumper \#lines;
__DATA__
Wind: 83,476,224
Solution: (category,runs)~
0.235,6.52312667,~
0.98962,14.33858333,~
sdasd,cccc,~
0.996052905,sdsd
EnterValues: 656,136,1
Speed: 48,32
State: 2,102,83,476,224
Output:
$VAR1 = [
'Wind: 83,476,224
',
'Solution: (category,runs)~
0.235,6.52312667,~
0.98962,14.33858333,~
sdasd,cccc,~
0.996052905,sdsd
',
'EnterValues: 656,136,1
',
'Speed: 48,32
',
'State: 2,102,83,476,224
'
You will do some sort of post processing on these variables, I assume, but I will leave that to you. One way to go from here is to split the values on newline.
As I see you first read all file to memory, but this is not a good pracrice. Try use flip flop operator:
while ( <$fh> ) {
if ( /Solution:/ ... !/~$/ ) {
print $_, "\n";
}
}
I can't test it right now, but I think this should work fine.
You can match from Solution to word followed by colon,
my ($solution) = $text =~ /(Solution:.*?) \w+: /xs;

How do I tokenise a word given tokens that are subsumed incompletely in the word?

I understand how to use regex in Perl in the following way:
$str =~ s/expression/replacement/g;
I understand that if any part of the expression is enclosed in parentheses, it can be used and captured in the replacement part, like this:
$str =~ s/(a)/($1)dosomething/;
But is there a way to capture the ($1) above outside of the regex expression?
I have a full word which is a string of consonants, e.g. bEdmA, its vowelized version baEodamaA (where a and o are vowels), as well its split up form of two tokens, separated by space, bEd maA. I want to just pick up the vowelized form of the tokens from the full word, like so: beEoda, maA. I'm trying to capture the token within the full word expression, so I have:
$unvowelizedword = "bEdmA";
$tokens[0] = "bEd", $tokens[1] = "mA";
$vowelizedword = "baEodamA";
foreach $t(#tokens) {
#find the token within the full word, and capture its vowels
}
I'm trying to do something like this:
$vowelizedword = m/($t)/;
This is completely wrong for two reasons: the token $t is not present in exactly its own form, such as bEd, but something like m/b.E.d/ would be more relevant. Also, how do I capture it in a variable outside the regular expression?
The real question is: how can I capture the vowelized sequences baEoda and maA, given the tokens bEd, mA from the full word beEodamaA?
Edit
I realized from all the answers that I missed out two important details.
Vowels are optional. So if the tokens are : "Al" and "ywm", and the fully vowelized word is "Alyawmi", then the output tokens would be "Al" and "yawmi".
I only mentioned two vowels, but there are more, including symbols made up of two characters, like '~a'. The full list (although I don't think I need to mention it here) is:
#vowels = ('a', 'i', 'u', 'o', '~', '~a', '~i', '~u', 'N', 'F', 'K', '~N', '~K');
The following seems to do what you want:
#!/usr/bin/env perl
use warnings;
use strict;
my #tokens = ('bEd', 'mA');
my $vowelizedword = "beEodamaA";
my #regex = map { join('.?', split //) . '.?' } #tokens;
my $regex = join('|', #regex);
$regex = qr/($regex)/;
while (my ($matched) = $vowelizedword =~ $regex) {
$vowelizedword =~ s{$regex}{};
print "matched $matched\n";
}
Update as per your updated question (vowels are optional). It works from the end of the string so you'll have to gather the tokens into an array and print them in reverse:
#!/usr/bin/env perl
use warnings;
use strict;
my #tokens = ('bEd', 'mA', 'Al', 'ywm');
my $vowelizedword = "beEodamaA Alyawmi"; # Caveat: Without the space it won't work.
my #regex = map { join('.?', split //) . '.?$' } #tokens;
my $regex = join('|', #regex);
$regex = qr/($regex)/;
while (my ($matched) = $vowelizedword =~ $regex) {
$vowelizedword =~ s{$regex}{};
print "matched $matched\n";
}
Use the m// operator in so-called "list context", as this:
my #tokens = ($input =~ m/capturing_regex_here/modifiershere);
ETA: From what I understand now, what you were trying to say is that you want to match an optional vowel after each character of the tokens.
With this, you can tweak the $vowels variable to only contain the letters you seek. Optionally, you may also just use . to capture any character.
use strict;
use warnings;
use Data::Dumper;
my #tokens = ("bEd", "mA");
my $full = "baEodamA";
my $vowels = "[aeiouy]";
my #matches;
for my $rx (#tokens) {
$rx =~ s/.\K/$vowels?/g;
if ($full =~ /$rx/) {
push #matches, $full =~ /$rx/g;
}
}
print Dumper \#matches;
Output:
$VAR1 = [
'baEoda',
'mA'
];
Note that
... $full =~ /$rx/g;
does not require capturing groups in the regex.
I suspect that there is an easier way to do whatever you're trying to accomplish. The trick is not to make the regex generation code so tricky that you forget what it's actually doing.
I can only begin to guess at your task, but from your single example, it looks like you want to check that the two subtokens are in the larger token, ignoring certain characters. I'm going to guess that those sub tokens have to be in order and can't have anything else between them besides those vowel characters.
To match the tokens, I can use the \G anchor with the /g global flag in scalar context. This anchors the match to the character one after the end of the last match for the same scalar. This way allows me to have separate patterns for each sub token. This is much easier to manage since I only need to change the list of values in #subtokens.
Once you go through each of the pairs and find which ones match all the patterns, I can extract the original string from the pair.
use v5.14;
my $vowels = '[ao]*';
my #subtokens = qw(bEd mA);
# prepare the subtoken regular expressions
my #patterns = map {
my $s = join "$vowels", map quotemeta, (split( // ), '');
qr/$s/;
} #subtokens;
my #tokens = qw( baEodamA mAabaEod baEoda mAbaEoda );
my #grand_matches;
TOKEN: foreach my $token ( #tokens ) {
say "-------\nMatching $token..........";
my #matches;
PATTERN: foreach my $pattern ( #patterns ) {
say "Position is ", pos($token) // 0;
# scalar context /g and \G
next TOKEN unless $token =~ /\G($pattern)/g;
push #matches, $1;
say "Matched with $pattern";
}
push #grand_matches, [ $token, \#matches ];
}
# Now report the original
foreach my $tuple ( #grand_matches ) {
say "$tuple->[0] has both fragments: #{$tuple->[1]}";
}
Now, here's the nice thing about this structure. I've probably guessed wrong about your task. If I have, it's easy to fix without changing the setup. Let's say that the subtokens don't have to be in order. That's an easy change to the pattern I created. I just get rid of the
\G anchor and the /g flag;
next TOKEN unless $token =~ /($pattern)/;
Or, suppose that the tokens have to be in order, but other things may be between them. I can insert a .*? to match that stuff, effectively skipping over it:
next TOKEN unless $token =~ /\G.*?($pattern)/g;
It would be much nicer if I could manage all of this from the map where I create the patterns, but the /g flag isn't a pattern flag. It has to go with the operator.
I find it much easier to manage changing requirements when I don't wrap everything in a single regular expression.
Assuming the tokens need to appear in order and without anything (other than a vowel) between them:
my #tokens = ( "bEd", "mA" );
my $vowelizedword = "baEodamaA";
my $vowels = '[ao]';
my (#vowelized_sequences) = $vowelizedword =~ ( '^' . join( '', map "(" . join( $vowels, split( //, $_ ) ) . "(?:$vowels)?)", #tokens ) . '\\z' );
print for #vowelized_sequences;

Perl substitution using a hash

open (FH,"report");
read(FH,$text,-s "report");
$fill{"place"} = "Dhahran";
$fill{"wdesc:desc"} = "hot";
$fill{"dayno.days"} = 4;
$text =~ s/%(\w+)%/$fill{$1}/g;
print $text;
This is the content of the "report" template file
"I am giving a course this week in %place%. The weather is %wdesc:desc%
and we're now onto day no %dayno.days%. It's great group of blokes on the
course but the room is like the weather - %wdesc:desc% and it gets hard to
follow late in the day."
For reasons that I won't go into, some of the keys in the hash I'll be using will have dots (.) or colons (:) in them, but the regex stops working for these, so for instance in the example above only %place% gets correctly replaced. By the way, my code is based on this example.
Any help with the regex greatly appreciated, or maybe there's a better approach...
You could loosen it right up and use "any sequence of anything that isn't a %" for the replaceable tokens:
$text =~ s/%([^%]+)%/$fill{$1}/g;
Good answers so far, but you should also decide what you want to do with %foo% if foo isn't a key in the %fill hash. Plausible options are:
Replace it with an empty string (that's what the current solutions do, since undef is treated as an empty string in this context)
Leave it alone, so "%foo%" stays as it is.
Do some kind of error handling, perhaps printing a warning on STDERR, terminating the translation, or inserting an error indicator into the text.
Some other observations, not directly relevant to your question:
You should use the three-argument version of open.
That's not the cleanest way to read an entire file into a string. For that matter, for what you're doing you might as well process the input one line at a time.
Here's how I might do it (this version leaves unrecognized "%foo%" strings alone):
#!/usr/bin/perl
use strict;
use warnings;
my %fill = ( place => 'Dhahran',
'wdesc:desc' => 'hot',
'dayno.days' => 4 );
my $filename = 'report';
open my $FH,,'<', $filename or die "$filename: $!\n";
while (my $line = <$FH>) {
foreach my $key (keys %fill) {
$line =~ s/\Q%$key%/$fill{$key}/g;
}
print $line;
}
And here's a version that dies with an error message if there's an unrecognized key:
#!/usr/bin/perl
use strict;
use warnings;
my %fill = ( place => 'Dhahran',
'wdesc:desc' => 'hot',
'dayno.days' => 4 );
my $filename = 'report';
open my $FH,,'<', $filename or die "$filename: $!\n";
while (my $line = <$FH>) {
$line =~ s/%([^%]*)%/Replacement($1)/eg;
print $line;
}
sub Replacement {
my($key) = #_;
if (exists $fill{$key}) {
return $fill{$key};
}
else {
die "Unrecognized key \"$key\" on line $.\n";
}
}
http://codepad.org/G0WEDNyH
$text =~ s/%([a-zA-Z0-9_\.\:]+)%/$fill{$1}/g;
By default \w equates to [a-zA-Z0-9_], so you'll need to add in the \. and \:.