How to capture specific string inside a double quotes - regex

I have a set of strings in the below format, i want to capture the value inside the double quotes.
Input:
"icici","1001","50.0"
"hdfc","2001","10.0","20.0"
Expected output from substitution parameter:
\0 match icici and hdfc
\1 match 1001 and 2001
\2 match 50.0 and 10.0
\3 match 20.0
I tried the below regex but its not working properly, could you please help?
((?:")([0-9A-Za-z.]+)(?:",?))+

try this:
/\"([a-z0-9.]+)\"/mi allow character puts in [allow char]
live demo
php
$re = "/\"([a-z0-9.]+)\"/mi";
$str = "\"icici\",\"1001\",\"50.0\"\n\"hdfc\",\"2001\",\"10.0\",\"20.0\"";
preg_match_all($re, $str, $matches);
var_dump( $matches[1]);
or
$str="";
if(count($matches[1])>0) foreach($matches[1] as $k=>$v){
$str .="\\$k"."->".$v." ";
}
echo $str ;
output:
array (size=7)
0 => string 'icici' (length=5)
1 => string '1001' (length=4)
2 => string '50.0' (length=4)
3 => string 'hdfc' (length=4)
4 => string '2001' (length=4)
5 => string '10.0' (length=4)
6 => string '20.0' (length=4)
or
\0->icici \1->1001 \2->50.0 \3->hdfc \4->2001 \5->10.0 \6->20.0

Related

Some capture groups seem lost when matching group repeatedly

Trying to parse the output of monitoring plugins I ran into a problem where the match result was unexpected by me:
First consider this debugger session with Perl 5.18.2:
DB<6> x $_
0 'last=0.508798;;;0'
DB<7> x $RE
0 (?^u:^((?^u:\'[^\'=]+\'|[^\'= ]+))=((?^u:\\d+(?:\\.\\d*)?|\\.\\d+))(s|%|[KMT]?B)?(;(?^u:\\d+(?:\\.\\d*)?|\\.\\d+)?){0,4}$)
-> qr/(?^u:^((?^u:'[^'=]+'|[^'= ]+))=((?^u:\d+(?:\.\d*)?|\.\d+))(s|%|[KMT]?B)?(;(?^u:\d+(?:\.\d*)?|\.\d+)?){0,4}$)/
DB<8> #m = /$RE/
DB<9> x #m
0 'last'
1 0.508798
2 undef
3 ';0'
DB<10>
OK, the regex $RE (intended to match "'label'=value[UOM];[warn];[crit];[min];[max]") looks terrifying at a first glance, so let me show the construction of it:
my $RE_label = qr/'[^'=]+'|[^'= ]+/;
my $RE_simple_float = qr/\d+(?:\.\d*)?|\.\d+/;
my $RE_numeric = qr/[-+]?$RE_simple_float(?:[eE][-+]?\d+)?/;
my $RE = qr/^($RE_label)=($RE_simple_float)(s|%|[KMT]?B)?(;$RE_simple_float?){0,4}$/;
The relevant part is (;$RE_simple_float?){0,4}$ intended to match ";[warn];[crit];[min];[max]" (still not perfect), so for ";;;0" I'd expect #m to end with ';', ';', ';0'.
However it seems the matches are lost, except for the last one.
Did I misunderstand something, or is it a Perl bug?
When you use {<number>} (or + or * for that matter) after a capture group, only the last value that is matched by the capture group is stored. This explain why you only end up with ;0 instead of ;;;0 in your fourth capture group: (;$RE_simple_float?){0,4} sets the fourth capture group to the last element it matches.
Top fix that, I would recommend to match the whole end of the string, and split it afterwards:
my $RE = qr/...((?:;$RE_simple_float?){0,4})$/;
my #m = /$RE/;
my #end = split /;/, $m[3]; # use /(?<=;)/ to keep the semicolons
Another solution is to repeat the capture group: replace (;$RE_simple_float?){0,4} with
(;$RE_simple_float?)?(;$RE_simple_float?)?(;$RE_simple_float?)?(;$RE_simple_float?)?
The capture groups that do not match will be set to undef. This issue with this approach is that it's a bit verbose, and only works for {}, but not for + or *.
Following demo code utilizes split to obtain data of interest. Investigate if it will fit as a solution for your problem.
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
while( <DATA> ) {
chomp;
say;
my $record;
$record->#{qw/label value warn crit min max/} = split(/[=;]/,$_);
say Dumper($record);
}
exit 0;
#'label'=value[UOM];[warn];[crit];[min];[max]
__DATA__
'label 1'=0.3345s;0.8s;1.2s;0.2s;3.2s
'label 2'=10%;7%;18%;2%;28%
'label 3'=0.5us;2.3us
Output
'label 1'=0.3345s;0.8s;1.2s;0.2s;3.2s
$VAR1 = {
'crit' => '1.2s',
'warn' => '0.8s',
'value' => '0.3345s',
'label' => '\'label 1\'',
'max' => '3.2s',
'min' => '0.2s'
};
'label 2'=10%;7%;18%;2%;28%
$VAR1 = {
'min' => '2%',
'max' => '28%',
'label' => '\'label 2\'',
'value' => '10%',
'warn' => '7%',
'crit' => '18%'
};
'label 3'=0.5us;2.3us
$VAR1 = {
'min' => undef,
'max' => undef,
'label' => '\'label 3\'',
'warn' => '2.3us',
'value' => '0.5us',
'crit' => undef
};

Regular expressions making scraper

I want to make a scraper that grab all the <a id href="">...</a> elements from some website and the format of the elements is:
<a id href="/model.aspx?modelid=886874">Samsung Galaxy Note 4 SM-N910F</a>
And the thing that is changed all the time is the ?modelid=integer. How do I make a regular expression for this?
try this:
$re = "/<a[^\"]*href=\"([^\"]*)\"[^>]*>([^<]+)<\\/a>/mi";
$str = "<a id href=\"sjdkg\">...</a>\n<a id href=\"sjdkg\">.dg..</a>";
preg_match_all($re, $str, $matches);
$matches[1]; // for href
$matches[2]; // for innertext
var_dump($matches);
output:
array
0 =>
array
0 => string '<a id href="sjdkg">...</a>' (length=26)
1 => string '<a id href="sjdkg">.dg..</a>' (length=28)
1 =>
array
0 => string 'sjdkg' (length=5)
1 => string 'sjdkg' (length=5)
2 =>
array
0 => string '...' (length=3)
1 => string '.dg..' (length=5)
live demo
This is the regex you need:
modelid\=\d\d\d\d\d\d\">(.*)</a>
Regex Explanation

Counting repeated characters around nth character of string

For a part of my University project, i am trying to count base repeats around the 11th character of 21 bp sequences of DNA. I want to look at the 11th character, then if there are repeated identical characters around it, to print them.
For example:
GCTAAAGTAAAAGAAGATGCA
Would give results of:
11th base is A, YES repeated 4 times
I really don't know how to go about this, to get the 11th character i'm sure i can use a regex but after that i'm not sure.
To start with I have playing around using a hash and looking for the number of occurrences of different nucleotide groups in each sequence, as follows:
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $outputfile = "/Users/edwardtickle/Documents/hash.txt";
open FILE1, "/Users/edwardtickle/Documents/randomoutput.txt";
open( OUTPUTFILE, ">$outputfile" );
while (<FILE1>) {
if (/^(\S+)/) {
my $dna = $1;
my #repeats = ( $dna =~ /[A]{3}/g );
my #changes = ( $dna =~ /[T]{2}/g );
my %hash = ();
my %hash1 = ();
for my $repeats (#repeats) {
$hash{$repeats}++;
}
for my $changes (#changes) {
$hash1{$changes}++;
}
for my $key ( keys %hash ) {
print OUTPUTFILE $key . " => " . $hash{$key} . "\n";
}
for my $key1 ( keys %hash1 ) {
print OUTPUTFILE $key1 . " => " . $hash1{$key1} . "\n";
}
}
}
FILE1 data:
ATTTTTAGCCGAACAAGTACC
TACTTAGTTAAATTGTTACAA
ATAAACCTTGTGCAGGTTTGT
CCTTAATCCTTGTATTTTTAA
TCTTGTTAAAATGTCTACAGG
ATGTTAGTTATTTCATTCTTC
AAGTAACTAAAATTGCTCAAT
ACATTCGACAAAAATGAAAAA
TGTTTCGAATTCACCATATGC
AGTCGCAGCGGGTGCTCCAGC
Gives results of:
TT => 2
AAA => 1
TT => 4
AAA => 1
TT => 2
TT => 4
AAA => 1
TT => 2
TT => 5
AAA => 1
TT => 1
AAA => 2
TT => 1
TT => 2
When for this sample data set i would like a cumulative tally of every sequence, rather than number of individual occurrences in each matching string, like this:
AAA => 6
TT => 23
How do i go about changing the output? And how do i prevent a string of TTTTT bases showing up as TT => 2? Then if anyone has any recommendations of how to go about the original problem/if it is even possible, that would be greatly appreciated.
Thanks in advance!
Using a regular expression:
use strict;
use warnings;
my $char = 11; # Looking for the 11th character, or position 10.
while (<DATA>) {
chomp;
if (m{
( (.) \2*+ ) # Look for a repeated character sequence
(?<= .{$char} ) # Must include pos $char - 1
}x) {
printf "%s => %d\n", $2, length($1);
}
}
__DATA__
ATTTTTAGCCGAACAAGTACC
TACTTAGTTAAATTGTTACAA
ATAAACCTTGTGCAGGTTTGT
CCTTAATCCTTGTATTTTTAA
TCTTGTTAAAATGTCTACAGG
ATGTTAGTTATTTCATTCTTC
AAGTAACTAAAATTGCTCAAT
ACATTCGACAAAAATGAAAAA
TGTTTCGAATTCACCATATGC
AGTCGCAGCGGGTGCTCCAGC
Output:
G => 1
A => 3
T => 1
T => 2
A => 4
T => 3
A => 4
A => 5
T => 2
G => 3
This code should do what you need. There really isn't a regular expression that will find the longest sequence of a given character at and around a given character position. This code works by splitting the string $seq into an array of characters #seq and then searching forwards and backwards from the centre.
It's practical to do things this way because the sequence is relatively short, and as long as there's an odd numbers of characters in the string it will calculate the centre point for you.
use strict;
use warnings;
while (<DATA>) {
chomp;
my ($base, $length) = find_mid_band($_);
printf "%s => %d\n", $base, $length;
}
sub find_mid_band {
my ($seq) = #_;
my #seq = unpack '(A1)*', $seq;
my $len_seq = #seq;
my $c_offset = ($len_seq - 1) / 2;
my $c_char = $seq[$c_offset];
my ($start, $end) = ($c_offset, $c_offset + 1);
--$start while $start > 0 and $seq[$start-1] eq $c_char;
++$end while $end < $len_seq and $seq[$end] eq $c_char;
return $c_char, $end-$start;
}
__DATA__
ATTTTTAGCCGAACAAGTACC
TACTTAGTTAAATTGTTACAA
ATAAACCTTGTGCAGGTTTGT
CCTTAATCCTTGTATTTTTAA
TCTTGTTAAAATGTCTACAGG
ATGTTAGTTATTTCATTCTTC
AAGTAACTAAAATTGCTCAAT
ACATTCGACAAAAATGAAAAA
TGTTTCGAATTCACCATATGC
AGTCGCAGCGGGTGCTCCAGC
output
G => 1
A => 3
T => 1
T => 2
A => 4
T => 3
A => 4
A => 5
T => 2
G => 3
Update
Here's a better way. It's shorter and faster, and works by all the subsequences of the same character until it finds a sequence that spans the middle character.
The output is identical to that of the above.
use strict;
use warnings;
while (<DATA>) {
chomp;
my ($base, $length) = find_mid_band($_);
printf "%s => %d\n", $base, $length;
}
sub find_mid_band {
my ($seq) = #_;
my $mid_seq = length($seq) / 2;
while ( $seq =~ /(.)\1*/g ) {
if ($-[0] < $mid_seq and $+[0] > $mid_seq) {
return $1, $+[0]-$-[0];
}
}
}
__DATA__
ATTTTTAGCCGAACAAGTACC
TACTTAGTTAAATTGTTACAA
ATAAACCTTGTGCAGGTTTGT
CCTTAATCCTTGTATTTTTAA
TCTTGTTAAAATGTCTACAGG
ATGTTAGTTATTTCATTCTTC
AAGTAACTAAAATTGCTCAAT
ACATTCGACAAAAATGAAAAA
TGTTTCGAATTCACCATATGC
AGTCGCAGCGGGTGCTCCAGC

Perl regex with a negative lookahead behaves unexpectedly

I'm attempting to match /ezmlm-(any word except 'weed' or 'return')\s+/ with a regex. The following demonstrates a foreach loop which does the right thing, and an attempted regex which almost does:
#!/usr/bin/perl
use strict;
use warnings;
my #tests = (
{ msg => "want 'yes', string has ezmlm, but not weed or return",
str => q[|/usr/local/bin/ezmlm-reject '<snip>'],
},
{ msg => "want 'yes', array has ezmlm, but not weed or return",
str => [ <DATA> ],
},
{ msg => "want 'no' , has ezmlm-weed",
str => q[|/usr/local/bin/ezmlm-weed '<snip>'],
},
{ msg => "want 'no' , doesn't have ezmlm-anything",
str => q[|/usr/local/bin/else '<snip>'],
},
{ msg => "want 'no' , ezmlm email pattern",
str => q[crazy/but/legal/ezmlm-wacky#example.org],
},
);
print "foreach regex\n";
foreach ( #tests ) {
print doit_fe( ref $_->{str} ? #{$_->{str}} : $_->{str} ) ? "yes" : "no";
print "\t";
print doit_re( ref $_->{str} ? #{$_->{str}} : $_->{str} ) ? "yes" : "no";
print "\t<--- $_->{msg}\n";
};
# for both of the following subs:
# #_ will contain one or more lines of data
# match the pattern /ezmlm-(any word except 'weed' or 'return')\s+/
sub doit_fe {
my $has_ezmlm = 0;
foreach ( #_ ) {
next if $_ !~ m/ezmlm-(.*?)\s/;
return 0 if $1 eq 'weed' or $1 eq 'return';
$has_ezmlm++;
};
return $has_ezmlm;
};
sub doit_re { return grep /ezmlm-(?!weed|return)/, #_; };
__DATA__
|/usr/local/bin/ezmlm-reject '<snip>'
|/usr/local/bin/ezmlm-issubn '<snip>'
|/usr/local/bin/ezmlm-send '<snip>'
|/usr/local/bin/ezmlm-archive '<snip>'
|/usr/local/bin/ezmlm-warn '<snip>'
The output of the sample program is as follows:
foreach regex
yes yes <--- want 'yes', string has ezmlm, but not weed or return
yes yes <--- want 'yes', array has ezmlm, but not weed or return
no no <--- want 'no' , has ezmlm-weed
no no <--- want 'no' , doesn't have ezmlm-anything
no yes <--- want 'no' , ezmlm email pattern
In the last instance, the regex fails, matching a goofy but legal email address. If I amend the regex placing a \s after the negative lookahead pattern like so:
grep /ezmlm-(?!weed|return)\s+/
The regex fails to match at all. I'm supposing it has to do with the how the negative pattern works. I've tried making the negation non-greedy, but it seems there's some lesson buried in 'perldoc perlre' that is escaping me. Is it possible to do this with a single regex?
The negative look-ahead is zero-width which means that the regex
/ezmlm-(?!weed|return)\s+/
will only match if one or more space characters immediately follow "ezmlm-".
The pattern
/ezmlm-(?!weed|return)/
will match
"crazy/but/legal/ezmlm-wacky#example.org"
because it contains "ezmlm-" not followed by "weedy" or "return".
Try
/ezmlm-(?!weed|return)\S+\s+/
where \S+ is one or more non-space characters (or instead use [^#\s]+ if you want to deny email addresses even if followed by a space).

Perl 5 - longest token matching in regexp (using alternation)

Is possible to force a Perl 5 regexp match longest possible string, if the regexp is, for example:
a|aa|aaa
I found is probably default in perl 6, but in perl 5, how i can get this behavior?
EXAMPLE pattern:
[0-9]|[0-9][0-9]|[0-9][0-9][0-9][0-9]
If I have string 2.10.2014, then first match will be 2, which is ok; but the next match will be 1, and this is not ok because it should be 10. Then 2014 will be 4 subsequently matches 2,0,1,4, but it should be 2014 using [0-9][0-9][0-9][0-9]. I know I could use [0-9]+, but I can't.
General solution: Put the longest one first.
my ($longest) = /(aaa|aa|a)/
Specific solution: Use
my ($longest) = /([0-9]{4}|[0-9]{1,2})/
If you can't edit the pattern, you'll have to find every possibility and find the longest of them.
my $longest;
while (/([0-9]|[0-9][0-9]|[0-9][0-9][0-9][0-9])/g) {
$longest = $1 if length($1) > length($longest);
}
The sanest solution I can see for unknown patterns is to match every possible pattern, look at the length of the matched substrings and select the longest substring:
my #patterns = (qr/a/, qr/a(a)/, qr/b/, qr/aaa/);
my $string = "aaa";
my #substrings = map {$string =~ /($_)/; $1 // ()} #patterns;
say "Matched these substrings:";
say for #substrings;
my $longest_token = (sort { length $b <=> length $a } #substrings)[0];
say "Longest token was: $longest_token";
Output:
Matched these substrings:
a
aa
aaa
Longest token was: aaa
For known patterns, one would sort them manually so that first-match is the same as longest-match:
"aaa" =~ /(aaa|aa|b|a)/;
say "I know that this was the longest substring: $1";
The alternation will use the first alternative that matches, so just write /aaa|aa|a/ instead.
For the example you have shown in your question, just put the longest alternative first like I said:
[0-9][0-9][0-9][0-9]|[0-9][0-9]|[0-9]
perl -Mstrict -Mre=/xp -MData::Dumper -wE'
{package Data::Dumper;our($Indent,$Sortkeys,$Terse,$Useqq)=(1)x4}
sub _dump { Dumper(shift) =~ s{(\[.*?\])}{$1=~s/\s+/ /gr}srge }
my ($count, %RS);
my $s= "aaaabbaaaaabbab";
$s =~ m{ \G a+b? (?{ $RS{ $+[0] - $-[0] } //= [ ${^MATCH}, $-[0] ]; $count++ }) (*FAIL) };
say sprintf "RS: %s", _dump(\%RS);
say sprintf "count: %s", $count;
'
RS: {
"1" => [ "a", 0 ],
"2" => [ "aa", 0 ],
"3" => [ "aaa", 0 ],
"4" => [ "aaaa", 0 ],
"5" => [ "aaaab", 0 ]
}
count: 5