Regular expressions making scraper - regex

I want to make a scraper that grab all the <a id href="">...</a> elements from some website and the format of the elements is:
<a id href="/model.aspx?modelid=886874">Samsung Galaxy Note 4 SM-N910F</a>
And the thing that is changed all the time is the ?modelid=integer. How do I make a regular expression for this?

try this:
$re = "/<a[^\"]*href=\"([^\"]*)\"[^>]*>([^<]+)<\\/a>/mi";
$str = "<a id href=\"sjdkg\">...</a>\n<a id href=\"sjdkg\">.dg..</a>";
preg_match_all($re, $str, $matches);
$matches[1]; // for href
$matches[2]; // for innertext
var_dump($matches);
output:
array
0 =>
array
0 => string '<a id href="sjdkg">...</a>' (length=26)
1 => string '<a id href="sjdkg">.dg..</a>' (length=28)
1 =>
array
0 => string 'sjdkg' (length=5)
1 => string 'sjdkg' (length=5)
2 =>
array
0 => string '...' (length=3)
1 => string '.dg..' (length=5)
live demo

This is the regex you need:
modelid\=\d\d\d\d\d\d\">(.*)</a>
Regex Explanation

Related

Some capture groups seem lost when matching group repeatedly

Trying to parse the output of monitoring plugins I ran into a problem where the match result was unexpected by me:
First consider this debugger session with Perl 5.18.2:
DB<6> x $_
0 'last=0.508798;;;0'
DB<7> x $RE
0 (?^u:^((?^u:\'[^\'=]+\'|[^\'= ]+))=((?^u:\\d+(?:\\.\\d*)?|\\.\\d+))(s|%|[KMT]?B)?(;(?^u:\\d+(?:\\.\\d*)?|\\.\\d+)?){0,4}$)
-> qr/(?^u:^((?^u:'[^'=]+'|[^'= ]+))=((?^u:\d+(?:\.\d*)?|\.\d+))(s|%|[KMT]?B)?(;(?^u:\d+(?:\.\d*)?|\.\d+)?){0,4}$)/
DB<8> #m = /$RE/
DB<9> x #m
0 'last'
1 0.508798
2 undef
3 ';0'
DB<10>
OK, the regex $RE (intended to match "'label'=value[UOM];[warn];[crit];[min];[max]") looks terrifying at a first glance, so let me show the construction of it:
my $RE_label = qr/'[^'=]+'|[^'= ]+/;
my $RE_simple_float = qr/\d+(?:\.\d*)?|\.\d+/;
my $RE_numeric = qr/[-+]?$RE_simple_float(?:[eE][-+]?\d+)?/;
my $RE = qr/^($RE_label)=($RE_simple_float)(s|%|[KMT]?B)?(;$RE_simple_float?){0,4}$/;
The relevant part is (;$RE_simple_float?){0,4}$ intended to match ";[warn];[crit];[min];[max]" (still not perfect), so for ";;;0" I'd expect #m to end with ';', ';', ';0'.
However it seems the matches are lost, except for the last one.
Did I misunderstand something, or is it a Perl bug?
When you use {<number>} (or + or * for that matter) after a capture group, only the last value that is matched by the capture group is stored. This explain why you only end up with ;0 instead of ;;;0 in your fourth capture group: (;$RE_simple_float?){0,4} sets the fourth capture group to the last element it matches.
Top fix that, I would recommend to match the whole end of the string, and split it afterwards:
my $RE = qr/...((?:;$RE_simple_float?){0,4})$/;
my #m = /$RE/;
my #end = split /;/, $m[3]; # use /(?<=;)/ to keep the semicolons
Another solution is to repeat the capture group: replace (;$RE_simple_float?){0,4} with
(;$RE_simple_float?)?(;$RE_simple_float?)?(;$RE_simple_float?)?(;$RE_simple_float?)?
The capture groups that do not match will be set to undef. This issue with this approach is that it's a bit verbose, and only works for {}, but not for + or *.
Following demo code utilizes split to obtain data of interest. Investigate if it will fit as a solution for your problem.
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
while( <DATA> ) {
chomp;
say;
my $record;
$record->#{qw/label value warn crit min max/} = split(/[=;]/,$_);
say Dumper($record);
}
exit 0;
#'label'=value[UOM];[warn];[crit];[min];[max]
__DATA__
'label 1'=0.3345s;0.8s;1.2s;0.2s;3.2s
'label 2'=10%;7%;18%;2%;28%
'label 3'=0.5us;2.3us
Output
'label 1'=0.3345s;0.8s;1.2s;0.2s;3.2s
$VAR1 = {
'crit' => '1.2s',
'warn' => '0.8s',
'value' => '0.3345s',
'label' => '\'label 1\'',
'max' => '3.2s',
'min' => '0.2s'
};
'label 2'=10%;7%;18%;2%;28%
$VAR1 = {
'min' => '2%',
'max' => '28%',
'label' => '\'label 2\'',
'value' => '10%',
'warn' => '7%',
'crit' => '18%'
};
'label 3'=0.5us;2.3us
$VAR1 = {
'min' => undef,
'max' => undef,
'label' => '\'label 3\'',
'warn' => '2.3us',
'value' => '0.5us',
'crit' => undef
};

How to capture specific string inside a double quotes

I have a set of strings in the below format, i want to capture the value inside the double quotes.
Input:
"icici","1001","50.0"
"hdfc","2001","10.0","20.0"
Expected output from substitution parameter:
\0 match icici and hdfc
\1 match 1001 and 2001
\2 match 50.0 and 10.0
\3 match 20.0
I tried the below regex but its not working properly, could you please help?
((?:")([0-9A-Za-z.]+)(?:",?))+
try this:
/\"([a-z0-9.]+)\"/mi allow character puts in [allow char]
live demo
php
$re = "/\"([a-z0-9.]+)\"/mi";
$str = "\"icici\",\"1001\",\"50.0\"\n\"hdfc\",\"2001\",\"10.0\",\"20.0\"";
preg_match_all($re, $str, $matches);
var_dump( $matches[1]);
or
$str="";
if(count($matches[1])>0) foreach($matches[1] as $k=>$v){
$str .="\\$k"."->".$v." ";
}
echo $str ;
output:
array (size=7)
0 => string 'icici' (length=5)
1 => string '1001' (length=4)
2 => string '50.0' (length=4)
3 => string 'hdfc' (length=4)
4 => string '2001' (length=4)
5 => string '10.0' (length=4)
6 => string '20.0' (length=4)
or
\0->icici \1->1001 \2->50.0 \3->hdfc \4->2001 \5->10.0 \6->20.0

Counting repeated characters around nth character of string

For a part of my University project, i am trying to count base repeats around the 11th character of 21 bp sequences of DNA. I want to look at the 11th character, then if there are repeated identical characters around it, to print them.
For example:
GCTAAAGTAAAAGAAGATGCA
Would give results of:
11th base is A, YES repeated 4 times
I really don't know how to go about this, to get the 11th character i'm sure i can use a regex but after that i'm not sure.
To start with I have playing around using a hash and looking for the number of occurrences of different nucleotide groups in each sequence, as follows:
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $outputfile = "/Users/edwardtickle/Documents/hash.txt";
open FILE1, "/Users/edwardtickle/Documents/randomoutput.txt";
open( OUTPUTFILE, ">$outputfile" );
while (<FILE1>) {
if (/^(\S+)/) {
my $dna = $1;
my #repeats = ( $dna =~ /[A]{3}/g );
my #changes = ( $dna =~ /[T]{2}/g );
my %hash = ();
my %hash1 = ();
for my $repeats (#repeats) {
$hash{$repeats}++;
}
for my $changes (#changes) {
$hash1{$changes}++;
}
for my $key ( keys %hash ) {
print OUTPUTFILE $key . " => " . $hash{$key} . "\n";
}
for my $key1 ( keys %hash1 ) {
print OUTPUTFILE $key1 . " => " . $hash1{$key1} . "\n";
}
}
}
FILE1 data:
ATTTTTAGCCGAACAAGTACC
TACTTAGTTAAATTGTTACAA
ATAAACCTTGTGCAGGTTTGT
CCTTAATCCTTGTATTTTTAA
TCTTGTTAAAATGTCTACAGG
ATGTTAGTTATTTCATTCTTC
AAGTAACTAAAATTGCTCAAT
ACATTCGACAAAAATGAAAAA
TGTTTCGAATTCACCATATGC
AGTCGCAGCGGGTGCTCCAGC
Gives results of:
TT => 2
AAA => 1
TT => 4
AAA => 1
TT => 2
TT => 4
AAA => 1
TT => 2
TT => 5
AAA => 1
TT => 1
AAA => 2
TT => 1
TT => 2
When for this sample data set i would like a cumulative tally of every sequence, rather than number of individual occurrences in each matching string, like this:
AAA => 6
TT => 23
How do i go about changing the output? And how do i prevent a string of TTTTT bases showing up as TT => 2? Then if anyone has any recommendations of how to go about the original problem/if it is even possible, that would be greatly appreciated.
Thanks in advance!
Using a regular expression:
use strict;
use warnings;
my $char = 11; # Looking for the 11th character, or position 10.
while (<DATA>) {
chomp;
if (m{
( (.) \2*+ ) # Look for a repeated character sequence
(?<= .{$char} ) # Must include pos $char - 1
}x) {
printf "%s => %d\n", $2, length($1);
}
}
__DATA__
ATTTTTAGCCGAACAAGTACC
TACTTAGTTAAATTGTTACAA
ATAAACCTTGTGCAGGTTTGT
CCTTAATCCTTGTATTTTTAA
TCTTGTTAAAATGTCTACAGG
ATGTTAGTTATTTCATTCTTC
AAGTAACTAAAATTGCTCAAT
ACATTCGACAAAAATGAAAAA
TGTTTCGAATTCACCATATGC
AGTCGCAGCGGGTGCTCCAGC
Output:
G => 1
A => 3
T => 1
T => 2
A => 4
T => 3
A => 4
A => 5
T => 2
G => 3
This code should do what you need. There really isn't a regular expression that will find the longest sequence of a given character at and around a given character position. This code works by splitting the string $seq into an array of characters #seq and then searching forwards and backwards from the centre.
It's practical to do things this way because the sequence is relatively short, and as long as there's an odd numbers of characters in the string it will calculate the centre point for you.
use strict;
use warnings;
while (<DATA>) {
chomp;
my ($base, $length) = find_mid_band($_);
printf "%s => %d\n", $base, $length;
}
sub find_mid_band {
my ($seq) = #_;
my #seq = unpack '(A1)*', $seq;
my $len_seq = #seq;
my $c_offset = ($len_seq - 1) / 2;
my $c_char = $seq[$c_offset];
my ($start, $end) = ($c_offset, $c_offset + 1);
--$start while $start > 0 and $seq[$start-1] eq $c_char;
++$end while $end < $len_seq and $seq[$end] eq $c_char;
return $c_char, $end-$start;
}
__DATA__
ATTTTTAGCCGAACAAGTACC
TACTTAGTTAAATTGTTACAA
ATAAACCTTGTGCAGGTTTGT
CCTTAATCCTTGTATTTTTAA
TCTTGTTAAAATGTCTACAGG
ATGTTAGTTATTTCATTCTTC
AAGTAACTAAAATTGCTCAAT
ACATTCGACAAAAATGAAAAA
TGTTTCGAATTCACCATATGC
AGTCGCAGCGGGTGCTCCAGC
output
G => 1
A => 3
T => 1
T => 2
A => 4
T => 3
A => 4
A => 5
T => 2
G => 3
Update
Here's a better way. It's shorter and faster, and works by all the subsequences of the same character until it finds a sequence that spans the middle character.
The output is identical to that of the above.
use strict;
use warnings;
while (<DATA>) {
chomp;
my ($base, $length) = find_mid_band($_);
printf "%s => %d\n", $base, $length;
}
sub find_mid_band {
my ($seq) = #_;
my $mid_seq = length($seq) / 2;
while ( $seq =~ /(.)\1*/g ) {
if ($-[0] < $mid_seq and $+[0] > $mid_seq) {
return $1, $+[0]-$-[0];
}
}
}
__DATA__
ATTTTTAGCCGAACAAGTACC
TACTTAGTTAAATTGTTACAA
ATAAACCTTGTGCAGGTTTGT
CCTTAATCCTTGTATTTTTAA
TCTTGTTAAAATGTCTACAGG
ATGTTAGTTATTTCATTCTTC
AAGTAACTAAAATTGCTCAAT
ACATTCGACAAAAATGAAAAA
TGTTTCGAATTCACCATATGC
AGTCGCAGCGGGTGCTCCAGC

Find text between string

How to extract text using perl from string like this:
Get text 1 ...
---------------------------------------------------------------------------------------------
Get text 2 ...
---------------------------------------------------------------------------------------------
Get text 3
---------------------------------------------------------------------------------------------
The result should be like this:
%texts = ( 'text1' => 'Get text 1 ...', 'text2' => 'Get text 2 ...',
'text3' => 'Get text 3 ...' )
Something like PHP preg_match_all.
Many thanks
If your preg_macth_all looks like preg_match_all('/(foo)/', $text, $matches), the perl equivalent is something like #matches = $text=~/(foo)/g.
I'm not sure if this is what you want, but the following code extracts the lines from the string you gave and puts them in a hash:
use Data::Dumper;
my $str = 'Get text 1 ...
---------------------------------------------------------------------------------------------
Get text 2 ...
---------------------------------------------------------------------------------------------
Get text 3
---------------------------------------------------------------------------------------------';
my %hash = (
'text1' => $str =~ /.*text 1.*/g,
'text2' => $str =~ /.*text 2.*/g,
'text3' => $str =~ /.*text 3.*/g);
print Dumper(\%hash);
The output of this snippet is:
$VAR1 = {
'text2' => 'Get text 2 ...',
'text1' => 'Get text 1 ...',
'text3' => 'Get text 3'
};
my $i = 1;
my %text = ();
open my $fh, "<", \$the_string;
while (<$fh>) {
if (/--------------/) { # text separator
$i++;
} else {
$text{"text$i"} .= $_;
}
}
In Perl >= 5.10, this should be:
while( $string =~ /^\w+\s+(\w+)\s+(\w+)[^\n\r]*/pgms ) {
$texts{$1.$2} = ${^MATCH}
}
Regards
rbo

Prepare string with regex for HTML tags management

P/S: I am a PHP programmer.
Given:
div{3|5|6|9}[id = abc| class=image], a[id=link|class=out]
I want to use regex to generate a result as an array, e.g:
array(
[div] => array(
"3|5|6|9",
"id = abc| class=image"
)
[a] => array(
"",
"id=link|class=out")
)
Would anyone please help?
Thank you a lot!
Have a try with this:
$str='div{3|5|6|9}[id = abc| class=image], a[id=link|class=out]';
preg_match_all('/(\w+)(\{(.*?)\})?\[(.*?)\](?:, |$)?/', $str, $m);
$out = array($m[1][0] => array($m[3][0], $m[4][0]), $m[1][1] => array($m[3][1], $m[4][1]));
print_r($out);
Output:
Array
(
[div] => Array
(
[0] => 3|5|6|9
[1] => id = abc| class=image
)
[a] => Array
(
[0] =>
[1] => id=link|class=out
)
)
If you can guarantee that a comma will not exist between { and } and between [ and ], you can first split the string by , and then use this regular expression:
/([a-z]+)(\{(.*?)\})?\[(.*?)\]/
The captured groups that you want are $1, $3, and $4 (those back-reference numbers should match up if you use preg_match_all)
Note: I tested this out in Javascript.
preg_match_all('/(\w+)(\{(.*?)\})?\[(.*?)\](?:, |$)?/', $str, $m);
I believe the above is fine to work unless another string like:
$str='div{3|5|6|9}[id = abc| class=image], a[id=link|class=out], br, ul';
the regex does not capture the br and ul.