Regex building Help required - regex

I need help in regular expression building that include html tags, repeated patterns etc at http://regex101.com/r/iD3xT7/1
I have done part of it already but when I want to repeat the pattern for <a\s[^<>]*>([^<>]*)<\/a>\s it gets failed for repetition. Just like recursive. I need complete pattern for this.

Warning: you shouldn't use regex for HTML parsing,
as it have already been said many times on SO.
That said, you won't be able to repeat the hyperlink pattern only.
For better clarity, you should extract each kind of data with it own regex. Example in PHP:
$html = /* choose your way to retrieve the HTML */;
$movie = array();
preg_match('/Released:.*?<td>(.+?)<\/td>/s', $html, $matches);
$movies['lucy']['released'] = $matches[1];
preg_match('/Runtime:.*?<td>(.+?)<\/td>/s', $html, $matches);
$movies['lucy']['runtime'] = $matches[1];
preg_match_all('/<a[^>]*?genre[^>]*?>(.+?)<\/a>/', $html, $matches);
$movies['lucy']['genres'] = $matches[1];
var_dump($movies);
/*
array(1) {
["lucy"]=>
array(3) {
["released"]=>
string(13) "July 25, 2014"
["runtime"]=>
string(8) "90 mins "
["genres"]=>
array(2) {
[0]=>
string(6) "Action"
[1]=>
string(6) "Sci-Fi"
}
}
}
*/

Related

Select characters that appear only once in a string

Is it possible to select characters who appear only once?
I am familiar with negative look-behind, and tried the following
/(.)(?<!\1.*)/
but could not get it to work.
examples:
given AXXDBD it should output ADBD
^^ - this is unacceptable
given 123558 it should output 1238
^^ - this is unacceptable
thanks in advance for the help
There are probably a lot of approaches to this, but I think you're looking for something like
(.)\1{1,}
That is, any character followed by the same character at least once.
Your question is tagged with both PHP and JS, so:
PHP:
$str = preg_replace('/(.)\1{1,}/', '', $str);
JS:
str = str.replace(/(.)\1{1,}/g, '');
Without using a regular expression:
function not_twice ($str) {
$str = (string)$str;
$new_str = '';
$prev = false;
for ($i=0; $i < strlen($str); $i++) {
if ($str[$i] !== $prev) {
$new_str .= $str[$i];
}
$prev = $str[$i];
}
return $new_str;
}
Removes consecutives characters (1+) and casts numbers to string in case you need that too.
Testing:
$string = [
'AXXDBD',
'123558',
12333
];
$string = array_map('not_twice', $string);
echo '<pre>' . print_r($string, true) . '</pre>';
Outputs:
Array
(
[0] => AXDBD
[1] => 12358
[2] => 123
)

Join, split and map using perl for creating new attribs

my $str = "<SampleElement oldattribs=\"sa1 sa2 sa3\">";
$str =~ s#<SampleElement[^>]*oldattribs="([^"]*)"#
my $fulcnt=$&;
my $afids=$1;
my #affs = ();
if($afids =~ m/\s+/) {
#affs = split /\s/, $afids;
my $jnafs = join ",", map { $_=~s/[a-z]*//i, } #affs;
($fulcnt." newattribs=\"$jnafs\"");
}
else {
($fulcnt);
}
#eg;
My Output:
<SampleElement oldattribs="sa1 sa2 sa3" newattribs="1,1,1">
Expected Output:
<SampleElement oldattribs="sa1 sa2 sa3" newattribs="1,2,3">
Someone could point out me where I am doing wrong. Thanks in advance.
Where you're going wrong is earlier than you think - you're parsing XML using regular expressions. XML is contextual, and regex isn't, so it's NEVER going to be better than a dirty hack.
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig -> parse ( \*DATA );
my $sample_elt = $twig -> get_xpath('//SampleElement',0);
my #old_att = split ( ' ', $sample_elt -> att('oldattribs') );
$sample_elt -> set_att('newattribs', join " ", map { /(\d+)/ } #old_att);
$twig -> set_pretty_print ( 'indented_a' );
$twig -> print;
__DATA__
<XML>
<SampleElement oldattribs="sa1 sa2 sa3">
</SampleElement>
</XML>
But to answer the core of your problem - you're misusing map as an iterator here.
map { $_=~s/[a-z]*//i, } #affs;
Because what that is doing is iterating all the elements in #affs and modifying those... but map is just returning the result of the expression - which is 1 because it worked.
If you want to change #affs you'd:
s/[a-z]*//i for #affs;
But if you didn't want to, then the easy answer is to use the r regex flag:
map { s/[a-z]*//ir } #affs;
Or as I've done in my example:
map { /(\d+)/ } #affs;
Which regex matches and captures the numeric part of the string, but as a result the 'captured' text is what's returned.
Here is a simple way to build shown output from the input $str.
Note: The input is in single quotes, not double. Then the \" isn't a problem in the regex.
my $str = '<SampleElement oldattribs=\"sa1 sa2 sa3\">';
# Pull 'sa1 sa2 sa3' string out of it
my ($attrs) = $str =~ /=\\"([^\\]+)/; # " # (turn off bad syntax highlight)
# Build '1,2,3' string from it
my $indices = join ',', map { /(\d+)/ } split ' ', $attrs;
# Extract content between < > so to add to it, put it back together
my ($content) = $str =~ /<(.*)>/;
my $outout = '<' . $content . " newattribs=\"$indices\"" . '>';
This gives the required output.
Some of these can be combined into single statements, if you are into that. For example
my $indices =
join ',', map { /(\d+)/ } split ' ', ($str =~ /"([^\\]+)/)[0]; # "
$str =~ s/<(.*)>/<$1 newattribs=\"$indices\">/;
All of this can be rolled into one regex, but it becomes just unwieldy and hard to maintain.
Above all – this appears to be XML or such ... please don't do it by hand, unless there is literally just a snippet or two. There are excellent parsers.
Found solution on this by searching map function:
my $str = "<SampleElement oldattribs=\"sa1 sa2 sa3\">";
$str=~s#<SampleElement[^>]*oldattribs="([^"]*)"#my $fulcnt=$&; my $afids=$1;
my #affs = ();
if($afids=~m/\s+/)
{
#affs = split /\s/, $afids;
my #newas = join ",", map { (my $foo = $_) =~ s/[a-z]*//i; $foo; } #affs ;
($fulcnt." newattribs=\"#newas\"");
}
else
{
($fulcnt);
}
#eg;
I have updated the below line on my code:
my #newas = join ",", map { (my $foo = $_) =~ s/[a-z]*//i; $foo; } #affs ;
Instead of
my $jnafs = join ",", map { $_=~s/[a-z]*//i, } #affs;
Its working thanks for all.

Regex on a string

I'm trying to formulate a regular expression to use on text. Using in-memory variables is not giving the same result.
The below regular expression provides $1 and $2 that return what I expect. rw results vary. These positions can vary: I am looking to extract the data irrespective of the position in the string.
\/vol\/(\w+)\?(\w+|\s+).*rw=(.*\w+)
My data:
_DATA_
/vol/vol1 -sec=sys,rw=h1:h2,anon=0
/vol/vol1/q1 -sec=sys,rw=h3:h4,anon=0,ro=h1:h2
/vol/vol2/q1 -sec=sys,root=host5,ro=h3:h5,rw=h1:h2,anon=0
I'm trying to capture the second and third groups (if it is a space it should return a space), and a list of entries in rw, ro and root.
The expression (.*\w+) will match up to the last word character in the line. What you are looking for is most likely this ([0-9a-z:]+)
Guessing from your comment in reply to ikegami, maybe the following will give results you want.
#!/usr/bin/perl
use strict;
use warnings;
my #keys = qw/ rw ro root /;
my $wanted = join "|", #keys;
my %data;
while (<DATA>) {
my ($path, $param) = split;
my ($vol, $q) = (split '/', $path)[2,3];
my %tmp = map {split /=/} grep /^(?:$wanted)/, split /,/, $param;
$data{$vol}{$q // ' '} = \%tmp;
}
use Data::Dumper; print Dumper \%data;
__DATA__
/vol/vol1 -sec=sys,rw=h1:h2,anon=0
/vol/vol1/q1 -sec=sys,rw=h3:h4,anon=0,ro=h1:h2
/vol/vol2/q1 -sec=sys,root=host5,ro=h3:h5,rw=h1:h2,anon=0
The output from Data::Dumper is:
$VAR1 = {
'vol2' => {
'q1' => {
'ro' => 'h3:h5',
'root' => 'host5',
'rw' => 'h1:h2'
}
},
'vol1' => {
' ' => {
'rw' => 'h1:h2'
},
'q1' => {
'ro' => 'h1:h2',
'rw' => 'h3:h4'
}
}
};
Update: can you tell me what does (?:) mean in the grep?
(?: . . .) is a non-capturing group. It is used in this case because the beginning of the regex has ^. Without grouping, the regex would attempt to match ro positioned at the beginning of the string or rw or root anywhere in the string (not just the beginning).
/^ro|rw|root/ rather than /^(?:ro|rw|root)/
The second expression helps the search along because it knows to only attempt a match at the beginning of the string for all 3 patterns and not to try to match anywhere in the string (speeds things up although in your case, there are only 3 alternating matches to attempt - so, wouldn't make a huge difference here). But, still a good practice to follow.
what does (// ' ') stand for?
That is the defined or operator. The expression $q // ' ' says to use $q for the key in the hash if it is defined or a space instead.
You said in your original post I'm trying to capture the second and third groups (if it is a space it should return a space).
$q can be undefined when the split, my ($vol, $q) = (split '/', $path)[2,3]; has only a vol and not a q such as in this data line (/vol/vol1 -sec=sys,rw=h1:h2,anon=0).
No idea what you want, but a regex would not make a good parser here.
while (<DATA>) {
my ($path, $opts) = split;
my %opts =
map { my ($k,$v) = split(/=/, $_, 2); $k=>$v }
split(/,/, $opts);
...
}
(my %opts = split(/[,=]/, $opts); might suffice.)

How to modify the content in the xml file except inside a particular tag

I try to modify the content in an XML file. If the pattern is in a particular tag, then it should not get converted. All other occurrences of that pattern in the rest of the file should get converted.
Here, I am planning to convert \d{4}\.\d{2} to <prv>\d{4}\.\d{2}</prv>. But the pattern within the <link> tag is also getting modified.
Input:
<abc>A change to a 1343.44 good of <link>subheading 1222.34</link> from
within that subheading or any 4545.56 other chapter.</abc>
Expected Output:
<abc>A change to a <prv>1343.44</prv> good of <link>subheading 1222.34</link> from
within that subheading or any <prv>4545.56</prv> other chapter.</abc>
Use a proper XML parser. Here's how I would proceed with XML::XSH2, a wrapper around XML::LibXML:
open file.xml ;
for my $text in //text() {
if $text/parent::link next ;
perl { $parts = [ split /(\d{4}\.\d{2})/, $text ] } ;
$text := insert text { shift #$parts } replace $text ;
while { #$parts } {
my $n = { shift #$parts } ;
my $t = { shift #$parts } ;
$t := insert text $t after $text ;
insert chunk concat('<prv>', $n, '</prv>') after $text ;
$text = $t ;
}
}
save :b ;
(\d{4}\.\d{2})(?!((?!<link>).)*<\/link>)
This will work if contents of link tag are of uniform nature.
See demo
http://regex101.com/r/pP3pN1/19
Regex Solution
The following regex will solve most situations. However, it won't cover if a link element is embedded in another link element:
$xml =~ s{
\b(\d{4}\.\d{2})\b
(?!
(?: (?!<link>). )*
</link>
)
}{<prv>$1</prv>}sgx;
XML::LibXML Solution
The much better solution is to use an actual XML Parser. The following uses XML::LibXML to parse the data and insert the prv tags according to your spec.
use strict;
use warnings;
use XML::LibXML;
my $xml = XML::LibXML->load_xml( IO => \*DATA );
for my $node ( $xml->findnodes('//*/text()') ) {
next if $node->nodePath() =~ m{/link/};
my $parent = $node->parentNode();
# Split on marked values
my #values = split /\b(\d{4}\.\d{2})\b/, $node->data;
$node->setData( shift #values );
while ( my ( $num, $text ) = splice #values, 0, 2 ) {
my $prv = XML::LibXML::Element->new('prv');
$prv->appendText($num);
$parent->insertAfter( $prv, $node );
$node = XML::LibXML::Text->new($text);
$parent->insertAfter( $node, $prv );
}
}
print $xml->toString(), "\n";
__DATA__
<root>
<abc>A change to a 1343.44 good of <link>subheading 1222.34</link> from
within that 1717.17 subheading or any 4545.56 other chapter.</abc>
</root>
Outputs:
<?xml version="1.0"?>
<root>
<abc>A change to a <prv>1343.44</prv> good of <link>subheading 1222.34</link> from
within that <prv>1717.17</prv> subheading or any <prv>4545.56</prv> other chapter.</abc>
</root>
The below regex would match all the numbers which are in this format \d4{}\.\d{2} except the ones which are inside the <link> tag.
Regex:
(\d{4}\.\d{2})(?!(?:(?!<\/link>|<link>).)*<\/link>)
Replacement string:
<prv>$1</prv>
DEMO

Regular expression to match links containing "Google"

I want to use PHP regular expressions to match out all the links which contain the word google. I've tried this:
$url = "http://www.google.com";
$html = file_get_contents($url);
preg_match_all('/<a.*(.*?)".*>(.*google.*?)<\/a>/i',$html,$links);
echo '<pre />';
print_r($links); // it should return 2 links 'About Google' & 'Go to Google English'
However it returns nothing. Why?
Better is to use XPath here:
$url="http://www.google.com";
$html=file_get_contents($url);
$doc = new DOMDocument;
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$query = "//a[contains(translate(text(), 'GOOGLE', 'google'), 'google')]";
// or just:
// $query = "//a[contains(text(),'Google')]";
$links = $xpath->query($query);
$links will be a DOMNodeList you can iterate.
You should use a dom parser, because using regex for html documents can be "painfully" error prone.
Try something like this
//Disable displaying errors
libxml_use_internal_errors(TRUE);
$url="http://www.google.com";
$html=file_get_contents($url);
$doc = new DOMDocument();
$doc->loadHTML($html);
$n=0;
foreach ($doc->getElementsByTagName('a') as $a) {
//check if anchor contains the word 'google' and print it out
if ($a->hasAttribute('href') && strpos($a->getAttribute('href'),'google') ) {
echo "Anchor" . ++$n . ': '. $a->getAttribute('href') . '<br>';
}
}