Regular expression to match links containing "Google" - regex

I want to use PHP regular expressions to match out all the links which contain the word google. I've tried this:
$url = "http://www.google.com";
$html = file_get_contents($url);
preg_match_all('/<a.*(.*?)".*>(.*google.*?)<\/a>/i',$html,$links);
echo '<pre />';
print_r($links); // it should return 2 links 'About Google' & 'Go to Google English'
However it returns nothing. Why?

Better is to use XPath here:
$url="http://www.google.com";
$html=file_get_contents($url);
$doc = new DOMDocument;
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$query = "//a[contains(translate(text(), 'GOOGLE', 'google'), 'google')]";
// or just:
// $query = "//a[contains(text(),'Google')]";
$links = $xpath->query($query);
$links will be a DOMNodeList you can iterate.

You should use a dom parser, because using regex for html documents can be "painfully" error prone.
Try something like this
//Disable displaying errors
libxml_use_internal_errors(TRUE);
$url="http://www.google.com";
$html=file_get_contents($url);
$doc = new DOMDocument();
$doc->loadHTML($html);
$n=0;
foreach ($doc->getElementsByTagName('a') as $a) {
//check if anchor contains the word 'google' and print it out
if ($a->hasAttribute('href') && strpos($a->getAttribute('href'),'google') ) {
echo "Anchor" . ++$n . ': '. $a->getAttribute('href') . '<br>';
}
}

Related

Replacing sub strings with regex in powershell

I have the following regex code in my powershell to identify URL's that I need to update:
'href[\s]?=[\s]?\"[^"]*(https:\/\/oursite.org\/[^"]*News and Articles[^"]*)+\"'
'href[\s]?=[\s]?\"[^"]*(https:\/\/oursite.org\/[^"]*en\/News-and-Articles[^"]*)+\"'
These are getting me the results I need to update, now I need to know how to replace the values "News and Articles" with "news-and-articles" and "en" with "news-and-articles".
I have some code that has a replacement url like so:
$newUrl = 'href="https://oursite.org/"' #replaced value
So the beginning result would be:
https://www.oursite.org/en/News-and-Articles/2017/11/article-name
to be replaced with
https://www.oursite.org/news-and-articles/2017/11/article-name
Here is the function that is going through all the articles and doing a replacement:
function SearchItemForMatch
{
param(
[Data.Items.Item]$item
)
Write-Host "------------------------------------item: " $item.Name
foreach($field in $item.Fields) {
#Write-Host $field.Name
if($field.Type -eq "Rich Text") {
#Write-Host $field.Name
if($field.Value -match $pattern) {
ReplaceFieldValue -field $field -needle $pattern -replacement $newUrl
}
#if($field.Value -match $registrationPattern) {
# ReplaceFieldValue -field $field -needle $registrationPattern -replacement $newRegistrationUrl
#}
if($field.Value -match $noenpattern){
ReplaceFieldValue -field $field -needle $noenpattern -replacment $newnoenpattern
}
}
}
}
Here is the replacement method:
Function ReplaceFieldValue
{
param (
[Data.Fields.Field]$field,
[string]$needle,
[string]$replacement
)
Write-Host $field.ID
$replaceValue = $field.Value -replace $needle, $replacement
$item = $field.Item
$item.Editing.BeginEdit()
$field.Value = $replaceValue
$item.Editing.EndEdit()
Publish-Item -item $item -PublishMode Smart
$info = [PSCustomObject]#{
"ID"=$item.ID
"PageName"=$item.Name
"TemplateName"=$item.TemplateName
"FieldName"=$field.Name
"Replacement"=$replacement
}
[void]$list.Add($info)
}
Forgive me if I'm missing something, but it seems to me that all you really want to achieve is to get rid if the /en part and finally convert the whole url to lowercase.
Given your example url, this could be as easy as:
$url = 'https://www.oursite.org/en/News-and-Articles/2017/11/article-name'
$replaceValue = ($url -replace '/en/', '/').ToLower()
Result:
https://www.oursite.org/news-and-articles/2017/11/article-name
If it involves more elaborate replacements, then please edit your question and give us more examples and desired output.
Try Regex: (?<=oursite\.org\/)(?:en\/)?News-and-Articles(?=\/)
Replace with news-and-articles
Demo

Join, split and map using perl for creating new attribs

my $str = "<SampleElement oldattribs=\"sa1 sa2 sa3\">";
$str =~ s#<SampleElement[^>]*oldattribs="([^"]*)"#
my $fulcnt=$&;
my $afids=$1;
my #affs = ();
if($afids =~ m/\s+/) {
#affs = split /\s/, $afids;
my $jnafs = join ",", map { $_=~s/[a-z]*//i, } #affs;
($fulcnt." newattribs=\"$jnafs\"");
}
else {
($fulcnt);
}
#eg;
My Output:
<SampleElement oldattribs="sa1 sa2 sa3" newattribs="1,1,1">
Expected Output:
<SampleElement oldattribs="sa1 sa2 sa3" newattribs="1,2,3">
Someone could point out me where I am doing wrong. Thanks in advance.
Where you're going wrong is earlier than you think - you're parsing XML using regular expressions. XML is contextual, and regex isn't, so it's NEVER going to be better than a dirty hack.
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig -> parse ( \*DATA );
my $sample_elt = $twig -> get_xpath('//SampleElement',0);
my #old_att = split ( ' ', $sample_elt -> att('oldattribs') );
$sample_elt -> set_att('newattribs', join " ", map { /(\d+)/ } #old_att);
$twig -> set_pretty_print ( 'indented_a' );
$twig -> print;
__DATA__
<XML>
<SampleElement oldattribs="sa1 sa2 sa3">
</SampleElement>
</XML>
But to answer the core of your problem - you're misusing map as an iterator here.
map { $_=~s/[a-z]*//i, } #affs;
Because what that is doing is iterating all the elements in #affs and modifying those... but map is just returning the result of the expression - which is 1 because it worked.
If you want to change #affs you'd:
s/[a-z]*//i for #affs;
But if you didn't want to, then the easy answer is to use the r regex flag:
map { s/[a-z]*//ir } #affs;
Or as I've done in my example:
map { /(\d+)/ } #affs;
Which regex matches and captures the numeric part of the string, but as a result the 'captured' text is what's returned.
Here is a simple way to build shown output from the input $str.
Note: The input is in single quotes, not double. Then the \" isn't a problem in the regex.
my $str = '<SampleElement oldattribs=\"sa1 sa2 sa3\">';
# Pull 'sa1 sa2 sa3' string out of it
my ($attrs) = $str =~ /=\\"([^\\]+)/; # " # (turn off bad syntax highlight)
# Build '1,2,3' string from it
my $indices = join ',', map { /(\d+)/ } split ' ', $attrs;
# Extract content between < > so to add to it, put it back together
my ($content) = $str =~ /<(.*)>/;
my $outout = '<' . $content . " newattribs=\"$indices\"" . '>';
This gives the required output.
Some of these can be combined into single statements, if you are into that. For example
my $indices =
join ',', map { /(\d+)/ } split ' ', ($str =~ /"([^\\]+)/)[0]; # "
$str =~ s/<(.*)>/<$1 newattribs=\"$indices\">/;
All of this can be rolled into one regex, but it becomes just unwieldy and hard to maintain.
Above all – this appears to be XML or such ... please don't do it by hand, unless there is literally just a snippet or two. There are excellent parsers.
Found solution on this by searching map function:
my $str = "<SampleElement oldattribs=\"sa1 sa2 sa3\">";
$str=~s#<SampleElement[^>]*oldattribs="([^"]*)"#my $fulcnt=$&; my $afids=$1;
my #affs = ();
if($afids=~m/\s+/)
{
#affs = split /\s/, $afids;
my #newas = join ",", map { (my $foo = $_) =~ s/[a-z]*//i; $foo; } #affs ;
($fulcnt." newattribs=\"#newas\"");
}
else
{
($fulcnt);
}
#eg;
I have updated the below line on my code:
my #newas = join ",", map { (my $foo = $_) =~ s/[a-z]*//i; $foo; } #affs ;
Instead of
my $jnafs = join ",", map { $_=~s/[a-z]*//i, } #affs;
Its working thanks for all.

Regex building Help required

I need help in regular expression building that include html tags, repeated patterns etc at http://regex101.com/r/iD3xT7/1
I have done part of it already but when I want to repeat the pattern for <a\s[^<>]*>([^<>]*)<\/a>\s it gets failed for repetition. Just like recursive. I need complete pattern for this.
Warning: you shouldn't use regex for HTML parsing,
as it have already been said many times on SO.
That said, you won't be able to repeat the hyperlink pattern only.
For better clarity, you should extract each kind of data with it own regex. Example in PHP:
$html = /* choose your way to retrieve the HTML */;
$movie = array();
preg_match('/Released:.*?<td>(.+?)<\/td>/s', $html, $matches);
$movies['lucy']['released'] = $matches[1];
preg_match('/Runtime:.*?<td>(.+?)<\/td>/s', $html, $matches);
$movies['lucy']['runtime'] = $matches[1];
preg_match_all('/<a[^>]*?genre[^>]*?>(.+?)<\/a>/', $html, $matches);
$movies['lucy']['genres'] = $matches[1];
var_dump($movies);
/*
array(1) {
["lucy"]=>
array(3) {
["released"]=>
string(13) "July 25, 2014"
["runtime"]=>
string(8) "90 mins "
["genres"]=>
array(2) {
[0]=>
string(6) "Action"
[1]=>
string(6) "Sci-Fi"
}
}
}
*/

Escaping regex pattern in string while using regex on the same

My script is required to insert pattern (?:<\/?[a-z\-\=\"\ ]+>)?in words after each letter which can be used in another regular expression.
Problem is that is some words their may be regex pattern like .*? or (?:<[a-z\-]+>). I tried it but error thows unmatched regex where my pattern adds after ( or space created in regex causing this problem. Any help.
Here is the code I tried:
sub process_info{
my $process_mod = shift;
#print "$process_mod\n";
#b = split('',$process_mod);
my $flag;
for my $i(#b){
#print "######## flag: $flag test: $i\n";
$i = "$i".'(?:<\/?[a-z\-\=\"\ ]>)?' if $flag == 0 and $i !~ /\\|\(|\)|\:|\?|\[|\]/;
#print "$i";
if ($i =~ /\\|\(|\)|\:|\?|\[|\]/){
$flag = 1;
}
else{
$flag = 0;
}
#print "After: $i\n";
}
$process_mod = join('',#b);
#print "$process_mod\n";
return $process_mod;
}
You want to search for a certain plaintext in an XML file. You try to do this by inserting a regex for an XML tag between each character. This is wasteful, but it can be easily done by escaping all metacharacters in the input with the quotemeta function:
sub make_XML_matchable {
my $string = #_;
my $xml_tag = qr{ ... }; # I won't write that regex for you
my $combined = join $xml_tag, map quotemeta, split //, $string;
return qr/$combined/; # return a compiled regex
}
This assumes that you'd want to write a regex that can match XML tags – not impossible, but tedious and difficult to do correctly. Use an XML parser instead to strip all tags from a section:
use XML::LibXML;
my $dom = XML::LibXML->load_xml(string => $xml)
my $text_content = $dom->textContent; # all tags are gone
Or if you're actually trying to match HTML, then you might want to use Mojolicious:
use Mojo;
my $dom = Mojo::DOM->new($html);
my $text_content = $dom->all_text; # all tags are replaced by a space
At the begining of the foreach loop, use this:
for my $i(#b){
$i = quotemeta $i;
$i .= '(?:<\/?[a-z\-\=\"\ ]>)?' if $flag == 0 and $i !~ /[\\|():?[\]]/;
# don't escape __^

adding regex capabilities on simple cgi search

I have this simple cgi script working just fine but I want to add regex capabilities. is that possible? if so what I need to add. thanks.
#!/usr/local/bin/perl
read(STDIN, $buffer,$ENV{'CONTENT_LENGTH'});
#pairs = split(/&/, $buffer);
foreach $pair (#pairs) {
($key, $value) = split(/=/, $pair);
foreach $pair (#pairs) {
($key, $value) = split(/=/, $pair);
$value =~ tr/+/ /;
$value =~ s/%([a-zA-Z0-9][a-zA-Z0-9])/pack("C", hex($1))/eg;
$formdata{$key}.= "$value";
}
}
$search = $formdata{'search'};
open(INFO, "/test/myfile");
#array=<INFO>;
close (INFO);
...code truncate
To find lines that end with ".cgi":
my #array = grep /\.cgi$/, <INFO>;