How do you find the following div using regex? The URL and image location will consistently change based on the post URL, so I need to use a wild card.
I must use a regular expression because I am limited in what I can use due to the software I am using: http://community.autoblogged.com/entries/344640-common-search-and-replace-patterns
<div class="tweetmeme_button" style="float: right; margin-left: 10px;"> <br /> <img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fjumpinblack.com%2F2011%2F11%2F25%2Fdrake-and-rick-ross-you-only-live-once-ep-mixtape-2011-download%2F&source=jumpinblack1&style=compact&b=2" height="61" width="50" /><br /> </div>
I tried using
<div class="tweetmeme_button" style="float: right; margin-left: 10px;">.*<\/div>
Using regular expression to process HTML is a bad idea. I'm using HTML::TreeBuilder::XPath for this.
use strict;
use warnings;
use HTML::TreeBuilder::XPath;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new();
$mech->get("http://www.someURL.com");
my $tree = HTML::TreeBuilder::XPath->new_from_content( $mech->content() );
my $div = $tree->findnodes( '//div[#class="tweetmeme_button"]')->[0];
Use an HTML parser to parse HTML.
HTML::TokeParser::Simple or HTML::TreeBuilder::XPath among many others.
E.g.:
#!/usr/bin/env perl
use strict;
use warnings;
use HTML::TokeParser::Simple;
my $parser = HTML::TokeParser::Simple->new( ... );
while (my $div = $parser->get_tag) {
next unless $div->is_start_tag('div');
{
no warnings 'uninitialized';
next unless $div->get_attr('class') eq 'tweetmeme_button';
next unless $div->get_attr('style') eq 'float: right; margin-left: 10px;'
# now do what you want until the next </div>
}
}
Related
use WWW::Mechanize;
mkdir "images";
$url = "https://www.somedomain.com/";
$mech = new WWW::Mechanize;
$mech->get($url);
$num = 1;
$year = 2019;
$number = 23;
$content = q{<P><div class="row" style="text-align:center"><a target="_blank" href="/image/SG0092a.gif"><img src="/image/SG0092a.gif" alt="graphic image" class="img-responsive graphic"/></a></div><div class="row" style="text-align:center"><a target="_blank" href="/image/SG0092b.gif"><img src="/image/SG0092b.gif" alt="graphic image" class="img-responsive graphic"/></a></div><div class="row" style="text-align:center"><a target="_blank" href="/image/SG0092c.gif"><img src="/image/SG0092c.gif" alt="graphic image" class="img-responsive graphic"/></a></div><div class="row" style="text-align:center"><a target="_blank" href="/image/SG0092d.gif"><img src="/image/SG0092d.gif" alt="graphic image" class="img-responsive graphic"/></a></div><div class="row" style="text-align:center"><a target="_blank" href="/image/SG0092e.gif"><img src="/image/SG0092e.gif" alt="graphic image" class="img-responsive graphic"/></a></div>};
while ($content =~ s/(<img.+?src=)"([^>]+?)\.([A-Za-z]+)"/$1"images\/${year}_${number}_$num.$3"/g)
{
$imageuri = "$2.$3";
print $imageuri, "\n";
$mech->get($imageuri);
$mech->save_content("images/${year}_${number}_$num.$3");
$num++;
}
print $content, "\n";
Is it possible to do the above in perl? I would like the src attributes of the img elements replaced with a new path and filename and for the image files to be downloaded and saved with that path and filename.
You could do the following (but you should really consider using a real HTML parser):
$content =~ s{(<img.+?src=)"([^>]+?)\.([A-Za-z]+)"}{
my $imageuri = "$2.$3";
print $imageuri, "\n";
$mech->get($imageuri);
my $file = "images/${year}_${number}_$num.$3";
$num++;
$mech->save_content($file);
qq($1"$file")
}eg;
The e modifier on the substitution operator makes perl parse the replacement part as a block of code, not a string.
Other notes:
Always start your Perl files with use strict; use warnings; or equivalent (e.g. use strict can be replaced by use v5.12.0 or higher).
Avoid indirect object syntax (new WWW::Mechanize). Use normal method calls instead (WWW::Mechanize->new).
Use local variables (e.g. my $num = 1;) unless you really need package variables.
Here is one way to do it with an HTML parser, HTML::TreeBuilder.
This changes the src attribute to the new value in the processed node and replaces that node in the tree with the changed copy, for all img tags.
use warnings;
use strict;
use feature 'say';
use HTML::TreeBuilder;
my $content = join '', <DATA>; # join in general (not needed with one line)
my ($num, $year, $number) = (1, 2019, 23);
my $new_src_base = "images/${year}_${number}_$num";
my $tree = HTML::TreeBuilder->new_from_content($content);
my #nodes = $tree->look_down(_tag => 'img');
for my $node (#nodes) {
my ($ext) = $node->attr('src') =~ m{.*/.*\.(.*)\z}; #/
my $orig_src = $node->attr('src', $new_src_base . ".$ext"); # change 'src'
$node->replace_with($node);
# my $imageurl = $orig_src; # fetch the image etc...
# $mech->get($imageurl);
}
say $tree->as_HTML; # to inspect; otherwise print to file
__DATA__
<P><div class="row" style="text-align:center"><a target="_blank" href="/image/SG0092a.gif"> <img src="/image/SG0092a.gif" alt="graphic image" class="img-responsive graphic"/></a></div> <div class="row" style="text-align:center"><a target="_blank" href="/image/SG0092b.gif"> <img src="/image/SG0092b.gif" alt="graphic image" class="img-responsive graphic"/></a></div> <div class="row" style="text-align:center"><a target="_blank" href="/image/SG0092c.gif"> <img src="/image/SG0092c.gif" alt="graphic image" class="img-responsive graphic"/></a></div> <div class="row" style="text-align:center"><a target="_blank" href="/image/SG0092d.gif"> <img src="/image/SG0092d.gif" alt="graphic image" class="img-responsive graphic"/></a></div> <div class="row" style="text-align:center"><a target="_blank" href="/image/SG0092e.gif"> <img src="/image/SG0092e.gif" alt="graphic image" class="img-responsive graphic"/></a></div>
For the new name of src attribute I copy what I can infer from the OP. The code in the question leaves href attribute of the link unchanged (path to the same gif) so this code leaves that, too.
There are other tools to do this with, see this post for more, for example.
The above could perhaps run into problems related to weak references in older versions, see documentation. Then this should be safer
for my $node (#nodes) {
my ($ext) = ( $node->attr('src') ) =~ m{.*/.*\.(.*)\z}; #/
my $copy = $node->clone;
my $orig_src = $copy->attr('src', $new_src_base . ".$ext");
$node->replace_with($copy)->delete;
...
}
Using Mojo::DOM:
use strict;
use warnings;
use Mojo::DOM;
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new;
my $dom = Mojo::DOM->new($content);
my $num = 1;
foreach my $img ($dom->find('img[src]')->each) {
next unless $img->{src} =~ m/\.([a-zA-Z]+)\z/;
my $ext = $1;
my $path = "images/${year}_${number}_$num.$ext";
$ua->get($img->{src})->result->save_to($path);
$img->attr(src => $path);
$num++;
}
print $dom->to_string;
I am trying to find and replace in a regex code
<div class="gallery-image-container">
<div jstcache="1116"
class="gallery-image-high-res loaded"
style="width: 396px;
height: 264px;
background-image: url("https://lh5.googleusercontent.com/p/AF1QipMcTfMPZj_d5iip9WKtN2SQB9Je5U4rRB0nT_t8=s396-k-no");
background-size: 396px 264px;"
jsan="7.gallery-image-high-res,7.loaded,5.width,5.height,5.background-image,5.background-size">
</div>
</div>
In the code above I used This
(https:\/\/[^&]*)
To extract this URL
https://lh5.googleusercontent.com/p/AF1QipMcTfMPZj_d5iip9WKtN2SQB9Je5U4rRB0nT_t8=s396-k-no
I used This regex s\d{3} to get s396
Now I want to replace s396 to s1000 in the URL
Now am Stock and don't know how to go about it.
Please is there anyway all these can be done in just one regex code not multiple codes?
I would suggest using an HTML parser, but I understand sometimes that is not possible. Here is a little example in python.
import re
data = '''
<div class="gallery-image-container">
<div jstcache="1116"
class="gallery-image-high-res loaded"
style="width: 396px;
height: 264px;
background-image: url("https://lh5.googleusercontent.com/p/AF1QipMcTfMPZj_d5iip9WKtN2SQB9Je5U4rRB0nT_t8=s396-k-no");
background-size: 396px 264px;"
jsan="7.gallery-image-high-res,7.loaded,5.width,5.height,5.background-image,5.background-size">
</div>
</div>
'''
match = re.search("(https?://[^&]+)", data)
url = match.group(1)
url = re.sub("s\d{3}", "s1000", url)
print(url)
They key part is the regex of
(https?://[^&]+)
It is using a negative character class. It's saying, look for http with an optional s followed by :// and then all the non & You can use this site to play around with regexs:
https://regex101.com/r/b0APFA/1
I'm sure you could do a clever 1 liner nested regex to find and replace all at once, but it's going to be easier to troubleshoot if you have a few lines.
I've got a bunch of strings already separated from an HTML file, examples:
<img alt="" src="//i.imgur.com/tApg8ebb.jpg" title="Some manly skills for you guys<p><span class='points-q7Vdm'>18,736</span> <span class='points-text-q7Vdm'>points</span> : 316,091 views</p>">
<img src="//i.imgur.com/SwmwL4Gb.jpg" width="48" height="48">
<img src="//s.imgur.com/images/blog_rss.png">
I am trying to make a regular expression that will grab the src="URL" part of the img tag so that I can replace it later based on a few other conditions. The many instances of quotation marks are giving me the biggest problem, I'm still relatively new with Regex, so a lot of the tricks are out of my knowledge,
Thanks in advance
Use DOM or another parser for this, don't try to parse HTML with regular expressions.
Example:
$html = <<<DATA
<img alt="" src="//i.imgur.com/tApg8ebb.jpg" title="Some manly skills for you guys<p><span class='points-q7Vdm'>18,736</span> <span class='points-text-q7Vdm'>points</span> : 316,091 views</p>">
<img src="//i.imgur.com/SwmwL4Gb.jpg" width="48" height="48">
<img src="//s.imgur.com/images/blog_rss.png">
DATA;
$doc = new DOMDocument();
$doc->loadHTML($html); // load the html
$xpath = new DOMXPath($doc);
$imgs = $xpath->query('//img');
foreach ($imgs as $img) {
echo $img->getAttribute('src') . "\n";
}
Output
//i.imgur.com/tApg8ebb.jpg
//i.imgur.com/SwmwL4Gb.jpg
//s.imgur.com/images/blog_rss.png
If you would rather store the results in an array, you could do..
foreach ($imgs as $img) {
$sources[] = $img->getAttribute('src');
}
print_r($sources);
Output
Array
(
[0] => //i.imgur.com/tApg8ebb.jpg
[1] => //i.imgur.com/SwmwL4Gb.jpg
[2] => //s.imgur.com/images/blog_rss.png
)
$pattern = '/<img.+src="([\w/\._\-]+)"/';
I'm not sure which language you're using, so quote syntax will vary.
I tried real hard to find solution but couldn't do. Yup regex is way too complex. Anyways here is problem.
Objective:
I want to replace image link with cdn image links in PHP. In order to do that I thought better is to use preg_replace.
if links is /var/b.png OR http://www.example.com/png it will be replaced with CDN but if case src or class contains 'captcha' then it shouldn't as these are dynamic in nature.
For start I am trying:
$_SERVER["HTTP_HOST"] = 'www.bring.com';
$preg_host = preg_quote($_SERVER["HTTP_HOST"], '/');
$content = preg_replace('/((\<image\s+.*?src\=)(["\']http\:\/\/'.$preg_host.')(\/.*?["\'](^(?=.*(captcha)))(.*)?\>))/i', '$2$3.nyud.net:8080$4', $content);
$content = preg_replace('/(\<image\s+.*?src\=["\'])(\/.*?["\'].*?\>)/i', '$1http://'.$_SERVER['HTTP_HOST'].'.nyud.net:8080$2', $content);
Condition is that:
When not to do: src can contain "captcha" word and in some cases class contains "captcha" and this class can ahead or src or behind src which is making it more complicated. In these cases I don't want to replace links for example:
$content = <<<END
<image
type="image" src="/skins/bph/customer/images/icons/go.gif" alt="Search" title="Search" class="go-button" />
<image
id="verification_image_login_login_popup_form" src="http://www.bring.com/index.php?dispatch=image.captcha&verification_id=%3Alogin_login_popup_form&login_login_popup_form4ef33269bf30b=" alt="" onclick="this.src += 'reload' ;" width="100" height="25" class="image-captcha valign" /></p><div
class="clear">
<image
id="verification_image_login_login_popup_form" class="valign" src="http://www.bring.com/skins/bph/customer/images/icons/go.gif" alt="" onclick="this.src += 'reload' ;" width="100" height="25" /></p><div
class="clear">
END;
So as a result:
Shouldn't be replaced, but is happening opposite :(
Following should get replace as it doesn't have any class with captcha or link with captcha word in it
<image
id="verification_image_login_login_popup_form" class="valign" src="http://www.bring.com/skins/bph/customer/images/icons/xxx" alt="" onclick="this.src += 'reload' ;" width="100" height="25" /></p>
Rather than trying to solve whole problem by using regex magic (which can bite you at unexpected times) it is highly recommended to use PHP DOM parser.
Using DOM parser iterate through all the images and examine their src and class attributes and make your link modification as needed.
You can see tons of examples on using DOM if you search it here on SO or on Google.
I have a code like this
<div class="rgz">
<div class="xyz">
</div>
<div class="ckh">
</div>
</div>
The class ckh wont appear everytime. Can someone suggest the regex to get the data of fiv rgz. Data inside ckh is not needed but the div wont appear always.
Thanks in advance
#diEcho and #Dve are correct, you should learn to use something like the native DOMdocument class rather than using regex. Your code will be easier to read and maintain, and will handle malformed HTML much better.
Here is some sample code which may or may not do what you want:
$contents = '';
$doc = new DOMDocument();
$doc->load($page_url);
$nodes = $doc->getElementsByTagName('div');
foreach ($nodes as $node)
{
if($node->hasAttributes()){
$attributes = $element->attributes;
if(!is_null($attributes)){
foreach ($attributes as $index=>$attr){
if($attr->name == 'class' && $attr->value == 'rgz'){
$contents .= $node->nodeValue;
}
}
}
}
}
Regex is probably not your best option here.
A javascript framework such as jquery will allow you to use CSS selectors to get to the element your require, by doing something like
$('.rgz').children().last().innerHTML