use WWW::Mechanize;
mkdir "images";
$url = "https://www.somedomain.com/";
$mech = new WWW::Mechanize;
$mech->get($url);
$num = 1;
$year = 2019;
$number = 23;
$content = q{<P><div class="row" style="text-align:center"><a target="_blank" href="/image/SG0092a.gif"><img src="/image/SG0092a.gif" alt="graphic image" class="img-responsive graphic"/></a></div><div class="row" style="text-align:center"><a target="_blank" href="/image/SG0092b.gif"><img src="/image/SG0092b.gif" alt="graphic image" class="img-responsive graphic"/></a></div><div class="row" style="text-align:center"><a target="_blank" href="/image/SG0092c.gif"><img src="/image/SG0092c.gif" alt="graphic image" class="img-responsive graphic"/></a></div><div class="row" style="text-align:center"><a target="_blank" href="/image/SG0092d.gif"><img src="/image/SG0092d.gif" alt="graphic image" class="img-responsive graphic"/></a></div><div class="row" style="text-align:center"><a target="_blank" href="/image/SG0092e.gif"><img src="/image/SG0092e.gif" alt="graphic image" class="img-responsive graphic"/></a></div>};
while ($content =~ s/(<img.+?src=)"([^>]+?)\.([A-Za-z]+)"/$1"images\/${year}_${number}_$num.$3"/g)
{
$imageuri = "$2.$3";
print $imageuri, "\n";
$mech->get($imageuri);
$mech->save_content("images/${year}_${number}_$num.$3");
$num++;
}
print $content, "\n";
Is it possible to do the above in perl? I would like the src attributes of the img elements replaced with a new path and filename and for the image files to be downloaded and saved with that path and filename.
You could do the following (but you should really consider using a real HTML parser):
$content =~ s{(<img.+?src=)"([^>]+?)\.([A-Za-z]+)"}{
my $imageuri = "$2.$3";
print $imageuri, "\n";
$mech->get($imageuri);
my $file = "images/${year}_${number}_$num.$3";
$num++;
$mech->save_content($file);
qq($1"$file")
}eg;
The e modifier on the substitution operator makes perl parse the replacement part as a block of code, not a string.
Other notes:
Always start your Perl files with use strict; use warnings; or equivalent (e.g. use strict can be replaced by use v5.12.0 or higher).
Avoid indirect object syntax (new WWW::Mechanize). Use normal method calls instead (WWW::Mechanize->new).
Use local variables (e.g. my $num = 1;) unless you really need package variables.
Here is one way to do it with an HTML parser, HTML::TreeBuilder.
This changes the src attribute to the new value in the processed node and replaces that node in the tree with the changed copy, for all img tags.
use warnings;
use strict;
use feature 'say';
use HTML::TreeBuilder;
my $content = join '', <DATA>; # join in general (not needed with one line)
my ($num, $year, $number) = (1, 2019, 23);
my $new_src_base = "images/${year}_${number}_$num";
my $tree = HTML::TreeBuilder->new_from_content($content);
my #nodes = $tree->look_down(_tag => 'img');
for my $node (#nodes) {
my ($ext) = $node->attr('src') =~ m{.*/.*\.(.*)\z}; #/
my $orig_src = $node->attr('src', $new_src_base . ".$ext"); # change 'src'
$node->replace_with($node);
# my $imageurl = $orig_src; # fetch the image etc...
# $mech->get($imageurl);
}
say $tree->as_HTML; # to inspect; otherwise print to file
__DATA__
<P><div class="row" style="text-align:center"><a target="_blank" href="/image/SG0092a.gif"> <img src="/image/SG0092a.gif" alt="graphic image" class="img-responsive graphic"/></a></div> <div class="row" style="text-align:center"><a target="_blank" href="/image/SG0092b.gif"> <img src="/image/SG0092b.gif" alt="graphic image" class="img-responsive graphic"/></a></div> <div class="row" style="text-align:center"><a target="_blank" href="/image/SG0092c.gif"> <img src="/image/SG0092c.gif" alt="graphic image" class="img-responsive graphic"/></a></div> <div class="row" style="text-align:center"><a target="_blank" href="/image/SG0092d.gif"> <img src="/image/SG0092d.gif" alt="graphic image" class="img-responsive graphic"/></a></div> <div class="row" style="text-align:center"><a target="_blank" href="/image/SG0092e.gif"> <img src="/image/SG0092e.gif" alt="graphic image" class="img-responsive graphic"/></a></div>
For the new name of src attribute I copy what I can infer from the OP. The code in the question leaves href attribute of the link unchanged (path to the same gif) so this code leaves that, too.
There are other tools to do this with, see this post for more, for example.
The above could perhaps run into problems related to weak references in older versions, see documentation. Then this should be safer
for my $node (#nodes) {
my ($ext) = ( $node->attr('src') ) =~ m{.*/.*\.(.*)\z}; #/
my $copy = $node->clone;
my $orig_src = $copy->attr('src', $new_src_base . ".$ext");
$node->replace_with($copy)->delete;
...
}
Using Mojo::DOM:
use strict;
use warnings;
use Mojo::DOM;
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new;
my $dom = Mojo::DOM->new($content);
my $num = 1;
foreach my $img ($dom->find('img[src]')->each) {
next unless $img->{src} =~ m/\.([a-zA-Z]+)\z/;
my $ext = $1;
my $path = "images/${year}_${number}_$num.$ext";
$ua->get($img->{src})->result->save_to($path);
$img->attr(src => $path);
$num++;
}
print $dom->to_string;
Related
Having a little trouble constructing a Powershell Replace regex that's not too greedy.
Looking to convert occurrences of this pattern: /sites/*/*/SitePages/*/*.aspx to: /sites/*/*/SitePages/*/*.html
But having an issue where there's multiple values on the one line to be replaced. replace's greediness is capturing the whole line, replacing only the last.
sample input:
<div class="ms-wikicontent ms-rtestate-field" style="padding-right: 10px"><div class="ExternalClass8E56354CC4314DBA861E187B689F3A2B"><table id="layoutsTable" style="width:100%"><tbody><tr style="vertical-align:top"><td style="width:100%"><div class="ms-rte-layoutszone-outer" style="width:100%"><div class="ms-rte-layoutszone-inner" role="textbox" aria-haspopup="true" aria-autocomplete="both" aria-multiline="true"><a id="0::Home|Home" class="ms-wikilink" href="/sites/Team/Project/SitePages/Home.aspx">Home</a> - <a id="1::Jenkins|Jenkins" class="ms-wikilink" href="/sites/Team/Project/SitePages/Jenkins.aspx">Jenkins</a><h1 class="ms-rteElement-H1">Jenkins Integration with Deployment Tools</h1>
failing regex segment:
% { $_ -Replace '(sites.*SitePages.*)\.aspx' , '${1}.html' }
Suggestions?
(motivation: I am trying to convert the aspx page references to html as we've moved from hosting on SharePoint. Pages are all static, so no issues, other than converting the page extensions)
Just as you stated yourself, using a regular expression to peek and poke in a structured string might give unexpected and greedy results. As suggested before, it is generally a bad idea to attempt to parse HTML with regular expressions.
Instead use a dedicated HTML parser as the HtmlDocument class (and the Uri class for uri's).
Example
$html = '<div class="ms-wikicontent ms-rtestate-field" style="padding-right: 10px"><div class="ExternalClass8E56354CC4314DBA861E187B689F3A2B"><table id="layoutsTable" style="width:100%"><tbody><tr style="vertical-align:top"><td style="width:100%"><div class="ms-rte-layoutszone-outer" style="width:100%"><div class="ms-rte-layoutszone-inner" role="textbox" aria-haspopup="true" aria-autocomplete="both" aria-multiline="true"><a id="0::Home|Home" class="ms-wikilink" href="/sites/Team/Project/SitePages/Home.aspx">Home</a> - <a id="1::Jenkins|Jenkins" class="ms-wikilink" href="/sites/Team/Project/SitePages/Jenkins.aspx">Jenkins</a><h1 class="ms-rteElement-H1">Jenkins Integration with Deployment Tools</h1>'
function ParseHtml($String) {
$Unicode = [System.Text.Encoding]::Unicode.GetBytes($String)
$Html = New-Object -Com 'HTMLFile'
if ($Html.PSObject.Methods.Name -Contains 'IHTMLDocument2_Write') {
$Html.IHTMLDocument2_Write($Unicode)
}
else {
$Html.write($Unicode)
}
$Html.Close()
$Html
}
$Document = ParseHtml $Html
# You might also select your div from a presumably larger document:
# $div = $Document.getElementsByClassName('ms-wikicontent')
$Document.getElementsByTagName('a') |ForEach-Object {
if (([Uri]$_.href).LocalPath -like '/sites/*/*/SitePages/*.aspx') {
$_.href = [System.IO.Path]::ChangeExtension($_.href, 'html')
}
}
$Document.body.innerHtml
result:
<DIV class="ms-wikicontent ms-rtestate-field" style="PADDING-RIGHT: 10px">
<DIV class=ExternalClass8E56354CC4314DBA861E187B689F3A2B>
<TABLE id=layoutsTable style="WIDTH: 100%">
<TBODY>
<TR style="VERTICAL-ALIGN: top">
<TD style="WIDTH: 100%">
<DIV class=ms-rte-layoutszone-outer style="WIDTH: 100%">
<DIV aria-haspopup=true role=textbox aria-multiline=true class=ms-rte-layoutszone-inner aria-autocomplete=both><A id=0::Home|Home class=ms-wikilink href="/sites/Team/Project/SitePages/Home.html">Home</A> - <A id=1::Jenkins|Jenkins class=ms-wikilink href="/sites/Team/Project/SitePages/Jenkins.html">Jenkins</A>
<H1 class=ms-rteElement-H1>Jenkins Integration with Deployment Tools</H1></DIV></DIV></TR></TBODY></DIV></DIV>
Without lookarounds, you can use a capture group like in your question. But when matching you should not cross the " as the string in between double quotes.
(/sites\b[^\"]*/SitePages/[^\"]+)\.aspx\b
Explanation
( Capture group 1
/sites\b Match sites and a word boundary
[^\"]*/SitePages/ Optionally match any char except " and then match /SitePages/
[^\"]+ Match 1+ chars other than "
) Close group 1
\.aspx\b Match .aspx and a word boundary
See a regex demo.
$input = #"
<div class="ms-wikicontent ms-rtestate-field" style="padding-right: 10px"><div class="ExternalClass8E56354CC4314DBA861E187B689F3A2B"><table id="layoutsTable" style="width:100%"><tbody><tr style="vertical-align:top"><td style="width:100%"><div class="ms-rte-layoutszone-outer" style="width:100%"><div class="ms-rte-layoutszone-inner" role="textbox" aria-haspopup="true" aria-autocomplete="both" aria-multiline="true"><a id="0::Home|Home" class="ms-wikilink" href="/sites/Team/Project/SitePages/Home.aspx">Home</a> - <a id="1::Jenkins|Jenkins" class="ms-wikilink" href="/sites/Team/Project/SitePages/Jenkins.aspx">Jenkins</a><h1 class="ms-rteElement-H1">Jenkins Integration with Deployment Tools</h1>
"#
$input -replace '(/sites\b[^\"]*/SitePages/[^\"]+)\.aspx\b' ,'$1.html'
Output
<div class="ms-wikicontent ms-rtestate-field" style="padding-right: 10px"><div class="ExternalClass8E56354CC4314DBA861E187B689F3A2B"><table id="layoutsTable" style="width:100%"><tbody><tr style="vertical-align:top"><td style="width:100%"><div class="ms-rte-layoutszone-outer" style="width:100%"><div class="ms-rte-layoutszone-inner" role="textbox" aria-haspopup="true" aria-autocomplete="both" aria-multiline="true"><a id="0::Home|Home" class="ms-wikilink" href="/sites/Team/Project/SitePages/Home.html">Home</a> - <a id="1::Jenkins|Jenkins" class="ms-wikilink" href="/sites/Team/Project/SitePages/Jenkins.html">Jenkins</a><h1 class="ms-rteElement-H1">Jenkins Integration with Deployment Tools</h1>
Another variation if there are always 2 parts with / you can do an exact repetition with a quantifier {2} and for example assert the double quote after .aspx
(/sites(?:/[^/\"]+){2}/SitePages/[^/\"]+)\.aspx(?=\")
See another regex demo.
Daniel already showed an excellent solution using character exclusion [^/]:
$_ -replace '(?<=/sites/[^/]*/[^/]*/SitePages/[^/]*)aspx', 'html'
Demo and detailed explanation at regex101
Alternatively you could use the lazy modifier ?:
$_ -replace '(?<=/sites/.*?/.*?/SitePages/.*?)aspx', 'html'
Demo and detailed explanation at regex101
While the latter looks cleaner, it is less performant, because it requires more backtracking.
I did a little benchmark:
$text = '<div class="ms-wikicontent ms-rtestate-field" style="padding-right: 10px"><div class="ExternalClass8E56354CC4314DBA861E187B689F3A2B"><table id="layoutsTable" style="width:100%"><tbody><tr style="vertical-align:top"><td style="width:100%"><div class="ms-rte-layoutszone-outer" style="width:100%"><div class="ms-rte-layoutszone-inner" role="textbox" aria-haspopup="true" aria-autocomplete="both" aria-multiline="true"><a id="0::Home|Home" class="ms-wikilink" href="/sites/Team/Project/SitePages/Home.aspx">Home</a> - <a id="1::Jenkins|Jenkins" class="ms-wikilink" href="/sites/Team/Project/SitePages/Jenkins.aspx">Jenkins</a><h1 class="ms-rteElement-H1">Jenkins Integration with Deployment Tools</h1>'
$runs = 100000
$excludeMillis = (Measure-Command { foreach( $i in 1..$runs ) { $text -replace '(?<=/sites/[^/]*/[^/]*/SitePages/[^/]*)aspx', 'html' }}).TotalMilliseconds
$lazyMillis = (Measure-Command { foreach( $i in 1..$runs ) { $text -replace '(?<=/sites/.*?/.*?/SitePages/.*?)aspx', 'html' }}).TotalMilliseconds
[PSCustomObject]#{
RegExExclude = '{0} ms' -f [int]$excludeMillis
RegExLazy = '{0} ms ({1}%)' -f [int]$lazyMillis, [int]($lazyMillis / $excludeMillis * 100)
}
Output from PS 7.2:
RegExExclude RegExLazy
------------ ---------
281 ms 350 ms (125%)
The difference is noticable, but not that big, so you may go for readability if performance doesn't matter.
The performance difference between the two becomes even smaller when using a compiled RegEx:
$text = '<div class="ms-wikicontent ms-rtestate-field" style="padding-right: 10px"><div class="ExternalClass8E56354CC4314DBA861E187B689F3A2B"><table id="layoutsTable" style="width:100%"><tbody><tr style="vertical-align:top"><td style="width:100%"><div class="ms-rte-layoutszone-outer" style="width:100%"><div class="ms-rte-layoutszone-inner" role="textbox" aria-haspopup="true" aria-autocomplete="both" aria-multiline="true"><a id="0::Home|Home" class="ms-wikilink" href="/sites/Team/Project/SitePages/Home.aspx">Home</a> - <a id="1::Jenkins|Jenkins" class="ms-wikilink" href="/sites/Team/Project/SitePages/Jenkins.aspx">Jenkins</a><h1 class="ms-rteElement-H1">Jenkins Integration with Deployment Tools</h1>'
$runs = 100000
$rxExclude = [regex]::new( '(?<=/sites/[^/]*/[^/]*/SitePages/[^/]*)aspx', [Text.RegularExpressions.RegexOptions]::Compiled )
$rxLazy = [regex]::new( '(?<=/sites/.*?/.*?/SitePages/.*?)aspx', [Text.RegularExpressions.RegexOptions]::Compiled )
$excludeMillis = (Measure-Command { foreach( $i in 1..$runs ) { $rxExclude.Replace( $text, 'html' ) }}).TotalMilliseconds
$lazyMillis = (Measure-Command { foreach( $i in 1..$runs ) { $rxLazy.Replace( $text, 'html' ) }}).TotalMilliseconds
[PSCustomObject]#{
RegExExclude = '{0} ms' -f [int]$excludeMillis
RegExLazy = '{0} ms ({1}%)' -f [int]$lazyMillis, [int]($lazyMillis / $excludeMillis * 100)
}
Output from PS 7.2:
RegExExclude RegExLazy
------------ ---------
160 ms 178 ms (111%)
try
[string]$string = "<div class='ms-wikicontent ms-rtestate-field' style='padding-right: 10px'><div class='ExternalClass8E56354CC4314DBA861E187B689F3A2B'><table id='layoutsTable' style='width:100%'><tbody><tr style='vertical-align:top'><td style='width:100%'><div class='ms-rte-layoutszone-outer' style='width:100%'><div class='ms-rte-layoutszone-inner' role='textbox' aria-haspopup='true' aria-autocomplete='both' aria-multiline='true'><a id='0::Home|Home' class='ms-wikilink' href='/sites/Team/Project/SitePages/Home.aspx'>Home</a> - <a id='1::Jenkins|Jenkins' class='ms-wikilink' href='/sites/Team/Project/SitePages/Jenkins.aspx'>Jenkins</a><h1 class='ms-rteElement-H1'>Jenkins Integration with Deployment Tools</h1>"
$string.Replace('.aspx','.html')
or if you looking for build regex. Check out https://rubular.com/
it helps to build regex expressions
Let's say we have this html and the following DOMXPath code:
<div>
<div>
<p>1</p>
</div>
<div>
<p>2</p>
</div>
<div>
<p>3</p>
</div>
<div>
<p>4</p>
</div>
<div>
<p>5</p>
</div>
<div>
<p>6</p>
</div>
</div>
$doc = new DOMDocument();
$doc->loadHtml($strhtml);
$doc->preserveWhiteSpace = false;
$xpath = new DOMXPath( $doc );
$nodelist = $xpath->query('//div/div[2]/p');
foreach( $nodelist as $node ) {
$result = $node->nodeValue."\n";
}
echo $result;
Obviously, $result = '2', since we asked for the value of 'p' from the second 'div' node.
Now, how can I get the values for, say, from 'div[2]' to 'div[4]' and sum them?
To be precise, I would like to know how to get "from # to #" and also how to get "this #, that #, also # and #". So two questions, for two different problems.
Thanks in advance.
You are able to select range of elements with DOMXPath:
as for the first problem to get "from # to #" use the
following approach:
// select nodes within specified range of positions
$nodelist = $xpath->query('//div/div[position()>1 and position()<5]');
as for the second problem to get "this #, that #, also # and #"
try the following (with | union operator):
// extracts the 2nd, 4th and 6th elements respectively
$nodelist = $xpath->query('//div/div[2] | //div/div[4] | //div/div[6]);
I've got a bunch of strings already separated from an HTML file, examples:
<img alt="" src="//i.imgur.com/tApg8ebb.jpg" title="Some manly skills for you guys<p><span class='points-q7Vdm'>18,736</span> <span class='points-text-q7Vdm'>points</span> : 316,091 views</p>">
<img src="//i.imgur.com/SwmwL4Gb.jpg" width="48" height="48">
<img src="//s.imgur.com/images/blog_rss.png">
I am trying to make a regular expression that will grab the src="URL" part of the img tag so that I can replace it later based on a few other conditions. The many instances of quotation marks are giving me the biggest problem, I'm still relatively new with Regex, so a lot of the tricks are out of my knowledge,
Thanks in advance
Use DOM or another parser for this, don't try to parse HTML with regular expressions.
Example:
$html = <<<DATA
<img alt="" src="//i.imgur.com/tApg8ebb.jpg" title="Some manly skills for you guys<p><span class='points-q7Vdm'>18,736</span> <span class='points-text-q7Vdm'>points</span> : 316,091 views</p>">
<img src="//i.imgur.com/SwmwL4Gb.jpg" width="48" height="48">
<img src="//s.imgur.com/images/blog_rss.png">
DATA;
$doc = new DOMDocument();
$doc->loadHTML($html); // load the html
$xpath = new DOMXPath($doc);
$imgs = $xpath->query('//img');
foreach ($imgs as $img) {
echo $img->getAttribute('src') . "\n";
}
Output
//i.imgur.com/tApg8ebb.jpg
//i.imgur.com/SwmwL4Gb.jpg
//s.imgur.com/images/blog_rss.png
If you would rather store the results in an array, you could do..
foreach ($imgs as $img) {
$sources[] = $img->getAttribute('src');
}
print_r($sources);
Output
Array
(
[0] => //i.imgur.com/tApg8ebb.jpg
[1] => //i.imgur.com/SwmwL4Gb.jpg
[2] => //s.imgur.com/images/blog_rss.png
)
$pattern = '/<img.+src="([\w/\._\-]+)"/';
I'm not sure which language you're using, so quote syntax will vary.
How do you find the following div using regex? The URL and image location will consistently change based on the post URL, so I need to use a wild card.
I must use a regular expression because I am limited in what I can use due to the software I am using: http://community.autoblogged.com/entries/344640-common-search-and-replace-patterns
<div class="tweetmeme_button" style="float: right; margin-left: 10px;"> <br /> <img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fjumpinblack.com%2F2011%2F11%2F25%2Fdrake-and-rick-ross-you-only-live-once-ep-mixtape-2011-download%2F&source=jumpinblack1&style=compact&b=2" height="61" width="50" /><br /> </div>
I tried using
<div class="tweetmeme_button" style="float: right; margin-left: 10px;">.*<\/div>
Using regular expression to process HTML is a bad idea. I'm using HTML::TreeBuilder::XPath for this.
use strict;
use warnings;
use HTML::TreeBuilder::XPath;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new();
$mech->get("http://www.someURL.com");
my $tree = HTML::TreeBuilder::XPath->new_from_content( $mech->content() );
my $div = $tree->findnodes( '//div[#class="tweetmeme_button"]')->[0];
Use an HTML parser to parse HTML.
HTML::TokeParser::Simple or HTML::TreeBuilder::XPath among many others.
E.g.:
#!/usr/bin/env perl
use strict;
use warnings;
use HTML::TokeParser::Simple;
my $parser = HTML::TokeParser::Simple->new( ... );
while (my $div = $parser->get_tag) {
next unless $div->is_start_tag('div');
{
no warnings 'uninitialized';
next unless $div->get_attr('class') eq 'tweetmeme_button';
next unless $div->get_attr('style') eq 'float: right; margin-left: 10px;'
# now do what you want until the next </div>
}
}
I have a newsletter that contains a few image inside a div called nieuwsbrief-tekst. I want to find those images and add inline css code to it. I can find the div with preg_match, and I can also find the image tag itself, but adding the style="" to the image tag hasn't worked so far.
There is also more then one nieuwsbief-tekst div, these divs are the different content blocks, so there are 3 or 4 of them. I tried the preg_replace, but that has no effect.
Any tips or suggestions how to handle this?
So the html would look like this, and I only want add the style attribute to the images inside the div.
HTM Code:
<div class="nieuwsbrief-tekst">lorum ipmsum</div>
<div class="nieuwsbrief-tekst"><img src="#"></div>
<div class="nieuwsbrief-tekst">lorum ipmsum</div>
<div class="nieuwsbrief-tekst"><img src="#"></div>
PHP Code:
if(preg_match_all('/<div class="nieuwsbrief-tekst">(.*?)<\/div>/is', $var, $matches)) {
foreach($matches[0] as $match) {
if(preg_match('/<img[^>]+>/is', $match, $match_img)) {
echo 'image found';
$pattern = '/<img[^>]+>/is';
$replacement = '<img style="float:left; margin:0 10px 10px 0;';
$test = preg_replace($pattern, $replacement, $match_img);
}
}
echo '<pre>';
print($test);
echo '</pre>';
}
Thanks :)
this should accomplish what you want
$pattern = "/<div class=\"tmp\">(\\w+)?<img ([^>]+)>(\\w+)?<\/div>/is";
$replacement = "<div class=\"tmp\">\\1 <img style=\"float:left; margin:0 10px 10px 0; \\2 /> \\3</div>";
$str = preg_replace($pattern, $replacement, $str);
echo $str;
Just change the tmp to fit your needs :)