Using WWW::Mechanize to scrape multiple pages under a directory - Perl

Using WWW::Mechanize to scrape multiple pages under a directory - Perl - regex

I'm working on a project to site scrape every interview found here into an HTML ready document to be later dumped into a DB which will automatically update our website with the latest content. You can see an example of my current site scraping script which I asked a question about the other day: WWW::Mechanize Extraction Help - PERL
The problem I can't seem to wrap my head around is knowing if what I'm trying to accomplish now is even possible. Because I don't want to have to guess when a new interview is published, my hope is to be able to scrape the website which has a directory listing of all of the interviews and automatically have my program fetch the content on the new URL (new interview).
Again, the site in question is here (scroll down to see the listing of interviews): http://millercenter.org/president/clinton/oralhistory
My initial thought was to have a regex of .\ at the end of the link above in hopes that it would automatically search any links found under that page. I can't seem to be able to get this to work using WWW::Mechanize, however. I will post what I have below and if anyone has any guidance or experience with this your feedback would be greatly appreciated. I'll also summarize my tasks below the code so that you have a concise understanding of what we hope to accomplish.
Thanks to any and all that can help!
#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
use WWW::Mechanize::Link;
use WWW::Mechanize::TreeBuilder;
my $mech = WWW::Mechanize->new();
WWW::Mechanize::TreeBuilder->meta->apply($mech);
$mech->get("http://millercenter.org/president/clinton/oralhistory/\.");
# find all <dl> tags
my #list = $mech->find('dl');
foreach ( #list ) {
print $_->as_HTML();
}
# # find all links
# my #links = $mech->links();
# foreach my $link (#links) {
# print "$link->url \n";
# }
To summarize what I'm hoping is possible:
Extract the content of every interview found here in an HTML ready document like I did here: WWW::Mechanize Extraction Help - PERL. This would require the 'get' action to be able to traverse the pages listed under the /oralhistory/ directory, which can perhaps be solved using a regex?
Possibly extract the respondent name and position fields on the directory page to be populated in a title field (this isn't that big of a deal if it can't be done)

No, you can't use wildcards on urls.. :-(
You'll have to parse yourself the page with the listing, and then get and process pages in a loop.
To extract specific fields from a page contents will be a strightforward task with WWW::Mechanize...
UPDATE: answering OP comment:
Try this logic:
use strict;
use warnings;
use WWW::Mechanize;
use LWP::Simple;
use File::Basename;
my $mech = WWW::Mechanize->new( autocheck => 1 );
$mech->get("http://millercenter.org/president/clinton/oralhistoryml");
# find all <dl> tags
my #list = $mech->find('dl');
foreach my $link (#list) {
my $url = $link->url();
my $localfile = basename($url);
my $localpath = "./$localfile";
print "$localfile\n";
getstore($url, $localpath);
}

My answer is focused on the approach of how to do this. I'm not providing code.
There are no IDs in the links, but the names of the interview pages seem to be fine to use. You need to parse them out and build a lookup table.
Basically you start by building a parser that fetches all the links that look like an interview. That is fairly simple with WWW::Mechanize. The page URL is:
http://millercenter.org/president/clinton/oralhistory
All the interviews follow this schema:
http://millercenter.org/president/clinton/oralhistory/george-mitchell
So you can find all links in that page that start with http://millercenter.org/president/clinton/oralhistory/. Then you make them unique, because there is this teaser box slider thing that showcases some of them, and it has a read more link to the page. Use a hash to do that like this:
my %seen;
foreach my $url (#urls) {
$mech->get($url) unless $seen{$url};
$seen{$url}++;
}
Then you fetch the page and do your stuff and write it to your database. Use the URL or the interview name part of the URL (e.g. goerge-mitchell) as the primary key. If there are other presidents and you want those as well, adapt in case the same name shows up for several presidents.
Then you go back and add a cache lookup into your code. You grab all the IDs from the DB before you start fetching the page, and put those in a hash.
# prepare query and stuff...
my %cache;
while (my $res = $sth->fetchrow_hashref) {
$cache{$res->{id}}++;
}
# later...
foreach my $url (#urls) {
next if $cache{$url}; # or grab the ID out of the url
next if $seen{$url};
$mech->get($url);
$seen{$url}++;
}
You also need to filter out the links that are not interviews. One of those would be http://millercenter.org/president/clinton/oralhistory/clinton-description, which is the read more of the first paragraph on the page.

Related

Regex - Coverting URLs to clickable links

We have some regex code that converts URLs to clickable links, it is working but we are running into issues where if a user submits a entry where they forget to space after a period it thinks it's a link as well.
example: End of a sentence.This is a new sentence
It would create a hyperlink for sentence.This
Is there anyway to valid the following code against say a proper domain like .com, .ca ect..?
Here is the code:
$url = '#(http)?(s)?(://)?(([a-zA-Z])([-\w]+\.)+([^\s\.]+[^\s]*)+[^,.\s])#';
$output = preg_replace($url, '$0', trim($val[0]));
Thanks,
Aaron

How can I use regex to construct an API call in my Jekyll plugin?

I'm trying to write my own Jekyll plugin to construct an api query from a custom tag. I've gotten as far as creating the basic plugin and tag, but I've run into the limits of my programming skills so looking to you for help.
Here's my custom tag for reference:
{% card "Arbor Elf | M13" %}
Here's the progress on my plugin:
module Jekyll
class Scryfall < Liquid::Tag
def initialize(tag_name, text, tokens)
super
#text = text
end
def render(context)
# Store the name of the card, ie "Arbor Elf"
#card_name =
# Store the name of the set, ie "M13"
#card_set =
# Build the query
#query = "https://api.scryfall.com/cards/named?exact=#{#card_name}&set=#{#card_set}"
# Store a specific JSON property
#card_art =
# Finally we render out the result
"<img src='#{#card_art}' title='#{#card_name}' />"
end
end
end
Liquid::Template.register_tag('cards', Jekyll::Scryfall)
For reference, here's an example query using the above details (paste it into your browser to see the response you get back)
https://api.scryfall.com/cards/named?exact=arbor+elf&set=m13
My initial attempts after Googling around was to use regex to split the #text at the |, like so:
#card_name = "#{#text}".split(/| */)
This didn't quite work, instead it output this:
[“A”, “r”, “b”, “o”, “r”, “ “, “E”, “l”, “f”, “ “, “|”, “ “, “M”, “1”, “3”, “ “]
I'm also then not sure how to access and store specific properties within the JSON response. Ideally, I can do something like this:
#card_art = JSONRESPONSE.image_uri.large
I'm well aware I'm asking a lot here, but I'd love to try and get this working and learn from it.
Thanks for reading.

Actually, your split should work – you just need to give it the correct regex (and you can call that on #text directly). You also need to escape the pipe character in the regex, because pipes can have special meaning. You can use rubular.com to experiment with regexes.
parts = #text.split(/\|/)
# => => ["Arbor Elf ", " M13"]
Note that they also contain some extra whitespace, which you can remove with strip.
#card_name = parts.first.strip
#card_set = parts.last.strip
This might also be a good time to answer questions like: what happens if the user inserts multiple pipes? What if they insert none? Will your code give them a helpful error message for this?
You'll also need to escape these values in your URL. What if one of your users adds a card containing a & character? Your URL will break:
https://api.scryfall.com/cards/named?exact=Sword of Dungeons & Dragons&set=und
That looks like a URL with three parameters, exact, set and Dragons. You need to encode the user input to be included in a URL:
require 'cgi'
query = "https://api.scryfall.com/cards/named?exact=#{CGI.escape(#card_name)}&set=#{CGI.escape(#card_set)}"
# => "https://api.scryfall.com/cards/named?exact=Sword+of+Dungeons+%26+Dragons&set=und"
What comes after that is a little less clear, because you haven't written the code yet. Try making the call with the Net::HTTP module and then parsing the response with the JSON module. If you have trouble, come back here and ask a new question.

How to get full url of a article by it's ID in joomla?

I have article id, How can I get valid full url of this article? This article already associated with menu but I might not know, is there any easy way in php to get url? I am using joomla 3.2
I tried following already.
$article = ControllerLegacy::getInstance('Content')->getModel('Article')->getItem($article‌Id);
JRoute::_(ContentHelperRoute::getArticleRoute($articleId,$article->catid))

You can use like this
$article = JControllerLegacy::getInstance('Content')
->getModel('Article')->getItem($articleId);
$url = JRoute::_(ContentHelperRoute::getArticleRoute($articleId,
$article->catid,
$article->language))

I am writing this because I think this information is useful to all other users who want full current URL anywhere in Joomla not only in articles.
In Joomla use JURIclass to get URLs.
Functions like root(), current() & base() will be used according to the need.
echo 'Joomla root URI is ' . JURI::root();
output:-Joomla root URI is http://localhost/joomla/
echo 'Joomla current URI is ' . JURI::current();
output:-Joomla current URI is http://localhost/joomla3/index.php/somealias
Note:- current() will give the whole URI except the query string part, for example,
IF your full URL is http://localhost/joomla3/index.php/somealias?id=1 then current() will only return this-> http://localhost/joomla3/index.php/somealias
While, if you use JURI::getInstance()->toString() then it will return this->
http://localhost/joomla3/index.php/somealias?id=1
For more information see these links->
https://docs.joomla.org/JURI/root
https://docs.joomla.org/JURI/current
https://docs.joomla.org/JURI/getInstance

Maybe the JURI (from Joomla! API) help you:
exemple:
echo 'Joomla current URI is ' . JURI::current() . "\n";
might output
Joomla base URI is http://localhost/joomla/
JURI class

Display name & picture knowing ID

I was looking to find an answer to my question, but so far I got this:
https://graph.facebook.com/the_user_id?fields=name,picture
I need to be able to display/print first,last name and picture of a set list of users for which I know their ID. What code is required to get this data and then to publish it on a php/html page? Of course, this will means that if I want to show 10 users, I will input 10 different IDs (read smtg about an array list?). Notice that I DO NOT require for this to work for the current user.
Thanks in advance for your replies.

You need to use file_get_contents ( http://uk3.php.net/file_get_contents ) or curl in php and issue a request to the url such as follows:
https://graph.facebook.com/?ids=id1,id2,id3&fields=name,picture
(replacing id1,id2 with your ids)
this will then return you a json object. You then need to decode ( http://uk3.php.net/json_decode ) and loop through this and access the information
this should get you started
// people array uses the users id as the key and the dessert as the value. The id is then used in the query to facebook to select the corresponding value from this array
$people = array("id1"=>"favourite "dessert", "id2"=>"favourite dessert", "id3"=>"apple pie");
$json = file_get_contents('https://graph.facebook.com/?ids=id1,id2,id3&fields=id,name,picture');
$json = json_decode($json);
foreach($json as $key=>$person){
echo '<p><img src="'.$person->picture.'" alt="'.$person->name.'" />';
echo $person->name.'\'s favourite dessert is '.$people[$person->id'];
echo '</p>';
}
I've batched the requests here, alternatively you could perform 10 separate queries for each user, but that would be a bit pointless and inefficient

The easiest way is with an FQL query:
SELECT first_name, last_name, pic, uid FROM user WHERE uid IN
(Known_ID_1, Known_ID_2, ... Known_ID_n)
The easiest, if you're using PHP is to install the PHP SDK, though you can also make a call directly to https://graph.facebook.com/fql?q=URL_ENCODED_QUERY

Extracting URI from web pages via BASH

I need to pull all the links for a page that resides on an Intranet however am unsure how best to do it. The structure of the site is as follows
List of topics
Topic 1
Topic 2
Topic 3
etc
Now the links reside in each of the topic pages. I want to avoid going through in excess of 500 topic pages manually to extract the URI.
Each of the topic pages has the following structure
http://alias/filename.php?cat=6&number=1
The cat parameter refers to the category and the number parameter refers to the topic.
Once in the topic page the URI I need to extract exists in a particular format again
http://alias/value?id=somevalue
Caveats
I don't have access to the database so the option to trawl through it is not an option
There is only ever a single URI in each topic page
I need to extract the list to a file that simply lists each URI in a new line
I would like to execute some sort of script I can run from the terminal via BASH that will trawl through the topical URI and then the URI in each of the topics.
In a nutshell
How can I extract a list using a script I can run using BASH that will recursively go through all the list of topics and then extract the URI in each of the topic pages and spit out a text file with the each of extracted URI in a new line.

I implement this with Perl, using the HTML::TokeParser and WWW::Mechanize modules:
use HTML::TokeParser;
use WWW::Mechanize;
my $site = WWW::Mechanize->new(autocheck =>1);
my $topicmax = 500; #Note: adjust this to the number of topic pages you have
# loop through each topic page
foreach(1..$topicmax) {
my $topicurl = "http://alias/filename.php?cat=6&number=$_";
# get the page
$site->get($topicurl);
$p = HTML::TokeParser->new(\$site->{content});
# parse the page and extract the links
while (my $token = $p->get_tag("a")) {
my $url = $token->[1]{href};
# use a regex to test for the link format we want
if($url =~ /^http:\/\/alias\/value\?id=/) {
print "$url\n";
}
}
}
The script prints to stdout, so you just need to redirect it to a file.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using WWW::Mechanize to scrape multiple pages under a directory - Perl - regex

Related

Regex - Coverting URLs to clickable links

How can I use regex to construct an API call in my Jekyll plugin?

How to get full url of a article by it's ID in joomla?

Display name & picture knowing ID

Extracting URI from web pages via BASH

Categories

Resources