Extracting URI from web pages via BASH - regex

I need to pull all the links for a page that resides on an Intranet however am unsure how best to do it. The structure of the site is as follows
List of topics
Topic 1
Topic 2
Topic 3
etc
Now the links reside in each of the topic pages. I want to avoid going through in excess of 500 topic pages manually to extract the URI.
Each of the topic pages has the following structure
http://alias/filename.php?cat=6&number=1
The cat parameter refers to the category and the number parameter refers to the topic.
Once in the topic page the URI I need to extract exists in a particular format again
http://alias/value?id=somevalue
Caveats
I don't have access to the database so the option to trawl through it is not an option
There is only ever a single URI in each topic page
I need to extract the list to a file that simply lists each URI in a new line
I would like to execute some sort of script I can run from the terminal via BASH that will trawl through the topical URI and then the URI in each of the topics.
In a nutshell
How can I extract a list using a script I can run using BASH that will recursively go through all the list of topics and then extract the URI in each of the topic pages and spit out a text file with the each of extracted URI in a new line.

I implement this with Perl, using the HTML::TokeParser and WWW::Mechanize modules:
use HTML::TokeParser;
use WWW::Mechanize;
my $site = WWW::Mechanize->new(autocheck =>1);
my $topicmax = 500; #Note: adjust this to the number of topic pages you have
# loop through each topic page
foreach(1..$topicmax) {
my $topicurl = "http://alias/filename.php?cat=6&number=$_";
# get the page
$site->get($topicurl);
$p = HTML::TokeParser->new(\$site->{content});
# parse the page and extract the links
while (my $token = $p->get_tag("a")) {
my $url = $token->[1]{href};
# use a regex to test for the link format we want
if($url =~ /^http:\/\/alias\/value\?id=/) {
print "$url\n";
}
}
}
The script prints to stdout, so you just need to redirect it to a file.

Related

Extract all images from zillow html - to google sheets with Regex

Hello i want to extract all images from zillow HTML
https://regex101.com/r/ifKDEa/1
Im trying to catch first 7 images
https://photos.zillowstatic.com/fp/4e900eea9506449f780cf1ffe718ff0e-cc_ft_960.jpg
https://photos.zillowstatic.com/fp/209c61592b999d83ca05b9b1e76edb5c-cc_ft_960.jpg
https://photos.zillowstatic.com/fp/1c1fe680263f8ac0ce477a5e82c589c7-cc_ft_960.jpg
https://photos.zillowstatic.com/fp/f6cbaed9371bacf8e259dbb83336f3a2-cc_ft_960.jpg
https://photos.zillowstatic.com/fp/19369ad2675cbb63a580dc204697ab07-cc_ft_960.jpg
https://photos.zillowstatic.com/fp/a59b1aba32afef92744c3a5b17123275-cc_ft_960.jpg
https://photos.zillowstatic.com/fp/abd1fea3dcdef3deddaa4a0ebbe84696-cc_ft_960.jpg
I want to paste this html in google sheets and extract image links there
Any idea how to do this?
When I check the HTML from your provided URL, it seems that your expected values are put using Javascript. By this, unfortunately, your expected values cannot be directly retrieved by Google Apps Script and Spreadsheet. In this case, as a workaround for indirectly achieving your goal, when the HTML data is manually retrieved using a browser and saved the HTML as a file to Google Drive, I think that your expected values can be retrieved using Google Apps Script. In this answer, I would like to propose this workaround.
Sample script:
Please copy and paste the following script to the script editor of Spreadsheet.
Before you use this script, please put HTML data as a file to your Google Drive. And, please set the filename to filename and run the script. By this, the URLs are retrieved and put on the active sheet.
function myFunction() {
const filename = "samplefilename.html";
const html = DriveApp.getFilesByName(filename).next().getBlob().getDataAsString();
const urls = [...html.matchAll(/https:\/\/photos\.zillowstatic\.com\/fp\/[\w\d\-_]+?\.jpg/g)];
if (!urls || urls.length == 0) return;
// console.log(urls) // When you use this, the URLs can be seen at the log.
// If you want to put the URLs to the active sheet.
const sheet = SpreadsheetApp.getActiveSheet();
sheet.getRange(1, 1, urls.length).setValues(urls);
}
Reference:
matchAll()

Regex - Coverting URLs to clickable links

We have some regex code that converts URLs to clickable links, it is working but we are running into issues where if a user submits a entry where they forget to space after a period it thinks it's a link as well.
example: End of a sentence.This is a new sentence
It would create a hyperlink for sentence.This
Is there anyway to valid the following code against say a proper domain like .com, .ca ect..?
Here is the code:
$url = '#(http)?(s)?(://)?(([a-zA-Z])([-\w]+\.)+([^\s\.]+[^\s]*)+[^,.\s])#';
$output = preg_replace($url, '$0', trim($val[0]));
Thanks,
Aaron

How can I use regex to construct an API call in my Jekyll plugin?

I'm trying to write my own Jekyll plugin to construct an api query from a custom tag. I've gotten as far as creating the basic plugin and tag, but I've run into the limits of my programming skills so looking to you for help.
Here's my custom tag for reference:
{% card "Arbor Elf | M13" %}
Here's the progress on my plugin:
module Jekyll
class Scryfall < Liquid::Tag
def initialize(tag_name, text, tokens)
super
#text = text
end
def render(context)
# Store the name of the card, ie "Arbor Elf"
#card_name =
# Store the name of the set, ie "M13"
#card_set =
# Build the query
#query = "https://api.scryfall.com/cards/named?exact=#{#card_name}&set=#{#card_set}"
# Store a specific JSON property
#card_art =
# Finally we render out the result
"<img src='#{#card_art}' title='#{#card_name}' />"
end
end
end
Liquid::Template.register_tag('cards', Jekyll::Scryfall)
For reference, here's an example query using the above details (paste it into your browser to see the response you get back)
https://api.scryfall.com/cards/named?exact=arbor+elf&set=m13
My initial attempts after Googling around was to use regex to split the #text at the |, like so:
#card_name = "#{#text}".split(/| */)
This didn't quite work, instead it output this:
[“A”, “r”, “b”, “o”, “r”, “ “, “E”, “l”, “f”, “ “, “|”, “ “, “M”, “1”, “3”, “ “]
I'm also then not sure how to access and store specific properties within the JSON response. Ideally, I can do something like this:
#card_art = JSONRESPONSE.image_uri.large
I'm well aware I'm asking a lot here, but I'd love to try and get this working and learn from it.
Thanks for reading.
Actually, your split should work – you just need to give it the correct regex (and you can call that on #text directly). You also need to escape the pipe character in the regex, because pipes can have special meaning. You can use rubular.com to experiment with regexes.
parts = #text.split(/\|/)
# => => ["Arbor Elf ", " M13"]
Note that they also contain some extra whitespace, which you can remove with strip.
#card_name = parts.first.strip
#card_set = parts.last.strip
This might also be a good time to answer questions like: what happens if the user inserts multiple pipes? What if they insert none? Will your code give them a helpful error message for this?
You'll also need to escape these values in your URL. What if one of your users adds a card containing a & character? Your URL will break:
https://api.scryfall.com/cards/named?exact=Sword of Dungeons & Dragons&set=und
That looks like a URL with three parameters, exact, set and Dragons. You need to encode the user input to be included in a URL:
require 'cgi'
query = "https://api.scryfall.com/cards/named?exact=#{CGI.escape(#card_name)}&set=#{CGI.escape(#card_set)}"
# => "https://api.scryfall.com/cards/named?exact=Sword+of+Dungeons+%26+Dragons&set=und"
What comes after that is a little less clear, because you haven't written the code yet. Try making the call with the Net::HTTP module and then parsing the response with the JSON module. If you have trouble, come back here and ask a new question.

Tweepy API search doesn't have keyword

I am working with Tweepy (python's REST API client) and I'm trying to find tweets by several keywords and without url included in tweet.
But search results are not up to our satisfaction. Looks like query has erros and was stopped. Additionally we had observed that results were returned one-by-one not (as previously) in bulk packs of 100.
Could you please tell me why this search does not work properly?
We wanted to get all tweets mentioning 'Amazon' without any URL links in the text.
We used search shown below. Search results were still containing tweets with URLs or without 'Amazon' keyword.
Could you please let us know what we are doing wrong?
auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
searchQuery = 'Amazon OR AMAZON OR amazon filter:-links' # Keyword
new_tweets = api.search(q=searchQuery, count=100,
result_type = "recent",
max_id = sinceId,
lang = "en")
The minus sign should be put before "filter", not before "links", like this:
searchQuery = 'Amazon OR AMAZON OR amazon -filter:links'
Also, I doubt that the count = 100 option is a valid one, since it is not listed on the API documentation (which may not be very up-to-date, though). Try to replace that with rpp = 100 to get tweets in bulk packs.
I am not sure why some of the tweets you find do not contain the "Amazon" keyword, but a possibility is that "Amazon" is contained within the username of the poster. I do not know if you can filter that directly in the query, or even if you would want to filter it, since it would mean you would reject tweets from the official Amazon accounts. I would suggest that, for each tweet the query returns, you check it to make sure it does contain "Amazon".

Using WWW::Mechanize to scrape multiple pages under a directory - Perl

I'm working on a project to site scrape every interview found here into an HTML ready document to be later dumped into a DB which will automatically update our website with the latest content. You can see an example of my current site scraping script which I asked a question about the other day: WWW::Mechanize Extraction Help - PERL
The problem I can't seem to wrap my head around is knowing if what I'm trying to accomplish now is even possible. Because I don't want to have to guess when a new interview is published, my hope is to be able to scrape the website which has a directory listing of all of the interviews and automatically have my program fetch the content on the new URL (new interview).
Again, the site in question is here (scroll down to see the listing of interviews): http://millercenter.org/president/clinton/oralhistory
My initial thought was to have a regex of .\ at the end of the link above in hopes that it would automatically search any links found under that page. I can't seem to be able to get this to work using WWW::Mechanize, however. I will post what I have below and if anyone has any guidance or experience with this your feedback would be greatly appreciated. I'll also summarize my tasks below the code so that you have a concise understanding of what we hope to accomplish.
Thanks to any and all that can help!
#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
use WWW::Mechanize::Link;
use WWW::Mechanize::TreeBuilder;
my $mech = WWW::Mechanize->new();
WWW::Mechanize::TreeBuilder->meta->apply($mech);
$mech->get("http://millercenter.org/president/clinton/oralhistory/\.");
# find all <dl> tags
my #list = $mech->find('dl');
foreach ( #list ) {
print $_->as_HTML();
}
# # find all links
# my #links = $mech->links();
# foreach my $link (#links) {
# print "$link->url \n";
# }
To summarize what I'm hoping is possible:
Extract the content of every interview found here in an HTML ready document like I did here: WWW::Mechanize Extraction Help - PERL. This would require the 'get' action to be able to traverse the pages listed under the /oralhistory/ directory, which can perhaps be solved using a regex?
Possibly extract the respondent name and position fields on the directory page to be populated in a title field (this isn't that big of a deal if it can't be done)
No, you can't use wildcards on urls.. :-(
You'll have to parse yourself the page with the listing, and then get and process pages in a loop.
To extract specific fields from a page contents will be a strightforward task with WWW::Mechanize...
UPDATE: answering OP comment:
Try this logic:
use strict;
use warnings;
use WWW::Mechanize;
use LWP::Simple;
use File::Basename;
my $mech = WWW::Mechanize->new( autocheck => 1 );
$mech->get("http://millercenter.org/president/clinton/oralhistoryml");
# find all <dl> tags
my #list = $mech->find('dl');
foreach my $link (#list) {
my $url = $link->url();
my $localfile = basename($url);
my $localpath = "./$localfile";
print "$localfile\n";
getstore($url, $localpath);
}
My answer is focused on the approach of how to do this. I'm not providing code.
There are no IDs in the links, but the names of the interview pages seem to be fine to use. You need to parse them out and build a lookup table.
Basically you start by building a parser that fetches all the links that look like an interview. That is fairly simple with WWW::Mechanize. The page URL is:
http://millercenter.org/president/clinton/oralhistory
All the interviews follow this schema:
http://millercenter.org/president/clinton/oralhistory/george-mitchell
So you can find all links in that page that start with http://millercenter.org/president/clinton/oralhistory/. Then you make them unique, because there is this teaser box slider thing that showcases some of them, and it has a read more link to the page. Use a hash to do that like this:
my %seen;
foreach my $url (#urls) {
$mech->get($url) unless $seen{$url};
$seen{$url}++;
}
Then you fetch the page and do your stuff and write it to your database. Use the URL or the interview name part of the URL (e.g. goerge-mitchell) as the primary key. If there are other presidents and you want those as well, adapt in case the same name shows up for several presidents.
Then you go back and add a cache lookup into your code. You grab all the IDs from the DB before you start fetching the page, and put those in a hash.
# prepare query and stuff...
my %cache;
while (my $res = $sth->fetchrow_hashref) {
$cache{$res->{id}}++;
}
# later...
foreach my $url (#urls) {
next if $cache{$url}; # or grab the ID out of the url
next if $seen{$url};
$mech->get($url);
$seen{$url}++;
}
You also need to filter out the links that are not interviews. One of those would be http://millercenter.org/president/clinton/oralhistory/clinton-description, which is the read more of the first paragraph on the page.