Regex - Coverting URLs to clickable links - regex

We have some regex code that converts URLs to clickable links, it is working but we are running into issues where if a user submits a entry where they forget to space after a period it thinks it's a link as well.
example: End of a sentence.This is a new sentence
It would create a hyperlink for sentence.This
Is there anyway to valid the following code against say a proper domain like .com, .ca ect..?
Here is the code:
$url = '#(http)?(s)?(://)?(([a-zA-Z])([-\w]+\.)+([^\s\.]+[^\s]*)+[^,.\s])#';
$output = preg_replace($url, '$0', trim($val[0]));
Thanks,
Aaron

Related

How can I use regex to construct an API call in my Jekyll plugin?

I'm trying to write my own Jekyll plugin to construct an api query from a custom tag. I've gotten as far as creating the basic plugin and tag, but I've run into the limits of my programming skills so looking to you for help.
Here's my custom tag for reference:
{% card "Arbor Elf | M13" %}
Here's the progress on my plugin:
module Jekyll
class Scryfall < Liquid::Tag
def initialize(tag_name, text, tokens)
super
#text = text
end
def render(context)
# Store the name of the card, ie "Arbor Elf"
#card_name =
# Store the name of the set, ie "M13"
#card_set =
# Build the query
#query = "https://api.scryfall.com/cards/named?exact=#{#card_name}&set=#{#card_set}"
# Store a specific JSON property
#card_art =
# Finally we render out the result
"<img src='#{#card_art}' title='#{#card_name}' />"
end
end
end
Liquid::Template.register_tag('cards', Jekyll::Scryfall)
For reference, here's an example query using the above details (paste it into your browser to see the response you get back)
https://api.scryfall.com/cards/named?exact=arbor+elf&set=m13
My initial attempts after Googling around was to use regex to split the #text at the |, like so:
#card_name = "#{#text}".split(/| */)
This didn't quite work, instead it output this:
[“A”, “r”, “b”, “o”, “r”, “ “, “E”, “l”, “f”, “ “, “|”, “ “, “M”, “1”, “3”, “ “]
I'm also then not sure how to access and store specific properties within the JSON response. Ideally, I can do something like this:
#card_art = JSONRESPONSE.image_uri.large
I'm well aware I'm asking a lot here, but I'd love to try and get this working and learn from it.
Thanks for reading.
Actually, your split should work – you just need to give it the correct regex (and you can call that on #text directly). You also need to escape the pipe character in the regex, because pipes can have special meaning. You can use rubular.com to experiment with regexes.
parts = #text.split(/\|/)
# => => ["Arbor Elf ", " M13"]
Note that they also contain some extra whitespace, which you can remove with strip.
#card_name = parts.first.strip
#card_set = parts.last.strip
This might also be a good time to answer questions like: what happens if the user inserts multiple pipes? What if they insert none? Will your code give them a helpful error message for this?
You'll also need to escape these values in your URL. What if one of your users adds a card containing a & character? Your URL will break:
https://api.scryfall.com/cards/named?exact=Sword of Dungeons & Dragons&set=und
That looks like a URL with three parameters, exact, set and Dragons. You need to encode the user input to be included in a URL:
require 'cgi'
query = "https://api.scryfall.com/cards/named?exact=#{CGI.escape(#card_name)}&set=#{CGI.escape(#card_set)}"
# => "https://api.scryfall.com/cards/named?exact=Sword+of+Dungeons+%26+Dragons&set=und"
What comes after that is a little less clear, because you haven't written the code yet. Try making the call with the Net::HTTP module and then parsing the response with the JSON module. If you have trouble, come back here and ask a new question.

Regex capture group in Varnish VCL

I have a URL in the form of:
http://some-site.com/api/v2/portal-name/some/webservice/call
The data I want to fetch needs
http://portal-name.com/webservices/v2/some/webservice/call
(Yes I can rewrite the application so it uses other URL's but we are testing varnish at the moment so for now it cannot be intrusive.)
But I'm having trouble getting the URL correctly in varnish VCL. The api part is replaced by an empty string, no worries but now the portal-name.
Things I've tried:
if (req.url ~ ".*/(.*)/") {
set req.http.portalhostname = re.group.0;
set req.http.portalhostname = $1;
}
From https://docs.fastly.com/guides/vcl/vcl-regular-expression-cheat-sheet and Extracting capturing group contents in Varnish regex
And yes, std is imported.
But this gives me either a
Syntax error at
('/etc/varnish/default.vcl' Line 36 Pos 35)
set req.http.portalhostname = $1;
or a
Symbol not found: 're.group.0' (expected type STRING_LIST):
So: how can I do this? When I have extracted the portalhostname I should be able to simply do a regsub to replace that value with an empty string and then prepend "webservices" and my URL is complete.
The varnish version i'm using: varnish-4.1.8 revision d266ac5c6
Sadly re.group seems to have been removed at some version. Similar functionality appears to be accessible via one of several vmods. See https://varnish-cache.org/vmods/

Using WWW::Mechanize to scrape multiple pages under a directory - Perl

I'm working on a project to site scrape every interview found here into an HTML ready document to be later dumped into a DB which will automatically update our website with the latest content. You can see an example of my current site scraping script which I asked a question about the other day: WWW::Mechanize Extraction Help - PERL
The problem I can't seem to wrap my head around is knowing if what I'm trying to accomplish now is even possible. Because I don't want to have to guess when a new interview is published, my hope is to be able to scrape the website which has a directory listing of all of the interviews and automatically have my program fetch the content on the new URL (new interview).
Again, the site in question is here (scroll down to see the listing of interviews): http://millercenter.org/president/clinton/oralhistory
My initial thought was to have a regex of .\ at the end of the link above in hopes that it would automatically search any links found under that page. I can't seem to be able to get this to work using WWW::Mechanize, however. I will post what I have below and if anyone has any guidance or experience with this your feedback would be greatly appreciated. I'll also summarize my tasks below the code so that you have a concise understanding of what we hope to accomplish.
Thanks to any and all that can help!
#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
use WWW::Mechanize::Link;
use WWW::Mechanize::TreeBuilder;
my $mech = WWW::Mechanize->new();
WWW::Mechanize::TreeBuilder->meta->apply($mech);
$mech->get("http://millercenter.org/president/clinton/oralhistory/\.");
# find all <dl> tags
my #list = $mech->find('dl');
foreach ( #list ) {
print $_->as_HTML();
}
# # find all links
# my #links = $mech->links();
# foreach my $link (#links) {
# print "$link->url \n";
# }
To summarize what I'm hoping is possible:
Extract the content of every interview found here in an HTML ready document like I did here: WWW::Mechanize Extraction Help - PERL. This would require the 'get' action to be able to traverse the pages listed under the /oralhistory/ directory, which can perhaps be solved using a regex?
Possibly extract the respondent name and position fields on the directory page to be populated in a title field (this isn't that big of a deal if it can't be done)
No, you can't use wildcards on urls.. :-(
You'll have to parse yourself the page with the listing, and then get and process pages in a loop.
To extract specific fields from a page contents will be a strightforward task with WWW::Mechanize...
UPDATE: answering OP comment:
Try this logic:
use strict;
use warnings;
use WWW::Mechanize;
use LWP::Simple;
use File::Basename;
my $mech = WWW::Mechanize->new( autocheck => 1 );
$mech->get("http://millercenter.org/president/clinton/oralhistoryml");
# find all <dl> tags
my #list = $mech->find('dl');
foreach my $link (#list) {
my $url = $link->url();
my $localfile = basename($url);
my $localpath = "./$localfile";
print "$localfile\n";
getstore($url, $localpath);
}
My answer is focused on the approach of how to do this. I'm not providing code.
There are no IDs in the links, but the names of the interview pages seem to be fine to use. You need to parse them out and build a lookup table.
Basically you start by building a parser that fetches all the links that look like an interview. That is fairly simple with WWW::Mechanize. The page URL is:
http://millercenter.org/president/clinton/oralhistory
All the interviews follow this schema:
http://millercenter.org/president/clinton/oralhistory/george-mitchell
So you can find all links in that page that start with http://millercenter.org/president/clinton/oralhistory/. Then you make them unique, because there is this teaser box slider thing that showcases some of them, and it has a read more link to the page. Use a hash to do that like this:
my %seen;
foreach my $url (#urls) {
$mech->get($url) unless $seen{$url};
$seen{$url}++;
}
Then you fetch the page and do your stuff and write it to your database. Use the URL or the interview name part of the URL (e.g. goerge-mitchell) as the primary key. If there are other presidents and you want those as well, adapt in case the same name shows up for several presidents.
Then you go back and add a cache lookup into your code. You grab all the IDs from the DB before you start fetching the page, and put those in a hash.
# prepare query and stuff...
my %cache;
while (my $res = $sth->fetchrow_hashref) {
$cache{$res->{id}}++;
}
# later...
foreach my $url (#urls) {
next if $cache{$url}; # or grab the ID out of the url
next if $seen{$url};
$mech->get($url);
$seen{$url}++;
}
You also need to filter out the links that are not interviews. One of those would be http://millercenter.org/president/clinton/oralhistory/clinton-description, which is the read more of the first paragraph on the page.

regex quandry - for word matching

I am out of my depth here, currently reading the tutorials and using python to learn regex.
I have a website where a php file http://www.example.com/showme.php?user=JOHN will load the visitor page of JOHN. However I want to let John have his own vanity URL like john.example.com and rewrite it to http://www.example.com/showme.php?user=JOHN .
I know it can be done and after fiddling with it it seems lighttpd mod_rewrite is the way to go. Now I am stumped as I am trying to come up with regex to match!
rewrite ("^![www]\.example\.com" => "www\.example\.com\?user=###");
I am playing with python re module to test out several ways of getting the john from john.example.com and recognize when the first segment of url is not www and then redirect. Above was my trial. Am I even in the right continent!
Any help will be appreciated in
recognizing when first part of url before the first . is not www and is something else - so that example.com won't stump it.
getting the first part of the url before first . and tag it to user=###
Thanks a bunch
Use lighttpd's mod-rewrite module. Add this to your lighttpd.conf file:
$HTTP["host"] != "www.example.com" {
$HTTP["host"] =~ "^([^.]+)\.example\.com$" {
url.rewrite-once = (
"^/?$" => "/showme.php?user=%1"
)
}
}
For an href value like /dir/page.php the domain part of the link gets automatically added from the current request as shown in the browser's address bar. So, if you had used www.example.com; the link would point to htp://www.example.com/dir/page.php and likewise for john.example.com.
For all your links to point at www.example.com, you need to be accessing the page using www. This would be possible only if you do an external redirect from the vanity URL to the actual one i.e. users can still use the shortened URL but they would get redirected to the actual one.
$HTTP["host"] != "www.example.com" {
$HTTP["host"] =~ "^([^.]+)\.example\.com$" {
url.redirect = (
"^/?$" => "http://www.example.com/showme.php?user=%1"
)
}
}

Extracting URI from web pages via BASH

I need to pull all the links for a page that resides on an Intranet however am unsure how best to do it. The structure of the site is as follows
List of topics
Topic 1
Topic 2
Topic 3
etc
Now the links reside in each of the topic pages. I want to avoid going through in excess of 500 topic pages manually to extract the URI.
Each of the topic pages has the following structure
http://alias/filename.php?cat=6&number=1
The cat parameter refers to the category and the number parameter refers to the topic.
Once in the topic page the URI I need to extract exists in a particular format again
http://alias/value?id=somevalue
Caveats
I don't have access to the database so the option to trawl through it is not an option
There is only ever a single URI in each topic page
I need to extract the list to a file that simply lists each URI in a new line
I would like to execute some sort of script I can run from the terminal via BASH that will trawl through the topical URI and then the URI in each of the topics.
In a nutshell
How can I extract a list using a script I can run using BASH that will recursively go through all the list of topics and then extract the URI in each of the topic pages and spit out a text file with the each of extracted URI in a new line.
I implement this with Perl, using the HTML::TokeParser and WWW::Mechanize modules:
use HTML::TokeParser;
use WWW::Mechanize;
my $site = WWW::Mechanize->new(autocheck =>1);
my $topicmax = 500; #Note: adjust this to the number of topic pages you have
# loop through each topic page
foreach(1..$topicmax) {
my $topicurl = "http://alias/filename.php?cat=6&number=$_";
# get the page
$site->get($topicurl);
$p = HTML::TokeParser->new(\$site->{content});
# parse the page and extract the links
while (my $token = $p->get_tag("a")) {
my $url = $token->[1]{href};
# use a regex to test for the link format we want
if($url =~ /^http:\/\/alias\/value\?id=/) {
print "$url\n";
}
}
}
The script prints to stdout, so you just need to redirect it to a file.