Perl 5: How to improve regex for URL parsing - regex

I'm trying to parse a text file of tweets and remove URLs and put them into a urls.txt file. At the moment, I have this regex:
($line =~ /((?:https?|ftp|telnet|gopher|file|imap):\/\/[\w\-\.\~\!\*\'\(\)\;\:\#\&\=\+\$\,\/\\\?\%\#\[\]]*)/)
But as I want to build on it further, and it's quite unwieldy even now, I'm wondering if there's any way I can check for valid URL characters (the [\w\-\.\~\!\*\'\(\)\;\:\#\&\=\+\$\,\/\\\?\%\#\[\]]* part) using something like an array or a hash. Or anything that doesn't make it so unnecessarily verbose.
The rest of my code can be provided if needed for whatever reason.

If you want to validate a URL why not use a module from CPAN to do the hard work for you.
my $uri = URI->new("http://www.perl.com");
See the details of the URI module here.
As recommended by Sobrique, you could also use:
use Data::Validate::URI qw(is_uri);
if (is_uri("http://www.perl.com")) {
...
}
See the details of the Data::Validate::URI module here.

Related

media monks crawler, blacklist

I'm using MediaMonks crawler to crawl some websites.
Packagist link
There is a function called blacklist, and I'd like to use that to avoid crawling all url's that has hashtags in them.
Something like this:
// TODO: Write the correct regular expression.
$crawler->addBlacklistUrlMatcher(new Matcher\PathRegexUrlMatcher('/#/'));
I'm really bad with regular expressions, can anyone help me with this?
It probably depends on how blacklist matcher works, but in general case if you want to catch the whole line containing # symbol in it, this is what you need to put into brackets:
.*\#.*
This will catch whole line(s) containing # symbol, for instance:
#somehashtag
#some hash tag
This site will help you to create regex for your needs: https://regex101.com

Parse HTML using perl regex

I created a Perl script that would use an online website to crack MD5 hashes after the user inputs the hashes. I am partially successful as I am able to get the response from the website, though I need to parse the HTML and display the hash, and corresponding password in clear text to the user. The following is the output snippet I get now:
<strong>21232f297a57a5a743894a0e4a801fc3</strong>: admin</p>
Using regex buddy, I was able to use the following expression [a-z0-9]{32} to match the hash part alone. I need the final output in the following format:
21232f297a57a5a743894a0e4a801fc3: admin
Any help would be appreciated. Thank you!
I think you'd be much better off using HTML::Parser to simply/reliably parse that HTML. Otherwise you're into the nightmare of parsing HTML with regexps, and you'll find that doesn't work reliably.
There are a few tools that can handle both fetching and parsing the page for you available on CPAN. One of them is Web::Scraper. Tell it what page to fetch and which nodes (in xpath or CSS syntax) you want, and it will get them for you. I'll not give an example as I don't know your URL.
There is a good blogpost about this on blogs.perl.org by stas that uses a different module that might also be helpful.
Here it is:
$str = q{<strong>21232f297a57a5a743894a0e4a801fc3</strong>: admin</p>};
#arr = $str =~ m{<strong>(.+)</strong>(.+)</p>};
print(join("", #arr), "\n");

Regex don't match if URI contains extension

I'm getting stuck on a bit of regex, needed in an htaccess file on an old project I've taken on. I want to match the following uris
/page?id=12
/admin/users-view?id=3242
/subscribe
Where there may or may not be a query string, and may or may not be multiple segments
I need to insert a .php extension, before the query string. So the first example becomes
/page.php?id=12
I also cannot match any uri with a file extension, so that images, js or css files do not get matched.
I came up with this:
^([/\w-]+)?/?
which does what i need apart from the last point. My regex skills are poor, so any help is appreciated
dont parse URIs with regexp, php has built in functions for that
http://php.net/manual/en/function.parse-url.php
note, there is also reverse function which builds url:
http://php.net/manual/en/function.http-build-url.php
you should use them instead of regexp because they will (at least should) handle url encoding correctly
You might want to think about disassembling the URL with parse_url and putting it back together after manipulation.
However, for a pure regex solution, I think I would try to find a string starting at a slash (or the beginning of the string) and a question mark that does not contain periods:
$url = preg_replace('~(^|/)[^.?]*(?=[?]|$)~', '$0.php', $url);
The parse_url solution would rather look like:
$urlParts = parse_url($url);
if(pathinfo($urlParts['path'], PATHINFO_EXTENSION) === null)
$urlParts['path'] .= '.php';
$url = implode($urlParts);

How can I manipulate just part of a Perl string?

I'm trying to write some Perl to convert some HTML-based text over to MediaWiki format and hit the following problem: I want to search and replace within a delimited subsection of some text and wondered if anyone knew of a neat way to do it. My input stream is something like:
Please mail support. if you want some help.
and I want to change Please help and Please can some one help me out here to Please%20help and Please%20can%20some%20one%20help%20me%20out%20here respectively, without changing any of the other spaces on the line.
Naturally, I also need to be able to cope with more than one such link on a line so splicing isn't such a good option.
I've taken a good look round Perl tutorial sites (it's not my first language) but didn't come across anything like this as an example. Can anyone advise an elegant way of doing this?
Your task has two parts. Find and replace the mailto URIs - use a HTML parsing module for that. This topic is covered thoroughly on Stack Overflow.
The other part is to canonicalise the URI. The module URI is suitable for this purpose.
use URI::mailto;
my #hrefs = ('mailto:help#myco.com&Subject=Please help&Body=Please can some one help me out here');
print URI::mailto->new($_)->as_string for #hrefs;
__END__
mailto:help#myco.com&Subject=Please%20help&Body=Please%20can%20some%20one%20help%20me%20out%20here
Why dont you just search for the "Body=" tag until the quotes and replace every space with %20.
I would not even use regular expresions for that since I dont find them useful for anything except mass changes where everything on the line is changes.
A simple loop might be the best solution.

replace url paths using Regex

How can I change the url of my images from this:
http://www.myOLDwebsite.com/**********.*** (i have gifs, jpgs, pngs)
to this:
http://www.myNEWwebiste.com/somedirectory/**********.***
Using REGexp text editor?
Really thanks for your time
[]'s
Mateus
Why use regex?
Using conventional means, replace:
src="http://www.myOLDwebsite.com/
with:
src="http://www.myNEWwebiste.com/somedirectory/
Granted, this assumes your image tags always follow the 'src="<url>"' pattern, with double quotes and everything.
Using regex is of course also possible. Replace this:
(src\s*=\s*["'])http://www\.myOLDwebsite\.com/
with:
\1http://www.myNEWwebiste.com/somedirectory/
alternatively, if your text editor uses $ to mark back references:
$1http://www.myNEWwebiste.com/somedirectory/
On second thought - why do your images have absolute URLs in the first place? Isn't that unnecessary?
Well, the easiest way is probably to use sed in in-place mode:
sed -ir \
's#http://www[.]myOLDwebsite[.]com/#http://www.myNEWwebsite.com/subdirectory/#g' \
file1 file2 ...
If for some reason you need to actually interpret the HTML (rather than just do a simple string replacement), a quick script built around BeautifulSoup is going to be safer -- lots of people try to do HTML or XML parsing via regular expressions, but it's very hard if not impossible to cover all corner cases.
All that said, it'd be better if you were using relative links to not have your HTML depend on the server it's hosted on. See also the <BASE HREF="..."> element you can put in your <HEAD> to specify a location all URLs are relative to; if you were using that, you'd only need to do a single replacement.