Process a Perl qr RegEx string in reverse direction - regex

I have a series of qr// RegEx string patterns to match URLs fed to my site. For example, qr#^/safari/article/([0-9]+)\.html(\?(.*))?$#. This string would match a path from a URL such as /safari/article/299.html?parameter=1. I have a separate subroutine where I can create URLs to link to different parts of the program. It occurred to me that it would be nice if that latter part could somehow use those aforementioned patterns I have already written -- it would reduce the likelihood of error if both the way URLs were generated and the way they are later processed came from the same set of patterns.
When a user comes to the site, my program takes the URL given to the server, runs it against strings like the one above and it will output $1 and $2 with the patterns it finds (e.g. "299" and "parameter=1," the two parameters for loading a page). In essence, now I'd like to do that in reverse and somehow provide $1 and $2 and feed them against that qr// string to create a new path output (say, I'd set $1 to "300" and $2 to "parameter=2," somehow merge that against the qr// string and get the output /safari/article/300.html?parameter=2.
Is there a simple way to do that sort of "reverse regex"? It seems like one way to do it would simply be to do a regex pattern match against those two parenthetical patterns, but that somehow feels sloppy to me. Is there a cleaner way?
EDIT: Part of the reason for storing the patterns in RegEx is that they all get thrown into a multidimensional array for later processing that can help figure out what module should be called. Here's a couple of sample items:
[
{ function => 'article', pattern => qr#^/safari/article/([0-9]+)\.html(\?(.*))?$#, weight => 106 },
{ function => 'topCommentedPosts', pattern => qr#^/safari/top\.html$#, weight => 100 }
]

I'm not sure I understand exactly what you want to achieve. The following works, but going this ways seems rather fragile and dangerous. Why do you need to generate the paths, anyway?
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my $TEMPLATE = '/safari/article/$1.html?$2';
sub generate {
my (#replacements) = #_;
return $TEMPLATE =~ s/\$([12])/$replacements[$1-1]/gr
}
sub match {
my ($string) = #_;
my $regex = "$TEMPLATE";
$regex =~ s/([?.])/\\$1/g;
$regex =~ s/\$[0-9]+/(.*)/g;
return $string =~ /$regex/
}
use Test::More;
is generate(300, 'parameter=2'), '/safari/article/300.html?parameter=2';
is_deeply [match('/safari/article/299.html?parameter=1')], [299, 'parameter=1'];
done_testing();

Related

Regex gets more result then in text available

I have a really weird problem: i searching for URLs on a html site and want only a specific part of the url. In my test html page the link occurs only once, but instead of one result i get about 20...
this is my regex im using:
perl -ne 'm/http\:\/\myurl\.com\/somefile\.php.+\/afolder\/(.*)\.(rar|zip|tar|gz)/; print "$1.$2\n";'
sample input would be something like this:
<html><body>Somelinknme</body></html>
which is a very easy example. so in real the link would apper on a normal website with content around...
my result should be something like this:
testfile.zip
but instead i see this line very often... Is this a problem with the regex or with something else?
Yes, the regex is greedy.
Use an appropriate tool for HTML instead: HTML::LinkExtor or one of the link methods in WWW::Mechanize, then URI to extract a specific part.
use 5.010;
use WWW::Mechanize qw();
use URI qw();
use URI::QueryParam qw();
my $w = WWW::Mechanize->new;
$w->get('file:///tmp/so10549258.html');
for my $link ($w->links) {
my $u = URI->new($link->url);
# 'http://myurl.com/somefile.php?x=foo&y=bla&z=sdf&path=/foo/bar/afolder/testfile.zip&more=arguments&and=evenmore'
say $u->query_param('path');
# '/foo/bar/afolder/testfile.zip'
$u = URI->new($u->query_param('path'));
say (($u->path_segments)[-1]);
# 'testfile.zip'
}
Are there 20 lines following in the file after your link?
Your problem is that the matching variables are not reseted. You match your link the first time, $1 and $2 get their values. In the following lines the regex is not matching, but $1 and $2 has still the old values, therefore you should print only if the regex matches and not every time.
From perlre, see section Capture Groups
NOTE: Failed matches in Perl do not reset the match variables, which makes it easier to write code that tests for a series of more specific cases and remembers the best match.
This should do the trick for your sample input & output.
$Str = '<html><body>Somelinknme</body></html>';
#Matches = ($Str =~ m#path=.+/(\w+\.\w+)#g);
print #Matches ;

How to remove a part of an URL with regexes?

How can I turn this:
http://site.com/index.php?id=15
Into this?:
http://site.com/index.php?id=
Which RegEx(s) do I use?
I've been trying to do this for a good 2 hours now and I've had no luck.
I can't seem to take out the number(s) at the end, and sometimes there are
letters in the end as well which give me problems.
I am using Bing! instead of Google.
My RegEx so far is this when I search something:
$start = '<h3><a href="';
$end = '" onmousedown=';
while ($result =~ m/$start(.*?)$end/g)
What can I add in their to take out the letters and digits in the end and just leave it as an equal sign?
Thank you.
Since you cannot parse [X]HTML properly with regular expressions, you should look for the minimum possible context that will get you the href you want.
To the best of my knowledge, the one character that cannot be in a href is ". therefore
/href="([^"]+)"/
Should yield a URL in $1. I would sanity check it for URL-ishness before extracting the id string you want, and then:
s/\?id=\w+/id=/
But this has hack written all over it, because you can't parse HTML with regular expressions. So it will probably break the first time you demonstrate it to a customer.
You should really check out proper Perl parsing: http://www.google.com/webhp?q=perl+html+parser
You asked for a regular expression solution but your problem is a bit ill-defined and regexes for HTML are only for stop-gap/one-off stuff or else you’re probably just hurting yourself.
Since I am really not positive what your actual need and HTML source look like this is a generic solution to taking a URL and spitting out all the links found on the page without query strings. Having id= is for all reasonable purposes/code equivalent to no id.
There are many ways, at least three or four of them good solutions, to do this in Perl. This is one that is often overlooked: libxml. Docs: XML::LibXML, URI, and URI::QueryParam (if you want better query manipulation).
use warnings;
use strict;
use URI;
use XML::LibXML;
my $source = shift || die "Give a URL!\n";
my $parser = XML::LibXML->new;
$parser->recover(1);
my $doc = $parser->load_html( location => $source );
for my $anchor ( $doc->findnodes('//a[#href]') )
{
my $uri = URI->new_abs( $anchor->getAttribute("href"), $source );
# commented out ideas.
# next unless $uri->host eq "TARGET HOST NAME";
# next unless $uri->path eq "TARGET PATH";
# Clear the query completely; id= might as well be nothing.
$uri->query(undef);
print $uri, $/;
}
It sounds like maybe you’re using Bing! for scraping. This kind of thing is against pretty much every search engine’s ToS. Don’t do it. They have APIs (well, Google does at least) if you register and get a dev token.
I'm not 100% sure what you are doing, but this is the problem:
while ($result =~ m/$start(.*?)$end/g)
What's the purpose of this loop? You're taking a scalar called $result and checking for a pattern match. How is $result changing?
Your original question was how to make this:
http://site.com/index.php?id=15
into this:
http://site.com/index.php?id=
That is, how do you remove the 15 (or another number) from the expression. The answer is pretty simple:
$url =~ s/=\d+$/=/;
That'll anchor your regular expression at the end of the URL replacing the ending digits with nothing.
If you're removing any string, it's a bit more complex:
$url =~ s/=[^=]+/=/;
You can't simply use \S+ because regular expressions are normally greedy. Therefore, you want to specify any series of non-equal sign characters preceded by an equal sign.
Now, as for the while loop, maybe you want an if statement instead...
if ($result =~ /$start(.*?)$end/g) {
print "Doing something if this matched\n";
}
else {
print "Doing something if there's no match\n";
}
And, I'm not sure what this means:
I am using Bing! instead of Google.
Are you trying to parse the input from Bing!? If so, please explain exactly what you're really trying to do. Maybe we know a better way of doing this. For example, if you're parsing the output of a search result, there might be an API that you can use.
How can I turn this:
http://site.com/index.php?id=15
Into this?:
http://site.com/index.php?id=
I think this is the solution you are looking for
#!/usr/bin/perl
use strict;
use warnings;
my $url="http://site/index.php?id=15";
$url =~ s/(?<=id=).*//g;
print $url;
Output :
http://site.com/index.php?id=
as per your need anything after = sign will be omitted from the URL

regex replacement that adds text

I'm trying to do link rewriting in my mobile application (written in ruby). I would like it to be able to accomplish both of these rewrites with a single regular expression:
m.example.com -> www.example.com
m.subd.example.com -> subd.example.com
The closest I've gotten replacing this:
m\.([a-z\.]*)example\.com
with this:
$1example.com
This works for the m.subd.example.com but it fails for m.example.com because of my "www." exception.
I do this A LOT so i'd like it to be very fast, which is why I am trying to avoid using any code, just a single regex. Is it possible? Is there a fancy feature of regex that I don't know about?
I am trying to avoid using any code, just a single regex
Regex is code. A more complex regex takes longer to run. You'll need to write some code or run two regexes.
result = subject.gsub(/m\.([a-z.]*)example\.com/, '\1example.com').gsub(/^example\.com/, 'www.example.com')
I don't know Ruby, but here is a Perl script that does the job for the examples you've given. May be it could be translated.
#!/usr/local/bin/perl
use strict;
use warnings;
my #list = qw/m.example.com m.subd.example.com/;
my $re = qr#^m\.(.*)(example\.com)$#;
foreach(#list) {
print $_;
s/$re/($1 || "www.") . $2/e;
print " -> $_ \n";
}
output:
m.example.com -> www.example.com
m.subd.example.com -> subd.example.com

How can I change my regular expression to read UTF-8?

I got very far in a script I am working on only to find out it has a problem reading UTF-8 characters.
I have a contact in Sweden that made a VM on his machine with some UTF-8 in it and when my script hit that VM it lost its mind, but it was able to read all of the other VMs that are in the "normal" charset.
Anyhow, maybe my code will make more sense.
#!/usr/bin/perl
use strict;
use warnings;
#use utf8;
use Net::OpenSSH;
# Create a hash for storing the options needed by Net::OpenSSH
my %ssh_options = (
port => '22',
user => 'root',
password => 'password'
);
# Create a new Net::OpenSSH object
my $ssh = Net::OpenSSH->new('192.168.2.101', %ssh_options);
# Create an array and capture the ESX\ESXi output from the current server
my #getallvms = $ssh->capture('vim-cmd vmsvc/getallvms');
shift #getallvms;
# Process data gathered from server
foreach my $vm (#getallvms) {
# Match ID, NAME
$vm =~ m/^(?<id> \d+)\s+(?<name> .+?)\s+/xm;
my $id = "$+{id}";
my $name = "$+{name}";
print "$id\n";
print "$name\n";
print "\n";
}
I have narrowed it down to my regular expression as the problem, because here the raw output from the server before regular expression is applied.
416
TEST Box åäö!"''*#
And this is what I get after I apply my regular expression
416
TEST
For some reason the regular expression is not matching, I just don't know why. And the current regular expression in the example is the third attempt at getting it to work.
The FULL line that I am matching looks like this. The way my regular expression was done was because I only need the first two blocks of information, the expression you have wants to copy the entire line.
The code:
432 TEST Box åäö!"''*# [Store] TEST Box +w6XDpMO2IQ-_''_+Iw/TEST Box +w6XDpMO2IQ _''_+Iw.vmx slesGuest vmx-04
The subpattern
(?<name> .+?)\s+
in your regular expression means “match and remember one or more non-newline characters, but stop as soon as you find whitespace,” so $name contains TEST because the pattern stopped matching when it saw the space just before Box.
The VI Toolkit wiki gives an example of the getallvms subcommand's output:
# vmware-vim-cmd -H 10.10.10.10 -U root -P password /vmsvc/getallvms
Vmid Name File Guest OS Version Annotation
64 bartPE [store] BartPE/BartPE.vmx winXPProGuest vmx-04
96 trustix [store] Trustix/Trustix.vmx otherLinuxGuest vmx-04
The case is slightly different from the example in your question, but it appears that we can look for [store] as a bumper for the match:
/^(?<id> \d+) \s+ (?<name> .+?) \s+ \[store]/mix
The non-greedy quantifier +? means match one or more of something, but the match wants to hand control to the rest of the pattern as quickly as possible. Remember that [ has a special meaning in regular expressions, but the pattern \[ matches a literal rather than introducing a character class.
I think of this technique as bookending or tacking-and-stretching. If you want to extract a chunk of text that's difficult to characterize, look for surrounding features that are easy to match—often as simple as ^ or $. Then use a stretchy pattern to grab everything in between, usually (.+) or (.+?). Read the “Quantifiers” section of the perlre documentation for an explanation of your many options.
This fixes the immediate problem, and you can also add polish in a few areas.
Do not use $1, $2, and friends unconditionally! Always test that the pattern matches before using capture variables. For example
if (/(foo|bar|baz)/) {
print "got $1\n";
}
else {
print "no match\n";
}
An unprotected print $1 can produce surprising results that are tough to debug.
Judicious use of Perl's defaults can help emphasize the computation and lets the mechanism fade into the background. Dropping $vm in favor of $_ as the implicit loop variable and implicit match target makes for a nicer result.
Your comments merely translate from Perl to English. The most helpful comments explain the why, not the what. Also keep in mind Rob Pike's advice on commenting:
If your code needs a comment to be understood, it would be better to rewrite it so it's easier to understand.
In the assignments from %+, the quotes don't do anything useful. The values are already strings, so remove the quotes.
my $id = $+{id};
my $name = $+{name};
Below is a modified version of your code that captures everything after the number but before [store] into $name. The utf8 pragma declares that your source code—not, as with a common mistake, your input—contains UTF-8. The test below simulates with a canned echo the output from vim-cmd on the Swedish VM.
As Tom suggested, I use the Encode module to decode the output that arrives through the SSH connection and encode it for benefit of the local host before printing it out.
The perlunifaq documentation advises decoding external data into Perl's internal format and then encoding any output just before it's written. I assume that the value returned from $ssh->capture(...) uses UTF-8 encoding, that is, that the remote host is sending UTF-8. We see the expected result because I'm running a modern distribution of Linux and ssh-ing back to it, but in the wild, you may be dealing with some other encoding.
You're able to get away with skipping the calls to decode and encode because Perl's internal format happens to match those of the hosts you're using. In general, however, cutting corners can get you into trouble:
What if I don't decode?
What if I don't encode?
Finally, the code!
#! /usr/bin/env perl
use strict;
use utf8;
use warnings;
use Encode;
use Net::OpenSSH;
my %ssh_options = ();
my $ssh = Net::OpenSSH->new('localhost', %ssh_options);
# Create an array and capture the ESX\ESXi output from the current server
#my #getallvms = $ssh->capture('vim-cmd vmsvc/getallvms');
my #getallvms = $ssh->capture(<<EOEcho);
echo -e 'JUNK\n416 TEST Box åäö!"'\\'\\''*# [Store] TEST Box +w6XDpMO2IQ-_''_+Iw/TEST Box +w6XDpMO2IQ _''_+Iw.vmx slesGuest vmx-04'
EOEcho
shift #getallvms;
for (#getallvms) {
$_ = decode "utf8", $_, Encode::FB_CROAK;
if (/^(?<id> \d+) \s+ (?<name> .+?) \s+ \[store]/mix) {
my $id = $+{id};
my $name = $+{name};
print encode("utf8", $id), "\n",
encode("utf8", $name), "\n",
"\n";
}
else {
print "no match\n";
}
}
Output:
416
TEST Box åäö!"''*#
If you know the string you work on is UTF-8 and Net::OpenSSH doesn't (and hence doesn't mark it as such), you can convert it to an internal representation Perl can work on with one of:
use Encode;
decode_utf8( $in_place );
$decoded = decode_utf8( $raw );
So you have make sure, that Perl understand those names as UTF-8 encoded strings. So far I don't think it has. A comprehensive overview about UTF-8 in Perl.
You can test your strings unicodeness with Encode::is_utf8 and decode them with Encode::decode('UTF-8', $your_string).
UTF-8 is pretty messy still in Perl, IMHO. You must have pretty patient with it.
To print UTF-8 strings out in pretty way, you should use something like that in your script:
BEGIN {
binmode(STDOUT, ':encoding(UTF-8)');
binmode(STDERR, ':encoding(UTF-8)'); # Error messages
}
If you got Perl understand your UTF-8 names, you could regex them properly too.
Recent Net::OpenSSH releases have native support for charset encoding/decoding in capture methods:
my #getallvms = $ssh->capture({stream_encoding => 'utf8'},
'vim-cmd vmsvc/getallvms');

How can I get the file extensions from relative links in HTML text using Perl?

For example, scanning the contents of an HTML page with a Perl regular expression, I want to match all file extensions but not TLD's in domain names. To do this I am making the assumption that all file extensions must be within double quotes.
I came up with the following, and it is working, however, I am failing to figure out a way to exclude the TLDs in the domains. This will return "com", "net", etc.
m/"[^<>]+\.([0-9A-Za-z]*)"/g
Is it possible to negate the match if there is more than one period between the quotes that are separated by text? (ie: match foo.bar.com but not ./ or ../)
Edit I am using $1 to return the value within parentheses.
#!/usr/bin/perl
use strict; use warnings;
use File::Basename;
use HTML::TokeParser::Simple;
use URI;
my $parser = HTML::TokeParser::Simple->new( \*DATA );
while ( my $tag = $parser->get_tag('a') ) {
my $uri = URI->new( $tag->get_attr('href') );
my $ext = ( fileparse $uri->path, qr/\.\w+\z/ )[2];
print "$ext\n";
}
__DATA__
<p>link link on example.com
</p>
First of all, extract the names with an HTML parser of your choice. You should then have something like an array containing the names, as if produced like this:
my #names = ("http://foo.bar.net/quux",
"boink.bak",
"mms://three.two.one"
"hello.jpeg");
The only way to distinguish domain names from file extensions seems to be that in "file names", there is at least one more slash between the :// part and the extension. Also, a file extension can only be the last thing in the string.
So, your regular expression would be something like this (untested):
^(?:(?:\w+://)?(?:\w+\.)+\w+/)?.*\.(\w+)$
#!/usr/bin/perl -w
use strict;
while (<>) {
if (m/(?<=(?:ref=|src=|rel=))"([^<>"]+?\.([0-9A-Za-z]+?))"/g) {
if ($1 !~ /:\/\//) {
print $2 . "\n";
}
}
}
Used positive lookbehind to get only the stuff between doublequotes behind one of the 'link' attributes (scr=, rel=, href=).
Fixed to look at "://" for recognizing URL's, and allow files with absolute paths.
#Structure : There is no proper way to protect against someone leaving off the protocol part, as it would just turn into a legitimate pathname : http://www.noo.com/afile.cfg -> www.noo.com/afile.cfg. You would need to wget (or something) all of the links to make sure they are actually there. And that's an entirely different question...
Yes, I know I should use a proper parser, but am just not feeling like it right now :P