Perl regular expression pattern matching the urls - regex

I have an old perl code which I need to improvise it by debugging it in apache server but it has some regular expressions in it which I am not able to figure out exactly as I am new to perl. Could some one please explain what does the following code do?
my $target = " ";
$target = $1 if( $url =~ m|^$shorturl(\/.*)$|);
Here,
url is http://127.0.0.1/test.pl/content/dist/hale_bopp_2.mpg
shorturl is http://127.0.0.1/test.pl

Is extracts the "path info" component of the URL, the extra segments of the path after the path to the script.
http://127.0.0.1/test.pl/content/dist/hale_bopp_2.mpg
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(It should really be $target = unescape_uri($1) to handle escaped characters.)

From the language perspective, it matches $url with regexp enclosed in m| | and if it matches, put first capture (part of regex in parens) into $target.

Related

Regex code for finding a URL containing and domain and adding a parameter

Help! Needed for wordpress search and replace
Have a bunch of links to amazon.com/productname/dp/productid or similar.
Want to have a regex code that will search any URLs that contain amazon.com and end of the URL add the following "?tag=tagname"
Thanks.
This should do the trick:
Search: (<a[^<>]*href=["']amazon.com.*?)(?=["'])
Replace: $0?tag=tagname
It basically looks for anything that start with <a, then tests if some characters are present or not. Finally it adds the addition, ?tag=tagname, before the closing " or '.
Assuming you're using PHP:
$re = '/(<a[^<>]*href=["\']amazon.com.*?)(?=["\'])/';
$str = '<a data-foo="href=\'amazon.com/\'" href=\'amazon.com/productname/dp/productid\' data-foo2=\'bar\'/>';
$subst = '$0?tag=tagname';
$result = preg_replace($re, $subst, $str);
echo "The result of the substitution is ".$result;
I am not sure why it's not capturing the data-foo="href='amazon.com/'" though, but works as required.

Regex That Pulls Certain Bits From a String

So I am trying to work with Regular Expression the string I have is
Successfully created package 'C:\Users\mhopper\Documents\CreateNugetPackage\AjaxControlToolkit.3.5.50401.nupkg'
I am trying to make a regular expression that pulls "Successfully" and "C:\Users\mhopper\Documents\CreateNugetPackage\AjaxControlToolkit.3.5.50401.nupkg"
I haven't used Regular Expression a lot and what I'm doing isn't working, what I have so far is
'.*(Successfully\.*\C\D+).*', '$1'
Regex:
^(\S*)\s.*'(.*)'$
#1 match is the status
#2 match is the path
https://regex101.com/r/eC7vX1/1
Powershell:
$line = "Successfully created package 'C:\Users\mhopper\Documents\CreateNugetPackage\AjaxControlToolkit.3.5.50401.nupkg'"
$values = $line -split "^(\S*)\s.*'(.*)'$"
$status = $values[1]
$path = $values[2]
("status:{0}\npath:{1}" -f $status,$path)
You need to be a little more specific about why you need the regex.
Just getting those two values from the string doesn't really need a regex.
$Status,$PackagePath = ($String.Trim().Split(' ',4))[0,3]

Perl Regular expression to replace the last matching string

I have a string as below:
$str = "/dir1/dir2/dir3/file.txt"
I want to remove the /file.txt from this string.
So that the $str will become.
$str = "/dir1/dir2/dir3"
I am using the following regex. But it is replacing everything.
$str =~ s/\/.*\.txt//;
How can I make regex to look for last '/' instead of first.
What is the correct regular expression for this?
Please note that file.txt is not fixed name. It can be anything like file1.txt, file2.txt, etc.
If you want to get the path from that string, you can use File::Basename. It is a core module since Perl version 5.
perl -MFile::Basename -le '$str = "/dir2/dir3/file.txt"; print dirname($str);'
In script form:
use strict;
use warnings; # always use these
use File::Basename;
my $str = "/dir1/dir2/dir3/file.txt";
print dirname($str);"
Your regex does not work because it is not anchored, and .* is greedy, so it matches as much as it can, starting from the first slash / it encounters. A working regex would look something like these:
$str =~ s#/[^/]*?\.txt$##;
Note the use of a non-greedy quantifier *?, which will match smallest possible string. Also note that I use another delimiter for the substitution to avoid the "leaning toothpick syndrome", e.g. s/\/\/\///.
Very simple regex : s/\/[^\/]*$//
In this regex
m/(.*)\/[^\/]*$/
the first submatch is the path you are looking for.
EDIT:
If you are looking for substitution user1215106's soultion is the way to go:
s/\/[^\/]*$//

How do I get the last directory from a URL path using a Zeus rewrite rule?

I need a regular expression that will return the last directory in a path.
e.g, from www.domain.com/shop/widgets/, return "widgets".
I have an expression that almost works.
[^/].*/([^/]+)/?$
It will return "widgets" from www.domain.com/shop/widgets/ but not from www.domain.com/widgets/
I also need to ignore any URLs that include a filename. So that www.domain.com/shop/widgets/blue_widget.html will not match.
This must be done using regular expressions as it is for the Zeus server request rewrite module.
/^www\.example\.com\/([^\/]+\/)*([^\/]+)\/$/
What does this do?
Matches normal text for the domain. Adjust this as required.
Matches any number of directories, each of which consists of non-slash characters followed by a slash.
Matches a string of non-slashes.
Matches a slash at the end of the input, thus eliminating files (since only directories end in a slash).
Implemented in Perl:
[ghoti#pc ~] cat perltest
#!/usr/local/bin/perl
#test = (
'www.example.com/path/to/file.html',
'www.example.com/match/',
'www.example.com/pages/match/',
'www.example.com/pages/widgets/thingy/',
'www.example.com/foo/bar/baz/',
);
foreach (#test) {
$_ =~ m/^www\.example\.com\/([^\/]+\/)*([^\/]+)\/$/i;
printf(">> %-50s\t%s\n", $_, $2);
}
[ghoti#pc ~] ./perltest
>> www.example.com/path/to/file.html
>> www.example.com/match/ match
>> www.example.com/pages/match/ match
>> www.example.com/pages/widgets/thingy/ thingy
>> www.example.com/foo/bar/baz/ baz
[ghoti#pc ~]
This should generally work:
/([^/.]+)/$
It matches a set of non-slash, non-period characters after the second-to-last slash in a string that must end in a slash.
The "folder name" will be in the first capture group.
#!/usr/bin/perl
use strict;
use warnings;
$_ = 'www.domain.com/shop/widgets/';
print "$1\n" if (/\/([^\/]+)\/$/);
$_ = 'www.domain.com/shop/widgets/blue_widget.html';
print "$1\n" if (/\/([^\/]+)\/$/);'
You don't want a Perl regular expression. You want a regular expression that Zeus will understand. Although they might call that PCRE, not even PCRE handles all Perl regular expressions.
Most of the answers here are wrong because they aren't thinking about the different sorts of URLs that you will can get as input.
Get just the path portion of the URL
Match against the path portion to find what you need
Distinguish between paths that end in a filename and those that don't
There are some examples that you can use as a start. I don't use Zeus and don't want to, so the next part is up to you:
Zeus Rewrite Rules
Mod Rewrite rule to Zeus Server rule (Codeigniter)
http://www.names.co.uk/support/support_centre_home/528-zeus_rewrite_rules_user_guide.html
http://drupal.org/node/46508
I've read that you can pass the request to a Perl program through Perl Extensions for ZWS, but I'd be surprised if you needed to do that. If you have to resort to that, I'd use the URI module to parse the URI and extract the path. Once you have that, split up the path into it's components:
use URI;
my $uri = URI->new( ... ); # I don't know how Zeus passes data
my $path = $uri->path;
# undef to handle the leading /
my( undef, #parts ) = split $path, '/';
Once you are this far, you have to decide how you want to recognize something as a directory. If you're mapping directly onto a filesystem structure, that is just a matter of popping elements off #parts until you find the directories, then counting back the number you want to skip.
However, I cringe at doing that, no matter what I put in the Perl program. I'd try really hard to get it done just in the Zeus rules first. Show us what you have so far.

How can I extract URLs from plain text with Perl?

I need the Perl regex to parse plain text input and convert all links to valid HTML HREF links. I've tried 10 different versions I found on the web but none of them seen to work correctly. I also tested other solutions posted on StackOverflow, none of which seem to work. The correct solution should be able to find any URL in the plain text input and convert it to:
$1
Some cases other regular expressions I tried didn't handle correctly include:
URLs at the end of a line which are followed by returns
URLs that included question marks
URLs that start with 'https'
I'm hoping that another Perl guy out there will already have a regular expression they are using for this that they can share. Thanks in advance for your help!
You want URI::Find. Once you extract the links, you should be able to handle the rest of the problem just fine.
This is answered in perlfaq9's answer to "How do I extract URLs?", by the way. There is a lot of good stuff in those perlfaq. :)
Besides URI::Find, also checkout the big regular expression database: Regexp::Common, there is a Regexp::Common::URI module that gives you something as easy as:
my ($uri) = $str =~ /$RE{URI}{-keep}/;
If you want different pieces (hostname, query parameters etc) in that uri, see the doc of Regexp::Common::URI::http for what's captured in the $RE{URI} regular expression.
When I tried URI::Find::Schemeless with the following text:
Here is a URL and one bare URL with
https: https://www.example.com and another with a query
http://example.org/?test=one&another=2 and another with parentheses
http://example.org/(9.3)
Another one that appears in quotation marks "http://www.example.net/s=1;q=5"
etc. A link to an ftp site: ftp://user#example.org/test/me
How about one without a protocol www.example.com?
it messed up http://example.org/(9.3). So, I came up with the following with the help of Regexp::Common:
#!/usr/bin/perl
use strict; use warnings;
use CGI 'escapeHTML';
use Regexp::Common qw/URI/;
use URI::Find::Schemeless;
my $heuristic = URI::Find::Schemeless->schemeless_uri_re;
my $pattern = qr{
$RE{URI}{HTTP}{-scheme=>'https?'} |
$RE{URI}{FTP} |
$heuristic
}x;
local $/ = '';
while ( my $par = <DATA> ) {
chomp $par;
$par =~ s/</</g;
$par =~ s/( $pattern ) / linkify($1) /gex;
print "<p>$par</p>\n";
}
sub linkify {
my ($str) = #_;
$str = "http://$str" unless $str =~ /^[fh]t(?:p|tp)/;
$str = escapeHTML($str);
sprintf q|%s|, ($str) x 2;
}
This worked for the input shown. Of course, life is never that easy as you can see by trying (http://example.org/(9.3)).
Here I have posted the sample code using how to extract the urls.
Here it will take the lines from the stdin.
And it will check whether the input line contains valid URL format.
And it will give you the URL
use strict;
use warnings;
use Regexp::Common qw /URI/;
while (1)
{
#getting the input from stdin.
print "Enter the line: \n";
my $line = <>;
chomp ($line); #removing the unwanted new line character
my ($uri)= $line =~ /$RE{URI}{HTTP}{-keep}/ and print "Contains an HTTP URI.\n";
print "URL : $uri\n" if ($uri);
}
Sample output I am getting is as follows
Enter the line:
http://stackoverflow.com/posts/2565350/
Contains an HTTP URI.
URL : http://stackoverflow.com/posts/2565350/
Enter the line:
this is not valid url line
Enter the line:
www.google.com
Enter the line:
http://
Enter the line:
http://www.google.com
Contains an HTTP URI.
URL : http://www.google.com