Perl : Extract domain name - regex

Extract the domaine name of an URL
Yet another request to parse an URL, but I have found many incomplete or theoretical examples. I would like to have something that work in perl for sure.
I have the following URLs:
https://vimdoc.sourceforge.net/htmldoc/pattern.html
http://linksyssmartwifi.com/ui/1.0.1.1001/dynamic/login.html
http://www.catonmat.net/download/perl1line.txt
https://github.com/robbyrussell/oh-my-zsh/wiki/Cheatsheet
https://drive.google.com/drive/u/0/folders/0B5jNDUmF2eUJuSnM
http://www.gnu.org/software/coreutils/manual/coreutils.html
http://www.catonmat.net/download/perl1line.txt
https://feedly.com/i/my
http://vimhelp.appspot.com/
https://git-scm.com/doc
https://read.amazon.com/
https://github.com/netsamir/following
https://scotch.io/
https://servicios.dgi.gub.uy/
https://sourcemaking.com/
https://stackedit.io/editor
https://stripe.com/be
https://toolbelt.heroku.com/
https://training.github.com/
https://vimeo.com/54505525
https://vimeo.com/tag:drew+neil
https://web.whatsapp.com/
https://www.ctan.org/
https://www.eff.org/
https://www.mybeluga.com/
https://www.solveforx.com/
https://www.symynd.com/
https://www.symynd.com/#
https://www.tizen.org/
http://workforall.net/CDS-Credit-default-Swaps.html#Credit_Default_Swaps_CDS
Try to extract the domain name only. For instance:
linksyssmartwifi.com
amazon.com
github.com
I have tried with Perl and Vim but could not accomplish the task. My best
approximation is the following
perl -pe 's!(^https?\://.*[\.](.+\..+?)/.*$)!$1 -- [$2] !g' all_urls_sorted.txt
Some of them are correctly parsed (see in []), other not :
https://sites.google.com/site/steveyegge2/singleton-considered-stupid -- [google.com]
https://sourcemaking.com/
https://stackedit.io/editor
https://stripe.com/be
https://toolbelt.heroku.com/ -- [heroku.com]
https://training.github.com/ -- [github.com]
https://vimeo.com/54505525
https://vimeo.com/tag:drew+neil
https://web.whatsapp.com/ -- [whatsapp.com]
https://wiki.haskell.org/GHC -- [haskell.org]
As my tests showed, the URL that start straight from // (in https?://) are being excluded.
If you know how to solve this problem I would be very happy.
Thank

Use the URI module:
#!/usr/bin/env perl
use strict;
use warnings;
use v5.10;
use URI;
while (<DATA>) {
chomp;
my $uri = URI->new($_);
my $host = $uri->host;
my ($domain) = $host =~ m/([^.]+\.[^.]+$)/;
say $domain;
}
__DATA__
https://vimdoc.sourceforge.net/htmldoc/pattern.html
http://linksyssmartwifi.com/ui/1.0.1.1001/dynamic/login.html
http://www.catonmat.net/download/perl1line.txt
https://github.com/robbyrussell/oh-my-zsh/wiki/Cheatsheet
https://drive.google.com/drive/u/0/folders/0B5jNDUmF2eUJuSnM
http://www.gnu.org/software/coreutils/manual/coreutils.html
http://www.catonmat.net/download/perl1line.txt
https://feedly.com/i/my
http://vimhelp.appspot.com/
https://git-scm.com/doc
https://read.amazon.com/
https://github.com/netsamir/following
https://scotch.io/
https://servicios.dgi.gub.uy/
https://sourcemaking.com/
https://stackedit.io/editor
https://stripe.com/be
https://toolbelt.heroku.com/
https://training.github.com/
https://vimeo.com/54505525
https://vimeo.com/tag:drew+neil
https://web.whatsapp.com/
https://www.ctan.org/
https://www.eff.org/
https://www.mybeluga.com/
https://www.solveforx.com/
https://www.symynd.com/
https://www.symynd.com/#
https://www.tizen.org/
http://workforall.net/CDS-Credit-default-Swaps.html#Credit_Default_Swaps_CDS
Outputs:
sourceforge.net
linksyssmartwifi.com
catonmat.net
github.com
google.com
gnu.org
catonmat.net
feedly.com
appspot.com
git-scm.com
amazon.com
github.com
scotch.io
gub.uy
sourcemaking.com
stackedit.io
stripe.com
heroku.com
github.com
vimeo.com
vimeo.com
whatsapp.com
ctan.org
eff.org
mybeluga.com
solveforx.com
symynd.com
symynd.com
tizen.org
workforall.net

My best approximation is URI::URL:
foreach my $uri (#filecontents) {
my $uriobj = URL::URL->new($uri);
my $host = $uriobj -> host;
my #parts = split /\./, $host;
print "$uri -- $parts[-2]$parts[-1]\n";
}
Hope that helps.

A regex solution is:
//(?:[^./]+[.])*([^/.]+[.][^/.]+)/
If the trailing slash is optional, just add a ?:
//(?:[^./]+[.])*([^/.]+[.][^/.]+)/?
This should be used with the global modifier and a delimiter other than /.
Essentially, it's looking between the // and the next /.
If there are any extra sub-domains, they will be caught by the (?:[^./]+[.])*. The main domain will fall into the capture group ([^/.]+[.][^/.]+).

Related

How can I get the host name from a URL in Perl?

I have a URL like "www.google.com/aabc/xyz". How can I get host name from this? I used this code:
my $referer = URI->new('www.google.com/aabc/xyz');
my $host = $referer->host; //compiler error
I'm getting error at the second line.
use URI;
use URI::Heuristic qw(uf_uristr);
my $referrer = URI->new( uf_uristr('www.google.com/aabc/xyz') );
print $referrer->host;
The question changed significantly since my first answer, which I've deleted. With high enough rep you can see it.
You have in the code (it's better to post complete programs):
my $referer = URI->new('www.google.com/aabc/xyz');
my $host = $referer->host; //compiler error
You say that you're getting a compiler error, but it's really a runtime error:
Can't locate object method "host" via package "URI::_generic"
When you made the new object, you gave URI a string. From that, it's going to guess what sort of URI it is. Since there's no scheme, such as http://, in front of it, it doesn't guess that it's that sort of URI. Instead, it falls back to a "generic" class URI::_generic. By the underscore in its name and the fact there's no documentation for it, you may surmise it's not meant for you to know about.
But, here it is complaining. It thinks the URI is a path (and some other things). The part you recognize as the host it parses as a path:
use v5.10;
use URI;
my $referer = URI->new('www.google.com/aabc/xyz');
my $path = $referer->path;
say "path is $path";
Now you see what it did:
path is www.google.com/aabc/xyz
The generic URI doesn't know anything about a host, so when you call host on its object, it blows up. It would be nicer for it to return undef, perhaps, but that's not what it does.
oanders already has an interesting answer that guesses for you to fill in schemes when it thinks they might be missing, but there's another thing you can do. Before you call host, check that the object can respond to it:
use v5.10;
use URI;
my $url = 'www.google.com/aabc/xyz';
my $referer = URI->new( $url );
if( $referer->can( 'host' ) ) {
say "Host is " . $referer->host;
}
else {
say "Weird hostless URL: $referer";
}
Now your program shouldn't blow up for the same reason and you can look at the output to discover strings that you couldn't process.
$ echo -e "http://www.google.www.com/abc/xyz\nhttps://google.com\nwww.google.www.com"
http://www.google.www.com/abc/xyz
https://google.com
www.google.www.com
$ echo -e "http://www.google.www.com/abc/xyz\nhttps://google.com\nwww.google.www.com" | perl -pe "s/^(http(s)?:\/\/)?(www\.)?//"
google.www.com/abc/xyz
google.com
google.www.com
You can do it much simpler than above.
CODE
use strict;
use warnings;
while (<DATA>) {
$_ =~ s/^(https?:\/\/)?(www.)?\b//;
print $_ ;
}
__DATA__
http://www.google.com/abc/xyz
https://google.com
www.google.com
Results
google.com/abc/xyz
google.com
google.com

Perl taint mode with domain name input for CGI resulting in “Insecure dependency in eval”

Given the following in a CGI script with Perl and taint mode I have not been able to get past the following.
tail /etc/httpd/logs/error_log
/usr/local/share/perl5/Net/DNS/Dig.pm line 906 (#1)
(F) You tried to do something that the tainting mechanism didn't like.
The tainting mechanism is turned on when you're running setuid or
setgid, or when you specify -T to turn it on explicitly. The
tainting mechanism labels all data that's derived directly or indirectly
from the user, who is considered to be unworthy of your trust. If any
such data is used in a "dangerous" operation, you get this error. See
perlsec for more information.
[Mon Jan 6 16:24:21 2014] dig.cgi: Insecure dependency in eval while running with -T switch at /usr/local/share/perl5/Net/DNS/Dig.pm line 906.
Code:
#!/usr/bin/perl -wT
use warnings;
use strict;
use IO::Socket::INET;
use Net::DNS::Dig;
use CGI;
$ENV{"PATH"} = ""; # Latest attempted fix
my $q = CGI->new;
my $domain = $q->param('domain');
if ( $domain =~ /(^\w+)\.(\w+\.?\w+\.?\w+)$/ ) {
$domain = "$1\.$2";
}
else {
warn("TAINTED DATA SENT BY $ENV{'REMOTE_ADDR'}: $domain: $!");
$domain = ""; # successful match did not occur
}
my $dig = new Net::DNS::Dig(
Timeout => 15, # default
Class => 'IN', # default
PeerAddr => $domain,
PeerPort => 53, # default
Proto => 'UDP', # default
Recursion => 1, # default
);
my #result = $dig->for( $domain, 'NS' )->to_text->rdata();
#result = sort #result;
print #result;
I normally use Data::Validate::Domain to do checking for a “valid” domain name, but could not deploy it in a way in which the tainted variable error would not occur.
I read that in order to untaint a variable you have to pass it through a regex with capture groups and then join the capture groups to sanitize it. So I deployed $domain =~ /(^\w+)\.(\w+\.?\w+\.?\w+)$/. As shown here it is not the best regex for the purpose of untainting a domain name and covering all possible domains but it meets my needs. Unfortunately my script is still producing tainted failures and I can not figure out how.
Regexp-Common does not provide a domain regex and modules don’t seem to work with untainting variable so I am at a loss now.
How to get this thing to pass taint checking?
$domain is not tainted
I verified that your $domain is not tainted. This is the only variable you use that could be tainted, in my opinion.
perl -T <(cat <<'EOF'
use Scalar::Util qw(tainted);
sub p_t($) {
if (tainted $_[0]) {
print "Tainted\n";
} else {
print "Not tainted\n";
}
}
my $domain = shift;
p_t($domain);
if ($domain =~ /(^\w+)\.(\w+\.?\w+\.?\w+)$/) {
$domain = "$1\.$2";
} else {
warn("$domain\n");
$domain = "";
}
p_t($domain);
EOF
) abc.def
It prints
Tainted
Not tainted
What Net::DNS::Dig does
See Net::DNS::Dig line 906. It is the beginning of to_text method.
sub to_text {
my $self = shift;
my $d = Data::Dumper->new([$self],['tobj']);
$d->Purity(1)->Deepcopy(1)->Indent(1);
my $tobj;
eval $d->Dump; # line 906
…
From new definition I know that $self is just hashref containing values from new parameters and several other filled in the constructor. The evaled code produced by $d->Dump is setting $tobj to a deep copy of $self (Deepcopy(1)), with correctly set self-references (Purity(1)) and basic pretty-printing (Indent(1)).
Where is the problem, how to debug
From what I found out about &Net::DNS::Dig::to_text, it is clear that the problem is at least one tainted item inside $self. So you have a straightforward way to debug your problem further: after constructing the $dig object in your script, check which of its items is tainted. You can dump the whole structure to stdout using print Data::Dumper::Dump($dig);, which is roughly the same as the evaled code, and check suspicious items using &Scalar::Util::tainted.
I have no idea how far this is from making Net::DNS::Dig work in taint mode. I do not use it, I was just curious and wanted to find out, where the problem is. As you managed to solve your problem otherwise, I leave it at this stage, allowing others to continue debugging the issue.
As resolution to this question if anyone comes across it in the future it was indeed the module I was using which caused the taint checks to fail. Teaching me an important lesson on trusting modules in a CGI environment. I switched to Net::DNS as I figured it would not encounter this issue and sure enough it does not. My code is provided below for reference in case anyone wants to accomplish the same thing I set out to do which is: locate the nameservers defined for a domain within its own zone file.
#!/usr/bin/perl -wT
use warnings;
use strict;
use IO::Socket::INET;
use Net::DNS;
use CGI;
$ENV{"PATH"} = ""; // Latest attempted fix
my $q = CGI->new;
my $domain = $q->param('domain');
my #result;
if ( $domain =~ /(^\w+)\.(\w+\.?\w+\.?\w+)$/ ) {
$domain = "$1\.$2";
}
else {
warn("TAINTED DATA SENT BY $ENV{'REMOTE_ADDR'}: $domain: $!");
$domain = ""; # successful match did not occur
}
my $ip = inet_ntoa(inet_aton($domain));
my $res = Net::DNS::Resolver->new(
nameservers => [($ip)],
);
my $query = $res->query($domain, "NS");
if ($query) {
foreach my $rr (grep { $_->type eq 'NS' } $query->answer) {
push(#result, $rr->nsdname);
}
}
else {
warn "query failed: ", $res->errorstring, "\n";
}
#result = sort #result;
print #result;
Thanks for the comments assisting me in this matter, and SO for teaching more then any other resource I have come across.

Perl Match Substring in a string ignore whitespace

I have a string
$str = "xxxxxx Code File(s) Name:Some_thing.c CodeFile(s) Version:27 Design Document:some_other_design.doc Module Design Document Version:43 somexxxxxxxxxx Compiler Version:9.5 Doc Type:Word xxxxxx";
where xxxxx represents any character. But i am only interested in extracting the values of each attribute.
ie I want to save
$fileName = Some_thing.c;
$fileVersion = 27;
$designDocName = some_other_design.doc;
$designDocVersion = 43;
$compilerVersion = 9.5;
Right now I feel like I have a messed up solution. Just wondering if there is a cleaner way to do this. This might also fail if i have multiple files with multiple versions.
First i remove all whitespaces, and next i split the string into 2 till i get all the values
$str =~ s/\s*//g;
($temp,$temp2) = split(/CodeFile\(s\)Name:/,$str,2);
($fileName,$temp) = split(/CodeFile\(s\)Version:/,$temp2,2);
($fileVersion,$temp2) = split(/DesignDocument:/,$temp,2);
($designDocName,$temp) = split(/DesignDocumentVersion:/,$temp2,2);
($designDocVersion,$temp2) = split(/some/,$temp,2);
($testedCompilerVersion,$temp) = split(/CompilerVersion:/,$temp2,2);
($testedCompilerVersion,$temp2) = split(/DocType:/,$temp,2);
Please lead me to a link or an efficient solution.
Thanks in advance.
PS: Please also check the comment below the question.
Perhaps the following will be helpful:
use strict;
use warnings;
use Data::Dumper;
my $str = "xxxxxx Code File(s) Name:Some_thing.c CodeFile(s) Version:27 Design Document:some_other_design.doc Module Design Document Version:43 somexxxxxxxxxx Compiler Version:9.5 Doc Type:Word xxxxxx";
my #labels = qw/fileName fileVersion designDocName designDocVersion compilerVersion docType/;
my ($i, %items) = 0;
$items{$labels[$i++]} = $1 while $str =~ /.+?:(\S+)\s+?/g;
print Dumper \%items
Output:
$VAR1 = {
'designDocName' => 'some_other_design.doc',
'fileName' => 'Some_thing.c',
'docType' => 'Word',
'designDocVersion' => '43',
'fileVersion' => '27',
'compilerVersion' => '9.5'
};
Although I would go with #Kenosis solution I still wanted to show you who your script could be simplified.
#!/usr/bin/perl
use v5.14;
use warnings;
my $str = "xxxxxx Code File(s) Name:Some_thing.c CodeFile(s) Version:27 Design Document:some_other_design.doc Module Design Document Version:43 somexxxxxxxxxx Compiler Version:9.5 Doc Type:Word xxxxxx";
my ($fileName,
$fileVersion,
$designDocName,
$designDocVersion,
$compilerVersion) = $str =~ /:(\S+)/g;
say "$fileName, $fileVersion, $designDocName, $designDocVersion, $compilerVersion";
#Some_thing.c, 27, some_other_design.doc, 43, 9.5
my ($fileName, $fileVersion, $designDocName, $designDocVersion, $compilerVersion) =
$str =~ /Code File\(s\) Name:(.*) CodeFile\(s\) Version:(.*) Design Document:(.*) Module Design Document Version:(.*) somexxxxxxxxxx Compiler Version:(.*) Doc Type:(.*) xxxxxx/;

soap lite pass string argument

I have a problem passing a string argument using Perl. The following code
#!/usr/bin/perl -w
use SOAP::Lite;
my $service = SOAP::Lite->service('http://localhost:8080/greeting?wsdl');
print $service->greetClient('perl wooooo'), "\n";
Results in
Greeting null! Have a nice day...
A similar python code
from suds.client import Client
client = Client('http://localhost:8080/greeting?wsdl')
print client.service.greetClient('python wooooo')
works perfectly
Greeting python wooooo! Have a nice day...
I tried to set different encodings
print $service->encoding('utf-8')->greetClient("perl wooooo"), "\n";
with the same result.
A SOAP Monitor shows that there is no arg0 in a case of Perl
<greetClient xsi:nil="true" xsi:type="tns:greetClient" />
which is present in a case of Python
<ns0:greetClient>
<arg0>python wooooo</arg0>
</ns0:greetClient>
What can be a problem?
Why it's so complicated to implement a SOAP client with Perl compared to Python?
EDIT:
SOLUTION
Finally the following solution is working
#!/usr/bin/perl -w
use strict;
use warnings;
use XML::Compile::SOAP11;
use XML::Compile::WSDL11;
use XML::Compile::Transport::SOAPHTTP;
my $soap = XML::Compile::WSDL11->new('c:/temp/greeting.wsdl');
my $call = $soap->compileClient('greetClient');
print $call->(arg0 => 'perl wooooo'){'greetClientResponse'}{'return'}, "\n";
SOAP::Lite can be infuriatingly bad. You might give XML::Compile::SOAP a try:
use strict;
use warnings;
use XML::Compile::SOAP11;
use XML::Compile::WSDL11;
use XML::Compile::Transport::SOAPHTTP;
my $soap = XML::Compile::WSDL11->new(
'http://localhost:8080/greeting?wsdl',
schema_dirs => [
'c:/soft/Perl/site/lib/XML/Compile/SOAP11/xsd'
'c:/soft/Perl/site/lib/XML/Compile/XOP/xsd'
'c:/soft/Perl/site/lib/XML/Compile/xsd'
]
);
$soap->compileCalls;
my ( $response, $trace ) = $soap->call( 'greetClient', arg0 => 'perl wooooo' );
$trace->printResponse;
$response will be the call response converted to a hashref via XML::Simple, which may be all you need. The $trace object is handy to see what the raw XML response looks like.
Unfortunately, I can't see your WSDL.
But in regards to SOAP::Lite, I don't see you setting up neither a proxy (endpoint) nor an uri.
You're also probably going to have to change the on_action behavior as well. By default, SOAP::Lite wants to use the '#' concatenation.
So something along these lines might work.
$service->proxy( $uri_of_my_end_point );
$service->uri( $schema_namespace );
$service->on_action( sub {
my ( $uri, $method ) = #_;
my $slash = $uri =~ m{/$} ? '' : '/';
return qq{"$uri$slash$method"};
});

Querying a website with Perl LWP::Simple to Process Online Prices

In my free time, I've been trying to improve my perl abilities by working on a script that uses LWP::Simple to poll one specific website's product pages to check the prices of products (I'm somewhat of a perl noob). This script also keeps a very simple backlog of the last price seen for that item (since the prices change frequently).
I was wondering if there was any way I could further automate the script so that I don't have to explicitly add the page's URL to the initial hash (i.e. keep an array of key terms and do a search query amazon to find the page or price?). Is there anyway way I could do this that doesn't involve me just copying Amazon's search URL and parsing in my keywords? (I'm aware that processing HTML with regex is generally bad form, I just used it since I only need one small piece of data).
#!usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
my %oldPrice;
my %nameURL = (
"Archer Season 1" => "http://www.amazon.com/Archer-Season-H-Jon-Benjamin/dp/B00475B0G2/ref=sr_1_1?ie=UTF8&qid=1297282236&sr=8-1",
"Code Complete" => "http://www.amazon.com/Code-Complete-Practical-Handbook-Construction/dp/0735619670/ref=sr_1_1?ie=UTF8&qid=1296841986&sr=8-1",
"Intermediate Perl" => "http://www.amazon.com/Intermediate-Perl-Randal-L-Schwartz/dp/0596102062/ref=sr_1_1?s=books&ie=UTF8&qid=1297283720&sr=1-1",
"Inglorious Basterds (2-Disc)" => "http://www.amazon.com/Inglourious-Basterds-Two-Disc-Special-Brad/dp/B002T9H2LK/ref=sr_1_3?ie=UTF8&qid=1297283816&sr=8-3"
);
if (-e "backlog.txt"){
open (LOG, "backlog.txt");
while(){
chomp;
my #temp = split(/:\s/);
$oldPrice{$temp[0]} = $temp[1];
}
close(LOG);
}
print "\nChecking Daily Amazon Prices:\n";
open(LOG, ">backlog.txt");
foreach my $key (sort keys %nameURL){
my $content = get $nameURL{$key} or die;
$content =~ m{\s*\$(\d+.\d+)} || die;
if (exists $oldPrice{$key} && $oldPrice{$key} != $1){
print "$key: \$$1 (Was $oldPrice{$key})\n";
}
else{
print "\n$key: $1\n";
}
print LOG "$key: $1\n";
}
close(LOG);
Yes, the design can be improved. It's probably best to delete everything and start over with an existing full-featured web scraping application or framework, but since you want to learn:
The name-to-URL map is configuration data. Retrieve it from outside of the program.
Store the historic data in a database.
Learn XPath and use it to extract data from HTML, it's easy if you already grok CSS selectors.
Other stackers, if you want to amend my post with the rationale for each piece of advice, go ahead and edit it.
I made simple script to demonstate Amazon search automation. Search url for all departments was changed with escaped search term. The rest of code is simple parsing with HTML::TreeBuilder. Structure of HTML in question can be easily examined with dump method (see commented-out line).
use strict; use warnings;
use LWP::Simple;
use URI::Escape;
use HTML::TreeBuilder;
use Try::Tiny;
my $look_for = "Archer Season 1";
my $contents
= get "http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords="
. uri_escape($look_for);
my $html = HTML::TreeBuilder->new_from_content($contents);
for my $item ($html->look_down(id => qr/result_\d+/)) {
# $item->dump; # find out structure of HTML
my $title = try { $item->look_down(class => 'productTitle')->as_trimmed_text };
my $price = try { $item->look_down(class => 'newPrice')->find('span')->as_text };
print "$title\n$price\n\n";
}
$html->delete;