Search for pattern in variable with html-code in perl

Search for pattern in variable with html-code in perl - regex

I get a variable $response_code with the whole html-code of a website via curl. In this string I want so search via regex for a specific pattern. But this dosn't work, I get no match at all, no matter for what I search. When I write something different in $response_code it works, the array is empty...
Whats wrong with this?
use strict;
use warnings;
use WWW::Curl::Easy;
my $curl = WWW::Curl::Easy->new;
$curl->setopt(CURLOPT_HEADER,1);
$curl->setopt(CURLOPT_URL, 'https://fahrplaner.vbn.de/hafas/stboard.exe/dn?protocol=https:&input=8003760&boardType=dep&time=09$
# A filehandle, reference to a scalar or reference to a typeglob can be used here.
my $response_body;
$curl->setopt(CURLOPT_WRITEDATA,\$response_body);
# Starts the actual request
my $retcode = $curl->perform;
# Looking at the results...
if ($retcode == 0) {
# print("Transfer went ok\n");
my $response_code = $curl->getinfo(CURLINFO_HTTP_CODE);
# judge result and next action based on $response_code
# print("Received response: $response_body\n");
my #test = ($response_code =~ m/ontime/g);
print($test[0]);
}
else {
# Error code, type of error, error message
print("An error happened: $retcode ".$curl->strerror($retcode)." ".$curl->errbuf."\n");
}

Related

How can I get the host name from a URL in Perl?

I have a URL like "www.google.com/aabc/xyz". How can I get host name from this? I used this code:
my $referer = URI->new('www.google.com/aabc/xyz');
my $host = $referer->host; //compiler error
I'm getting error at the second line.

use URI;
use URI::Heuristic qw(uf_uristr);
my $referrer = URI->new( uf_uristr('www.google.com/aabc/xyz') );
print $referrer->host;

The question changed significantly since my first answer, which I've deleted. With high enough rep you can see it.
You have in the code (it's better to post complete programs):
my $referer = URI->new('www.google.com/aabc/xyz');
my $host = $referer->host; //compiler error
You say that you're getting a compiler error, but it's really a runtime error:
Can't locate object method "host" via package "URI::_generic"
When you made the new object, you gave URI a string. From that, it's going to guess what sort of URI it is. Since there's no scheme, such as http://, in front of it, it doesn't guess that it's that sort of URI. Instead, it falls back to a "generic" class URI::_generic. By the underscore in its name and the fact there's no documentation for it, you may surmise it's not meant for you to know about.
But, here it is complaining. It thinks the URI is a path (and some other things). The part you recognize as the host it parses as a path:
use v5.10;
use URI;
my $referer = URI->new('www.google.com/aabc/xyz');
my $path = $referer->path;
say "path is $path";
Now you see what it did:
path is www.google.com/aabc/xyz
The generic URI doesn't know anything about a host, so when you call host on its object, it blows up. It would be nicer for it to return undef, perhaps, but that's not what it does.
oanders already has an interesting answer that guesses for you to fill in schemes when it thinks they might be missing, but there's another thing you can do. Before you call host, check that the object can respond to it:
use v5.10;
use URI;
my $url = 'www.google.com/aabc/xyz';
my $referer = URI->new( $url );
if( $referer->can( 'host' ) ) {
say "Host is " . $referer->host;
}
else {
say "Weird hostless URL: $referer";
}
Now your program shouldn't blow up for the same reason and you can look at the output to discover strings that you couldn't process.

$ echo -e "http://www.google.www.com/abc/xyz\nhttps://google.com\nwww.google.www.com"
http://www.google.www.com/abc/xyz
https://google.com
www.google.www.com
$ echo -e "http://www.google.www.com/abc/xyz\nhttps://google.com\nwww.google.www.com" | perl -pe "s/^(http(s)?:\/\/)?(www\.)?//"
google.www.com/abc/xyz
google.com
google.www.com

You can do it much simpler than above.
CODE
use strict;
use warnings;
while (<DATA>) {
$_ =~ s/^(https?:\/\/)?(www.)?\b//;
print $_ ;
}
__DATA__
http://www.google.com/abc/xyz
https://google.com
www.google.com
Results
google.com/abc/xyz
google.com
google.com

Pattern doesn't remove special characters which are by themselves on a website

So i am currently getting a user input in the form of a URL and parsing it and then printing the other pages that website links to. The package that i am using is:
LWP::Simple
I fetch the link using user input from command line and store it in a variable. I get it using the $ARGV[0].
Then i proceed to make another variable and use the $get on the variable where i store the website.
Then i proceeded to make an array variable and apply the regex on the variable
/\shref="?([^\s>"]+)/gi;
which stored the results of the get function being used on the variable containing the website string. And then i did a foreach loop on the array to print out the results.
However, while it does print links and stuff, it also end up printing just standalone special characters such as / and # if there is nothing after them.
So like if there is something like /blabalbla it prints that. but if there are just standalone special characters such as /, \, or #, it also prints them. Any way i can modify the regex so that if the special characters don't follow a string, they should not print. New at learning perl and not so talented at regex

I can't help you with your specific problem without further information, but in the mean time I suggest that you look at HTML::LinkExtor which was written for this purpose.
Here's an example code its output. It lists only <a> elements that have an href attribute.
use strict;
use warnings;
use 5.010;
use LWP;
use HTML::LinkExtor;
my $ua = LWP::UserAgent->new;
my $resp = $ua->get('http://www.bbc.co.uk/');
my $extor = HTML::LinkExtor->new(undef, $resp->base);
$extor->parse($resp->decoded_content);
for my $link ($extor->links) {
my ($tag, %attr) = #$link;
next unless $tag eq 'a' and $attr{href};
say $attr{href};
}
output
http://m.bbc.co.uk
http://www.bbc.co.uk/
http://www.bbc.co.uk/#h4discoveryzone
http://www.bbc.co.uk/accessibility/
https://ssl.bbc.co.uk/id/status
http://www.bbc.co.uk/news/
http://www.bbc.com/news/
http://www.bbc.co.uk/sport/
http://www.bbc.co.uk/weather/
http://shop.bbc.com/
http://www.bbc.com/earth/
http://www.bbc.com/travel/
http://www.bbc.com/capital/
http://www.bbc.co.uk/iplayer/
http://www.bbc.com/culture/
http://www.bbc.com/autos/
http://www.bbc.com/future/
http://www.bbc.co.uk/tv/
http://www.bbc.co.uk/radio/
http://www.bbc.co.uk/cbbc/
http://www.bbc.co.uk/cbeebies/
http://www.bbc.co.uk/arts/
http://www.bbc.co.uk/ww1/
http://www.bbc.co.uk/food/
http://www.bbc.co.uk/history/
http://www.bbc.co.uk/learning/
http://www.bbc.co.uk/music/
http://www.bbc.co.uk/science/
http://www.bbc.co.uk/nature/
http://www.bbc.com/earth/
http://www.bbc.co.uk/local/
http://www.bbc.co.uk/travel/
http://www.bbc.co.uk/a-z/
http://www.bbc.co.uk/#orb-footer
http://search.bbc.co.uk/search
http://www.bbc.co.uk/privacy/cookies/managing/cookie-settings.html
http://www.bbc.co.uk/locator/default/desktop/en-GB?ptrt=%2F
http://www.bbc.co.uk/#
http://www.bbc.co.uk/#
http://www.bbc.co.uk/weather/2643743?day=0
http://www.bbc.co.uk/weather/2643743?day=0
http://www.bbc.co.uk/weather/2643743?day=1
http://www.bbc.co.uk/weather/2643743?day=1
http://www.bbc.co.uk/weather/2643743?day=2
http://www.bbc.co.uk/weather/2643743?day=2
http://www.bbc.co.uk/locator/default/desktop/en-GB?ptrt=%2F
http://www.bbc.co.uk/weather/2643743
http://www.bbc.co.uk/news/science-environment-30311816
http://www.bbc.co.uk/news/science-environment-30311822
http://www.bbc.co.uk/news/science-environment-30311818
http://www.bbc.co.uk/news/magazine-30282261
http://www.bbc.co.uk/news/science-environment-30311816
http://www.bbc.co.uk/news/uk-politics-30291460
http://www.bbc.co.uk/news/
http://www.bbc.co.uk/news/uk-england-kent-30319549
http://www.bbc.co.uk/news/world-europe-30306106
http://www.bbc.co.uk/news/world-europe-30306992
http://www.bbc.co.uk/news/uk-30306145
http://www.bbc.co.uk/news/local/
http://www.bbc.co.uk/news/england/london/
http://www.bbc.co.uk/news/uk-england-london-30308694
http://www.bbc.co.uk/news/uk-england-london-30315650
http://www.bbc.co.uk/news/uk-england-london-30321504
http://www.bbc.co.uk/sport/live/football/29959148
http://www.bbc.co.uk/sport/0/
http://www.bbc.co.uk/sport/live/snooker/29618359
http://www.bbc.co.uk/sport/football/30204433
http://www.bbc.co.uk/sport/cricket/30308980
http://www.bbc.co.uk/sport/football/30204434
http://www.bbc.co.uk/sport/0/football/
http://www.bbc.co.uk/sport/football/30204459
http://www.bbc.co.uk/sport/football/30204511
http://www.bbc.co.uk/sport/football/28647040
http://www.bbc.co.uk/?dzf=sport
http://www.bbc.co.uk/?dzf=entertainment
http://www.bbc.co.uk/?dzf=bbcnow
http://www.bbc.co.uk/?dzf=entertainment
http://www.bbc.co.uk/?dzf=news
http://www.bbc.co.uk/?dzf=lifestyle
http://www.bbc.co.uk/?dzf=knowledge
http://www.bbc.co.uk/?dzf=sport
http://www.bbc.co.uk/news/
http://www.bbc.com/news/
http://www.bbc.co.uk/sport/
http://www.bbc.co.uk/weather/
http://shop.bbc.com/
http://www.bbc.com/earth/
http://www.bbc.com/travel/
http://www.bbc.com/capital/
http://www.bbc.co.uk/iplayer/
http://www.bbc.com/culture/
http://www.bbc.com/autos/
http://www.bbc.com/future/
http://www.bbc.co.uk/tv/
http://www.bbc.co.uk/radio/
http://www.bbc.co.uk/cbbc/
http://www.bbc.co.uk/cbeebies/
http://www.bbc.co.uk/arts/
http://www.bbc.co.uk/ww1/
http://www.bbc.co.uk/food/
http://www.bbc.co.uk/history/
http://www.bbc.co.uk/learning/
http://www.bbc.co.uk/music/
http://www.bbc.co.uk/science/
http://www.bbc.co.uk/nature/
http://www.bbc.com/earth/
http://www.bbc.co.uk/local/
http://www.bbc.co.uk/travel/
http://www.bbc.co.uk/a-z/
http://www.bbc.co.uk/
http://www.bbc.co.uk/terms/
http://www.bbc.co.uk/aboutthebbc/
http://www.bbc.co.uk/privacy/
http://www.bbc.co.uk/privacy/cookies/about
http://www.bbc.co.uk/accessibility/
http://www.bbc.co.uk/guidance/
http://www.bbc.co.uk/contact/
http://www.bbc.co.uk/bbctrust/
http://www.bbc.co.uk/complaints/
http://www.bbc.co.uk/help/web/links/

Perl taint mode with domain name input for CGI resulting in “Insecure dependency in eval”

Given the following in a CGI script with Perl and taint mode I have not been able to get past the following.
tail /etc/httpd/logs/error_log
/usr/local/share/perl5/Net/DNS/Dig.pm line 906 (#1)
(F) You tried to do something that the tainting mechanism didn't like.
The tainting mechanism is turned on when you're running setuid or
setgid, or when you specify -T to turn it on explicitly. The
tainting mechanism labels all data that's derived directly or indirectly
from the user, who is considered to be unworthy of your trust. If any
such data is used in a "dangerous" operation, you get this error. See
perlsec for more information.
[Mon Jan 6 16:24:21 2014] dig.cgi: Insecure dependency in eval while running with -T switch at /usr/local/share/perl5/Net/DNS/Dig.pm line 906.
Code:
#!/usr/bin/perl -wT
use warnings;
use strict;
use IO::Socket::INET;
use Net::DNS::Dig;
use CGI;
$ENV{"PATH"} = ""; # Latest attempted fix
my $q = CGI->new;
my $domain = $q->param('domain');
if ( $domain =~ /(^\w+)\.(\w+\.?\w+\.?\w+)$/ ) {
$domain = "$1\.$2";
}
else {
warn("TAINTED DATA SENT BY $ENV{'REMOTE_ADDR'}: $domain: $!");
$domain = ""; # successful match did not occur
}
my $dig = new Net::DNS::Dig(
Timeout => 15, # default
Class => 'IN', # default
PeerAddr => $domain,
PeerPort => 53, # default
Proto => 'UDP', # default
Recursion => 1, # default
);
my #result = $dig->for( $domain, 'NS' )->to_text->rdata();
#result = sort #result;
print #result;
I normally use Data::Validate::Domain to do checking for a “valid” domain name, but could not deploy it in a way in which the tainted variable error would not occur.
I read that in order to untaint a variable you have to pass it through a regex with capture groups and then join the capture groups to sanitize it. So I deployed $domain =~ /(^\w+)\.(\w+\.?\w+\.?\w+)$/. As shown here it is not the best regex for the purpose of untainting a domain name and covering all possible domains but it meets my needs. Unfortunately my script is still producing tainted failures and I can not figure out how.
Regexp-Common does not provide a domain regex and modules don’t seem to work with untainting variable so I am at a loss now.
How to get this thing to pass taint checking?

$domain is not tainted
I verified that your $domain is not tainted. This is the only variable you use that could be tainted, in my opinion.
perl -T <(cat <<'EOF'
use Scalar::Util qw(tainted);
sub p_t($) {
if (tainted $_[0]) {
print "Tainted\n";
} else {
print "Not tainted\n";
}
}
my $domain = shift;
p_t($domain);
if ($domain =~ /(^\w+)\.(\w+\.?\w+\.?\w+)$/) {
$domain = "$1\.$2";
} else {
warn("$domain\n");
$domain = "";
}
p_t($domain);
EOF
) abc.def
It prints
Tainted
Not tainted
What Net::DNS::Dig does
See Net::DNS::Dig line 906. It is the beginning of to_text method.
sub to_text {
my $self = shift;
my $d = Data::Dumper->new([$self],['tobj']);
$d->Purity(1)->Deepcopy(1)->Indent(1);
my $tobj;
eval $d->Dump; # line 906
…
From new definition I know that $self is just hashref containing values from new parameters and several other filled in the constructor. The evaled code produced by $d->Dump is setting $tobj to a deep copy of $self (Deepcopy(1)), with correctly set self-references (Purity(1)) and basic pretty-printing (Indent(1)).
Where is the problem, how to debug
From what I found out about &Net::DNS::Dig::to_text, it is clear that the problem is at least one tainted item inside $self. So you have a straightforward way to debug your problem further: after constructing the $dig object in your script, check which of its items is tainted. You can dump the whole structure to stdout using print Data::Dumper::Dump($dig);, which is roughly the same as the evaled code, and check suspicious items using &Scalar::Util::tainted.
I have no idea how far this is from making Net::DNS::Dig work in taint mode. I do not use it, I was just curious and wanted to find out, where the problem is. As you managed to solve your problem otherwise, I leave it at this stage, allowing others to continue debugging the issue.

As resolution to this question if anyone comes across it in the future it was indeed the module I was using which caused the taint checks to fail. Teaching me an important lesson on trusting modules in a CGI environment. I switched to Net::DNS as I figured it would not encounter this issue and sure enough it does not. My code is provided below for reference in case anyone wants to accomplish the same thing I set out to do which is: locate the nameservers defined for a domain within its own zone file.
#!/usr/bin/perl -wT
use warnings;
use strict;
use IO::Socket::INET;
use Net::DNS;
use CGI;
$ENV{"PATH"} = ""; // Latest attempted fix
my $q = CGI->new;
my $domain = $q->param('domain');
my #result;
if ( $domain =~ /(^\w+)\.(\w+\.?\w+\.?\w+)$/ ) {
$domain = "$1\.$2";
}
else {
warn("TAINTED DATA SENT BY $ENV{'REMOTE_ADDR'}: $domain: $!");
$domain = ""; # successful match did not occur
}
my $ip = inet_ntoa(inet_aton($domain));
my $res = Net::DNS::Resolver->new(
nameservers => [($ip)],
);
my $query = $res->query($domain, "NS");
if ($query) {
foreach my $rr (grep { $_->type eq 'NS' } $query->answer) {
push(#result, $rr->nsdname);
}
}
else {
warn "query failed: ", $res->errorstring, "\n";
}
#result = sort #result;
print #result;
Thanks for the comments assisting me in this matter, and SO for teaching more then any other resource I have come across.

Pattern Match Timed-out

I use Perl Net::telnet for connecting to my router and change some options, but i get this error:
pattern match timed-out
every thing is true (user , pass , pattern and etc), i am going crazy for the source of this error. my code is:
use Net::Telnet;
$telnet = new Net::Telnet ( Timeout=>10, Errmode=>'die');
$telnet->open('192.168.1.1');
$telnet->waitfor('/login[: ]$/i');
$telnet->print('admin');
$telnet->waitfor('/password[: ]$/i');
$telnet->print('admin');
$telnet->waitfor('/\$ $/i' );
$telnet->print('list');
$output = $telnet->waitfor('/\$ $/i');
print $output;
What should i do now? Is there any alternative way?
Thank you

Maybe try logging in using the example at the top of Net::Telnet page?
use Net::Telnet ();
$t = new Net::Telnet (Timeout => 10, Errmode=>'die');
$t->open($host);
$t->login($username, $passwd);
#lines = $t->cmd("who");
print #lines;
That seems to work for me. While your code snippet times out at the first waitfor trying to login.

Set different Lighttpd vhost for internal LAN clients - possibly just RegEx required...?

I want Lighttpd to display a different page for internal clients and the default page for everyone else.
Between these two links, I have an idea of what I want to do but am not sure of the RegEx I would need to restrict clients using a hostname of [http://]192.168.0.? or [http://]192.168.?.? to a different page. I've been using the following code in lighttpd.conf:
server.document-root = "/var/www/sites"
$HTTP["host"] == "RegExHere" {
server.document-root = "/var/www/setup"
}
...where for 'RegExHere' I have tried a variety of attempts such as:
192\.168\.0\.\d{1,3}(\s|$))+
192\.168\.
[192.168.[0-9]+.]
192\.168\.[0-9]+.[0-9]+$
...and various combinations thereof. I have no idea whether I'm close, but regardless it only shows me the default page.
Can anyone advise where I may be going wrong please?
Thanks in advance!

You have to use the =~ syntax to match a regex. Change $HTTP["host"] == "RegExHere" to $HTTP["host"] =~ "RegExHere" and one of those regexs should work. ^192\.168\.\d{1,3}\.\d{1,3}$ should do it.
Found this article on it http://blog.evanweaver.com/2006/06/07/regular-expressions-in-lighttpd-host-redirects/
edit: I think you need to use $HTTP["remoteip"] instead of $HTTP["host"] and it looks like you can do it without regexes.
$HTTP["remoteip"] == "10.0.0.0/8" { url.access-deny = ("") }
$HTTP["remoteip"] == "127.0.0.0/8" { url.access-deny = ("") }
http://forum.lighttpd.net/topic/27

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Search for pattern in variable with html-code in perl - regex

Related

How can I get the host name from a URL in Perl?

Pattern doesn't remove special characters which are by themselves on a website

Perl taint mode with domain name input for CGI resulting in “Insecure dependency in eval”

Pattern Match Timed-out

Set different Lighttpd vhost for internal LAN clients - possibly just RegEx required...?

Categories

Resources