Parse HTML using perl regex

Parse HTML using perl regex - regex

I created a Perl script that would use an online website to crack MD5 hashes after the user inputs the hashes. I am partially successful as I am able to get the response from the website, though I need to parse the HTML and display the hash, and corresponding password in clear text to the user. The following is the output snippet I get now:
<strong>21232f297a57a5a743894a0e4a801fc3</strong>: admin</p>
Using regex buddy, I was able to use the following expression [a-z0-9]{32} to match the hash part alone. I need the final output in the following format:
21232f297a57a5a743894a0e4a801fc3: admin
Any help would be appreciated. Thank you!

I think you'd be much better off using HTML::Parser to simply/reliably parse that HTML. Otherwise you're into the nightmare of parsing HTML with regexps, and you'll find that doesn't work reliably.

There are a few tools that can handle both fetching and parsing the page for you available on CPAN. One of them is Web::Scraper. Tell it what page to fetch and which nodes (in xpath or CSS syntax) you want, and it will get them for you. I'll not give an example as I don't know your URL.
There is a good blogpost about this on blogs.perl.org by stas that uses a different module that might also be helpful.

Here it is:
$str = q{<strong>21232f297a57a5a743894a0e4a801fc3</strong>: admin</p>};
#arr = $str =~ m{<strong>(.+)</strong>(.+)</p>};
print(join("", #arr), "\n");

Related

Perl 5: How to improve regex for URL parsing

I'm trying to parse a text file of tweets and remove URLs and put them into a urls.txt file. At the moment, I have this regex:
($line =~ /((?:https?|ftp|telnet|gopher|file|imap):\/\/[\w\-\.\~\!\*\'\(\)\;\:\#\&\=\+\$\,\/\\\?\%\#\[\]]*)/)
But as I want to build on it further, and it's quite unwieldy even now, I'm wondering if there's any way I can check for valid URL characters (the [\w\-\.\~\!\*\'\(\)\;\:\#\&\=\+\$\,\/\\\?\%\#\[\]]* part) using something like an array or a hash. Or anything that doesn't make it so unnecessarily verbose.
The rest of my code can be provided if needed for whatever reason.

If you want to validate a URL why not use a module from CPAN to do the hard work for you.
my $uri = URI->new("http://www.perl.com");
See the details of the URI module here.
As recommended by Sobrique, you could also use:
use Data::Validate::URI qw(is_uri);
if (is_uri("http://www.perl.com")) {
...
}
See the details of the Data::Validate::URI module here.

How to match plain text URL in a markdown?

I'm currently trying to match all plain text links in a markdown text.
Example of the markdown text:
Dude, look at this url http://www.google.com .. it's a great search engine
I would like it to be converted into
Dude, look at this url <http://www.google.com> .. it's a great search engine
So in short, processing url should become <url>, but processing existing <url> shouldnt become <<url>>. Also, the link in the markdown can be in the form of (url), so we'll have to avoid matching the normal brackets too.
So my working regex for matching the plain text url in java is :
"[^(\\<|\\(](https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|][^(\\>|\\)]",
with [^(\\<|\\(] and [^(\\>|\\)] to avoid matching the wrapping brackets.
But here lies one problem where i also do not want to match this kind of url :
[1]: http://slashdot.org
So, if the markdown text is
Dude, look at this url http://www.google.com .. it's a great search engine
[1]: http://slashdot.org
I want only http://www.google.com to be matched, but not the http://slashdot.org.
I wonder what's the pattern to meet this criteria ?

What you have here is a parsing problem. Regexes are fine, but just using regexes here will make it a mess (supposing you achieve it). After you fix this problem, you'll probably find yourself facing other ones, like URL in code (between ` or in lines starting with tabs or four spaces) that you don't want to replace.
A solution would be to split into lines and then
detect patterns (for example ^\[\d+\]:\s+)
apply your replacements (for example this URL to link change) only on lines which doesn't follow an incompatible pattern
That's the logic I use in this small pseudo-markdown parser that you can test here.
Note that there's always the solution to use an existing proved markdown parser, there are many of them.

Quick regex help: grab text from html

I have the following html snippet:
<h1 class="header" itemprop="name">Some text here<span class="nobr">
I would like to get the text between the html tags, I'm struggling with this for hours now, please help me! What regex would solve my problem?

You should not use regex for that, but some HTML parser. As you didn't specify language, it is hard to help, but you will find it by googling...
If you need it just for this one case, you can use regex />(.*?)</

In Javascript you can access that info via:
document.getElementsByTagName("h1").item(0).textContent
or
document.getElementsByClassName("header").item(0).textContent

Like other's have said - you shouldn't be using regular expressions for parsing HTML. But with that aside the following will grab that text for you:
(?<=\>).+(?=\<)

How can I manipulate just part of a Perl string?

I'm trying to write some Perl to convert some HTML-based text over to MediaWiki format and hit the following problem: I want to search and replace within a delimited subsection of some text and wondered if anyone knew of a neat way to do it. My input stream is something like:
Please mail support. if you want some help.
and I want to change Please help and Please can some one help me out here to Please%20help and Please%20can%20some%20one%20help%20me%20out%20here respectively, without changing any of the other spaces on the line.
Naturally, I also need to be able to cope with more than one such link on a line so splicing isn't such a good option.
I've taken a good look round Perl tutorial sites (it's not my first language) but didn't come across anything like this as an example. Can anyone advise an elegant way of doing this?

Your task has two parts. Find and replace the mailto URIs - use a HTML parsing module for that. This topic is covered thoroughly on Stack Overflow.
The other part is to canonicalise the URI. The module URI is suitable for this purpose.
use URI::mailto;
my #hrefs = ('mailto:help#myco.com&Subject=Please help&Body=Please can some one help me out here');
print URI::mailto->new($_)->as_string for #hrefs;
__END__
mailto:help#myco.com&Subject=Please%20help&Body=Please%20can%20some%20one%20help%20me%20out%20here

Why dont you just search for the "Body=" tag until the quotes and replace every space with %20.
I would not even use regular expresions for that since I dont find them useful for anything except mass changes where everything on the line is changes.
A simple loop might be the best solution.

How can I get only href value from link

I have many links in my page.
For example Australia
Now I want only the href with its value i.e (href="/promotions/download/schools/australia.aspx") with vbscript regular expression.

My regex would be something like:
href="([^"]*)"
Might need escaping in your context but that (or something very much like it) should work.

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). Luckily, you should have access to the best parser available: the web browser. Modern browsers create a Document Object Model which is a tree structure that contains all of the information about the page. One of the methods you can call on the DOM is links. I don't really know vbscript, but this code looks like it should work:
For i = 0 To document.links.length
document.write(document.links(i).href & "<BR>")
Next

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Parse HTML using perl regex - regex

I think you'd be much better off using HTML::Parser to simply/reliably parse that HTML. Otherwise you're into the nightmare of parsing HTML with regexps, and you'll find that doesn't work reliably.

Here it is: $str = q{<strong>21232f297a57a5a743894a0e4a801fc3</strong>: admin</p>}; #arr = $str =~ m{<strong>(.+)</strong>(.+)</p>}; print(join("", #arr), "\n");

Related

Perl 5: How to improve regex for URL parsing

How to match plain text URL in a markdown?

Quick regex help: grab text from html

How can I manipulate just part of a Perl string?

How can I get only href value from link

Categories

Resources