extract out-links of an html file in c++/g++ - c++

I want to list all the urls in an html file using g++, and I tried to use grep but I wasn't successful. can anyone help me doing that?
thanks a lot

You could use boost::regex in order to find url's in html code.

Related question about C++ library for HTML parsing, which you could use to parse your HTML. Then look for <a> tags.

Related

Parse HTML using perl regex

I created a Perl script that would use an online website to crack MD5 hashes after the user inputs the hashes. I am partially successful as I am able to get the response from the website, though I need to parse the HTML and display the hash, and corresponding password in clear text to the user. The following is the output snippet I get now:
<strong>21232f297a57a5a743894a0e4a801fc3</strong>: admin</p>
Using regex buddy, I was able to use the following expression [a-z0-9]{32} to match the hash part alone. I need the final output in the following format:
21232f297a57a5a743894a0e4a801fc3: admin
Any help would be appreciated. Thank you!
I think you'd be much better off using HTML::Parser to simply/reliably parse that HTML. Otherwise you're into the nightmare of parsing HTML with regexps, and you'll find that doesn't work reliably.
There are a few tools that can handle both fetching and parsing the page for you available on CPAN. One of them is Web::Scraper. Tell it what page to fetch and which nodes (in xpath or CSS syntax) you want, and it will get them for you. I'll not give an example as I don't know your URL.
There is a good blogpost about this on blogs.perl.org by stas that uses a different module that might also be helpful.
Here it is:
$str = q{<strong>21232f297a57a5a743894a0e4a801fc3</strong>: admin</p>};
#arr = $str =~ m{<strong>(.+)</strong>(.+)</p>};
print(join("", #arr), "\n");

Quick regex help: grab text from html

I have the following html snippet:
<h1 class="header" itemprop="name">Some text here<span class="nobr">
I would like to get the text between the html tags, I'm struggling with this for hours now, please help me! What regex would solve my problem?
You should not use regex for that, but some HTML parser. As you didn't specify language, it is hard to help, but you will find it by googling...
If you need it just for this one case, you can use regex />(.*?)</
In Javascript you can access that info via:
document.getElementsByTagName("h1").item(0).textContent
or
document.getElementsByClassName("header").item(0).textContent
Like other's have said - you shouldn't be using regular expressions for parsing HTML. But with that aside the following will grab that text for you:
(?<=\>).+(?=\<)

Searching HTML Files using Regex

I have a pool of html files and want to search through them for same targeted text. It is required to search in their text contents only while ignoring all html tags, header, script, etc.
I tried QRegExp, the regex class in Qt, but could not find a good pattern to do what I'm after.
I’d appreciate any help in this regard.
Thank you.
This may or may not be a good answer for you, but have you considered using a DOM-parser instead? That will eliminate the need to filter out what is text and what is HTML markup. Sadly I can't recommend a good one for C++ though.

parsing robots.txt file using c++

is there is any library to parse robots.txt, and if it does not exist, how can i write it in c++ with boost regex?
Check out the examples in the Boost Regex library. If you edit your question to give a better idea of what exactly you are looking for in your robots.txt file, someone can help you with the Regex syntax.
For example, if you are trying to find the names of all User-agents in the file, you could use an expression like this.
boost::regex expression("^User-agent:\s*(.*)");

Getting alt tags with regex

I am parsing some HTML source. Is there a regex script to find out whether alt tags in a html document are empty?
I want to see if the alt tags are empty or not.
Is regex suitable for this or should I use string manipulation in C#?
You have to parse the HTML and check tags, use the following link, it includes a C# library for parsing HTML tags, and you can loop through tags and get the number of tags: Parsing HTML tags.
If this is valid XHTML, why do you need Regex at all? If you simply search for the string:
alt=""
... you should be able to find all empty alt tags.
In any case, it shouldn't be too complicated to construct a Regex for the search too, taking into account poorly written HTML markup (especially with spaces):
alt\s*=\s*"\s*"
If you want to do it just looking at the page then CSS selectors might be better, assuming your browser supports the :not selector.
Install the selectorgadget bookmarklet. Activate it on your page and then put the following selector in the intput box and press enter.
img:not([alt])
If you are automating it, and have access to the DOM for the HTML you could use the same selector.
Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.