parsing robots.txt file using c++ - c++

is there is any library to parse robots.txt, and if it does not exist, how can i write it in c++ with boost regex?

Check out the examples in the Boost Regex library. If you edit your question to give a better idea of what exactly you are looking for in your robots.txt file, someone can help you with the Regex syntax.
For example, if you are trying to find the names of all User-agents in the file, you could use an expression like this.
boost::regex expression("^User-agent:\s*(.*)");

Related

Is there a function to create a regex pattern from a string input?

I'm lousy at regular expressions but occasionally they're the only thing that's the right solution for a problem.
Is there something in the .NET framework that allows you to input an unencoded string and get a pattern from it? Which you could then modify as required?
e.g. I want to remove a CDATA section that contains a file from some XML but I can't work out what the right pattern is for <![CDATA[hugepileofrandombinarydataherethatalsoneedstogo]]> and I don't want to ask for help each time I'm stuck on a regex pattern.
Such tools exist, google by "regex generator".
But, as suggested in comments, better learn regex. Simple patterns are easy. Something like <!\[.*?]]>
in your case.
There are Regex Design tools like expresso...
http://www.ultrapico.com/expresso.htm
It's not perfect but as there is no suitable .Net component the text to regex page at txt2re.com is the best I've seen for those people who occasionally need to build a regex to match a string but don't have the time to relearn regex each time they want to use one.

extract out-links of an html file in c++/g++

I want to list all the urls in an html file using g++, and I tried to use grep but I wasn't successful. can anyone help me doing that?
thanks a lot
You could use boost::regex in order to find url's in html code.
Related question about C++ library for HTML parsing, which you could use to parse your HTML. Then look for <a> tags.

Boost RegEx to parse url (RFC 1738) to extract domain name

Can someone please post a regex to extract domain from a url confirming RFC 1738 (http://www.ietf.org/rfc/rfc1738.txt)?
PROTOCOL://USERNAME:PASSWORD#DOMAINNAME:PORT/QUERYSTRING
Example:
https://abc:password#answers.yahoo.com:777/question/index?qid=20100728205639
Thanks,
Sumit
You can find one such regular expression here. You can probably simplify it, but that depends entirely on your needs.
You can also use a library which provides functions for parsing URLs. A good starting point is this Stack Overflow thread:
Easy way to parse a url in C++ cross platform?

Searching HTML Files using Regex

I have a pool of html files and want to search through them for same targeted text. It is required to search in their text contents only while ignoring all html tags, header, script, etc.
I tried QRegExp, the regex class in Qt, but could not find a good pattern to do what I'm after.
I’d appreciate any help in this regard.
Thank you.
This may or may not be a good answer for you, but have you considered using a DOM-parser instead? That will eliminate the need to filter out what is text and what is HTML markup. Sadly I can't recommend a good one for C++ though.

Needed C++ HTML parser + regular expression support

I'm working on a C++ project and I need to find an external library which
provides HTML parser and regular expression support.
The project is under 2 OS - iOS & Android.
I was thinking using libxml2 which has a HTML parser module and xml regular expression.
Can I use the xml regular expression module on HTML page?
In addition, I need some basic html function support in C++.
Like those 2 PHP functions: rawurlencode & urlencode.
I'm open to different libraries.
Thanks
I've never used libxml2 to parse html, but I remember that it was easy to use for xml parsing, so probably it's worth a try.
For the regular expressions, instead, I would suggest you to use Boost Regex.