Needed C++ HTML parser + regular expression support - c++

I'm working on a C++ project and I need to find an external library which
provides HTML parser and regular expression support.
The project is under 2 OS - iOS & Android.
I was thinking using libxml2 which has a HTML parser module and xml regular expression.
Can I use the xml regular expression module on HTML page?
In addition, I need some basic html function support in C++.
Like those 2 PHP functions: rawurlencode & urlencode.
I'm open to different libraries.
Thanks

I've never used libxml2 to parse html, but I remember that it was easy to use for xml parsing, so probably it's worth a try.
For the regular expressions, instead, I would suggest you to use Boost Regex.

Related

Can xPath in LibXML be regex type

We usually write our Search Path in findnodes() function as below
//parentNode[subNode/text() = 'CPUUSAGE']/subNode
what is I want to match a part of the text here and find all the nodes?
something like
//parentNode[subNode/text() =~ '/CPUUSAGE'/]/subNode
Obviously this is Invalid xPath...
Any thoughts how to achieve this?
I know I can first find the nodes and then try to match the textContent. But Can we do that in one shot directly in findnodes()?
XPath 1.0 (which libxml implements) doesn’t include any built-in support for regular expressions. In the example you give, which uses a fairly simple regular expression, you could use the contains function to achieve a similar result:
//parentNode[subNode[contains(text(), 'CPUUSAGE')]]/subNode
(As an aside that’s an odd expression – you’d probably really want something like //parentNode/subNode[contains(text(), 'CPUUSAGE')] but I realise it’s just an example.)
There are some other string functions that could be useful in creating other simple queries.
You could create your own custom XPath function to filter nodes based on a regular expression, in fact the docs for the Perl LibXML module includes an example of doing just that.
XPath 2.0 does have support for using regular expressions with a group of string functions. Unless you have an XPath 2.0 processor that will not be too useful.
XML::Twig has support for regular expressions in its xpaths.
The following is an xpath that I used in an answer to this SO question: Updating xml attribute value based on other with Perl
project[string(path) =~ /\bopensource\b/]/revision
I also created a second answer so that I could experiment with how XML::LibXML could be used to solve the same problem, and in that case I just iterated over all projects and did the regex filtering manually.

Can regular expressions be used in Protege OWL query?

I have an OWL class Verse wich has a data property named hasContent. Property's range is string. Using DL query, e.g. Verse and hasContent "complete text of a verse", I can find the verse that contains the specified text. I now want to find all intances of verses that contain some word.
Can regular expressions be used in Protege OWL query? Is there any example? Or I need to use the more complicated query language, SPARQL?
You can use XSD facets directly within the OWL Manchester syntax (the syntax what you use in Protege). With a facet you can achieve some of the things you can do with a regex, via the pattern construct. The implementation is reasoner-specific, some it might work sometimes and sometimes not :-/
s an
Some links to learn more about it:
The answers to restrict xsd:string to [A-Z] for rdfs:range contain examples of facet use.
Facets specs.
Or SPARQL as suggested by other answers.
SPARQL is the query language for RDF and there is a reason for it. If you use plain regex (withou SPARQL) you would not be able to define your target (instances, classes, properties etc) and you would not exploit the benefits of using an ontology. Regular expressions are fine for plain texts, but an ontology is not a plain text and you shouldn't handle it as such. I would strongly suggest using SPARQL, which already has included regular expressions when it comes to restricting string values.
Another solution, (I would not by anyway suggest it) is to export your target ontology as an RDF/XML document and apply regular expressions search on it as if it was a simple document.
Hope I helped!

How to implement Regex

I'm working on a database server software product (see my profile) and we see the need to implement free- text searching in our software. The query language standard we are using only supports free-text search using a BT type Regex. The only way we can use our free-text database indexes together with Regex seems to be to implement our own. My questions to SO is:
Where can I find papers/examples/patterns on how to implement a BT style Regex?
Is it worth looking into taking one of the open source C/C++ Regex libraries and altering the code to fit our needs?
If I'm not wrong SPARQL uses the XPath/XQuery regular expression syntax which is based on PERL regular expressions (At least that is what the W3C docs say)
If this is indeed the case then you can use PCRE from http://www.pcre.org/
It is licensed as BSD so you will be able to use it in a commercial product
If your syntax is slightly modified you can probably write a small routine to normalize it to the PERL syntax used by PCRE
There are two papers I have found on the subject on REGEX indexing online; one from Bell Labs and one from UCLA/IBM. I'm still not sure if to use an existing Regex library and modify it or write one from scratch.

Regular expression to match style="whatever:0; morestuff:1; otherstuff:3"

I'm trying to match anything between and including style=""
eg: style="whatever:0; morestuff:1; otherstuff:3"
The pattern will be /style="([^"]*)"/, but may vary a bit depending on what language you're using.
Also if you're trying to do this through javascript, jquery would make this as easy as
$("#element-id").attr("style");
If you're trying to do this from another language, use an HTML parsing lib as HTML isn't regular. BeautifulSoup for Python is quite nice.
String under test
style="whatever:0; morestuff:1; otherstuff:3"
Regex
style\s*=\s*"([^"]*)"
Contents of group 1
whatever:0; morestuff:1; otherstuff:3
Notice!
It is very hard to write a regex-based HTML parser that is correct, secure, and maintainable. If you need to write program that deals with HTML in a robust, reliable, and secure way, you should use a real HTML parsing library like jsoup (Java) or Html Agility Pack (C#). To find an HTML parser for your favorite language, Google: yourlanguage html parser.
If you need to remove all style tags from html (clean inline styles entirely), use this as regexp:
style=\"[^\"]*\"
This works for me in sublime text 2-3
/(style="([^"]*)")/
for the whole string (untested). do you want the key value pairs retrieved as well?

A Regex builder for CSS queries

I've got a problem I need solved using Regex expressions; it involves taking a CSS selector and compiling a regex that matches the string representation of the nodes inside an HTML document. The point is to avoid parsing the HTML as XML and then either making Xpath or DOM queries to apply style attributes.
Does anyone know of a project that already implements something like this in any language? The target platform would be .NET 3.5.
Html Agility Pack
Regular expressions seem like an amazingly bad way of matching those nodes. I'm not sure I follow your problem - why not just use something like jquery to pick out those nodes? eg given a css selector 'div>span.red:first-child',
$('div>span.red:first-child')
would return an array of those matching nodes.
EDIT: Oh, wait - are you trying to do this 'offline', as it were - not in a user's browser? Yeah, ignore my advice. (Even so, I'd still suggest that regular expressions aren't going to help you. Why are you against generating an xml-document representation of the page?)