Can't get node with libxml2 and xpath

Can't get node with libxml2 and xpath - c++

I have the following piece of code to find all elements from an HTML page:
string AParser::cleanHTMLDocument(const string& aDoc) {
vector<xmlNodePtr> nodesToRemove;
xmlDocPtr doc = xmlParseDoc((xmlChar *)aDoc.c_str());
xmlXPathContextPtr context = xmlXPathNewContext(doc);
xmlXPathObjectPtr result = xmlXPathEvalExpression(
(const xmlChar *)string("//link").c_str(), context);
if (xmlXPathNodeSetIsEmpty(result->nodesetval)) {
xmlXPathFreeObject(result);
xmlXPathFreeContext(context);
xmlFreeDoc(doc);
LOG(WARNING)<< "XPath is invalid, bailing out.";
return string();
}
const int size = result->nodesetval->nodeNr;
for(int i = size - 1; i >= 0; i--) {
LOG(DEBUG)<< result->nodesetval->nodeTab[i]->name;
}
}
But for some reason xmlXPathNodeSetIsEmpty is always true. Am I missing something here?
Update: Input Document
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.1//EN' 'http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd'>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta name="generator" content="HTML Tidy for Linux (vers 7 December 2008), see www.w3.org"/>
<title>The Republic, by Plato</title>
<link href="0.css" type="text/css" rel="stylesheet"/>
<link href="1.css" type="text/css" rel="stylesheet"/>
<link href="pgepub.css" type="text/css" rel="stylesheet"/>
<meta content="EpubMaker 0.3.20a6 by Marcello Perathoner <webmaster#gutenberg.org>" name="generator"/>
</head>
<body>
<div xml:space="preserve" class="pgmonospaced pgheader"><br/>The Project Gutenberg EBook of The Republic, by Plato<br/><br/>This eBook is for the use of anyone anywhere at no cost and with<br/>almost no restrictions whatsoever.  You may copy it, give it away or<br/>re-use it under the terms of the Project Gutenberg License included<br/>with this eBook or online at www.gutenberg.org<br/><br/><br/>Title: The Republic<br/><br/>Author: Plato<br/><br/>Translator: B. Jowett<br/><br/>Release Date: August 27, 2008 [EBook #1497]<br/>Last Updated: November 5, 2012<br/><br/>Language: English<br/><br/><br/>*** START OF THIS PROJECT GUTENBERG EBOOK THE REPUBLIC ***<br/><br/><br/><br/><br/>Produced by Sue Asscher, and David Widger<br/><br/><br/><br/><br/><br/></div>
<p><br/>
<br/></p>
<h1 id="pgepubid00000">THE REPUBLIC</h1>
<p><br/></p>
<h2>By Plato</h2>
<p><br/></p>
<h3 id="pgepubid00001">Translated by Benjamin Jowett</h3>
<p><br/>
<br/>
<br/>
Note: See also "The Republic" by Plato, Jowett, etext #150<br/>
<br/></p>
<hr/>
<p><br/>
<br/></p>
<h2 id="pgepubid00002">Contents</h2>
<table summary="">
<tbody><tr>
<td>
<p class="toc">INTRODUCTION AND ANALYSIS.</p>
<br/>
<p class="toc"><a class="c1 pginternal" href="#public#vhost#g#gutenberg#html#files#1497#1497-h#1497-h-7.htm.html#link2H_4_0002">THE REPUBLIC.</a></p>
<p class="toc">PERSONS OF THE DIALOGUE.</p>
<p class="toc">BOOK I.</p>
<p class="toc">BOOK II.</p>
<p class="toc">BOOK III.</p>
<p class="toc">BOOK IV.</p>
<p class="toc">BOOK V.</p>
<p class="toc">BOOK VI.</p>
<p class="toc">BOOK VII.</p>
<p class="toc">BOOK VIII.</p>
<p class="toc">BOOK IX.</p>
<p class="toc">BOOK X.</p>
</td>
</tr>
</tbody></table>
<p><br/>
<br/></p>
<hr/>
<p><br/>
<br/>
<a id="link2H_INTR"><!-- H2 anchor --></a></p>
<h2 id="pgepubid00003">INTRODUCTION AND ANALYSIS.</h2>
<p>The Republic of Plato is the longest of his works with the exception of the Laws, and is certainly the greatest of them. There are nearer approaches to modern metaphysics in the Philebus and in the Sophist; the Politicus or Statesman is more ideal; the form and institutions of the State are more clearly drawn out in the Laws; as works of art, the Symposium and the Protagoras are of higher excellence. But no other Dialogue of Plato has the same largeness of view and the same perfection of style; no other shows an equal knowledge of the world, or contains more of those thoughts which are new as well as old, and not of one age only but of all. Nowhere in Plato is there a deeper irony or a greater wealth of humour or imagery, or more dramatic power. Nor in any other of his writings is the attempt made to interweave life and speculation, or to connect politics with philosophy. The Republic is the centre around which the other Dialogues may be grouped; here philosophy reaches the highest point (cp, especially in Books V, VI, VII) to which ancient thinkers ever attained. Plato among the Greeks, like Bacon among the moderns, was the first who conceived a method of knowledge, although neither of them always distinguished the bare outline or form from the substance of truth; and both of them had to be content with an abstraction of science which was not yet realized. He was the greatest metaphysical genius whom the world has seen; and in him, more than in any other ancient thinker, the germs of future knowledge are contained. The sciences of logic and psychology, which have supplied so many instruments of thought to after-ages, are based upon the analyses of Socrates and Plato. The principles of definition, the law of contradiction, the fallacy of arguing in a circle, the distinction between the essence and accidents of a thing or notion, between means and ends, between causes and conditions; also the division of the mind into the rational, concupiscent, and irascible elements, or of pleasures and desires into necessary and unnecessary—these and other great forms of thought are all of them to be found in the Republic, and were probably first invented by Plato. The greatest of all logical truths, and the one of which writers on philosophy are most apt to lose sight, the difference between words and things, has been most strenuously insisted on by him (cp. Rep.; Polit.; Cratyl), although he has not always avoided the confusion of them in his own writings (e.g. Rep.). But he does not bind up truth in logical formulae,—logic is still veiled in metaphysics; and the science which he imagines to 'contemplate all truth and all existence' is very unlike the doctrine of the syllogism which Aristotle claims to have discovered (Soph. Elenchi).</p>
</body></html>

The document you're querying uses an XML namespace. You have to either ignore the namespace or register and use it.
To ignore namespaces, query for all nodes and compare the local name (without namespaces) in a predicate, like in //*[local-name(.) = 'link'].
To register a namespace, call xmlXPathRegisterNs and afterwards prefix all nodes having that namespace with [ns-prefix]:. For example:
xmlXPathContextPtr context = xmlXPathNewContext(doc);
xmlXPathRegisterNs(context, 'xhtml', 'http://www.w3.org/1999/xhtml');
xmlXPathObjectPtr result = xmlXPathEvalExpression(
(const xmlChar *)string("//xhtml:link").c_str(), context);

As an alternative solution to the one posted by Jens, you could parse the HTML document using libxml2's HTML parser. All you have to do is to replace xmlParseDoc with htmlParseDoc.

Related

addition of not requested code into index.html

when executing build book, index.html is gaining "automatic contributions" that I have not ordered; example of the "contribution":
</div>
<div id="prerequisites" class="section level1">
<h1><span class="header-section-number"> 1</span> Prerequisites</h1>
<p>This is a <em>sample</em> book written in <strong>Markdown</strong>. You can use anything that
Pandoc’s Markdown supports, e.g., a math equation <span class="math inline">\(a^2 + b^2 = c^2\)
</span>
...
</p>class="uri">https://yihui.org/tinytex/</a>.</p>
This "gift" is reflected into the ePUB output, but not into the html output
Seems improper to have to edit index.html to exclude this "addition".
Anyone has been able to avoid this effect?
Thank you

Regex: Remove a <p></p> paragraph that has curly brackets inside

I would like to remove any paragraph for article body that has curly brackets inside.
For example, from this piece of content:
<p>While orthotic inserts are able to provide great support and pain relief, they aren’t quite as good as a specialty shoe. Remember that an ill-fitting insert can cause permanent damage and talk to a podiatrist about your foot pain for the best recommendation. Click here if you want to learn more about pain in the foot arch unrelated to plantar fasciitis.</p> <h2>Related Posts</h2> <h2>So What Are These Socks Really Good For?</h2> <h2>Are the bottom of your feet causing you problems?</h2> <h2>A PF Relief Guide</h2> <h2>What is Foot Reflexology & What is it Good For?</h2> <h2>Leave a Reply Cancel reply</h2> <p>Your email address will not be published. Required fields are marked *</p> <p>Name</p> <p>Email</p> <p>Website</p> <p>five − = 2 .hide-if-no-js { display: none !important; } </p><h2>Food For Thought January 2016</h2> <h2>Show Us Some Social Love!!</h2> <h2>Recent Posts</h2> <li> The Climate Pledge of Resistance</li> <li> Green Activism in Boulder, Colorado</li> <li> The Truth About Money and Happiness</li> <li> Why Is There So Much Skepticism About Climate Change?</li> <li> Which Device Would Work Best For You?</li>
I would like to remove this part:
<p>five − = 2 .hide-if-no-js { display: none !important; } </p>
Using the following regex: <p>.*?\{.*?\}.*?</p>
It removes the whole article instead of this paragraph that contains curly braces, for some strange reason...
What am I doing wrong with the regex code?
Thanks!

Lazy / greedy quantifiers not always work as intended, instead of them match the string excluding <, this works for me: <p>[^<]*\{[^<]*</p>

Try this:
var str = '<p>While orthotic inserts are able to provide great support and pain relief, they aren’t quite as good as a specialty shoe. Remember that an ill-fitting insert can cause permanent damage and talk to a podiatrist about your foot pain for the best recommendation. Click here if you want to learn more about pain in the foot arch unrelated to plantar fasciitis.</p> <h2>Related Posts</h2> <h2>So What Are These Socks Really Good For?</h2> <h2>Are the bottom of your feet causing you problems?</h2> <h2>A PF Relief Guide</h2> <h2>What is Foot Reflexology & What is it Good For?</h2> <h2>Leave a Reply Cancel reply</h2> <p>Your email address will not be published. Required fields are marked *</p> <p>Name</p> <p>Email</p> <p>Website</p> <p>five − = 2 .hide-if-no-js { display: none !important; } </p><h2>Food For Thought January 2016</h2> <h2>Show Us Some Social Love!!</h2> <h2>Recent Posts</h2> <li> The Climate Pledge of Resistance</li> <li> Green Activism in Boulder, Colorado</li> <li> The Truth About Money and Happiness</li> <li> Why Is There So Much Skepticism About Climate Change?</li> <li> Which Device Would Work Best For You?</li>';
var result = str.replace(/(<p>[^<]*\{.*<\/p>)/, '');
console.log(result);
Regex Demo

I'd suggest a two step approach (parsing and analyzing the text node).
Below you'll find examples for both Python and PHP (could be adopted for other languages, obviously):
Python:
# -*- coding: utf-8> -*-
import re
from bs4 import BeautifulSoup
html = """
<html>
<p>While orthotic inserts are able to provide great support and pain relief, they aren’t quite as good as a specialty shoe. Remember that an ill-fitting insert can cause permanent damage and talk to a podiatrist about your foot pain for the best recommendation. Click here if you want to learn more about pain in the foot arch unrelated to plantar fasciitis.</p> <h2>Related Posts</h2> <h2>So What Are These Socks Really Good For?</h2> <h2>Are the bottom of your feet causing you problems?</h2> <h2>A PF Relief Guide</h2> <h2>What is Foot Reflexology & What is it Good For?</h2> <h2>Leave a Reply Cancel reply</h2> <p>Your email address will not be published. Required fields are marked *</p> <p>Name</p> <p>Email</p> <p>Website</p> <p>five − = 2 .hide-if-no-js { display: none !important; } </p><h2>Food For Thought January 2016</h2> <h2>Show Us Some Social Love!!</h2> <h2>Recent Posts</h2> <li> The Climate Pledge of Resistance</li> <li> Green Activism in Boulder, Colorado</li> <li> The Truth About Money and Happiness</li> <li> Why Is There So Much Skepticism About Climate Change?</li> <li> Which Device Would Work Best For You?</li>
</html>
"""
soup = BeautifulSoup(html, 'lxml')
regex = r'{[^}]+}'
for p in soup.find_all('p', string=re.compile(regex)):
p.replaceWith('')
print soup
PHP:
<?php
$html = "<html>
<p>While orthotic inserts are able to provide great support and pain relief, they aren’t quite as good as a specialty shoe. Remember that an ill-fitting insert can cause permanent damage and talk to a podiatrist about your foot pain for the best recommendation. Click here if you want to learn more about pain in the foot arch unrelated to plantar fasciitis.</p> <h2>Related Posts</h2> <h2>So What Are These Socks Really Good For?</h2> <h2>Are the bottom of your feet causing you problems?</h2> <h2>A PF Relief Guide</h2> <h2>What is Foot Reflexology & What is it Good For?</h2> <h2>Leave a Reply Cancel reply</h2> <p>Your email address will not be published. Required fields are marked *</p> <p>Name</p> <p>Email</p> <p>Website</p> <p>five − = 2 .hide-if-no-js { display: none !important; } </p><h2>Food For Thought January 2016</h2> <h2>Show Us Some Social Love!!</h2> <h2>Recent Posts</h2> <li> The Climate Pledge of Resistance</li> <li> Green Activism in Boulder, Colorado</li> <li> The Truth About Money and Happiness</li> <li> Why Is There So Much Skepticism About Climate Change?</li> <li> Which Device Would Work Best For You?</li>
</html>";
$html = str_replace(' ', ' ', $html); // only because of the
$xml = simplexml_load_string($html);
# look for p tags
$lines = $xml->xpath("//p");
# the actual regex - match anything between curly brackets
$regex = '~{[^}]+}~';
for ($i=0;$i<count($lines);$i++) {
if (preg_match($regex, $lines[$i]->__toString())) {
# unset it if it matches
unset($lines[$i][0]);
}
}
// vanished without a sight...
print_r($xml);
// convert it back to a string
$html = echo $xml->asXML();
?>

I'd suggest a two step approach (parsing and analyzing the text node). Below you'll find examples for both Python and PHP (could be adopted for other languages, obviously):

Regex in google link params

I do not have experience with regex code.
I want take from following text
http://news.google.com/news/url?sa=t&fd=R&ct2=it&usg=AFQjCNG4x7juUilTtEDL5ae1ecsNh7E-yQ&clid=c3a7d30bb8a4878e06b80cf16b898331&cid=52778905305151&ei=2_utVbj7MsHS1QaH3YHQBA&url=http://time.com/3964691/yoga-dogs-and-cats/ tag:news.google.com,2005:cluster=http://time.com/3964691/yoga-dogs-and-cats/ Mon, 20 Jul 2015 17:44:50 GMT <table border="0" cellpadding="2" cellspacing="7" style="vertical-align:top;"><tr><td width="80" align="center" valign="top"><font style="font-size:85%;font-family:arial,sans-serif"><img src="//t0.gstatic.com/images?q=tbn:ANd9GcSPm8SUGKyWdqCih-LdFBEVfcJI2B86tVNolZJLoeWesaK1Jss7lbJsPKhaqLe8Pap7kYdL2Xw" alt="" border="1" width="80" height="80"><br><font size="-2">TIME</font></font></td><td valign="top" class="j"><font style="font-size:85%;font-family:arial,sans-serif"><br><div style="padding-top:0.8em;"><img alt="" height="1" width="1"></div><div class="lh"><b>Watch <b>cats</b> and dogs interrupt yoga routines - Time</b><br><font size="-1"><b><font color="#6f6f6f">TIME</font></b></font><br><font size="-1">The compilation above shows many a yoga routine getting interrupted. And it really never gets old watching a dog rush to the aid of his owner trapped in a headstand or for a a pet to think pigeon pose is an invitation for kisses. There's also the <b>cat</b> <b>...</b></font><br><font size="-1"><b>Cats</b> And Dogs Interrupting Yoga - Huffington Post UK<font size="-1" color="#6f6f6f"><nobr>Huffington Post UK</nobr></font></font><br><font size="-1" class="p"></font><br><font class="p" size="-1"><a class="p" href="http://news.google.com/news/story?ncl=dtJjhOioeLRtSJMzD7u9ebMAVfF0M&ned=it&hl=en"><nobr><b>tutte le notizie (3) »</b></nobr></a></font></div></font></td></tr></table>
the following string present in the text above
http://time.com/3964691/yoga-dogs-and-cats/

You can get this text using
(?<=url=)http[^\s"]+
See demo
Note that your (?<=url=).+?(?= ) regex matches more than the URL you need to extract:

Try this:
(?<=url=).+?(?= )
Play around with it here: https://regex101.com/r/pO4cT3/1

Zurb Foundation 5 hiding content in Offcanvas

If you copy and paste Zurb's "Basic" code implementation of Foundation's Offcanvas layout, the paragraph content doesn't scroll. Doesn't this defeat the utility of this functionality? What am I missing here?
http://foundation.zurb.com/docs/components/offcanvas.html
This is the code I'm copy-pasting from that page:
<div class="off-canvas-wrap">
<div class="inner-wrap">
<a class="left-off-canvas-toggle" >Menu</a>
<!-- Off Canvas Menu -->
<aside class="left-off-canvas-menu">
<!-- whatever you want goes here -->
<ul>
<li>Item 1</li>
...
</ul>
</aside>
<!-- main content goes here -->
<p>THESE PARAGRAPHS WILL NOT SCROLL Set in the year 0 F.E. ("Foundation Era"), The Psychohistorians opens on Trantor, the capital of the 12,000-year-old Galactic Empire. Though the empire appears stable and powerful, it is slowly decaying in ways that parallel the decline of the Western Roman Empire. Hari Seldon, a mathematician and psychologist, has developed psychohistory, a new field of science and psychology that equates all possibilities in large societies to mathematics, allowing for the prediction of future events.</p>
<p>THESE PARAGRAPHS WILL NOT SCROLL Set in the year 0 F.E. ("Foundation Era"), The Psychohistorians opens on Trantor, the capital of the 12,000-year-old Galactic Empire. Though the empire appears stable and powerful, it is slowly decaying in ways that parallel the decline of the Western Roman Empire. Hari Seldon, a mathematician and psychologist, has developed psychohistory, a new field of science and psychology that equates all possibilities in large societies to mathematics, allowing for the prediction of future events.</p>
<p>THESE PARAGRAPHS WILL NOT SCROLL Set in the year 0 F.E. ("Foundation Era"), The Psychohistorians opens on Trantor, the capital of the 12,000-year-old Galactic Empire. Though the empire appears stable and powerful, it is slowly decaying in ways that parallel the decline of the Western Roman Empire. Hari Seldon, a mathematician and psychologist, has developed psychohistory, a new field of science and psychology that equates all possibilities in large societies to mathematics, allowing for the prediction of future events.</p>
<!-- close the off-canvas menu -->
<a class="exit-off-canvas"></a>
</div>
</div>

Make sure to include javascript files :
In the head section
<script src="/js/vendor/custom.modernizr.js"></script>
Just before the closing body tag
<script src="/js/vendor/jquery.js"></script>
<script src="/js/vendor/fastclick.js"></script>
<script src="/js/foundation.js"></script>
and the right dependency, in your case add this:
<script src="/js/foundation.offcanvas.js"></script>
for more details go to : http://foundation.zurb.com/docs/javascript.html
regards

CSS3 class match letter range [a-z]+?

Is there any possibility to create CSS definition for any element with the class "icon-" and then a set of letters but not numbers.
According to this article something like:
[class^='/icon\-([a-zA-Z]+)/'] {}
should works. But for some reason it doesn't.
In particular I need to create style definition for all elements like "icon-user", "icon-ok" etc but not "icon-16" or "icon-32"
Is it possible at all?

CSS attribute selectors do not support regular expressions.
If you actually read that article closely:
Regex Matching Attribute Selectors
They don’t exist, but wouldn’t that be so cool? I’ve no idea how hard it would be to implement, or how to expensive to parse, but wouldn’t it just be the bomb?
Notice the first three words. They don't exist. That article is nothing more than a blog post lamenting the absence of regex support in CSS attribute selectors.
But if you're using jQuery, James Padolsey's :regex selector for jQuery may interest you. Your given CSS selector might look like this for example:
$(":regex(class, ^icon\-[a-zA-Z]+)")

I answered this one on facebook but thought I'd best share here too :)
I haven't tested this so don't shoot me if it doesn't work :) but my guess would be to excplicitly target elements that contain the word icon in the classname, but to instruct the browser not to inlcude those classes containing numbers.
Example code:
div[class|=icon]:not(.icon-16, .icon-32, icon-64, icon-96) {.....}
Reference:
attribute selectors... (http://www.w3.org/TR/CSS2/selector.html#attribute-selectors):
[att|=val]
Represents an element with the att attribute, its value either being exactly "val" or beginning with "val" immediately followed by "-" (U+002D).
:not selector...
(http://kilianvalkhof.com/2008/css-xhtml/the-css3-not-selector/)
Hope this helps,
Waseem

I tested my previous solution and can confirm that it DOES NOT work (see comment from BoltClock). This however does:
OP: "In particular I need to create style definition for all elements like "icon-user", "icon-ok" etc but not "icon-16" or "icon-32""
The required CSS code would look something like this:
/* target every element were the class name begins with ( ^= ) "icon" but NOT those that begin with ( ^= ) "icon-16", or "icon-32" */
*[class^="icon"]:not([class^="icon-16"]):not([class^="icon-32"]) {.....}
or
/* target every element were the class name begins with ( ^= ) "icon" but NOT those that contain ( *= ) the number "16" or the number "18" */
*[class^="icon"]:not([class*="16"]):not([class*="32"]) { ...... }
Test code:
<!DOCTYPE html>
<html>
<head>
<style>
div{border:1px solid #999;margin-bottom:1em;height:100px;}
*[class|=icon]:not([class|=icon-16]):not([class|=icon-32]) {background:red;color:white;}
</style>
</head>
<body>
<div class="icon-something">
<h4>icon-something</h4>
<p><strong>IS</strong> targeted therfore background colour will be red</p>
</div>
<div class="icon-anotherthing">
<h4>icon-anotherthing</h4>
<p><strong>IS</strong> targeted therfore background colour will be red</p>
</div>
<div class="icon-16-install">
<h4>icon-16-install</h4>
<p>Is <strong>NOT</strong> targeted therfore no background colour</p>
</div>
<div class="icon-16-redirect">
<h4>icon-16-redirect</h4>
<p>Is <strong>NOT</strong> targeted therfore no background colour</p>
</div>
<div class="icon-16-login">
<h4>icon-16-login</h4>
<p>Is <strong>NOT</strong> targeted therfore no background colour</p>
</div>
<div class="icon-32-install">
<h4>icon-32-install</h4>
<p>Is <strong>NOT</strong> targeted therfore no background colour</p>
</div>
<div class="icon-32-redirect">
<h4>icon-32-redirect</h4>
<p>Is <strong>NOT</strong> targeted therfore no background colour</p>
</div>
<div class="icon-32-login">
<h4>icon-32-login</h4>
<p>Is <strong>NOT</strong> targeted therfore no background colour</p>
</div>
</body>
</html>

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Can't get node with libxml2 and xpath - c++

As an alternative solution to the one posted by Jens, you could parse the HTML document using libxml2's HTML parser. All you have to do is to replace xmlParseDoc with htmlParseDoc.

Related

addition of not requested code into index.html

Regex: Remove a <p></p> paragraph that has curly brackets inside

Regex in google link params

Zurb Foundation 5 hiding content in Offcanvas

CSS3 class match letter range [a-z]+?

Categories

Resources