pugixml: selecting nodes fails - c++

I'm using pugixml to parse the following xml:
<td class="title">
<div class="random" />
Link1
</td>
<td class="title">
<div class="random" />
Link2
</td>
etc...
I want the value of every 'a href' in a td class ="title" (which appears an indeterminate number of times) but only the first such instance.
I am using the following code to try and get these values:
pugi::xpath_node_set link_nodes = list_doc.select_nodes("//td[#class='title']");
for (pugi::xpath_node_set::const_iterator it = link_nodes.begin();it != link_nodes.end();++it)
{
pugi::xpath_node single_link_node = *it;
std::cout << single_link_node.node().select_single_node("//a").node().attribute("href").value()<<std::endl;
}
which doesn't seem to work (it outputs number of times but with a value that doesn't even seem to appear within that element).
Thanks.

"//a" selects all "a" nodes in the document; you probably meant ".//a" that selects all "a" nodes in the subtree.
You can also use one XPath expression instead of multiple:
//td[#class='title']//a[1]
This selects the first tag for each td - i.e. [1] only applies to //a, not to the full expression.

Related

xpath descendant and descendant-or-self work completely different

I try to find all seconds tds among the descendants of div with the specified id, i.e. 22 and 222. The first solution that comes to my mind was:
//div[#id='indicator']//td[2]
but it selects only the first table cell, i.e. 22 but not both 22 and 222.
Then I replaced // with /descendant-or-self::node()/ and got the same result (obviously). But when I removed '-or-self' the xpath expression started to work as expected
test1 = test_tree.xpath(u"//div[#id='indicator']/descendant-or-self::node()/td[2]")
print len(test1) #prints 1 (first one: 22)
test1 = test_tree.xpath(u"//div[#id='indicator']/descendant::node()/td[2]")
print len(test1) #prints 2 (22 and 222)
Here is test HTML
<html>
<body>
<div id='indicator'>
<table>
<tbody>
<tr>
<th>1</th>
<th>2</th>
<th>3</th>
</tr>
<tr>
<td>11</td>
<td>22</td>
<td>33</td>
</tr>
<tr>
<td>111</td>
<td>222</td>
<td>333</td>
</tr>
</tbody>
</table>
</div>
</body>
</html>
I'm wondering why both expressions don't work identically since all the tds are descendants of div element no matter div included or not.
I think you have found a bug in your XPath processor.
I think I've found the cause of this issue:
http://www.w3.org/TR/xpath20/#id-errors-and-opt
"In some cases, a processor can determine the result of an expression without accessing all the data that would be implied by the formal expression semantics. For example, the formal description of filter expressions suggests that $s[1] should be evaluated by examining all the items in sequence $s, and selecting all those that satisfy the predicate position()=1. In practice, many implementations will recognize that they can evaluate this expression by taking the first item in the sequence and then exiting."
So there is no remedy. It's xpath processor implementation dependent however I still don't understand why //div[#id='indicator']/descendant-or-self::node()/td[2] and //div[#id='indicator']/descendant::node()/td[2] produce different results.
I developed a web page contains the HTML you provided in your question.
When you use this xpath:
.//div[#id='indicator']//tr/td[2]
It works as expected and the result is:
[u'<td>22</td>', u'<td>222</td>']
However, according to your comment, you were asking when .//td[2] doesn't work. The reason is .//td gives you a list of all the td(s) in your DOM. Adding an index such as [2] will result in the second td in that list
To sum up:
These are the results of applying .//td and .//td[2] respectively:
and if you want to take the text inside these tds, you should add /text() as the following:
Update:
The OP said:
So why then //div[#id='indicator']/descendant::node()/td[2] produces ['22', '222']? According to your comment: "Adding an index such as [2] will result in the second td in that list" it should populate only ['22'].
I will try to explain what is going on here:
descendant:node() doesn't equal to //
the equal to // is: descendant-or-self::node()
It is explained at W3C specification:
I hope this code could help you:

How to write a schematron to ensure list-items are alphanumeric?

Is it possible to use schematron to ensure that the list items are in alphanumeric order?
<ul>
<li>1</li>
<li>a</li>
<li>d</li>
<li>g</li>
</ul>
Many thanks!
Yes, it is possible. You can use something like this example rule that reports all <li> elements whose value is lower than (lt) their previous <li> sibling value.
<sch:rule context="li">
<sch:report test=". lt preceding-sibling::li[1]">
This li value is lower than his previous li sibling value.
</sch:report>
</sch:rule>

Get text but exclude node if it has a certain child in a for-each loop in XSLT?

I'm trying to get the following text
Divided into:
Bonaire, Sint Eustatius and Saba (BQ, BES, 535)
Curaçao (CW, CUW, 531)
Sint Maarten (Dutch part) (SX, SXM, 534)
out of this source (excerpt):
<td>
Divided into:<br />
Bonaire, Sint Eustatius and Saba (<tt>BQ</tt>, <tt>BES</tt>, <tt>535</tt>) <sup id="cite_ref-7" class="reference">
<a href="#cite_note-7">
<span>[</span>note 4<span>]</span>
</a>
</sup><br />
Curaçao (<tt>CW</tt>, <tt>CUW</tt>, <tt>531</tt>)<br />
Sint Maarten (Dutch part) (<tt>SX</tt>, <tt>SXM</tt>, <tt>534</tt>)
</td>
This is easily done with <xsl:value select="td[4]"/> (it's the 4th td element, and I'm looping over the surrounded trs).
But I want to exclude the text [note 4], so every a that has span children.
I tried td[4]/node()[not(descendant::span)], but it only left Divided into:. td[4][not(//span)] gives always empty strings.
when you match td[4]/node()[not(descendant::span)] you're matching the forth td that doesn't have a span descendant. Since your td[4] does have a span descendant,you're getting empty results.
What you need is is a template to match the td[4] descendant nodes that does text output:
<xsl:template match="td[4]/node()"> ... <xsl:template> <!-- match descendant nodes of td[4] -->
and another template to specifically catch the span node:
<xsl:template match="span | text()[preceding-sibling::span] | text()[following-sibling::span]"/>

xpath expression dependent upon siblings

<a id ="1">
...<c>
......<b/>
......<f/>
......<b/>
......<f/>
...</c>
</a>
<a id="2">
...<c>
......<b/>
......<f/>
......<f/>
...</c>
</a>
If any elem b is followed by two or more f elements in order, return node a. I prefer straight XPath 2.0 solution, if possible. What xpath will get me a2 but not a1? I have tried following siblings, position, and such, to no avail.
With XPath 1.0:
a[.//b[following-sibling::*[1]/self::f and following-sibling::*[2]/self::f]]
You could do:
//a[.//b/following-sibling::*[1][self::f]/following-sibling::*[1][self::f]]
This says to find the a element that contains a b element, which is immediately followed by a f element, which is immediately followed by a f element.
Something like below will work using XPATH 1.0:
//b[following-sibling::*[
position()=1 and self::f
and
./following-sibling::*[
position()= 1 and self::f
]
]
]/ancestor::a[1]
output
<a id="2">
...<c>
......<b/>
......<f/>
......<f/>
...</c>
</a>
you can get a2 using this :
//a[#id=2]
just by using the id attribute.
I came up with this:
/a//c[f/following-sibling::*[1] = f]

C++, subtract certain strings?

This is a homework, thus I hope you guys don't give me the direct answers/code, but guide me to the solution.
My problem is, I have this XXX.html file, inside have thousands of codes. But what I need is to extract this portion:
<html>
...
<table>
<thead>
<tr>
<th class="xxx">xxx</th>
<th>xxx</th> <th>xxx</th> </tr>
</thead>
<tbody>
<tr class=xxx>
<td class="xxx"><a href="xxx" >ZZZ ZZ ZZZ</a></td>
<td>ZZZZ</td> <td class="xxx">ZZZZ</td> </tr> <tr class=xxx>
<td class="xxx"><a href="xxx" >ZZZ ZZ ZZZ</a></td>
<td>ZZZZ</td> <td class="xxx">ZZZZ</td> </tr> <tr class=xxx>
<td class="xxxx"><a href="xxxx" >ZZZ ZZ ZZZ</a></td>
<td>ZZZZ</td> <td class="xxxx">zzzz</td> </tr> <tr class=xxx>
<td class="xxx"><a href="xxxx" >ZZZ ZZ ZZZ</a></td>
... and so on
This is my current codes so far:
// after open the file
while(!fileOpened.eof()){
getline(fileOpened, reader);
if(reader.find("ZZZ")){
cout << reader << endl;
}
}
The "reader" is a string variable that I want to hold for each line of the HTML file. If the value of ZZZZ, as I need to get live, the value will change, what method should I use instead of using "find" method? (I am really sorry, for not mention this part)
But instead of display the value that I want, it display the some others portion of the html file. Why? Is my method wrong? If my method is wrong, how do I extract the ZZZZZ value?
std::string::find does not return a boolean value. It returns an index into the string where the substring match occurs if it is successful, else it returns std::string::npos.
So you would want to say:
if (reader.find("ZZZ") != std::string::npos){
cout << reader << endl;
}
In general using string matching just won't work to extract values from an HTML file. A proper HTML parser would be required -- they are available for C++ as standard code.
Otherwise I'd suggest using a regex library (boost::regex until C++0x comes out). You'll be able to write better expressions to capture the part of the file you are interested in.
Reading by line probably won't work since an HTML file could be one large line. Outputing then each line you find will simply emit the entire file. Thus try the regexes and look for small sections of the code and output those. The regex library will have a "match all" command (I forgot the exact name).
The skeleton code for reading lines from a file should look like this:
if( !file.good() )
throw "opening file failed!";
for(;;) {
std::string line;
std::getline(file, line);
if( !file.good() )
break;
// reading succeeded, process line
}
if(!file.eof())
// error before reaching EOF
(That funny looking loop is one that checks for the ending condition in the middle of the loop. There is not such thing in C++, so you have to use an endless loop with a break in the middle.)
However, as I said in a comment to your question, reading HTML code line-by-line isn't necessarily useful, as HTML doesn't rely on specific whitespaces.