C++, subtract certain strings? - c++

This is a homework, thus I hope you guys don't give me the direct answers/code, but guide me to the solution.
My problem is, I have this XXX.html file, inside have thousands of codes. But what I need is to extract this portion:
<html>
...
<table>
<thead>
<tr>
<th class="xxx">xxx</th>
<th>xxx</th> <th>xxx</th> </tr>
</thead>
<tbody>
<tr class=xxx>
<td class="xxx"><a href="xxx" >ZZZ ZZ ZZZ</a></td>
<td>ZZZZ</td> <td class="xxx">ZZZZ</td> </tr> <tr class=xxx>
<td class="xxx"><a href="xxx" >ZZZ ZZ ZZZ</a></td>
<td>ZZZZ</td> <td class="xxx">ZZZZ</td> </tr> <tr class=xxx>
<td class="xxxx"><a href="xxxx" >ZZZ ZZ ZZZ</a></td>
<td>ZZZZ</td> <td class="xxxx">zzzz</td> </tr> <tr class=xxx>
<td class="xxx"><a href="xxxx" >ZZZ ZZ ZZZ</a></td>
... and so on
This is my current codes so far:
// after open the file
while(!fileOpened.eof()){
getline(fileOpened, reader);
if(reader.find("ZZZ")){
cout << reader << endl;
}
}
The "reader" is a string variable that I want to hold for each line of the HTML file. If the value of ZZZZ, as I need to get live, the value will change, what method should I use instead of using "find" method? (I am really sorry, for not mention this part)
But instead of display the value that I want, it display the some others portion of the html file. Why? Is my method wrong? If my method is wrong, how do I extract the ZZZZZ value?

std::string::find does not return a boolean value. It returns an index into the string where the substring match occurs if it is successful, else it returns std::string::npos.
So you would want to say:
if (reader.find("ZZZ") != std::string::npos){
cout << reader << endl;
}

In general using string matching just won't work to extract values from an HTML file. A proper HTML parser would be required -- they are available for C++ as standard code.
Otherwise I'd suggest using a regex library (boost::regex until C++0x comes out). You'll be able to write better expressions to capture the part of the file you are interested in.
Reading by line probably won't work since an HTML file could be one large line. Outputing then each line you find will simply emit the entire file. Thus try the regexes and look for small sections of the code and output those. The regex library will have a "match all" command (I forgot the exact name).

The skeleton code for reading lines from a file should look like this:
if( !file.good() )
throw "opening file failed!";
for(;;) {
std::string line;
std::getline(file, line);
if( !file.good() )
break;
// reading succeeded, process line
}
if(!file.eof())
// error before reaching EOF
(That funny looking loop is one that checks for the ending condition in the middle of the loop. There is not such thing in C++, so you have to use an endless loop with a break in the middle.)
However, as I said in a comment to your question, reading HTML code line-by-line isn't necessarily useful, as HTML doesn't rely on specific whitespaces.

Related

Is there documentation for uncommon arguments in R Shiny renderTable?

I am using the solution to the following question:
How can symbols be used in a Shiny table header?
My question is >> does anyone know where there might be some reference material for the uncommon arguments? I've looked at the R documentation and have come up short.
I'm referring to arguments such as 'include.colnames', and 'add.to.row' from #Minnow's code in the answer to the original question. Here is the code:
output$mytable2 <- renderTable({mytable()},include.colnames=FALSE,
add.to.row = list(pos = list(0),
command = " <tr> <th> &#931 </th> <th> σ</th> <th> ẟ</th> <th> 🂡</th> <th> ☺ </th> </tr>" ))
Any breadcrumbs are appreciated!
Yes, there is more documentation inside shiny/the used packages, but it's a bit hidden. If you look at the documentation of help(renderTable), you see that besides the explained arguments there is .... This means that the function passes further arguments to functions it calls. It is specified that renderTable will pass these additional arguments to xtable::xtable() and xtable::print.xtable(). So it's a good idea to look at these help pages, and indeed you find the documentation for add.to.row there.

xpath descendant and descendant-or-self work completely different

I try to find all seconds tds among the descendants of div with the specified id, i.e. 22 and 222. The first solution that comes to my mind was:
//div[#id='indicator']//td[2]
but it selects only the first table cell, i.e. 22 but not both 22 and 222.
Then I replaced // with /descendant-or-self::node()/ and got the same result (obviously). But when I removed '-or-self' the xpath expression started to work as expected
test1 = test_tree.xpath(u"//div[#id='indicator']/descendant-or-self::node()/td[2]")
print len(test1) #prints 1 (first one: 22)
test1 = test_tree.xpath(u"//div[#id='indicator']/descendant::node()/td[2]")
print len(test1) #prints 2 (22 and 222)
Here is test HTML
<html>
<body>
<div id='indicator'>
<table>
<tbody>
<tr>
<th>1</th>
<th>2</th>
<th>3</th>
</tr>
<tr>
<td>11</td>
<td>22</td>
<td>33</td>
</tr>
<tr>
<td>111</td>
<td>222</td>
<td>333</td>
</tr>
</tbody>
</table>
</div>
</body>
</html>
I'm wondering why both expressions don't work identically since all the tds are descendants of div element no matter div included or not.
I think you have found a bug in your XPath processor.
I think I've found the cause of this issue:
http://www.w3.org/TR/xpath20/#id-errors-and-opt
"In some cases, a processor can determine the result of an expression without accessing all the data that would be implied by the formal expression semantics. For example, the formal description of filter expressions suggests that $s[1] should be evaluated by examining all the items in sequence $s, and selecting all those that satisfy the predicate position()=1. In practice, many implementations will recognize that they can evaluate this expression by taking the first item in the sequence and then exiting."
So there is no remedy. It's xpath processor implementation dependent however I still don't understand why //div[#id='indicator']/descendant-or-self::node()/td[2] and //div[#id='indicator']/descendant::node()/td[2] produce different results.
I developed a web page contains the HTML you provided in your question.
When you use this xpath:
.//div[#id='indicator']//tr/td[2]
It works as expected and the result is:
[u'<td>22</td>', u'<td>222</td>']
However, according to your comment, you were asking when .//td[2] doesn't work. The reason is .//td gives you a list of all the td(s) in your DOM. Adding an index such as [2] will result in the second td in that list
To sum up:
These are the results of applying .//td and .//td[2] respectively:
and if you want to take the text inside these tds, you should add /text() as the following:
Update:
The OP said:
So why then //div[#id='indicator']/descendant::node()/td[2] produces ['22', '222']? According to your comment: "Adding an index such as [2] will result in the second td in that list" it should populate only ['22'].
I will try to explain what is going on here:
descendant:node() doesn't equal to //
the equal to // is: descendant-or-self::node()
It is explained at W3C specification:
I hope this code could help you:

pugixml: selecting nodes fails

I'm using pugixml to parse the following xml:
<td class="title">
<div class="random" />
Link1
</td>
<td class="title">
<div class="random" />
Link2
</td>
etc...
I want the value of every 'a href' in a td class ="title" (which appears an indeterminate number of times) but only the first such instance.
I am using the following code to try and get these values:
pugi::xpath_node_set link_nodes = list_doc.select_nodes("//td[#class='title']");
for (pugi::xpath_node_set::const_iterator it = link_nodes.begin();it != link_nodes.end();++it)
{
pugi::xpath_node single_link_node = *it;
std::cout << single_link_node.node().select_single_node("//a").node().attribute("href").value()<<std::endl;
}
which doesn't seem to work (it outputs number of times but with a value that doesn't even seem to appear within that element).
Thanks.
"//a" selects all "a" nodes in the document; you probably meant ".//a" that selects all "a" nodes in the subtree.
You can also use one XPath expression instead of multiple:
//td[#class='title']//a[1]
This selects the first tag for each td - i.e. [1] only applies to //a, not to the full expression.

how to filter an xml file using the expressions < or < in xslt?

Hi I'm trying to display a number of things that match a certain criteria. In my xml file i have a bunch of 'suppliers'
<Suppliers>
<ASupplier>
<SupplierId> 12 </SupplierId>
<SupplierName> Amazon </SupplierName>
<Email> Something#live.com </Email>
<StartDate> 01</11</2010 </StartDate>
<ContractLength> 6 </ContractLength>
<AnnualTurnover> 1233.32 </AnnualTurnover>
</ASupplier>
</Suppliers>
This is the code from my xslt file
<xsl:if test="$SearchType = 'Length'">
<xsl:for-each select="ASupplier[$SupplierFilter >= ContractLength]">
<tr>
<td>
<xsl:value-of select="../SupplierName"/>
</td>
</tr>
</xsl:for-each>
</xsl:if>
the 'SearchType' is a parameter.
The problem im having is that im getting back empty data instead of a table with the name inside. It returns the correct number but without any data i.e. i have 3 suppliers with contract length of 6 or less and if i type 6 and search it brings back 3 table cells but without any data. Any thoughts.
p.s I have the functions start-with and contains that use similar code and work just fine.
the input you show lost its XML tags so it is hard to see what the xslt is acting on (I'll put them back).
that said I would guess that
<xsl:value-of select="../SupplierName"/>
should be
<xsl:value-of select="SupplierName"/>
If SupplierName is a child of ASupplier

Regular expression example: find table row based on matched table data

I want to extract value of 3rd td where 1st td has value 'Total (A)+(B)+(C)'
<td class="tbmain" height="25"><b>Total (A)+(B)+(C)</b></td>
<td class="tbmain" align="right"><b>100,000</b></td>
<td class="tbmain" align="right"><b>111,111,111</b></td>
<td class="tbmain" align="right"><b>101,101</b></td>
</tr>
You can do this easily with jQuery:
alert($("table tr td:contains('Total (A)+(B)+(C)')").siblings("td:eq(1)").html());
will return <b>111,111,111</b> adn this is the value of the 3rd td where the 1st td has Total (A)+(B)+(C) in the value
Example
You can do this too when you get the table as string (example)
But if you relay want do do this with regex, this can help:
<tr>(\s+)?<td.*?>(.*?)?</td>(\s+)?<td.*?>.*?</td>(\s+)?<td.*?>(.*?)</td>