Regular expression example: find table row based on matched table data - regex

I want to extract value of 3rd td where 1st td has value 'Total (A)+(B)+(C)'
<td class="tbmain" height="25"><b>Total (A)+(B)+(C)</b></td>
<td class="tbmain" align="right"><b>100,000</b></td>
<td class="tbmain" align="right"><b>111,111,111</b></td>
<td class="tbmain" align="right"><b>101,101</b></td>
</tr>

You can do this easily with jQuery:
alert($("table tr td:contains('Total (A)+(B)+(C)')").siblings("td:eq(1)").html());
will return <b>111,111,111</b> adn this is the value of the 3rd td where the 1st td has Total (A)+(B)+(C) in the value
Example
You can do this too when you get the table as string (example)
But if you relay want do do this with regex, this can help:
<tr>(\s+)?<td.*?>(.*?)?</td>(\s+)?<td.*?>.*?</td>(\s+)?<td.*?>(.*?)</td>

Related

Tables in R markdown

I would like to create a manual table in R markdown, I am aiming to have the final output as follow:
I tried the following code but it did not work:
Authority | Responsibility | Period
:----- | :---- | :-----
MOIWR | Text 1 | 2010
^^ | Text 2 | 2011
^^ | Text 3 | 2012
IWC | Text 4 | 2013
SGB | Text 5 | |
could you please help me to figure out how to do that !
Pandoc, the converter used in R Markdown, does not yet support Markdown tables with cells spanning multiple rows and/or columns. A good workaround is to write the table in HTML and to parse it in a Lua filter.
The following filter detects HTML tables and makes sure they can be converted to different output formats:
function RawBlock (raw)
if raw.format:match 'html' and raw.text:match '^%s*%<table' then
return pandoc.read(raw.text, 'html').blocks
end
end
Use the filter like this:
---
output:
html_document:
pandoc_args:
- '--lua-filter=html-table.lua'
---
``` {=html}
<table>
<tr>
<td>column 1</td>
<td>column 2</td>
</tr>
<tr>
<td colspan="2">column 1 and 2</td>
</tr>
</table>
```

ImportXML "//tr/td[#class='X']" only when inside "//tr" there is also "//tr/td[#class='Y']"

Example: The page I want to make an importXML (Google Sheets Function) has the following structure (Status & Score → Td Class):
<tr class=class="odd expanded match no-date-repetition"
<td class="date no-repetition"
<td class="score-time status"
<tr class=class="odd expanded match no-date-repetition"
<td class="date no-repetition"
<td class="score-time score"
<tr class=class="odd expanded match no-date-repetition"
<td class="date no-repetition"
<td class="score-time status"
The data that exists in td class = "date no-repetition" but without being followed by td class = "score-time status" is not necessary for what I need, I would like to know if there is any way to filter to import "home" only when inside "TR" exists "date no-repetition" and "score-time status" classes... The site has no fixed data location, so I can't work by choosing date no-repetition[1] date no-repetition[3] date no-repetition[5] to define which "home's" to import.
=ARRAYFORMULA(TRANSPOSE(SPLIT(CONCATENATE("♦"&TEXT(REGEXEXTRACT(
IMPORTXML(A4, A3), "\d+/\d+/\d+"), "dd/mm/yyyy")&"♦"&IMPORTXML(A4, A3)), "♦")))
or simple:
=ARRAYFORMULA({TEXT(REGEXEXTRACT(
IMPORTXML(A4, A3), "\d+/\d+/\d+"), "dd/mm/yy"),IMPORTXML(A4, A3)})

xpath descendant and descendant-or-self work completely different

I try to find all seconds tds among the descendants of div with the specified id, i.e. 22 and 222. The first solution that comes to my mind was:
//div[#id='indicator']//td[2]
but it selects only the first table cell, i.e. 22 but not both 22 and 222.
Then I replaced // with /descendant-or-self::node()/ and got the same result (obviously). But when I removed '-or-self' the xpath expression started to work as expected
test1 = test_tree.xpath(u"//div[#id='indicator']/descendant-or-self::node()/td[2]")
print len(test1) #prints 1 (first one: 22)
test1 = test_tree.xpath(u"//div[#id='indicator']/descendant::node()/td[2]")
print len(test1) #prints 2 (22 and 222)
Here is test HTML
<html>
<body>
<div id='indicator'>
<table>
<tbody>
<tr>
<th>1</th>
<th>2</th>
<th>3</th>
</tr>
<tr>
<td>11</td>
<td>22</td>
<td>33</td>
</tr>
<tr>
<td>111</td>
<td>222</td>
<td>333</td>
</tr>
</tbody>
</table>
</div>
</body>
</html>
I'm wondering why both expressions don't work identically since all the tds are descendants of div element no matter div included or not.
I think you have found a bug in your XPath processor.
I think I've found the cause of this issue:
http://www.w3.org/TR/xpath20/#id-errors-and-opt
"In some cases, a processor can determine the result of an expression without accessing all the data that would be implied by the formal expression semantics. For example, the formal description of filter expressions suggests that $s[1] should be evaluated by examining all the items in sequence $s, and selecting all those that satisfy the predicate position()=1. In practice, many implementations will recognize that they can evaluate this expression by taking the first item in the sequence and then exiting."
So there is no remedy. It's xpath processor implementation dependent however I still don't understand why //div[#id='indicator']/descendant-or-self::node()/td[2] and //div[#id='indicator']/descendant::node()/td[2] produce different results.
I developed a web page contains the HTML you provided in your question.
When you use this xpath:
.//div[#id='indicator']//tr/td[2]
It works as expected and the result is:
[u'<td>22</td>', u'<td>222</td>']
However, according to your comment, you were asking when .//td[2] doesn't work. The reason is .//td gives you a list of all the td(s) in your DOM. Adding an index such as [2] will result in the second td in that list
To sum up:
These are the results of applying .//td and .//td[2] respectively:
and if you want to take the text inside these tds, you should add /text() as the following:
Update:
The OP said:
So why then //div[#id='indicator']/descendant::node()/td[2] produces ['22', '222']? According to your comment: "Adding an index such as [2] will result in the second td in that list" it should populate only ['22'].
I will try to explain what is going on here:
descendant:node() doesn't equal to //
the equal to // is: descendant-or-self::node()
It is explained at W3C specification:
I hope this code could help you:

pugixml: selecting nodes fails

I'm using pugixml to parse the following xml:
<td class="title">
<div class="random" />
Link1
</td>
<td class="title">
<div class="random" />
Link2
</td>
etc...
I want the value of every 'a href' in a td class ="title" (which appears an indeterminate number of times) but only the first such instance.
I am using the following code to try and get these values:
pugi::xpath_node_set link_nodes = list_doc.select_nodes("//td[#class='title']");
for (pugi::xpath_node_set::const_iterator it = link_nodes.begin();it != link_nodes.end();++it)
{
pugi::xpath_node single_link_node = *it;
std::cout << single_link_node.node().select_single_node("//a").node().attribute("href").value()<<std::endl;
}
which doesn't seem to work (it outputs number of times but with a value that doesn't even seem to appear within that element).
Thanks.
"//a" selects all "a" nodes in the document; you probably meant ".//a" that selects all "a" nodes in the subtree.
You can also use one XPath expression instead of multiple:
//td[#class='title']//a[1]
This selects the first tag for each td - i.e. [1] only applies to //a, not to the full expression.

C++, subtract certain strings?

This is a homework, thus I hope you guys don't give me the direct answers/code, but guide me to the solution.
My problem is, I have this XXX.html file, inside have thousands of codes. But what I need is to extract this portion:
<html>
...
<table>
<thead>
<tr>
<th class="xxx">xxx</th>
<th>xxx</th> <th>xxx</th> </tr>
</thead>
<tbody>
<tr class=xxx>
<td class="xxx"><a href="xxx" >ZZZ ZZ ZZZ</a></td>
<td>ZZZZ</td> <td class="xxx">ZZZZ</td> </tr> <tr class=xxx>
<td class="xxx"><a href="xxx" >ZZZ ZZ ZZZ</a></td>
<td>ZZZZ</td> <td class="xxx">ZZZZ</td> </tr> <tr class=xxx>
<td class="xxxx"><a href="xxxx" >ZZZ ZZ ZZZ</a></td>
<td>ZZZZ</td> <td class="xxxx">zzzz</td> </tr> <tr class=xxx>
<td class="xxx"><a href="xxxx" >ZZZ ZZ ZZZ</a></td>
... and so on
This is my current codes so far:
// after open the file
while(!fileOpened.eof()){
getline(fileOpened, reader);
if(reader.find("ZZZ")){
cout << reader << endl;
}
}
The "reader" is a string variable that I want to hold for each line of the HTML file. If the value of ZZZZ, as I need to get live, the value will change, what method should I use instead of using "find" method? (I am really sorry, for not mention this part)
But instead of display the value that I want, it display the some others portion of the html file. Why? Is my method wrong? If my method is wrong, how do I extract the ZZZZZ value?
std::string::find does not return a boolean value. It returns an index into the string where the substring match occurs if it is successful, else it returns std::string::npos.
So you would want to say:
if (reader.find("ZZZ") != std::string::npos){
cout << reader << endl;
}
In general using string matching just won't work to extract values from an HTML file. A proper HTML parser would be required -- they are available for C++ as standard code.
Otherwise I'd suggest using a regex library (boost::regex until C++0x comes out). You'll be able to write better expressions to capture the part of the file you are interested in.
Reading by line probably won't work since an HTML file could be one large line. Outputing then each line you find will simply emit the entire file. Thus try the regexes and look for small sections of the code and output those. The regex library will have a "match all" command (I forgot the exact name).
The skeleton code for reading lines from a file should look like this:
if( !file.good() )
throw "opening file failed!";
for(;;) {
std::string line;
std::getline(file, line);
if( !file.good() )
break;
// reading succeeded, process line
}
if(!file.eof())
// error before reaching EOF
(That funny looking loop is one that checks for the ending condition in the middle of the loop. There is not such thing in C++, so you have to use an endless loop with a break in the middle.)
However, as I said in a comment to your question, reading HTML code line-by-line isn't necessarily useful, as HTML doesn't rely on specific whitespaces.