Tree-like matches in regex with a fixed chain - regex

i have a very specific task to achieve with a single regex.
Here's the pattern of the text i have to extract the data from (note i'm parsing HTML-like code, stored in an immutable file) :
<tr>
<td > <a ><img /></a>
</td>
<td > <a ><span >RootData</span></a>
</td>
<td > Data1.1
</td>
<td > <a ><img /></a>
</td>
<td > <a ><span >Data1.2</span></a>
</td>
<td >  
</td></tr>
<tr>
<td > Data2.1
</td>
<td > <a ><img /></a>
</td>
<td > <a ><span >Data2.2</span></a>
</td>
<td >  
</td></tr>
...
First there's a root contained inside the first "tr". Still inside this one, there's some datq (Data1.1 and Data1.2) to extract.
Then comes a finite number of "tr" block each containing data to extract.
I'd like the matches to be like this :
match 1 : 'RootData' 'Data1.1' 'Data1.2'
match 2 : 'RootData' 'Data2.1' 'Data2.2'
etc
So far i see what to do with 2 regex and 2 loops (like 1 searching for the Root, and the other to find all datas from this root) but i'd like it to be in a single regex.
If some of you already encountered that and could help, that'd be nice :)
Thanks in advance.

If I understand you correctly, you'd like to have a single regular expression provide more than one match for the same input. Regular expressions do not work that way, and are probably just not the right tool for the problem you're trying to solve.

Related

XPath to find specific rows and related values based on a passed parameter

I'm trying to grab the value from the lights node, based on a house number set in a parameter. The problem is, based on certain conditions, houses may be in different row positions.
If the parameter being sent to me for the house number is House237, then how to I get the number of lights located within the row-2-Lights node?
Also, how do I do the same if the next run, the house number is House867? Below is my HTML:
<?xml version='1.0' encoding='utf-8'?>
<table id="neighborhood">
<tr onmouseover="leave('1')">
<td id="row-1-house">
<div class="houseCol">
<a href="#" onClick="goHome('867');return false">
House867
</a>
</div>
</td>
<td id="row-1-Lights">
<div class="decimal">14</div>
</td>
</tr>
<tr onmouseover="leave('2')">
<td id="row-2-house">
<div class="houseCol">
<a href="#" onClick="goHome('237');return false">
House237
</a>
</div>
</td>
<td id="row-2-Lights">
<div class="decimal">12</div>
</td>
</tr>
</table>
You can try the following XPath-1.0 expression. The parameter is the 'HouseXXX' string, the child of the a element.
/table[#id='neighborhood']/tr[td/div[#class='houseCol']/a[normalize-space(text())='House237']]/td[contains(#id,'Lights')]/div[#class='decimal']/text()
The output of this is
12
In this example the parameter is set to 'House237'. How you incorporate the parameter into the XPath expression depends on your usecase scenario.
For example, in XSLT you would replace 'House237' with a variable like $HouseNumber to set the parameter.

How to extract parent and nested items in single regular expression?

I want to use regular expression to match the following html table:
<tbody class=\"DocTableBody \">
<tr data-fastRow=\"1\" class=\"DataRow TDRE\">
<td id=\"g-f-1\" class=\"TDC FieldDisabled Field TCLeft CellText g-f\" >
<div class=\"DTC\">
<label id=\"c_g-f-1\" class=\"DCC\" >01-Apr-2015</label>
</div>
</td>
<td id=\"g-g-1\" class=\"TDC FieldDisabled Field TCLeft CellTextHtml g-g\" >
<div class=\"DTC\">
<label id=\"c_g-g-1\" class=\"DCC\" >ACTIVE</label>
</div>
</td>
</tr>
<tr data-fastRow=\"2\" class=\"DataRow TDRO\">
<td id=\"g-f-2\" class=\"TDC FieldDisabled Field TCLeft CellText g-f\" >
<div class=\"DTC\">
<label id=\"c_g-f-2\" class=\"DCC\" >01-Apr-2015</label>
</div>
</td>
<td id=\"g-g-2\" class=\"TDC FieldDisabled Field TCLeft CellTextHtml g-g\" >
<div class=\"DTC\">
<label id=\"c_g-g-2\" class=\"DCC\" >ACTIVE</label>
</div>
</td>
</tr>
</tbody>
I expected to extract the following value:
"1"
01-Apr-2015
ACTIVE
"2"
01-Apr-2015
ACTIVE
I tried the following to extract the value in data-fastRow:
(?sUi)<tr data-fastRow=\\"(\d+)\\".+>.*<\/tr>
But I couldn't extract the nested items in <label.+>(.*)</label> in single regular expression.
Is that possible to extract parent and nested items in single regular expression?
It's a really bad idea to parse HTML with regular expressions.
Each languahe has its own libraries to parse HTML.
In Python for example you have BeautifulSoup.
It's by far much better to use such libraries.
Usually, such libraries has jQuery-Selector-like interface (or something like that), which allows you to find your data with extremely easy queries.

regular expression: what's wrong with my expression?

I have a difficulty building a regex.
Suppose there is a html clip as below.
I want to use Javascript to cut the <tbody> part with the link of "apple"(which <a> is inside of the <td class="by">)
I construct the following expression :
/<tbody.*?text[\s\S]*?<td class="by"[\s\S]*?<a.*?>apple<\/a>[\s\S]*?<\/tbody>/g
But the result is different from what I wanted. Each match contains more than one block of <tbody>. How it should be? Regards!!!!
(I tested with https://regex101.com/ and get the unexpected selection. Please forgive me I can't figure out the problem :( )
<tbody id="text_0">
<td class="by">
...lots of other tags
cat
...lots of other tags
</td>
</tbody>
<tbody id="text_1">
...lots of other tags
<td class="by">
apple
</td>
...lots of other tags
</tbody>
<tbody id="text_2">
...lots of other tags
<td class="by">
cat
</td>
...lots of other tags
</tbody>
<tbody id="text_3">
...lots of other tags
<td class="by">
...lots of other tags
tiger
</td>
...lots of other tags
</tbody>
<tbody id="text_4">
<td class="by">
banana
</td>
</tbody>
<tbody id="text_5">
<td class="by">
peach
</td>
</tbody>
<tbody id="text_6">
<td class="by">
apple
</td>
</tbody>
<tbody id="text_7">
<td class="by">
banana
</td>
</tbody>
And this is what i expect to get
<tbody id="text_1">
<td class="by">
apple
</td>
</tbody>
<tbody id="text_6">
<td class="by">
apple
</td>
</tbody>
This is not an answer to the regex part of the question, but shouldn't the td elements be embedded in tr elements? tr stands for "table row", while tbody stands for "table body". tbody usually groups the table rows. It is not prohibited to have more than one tbody in the same table, but it is usually not necessary. (tbody is actually optional; you can have tr directly inside the table element.)
First, Regex is not a good solution for parsing anything like HTML or XML.
I can fix your pattern to work with this specific example but I can't guarantee that it will work in all cases. Regex just is not the right tool for the job.
But anyway, replace the first 2 instances of [\s\S] in your pattern with [^<].
<tbody.*?text[^<]*?<td class="by"[^<]*?<a.*?>apple<\/a>[\s\S]*?</tbody>
Start with this working regexp and go from there:
/<a href="(.*?)">apple<\/a>/g
If that is too broad and you want to make it more specific, add the next surrounding tag:
/<td.*?>\s*<a href="(.*?)">apple<\/a>/g
Then continue:
/<tbody.*?>\s*<td.*?>\s*<a href="(.*?)">apple<\/a>/g
Also, consider an alternate solution such as XPATH. Regular expressions can't really parse all variations of HTML.

Regular expression with multiple results

What's wrong with my regex ?
"/Blabla\(2\) :.*<tr><td class=\"generic\">(.*)<\/td>.+<\/tr>/Uis"
....
<tr>
<td class="aaa">Blabla(1) :</td>
<td>
<table class="bbb"><tbody>
<tr class="ccc"><th>title1</th><th>title2</th><th>title3</th></tr>
<tr><td class="generic">word1</td><td class="generic">word2 </td><td class="generic">word3</td></tr>
<tr><td class="generic">word4</td><td class="generic">word5 </td><td class="generic">word6</td></tr>
</tbody></table>
</td>
</tr>
<tr>
<td class="aaa">Blabla(2) :</td>
<td>
<table class="bbb"><tbody>
<tr class="ccc"><th>title1</th><th>title2</th><th>title3</th></tr>
<tr><td class="generic">word1b</td><td class="generic">word2b </td><td class="generic">word3b</td></tr>
<tr><td class="generic">word4b</td><td class="generic">word5b </td><td class="generic">word6b</td></tr>
</tbody></table>
</td>
</tr
What I want to do is to get the content of the FIRST TD of each TR from the block beginning with Blabla(2).
So the expected answer is word1b AND word4b
But only the first is returned...
Thank you for your help. Please don't answer me to use a DOM navigator, it's not possible in my case.
That's an interesting regex, in which I learned about the ungreedy flag, nice!
And for your problem, you might make use of \G to match immediately after the previous match and the flag g, assuming PCRE engine:
/(?:Blabla\(2\) :|(?<!^)\G).*<tr><td class=\"generic\">(.*)<\/td>.+<\/tr>/Uisg
regex101 demo
Or a little shorter with different delimiters:
'~(?:Blabla\(2\) :|(?<!^)\G).*<tr><td class="generic">(.*)</td>.+</tr>~Uisg'
Thanks to #Jerry, I learn today new tricks:
(Blabla\(2\) :.*?|\G)<tr><td class=\"generic\">\K([^<]+).+?<\/tr>\r\n

Python 27 - BeautifulSoup and Tables

I have this source:
<tr id="bitstampUSD">
<td class="arrow" change="up" latest_trade="1363480722">
<span class="down">▼</span>
</td>
<td class="symbol">
<nobr>
bitstampUSD
</nobr>
<span class="sub">USD (SEPA converted)</span>
</td>
<td>46.74
<span class="sub">41 min ago</span>
</td>
<td class="minichart break">
<span volume="**whole heaps of number here that I want**"
print="**more numbers I want**"
avg="**more numbers I want**"
class="marketsparkline"></span>
</td>
<td>**36.39**
<span class="sub change">**10.35 28.46%**</span>
</td>
<td>**141,043.10**
<span class="sub">**5,132,052.22 USD**</span>
</td>
<td>**25.25**
<span class="sub">**46.58** (24h)</span>
</td>
<td>**49.17**
<span class="sub">47 (24h)</span>
</td>
<td class="break">**46.7**</td>
<td>**46.74**</td>
<td class="break">**46.78**
<span class="sub change">-0.04 -0.09%</span>
</td>
<td>**819.54**
<span class="sub">**38,340.96** USD</span>
</td>
</tr>
So I want to get the data in bold. (Well, it's supposed to be in bold, I guess the code tags stop that from happening. The data inside two asterisks.
I managed to figure out how to get the bits in code which I didn't include here, because it was inside the classes. But here, some of it is outside the classes so I don't know how to grab it.
It may help to look at the entire source, if you want http://bitcoincharts.com/markets/
It's laid out differently than other table code I've seen before.
Well, this outputs a bit more than you requested, but should get you started:
soup = BeautifulSoup(f)
for td in soup.find_all('td', class_='minichart break'):
avg = td.span['avg']
print_ = td.span['print']
volume = td.span['volume']
print avg, print_, volume
for td in soup.find_all('td'):
print 'TD', td.text.split()
On your example I obtain:
**more numbers I want** **more numbers I want** **whole heaps of number here that I want**
[u'\u25bc']
[u'bitstampUSD', u'USD', u'(SEPA', u'converted)']
[u'46.74', u'41', u'min', u'ago']
[]
[u'**36.39**', u'**10.35', u'28.46%**']
[u'**141,043.10**', u'**5,132,052.22', u'USD**']
[u'**25.25**', u'**46.58**', u'(24h)']
[u'**49.17**', u'47', u'(24h)']
[u'**46.7**']
[u'**46.74**']
[u'**46.78**', u'-0.04', u'-0.09%']
[u'**819.54**', u'**38,340.96**', u'USD']