Clean html code with a Regex

Clean html code with a Regex - regex

I am parsing some transactions, for example 3 transactions look like this:
<TR class=DefGVRow>
<TD>29/04/2013</TD>
<TD>DEPOSITO 0140959158</TD>
<TD>0140959158</TD>
<TD align=right>336,00</TD>
<TD align=center>+</TD>
<TD align=right>16.210,60</TD></TR>H
<TR class=DefGVAltRow>
<TD>29/04/2013</TD>
<TD>RETIRO ATM CTA/CTE</TD>
<TD>1171029739</TD>
<TD align=right>600,00</TD>
<TD align=center>-</TD>
<TD align=right>15.610,60</TD></TR>
<TR class=DefGVRow>
<TD>29/04/2013</TD>
<TD>C.SERV.CAJERO AUT.</TD>
<TD>1171029739</TD>
<TD align=right>3,25</TD>
<TD align=center>-</TD>
<TD align=right>15.607,35</TD></TR>
And my current Regex is:
<TR class=\w+>
<TD>(?<day>\d{1,2})/(?<month>\d{1,2})/(?<year>\d{4})</TD>
<TD>(?<description>.+?)</TD>
<TD>(?<id>\d{3,30})</TD>
<TD.+?>(?<amount>[\d\.]{1,20},\d{1,10})</TD>
<TD.+?>(?<info>.+?)</TD>
<TD.+?>(?<balance>[\d\.]{1,20},\d{1,10})</TD></TR>
How can I edit the
<TD>(?<description>.+?)</TD>
To process optional tags that match other parts of the same extraction? (basically: how to ignore the A tag when capturing the group)
Thanks!

It is a very common problem. Please check this epic answer and stop using regexp to "parse" html, instead use a proper parser and get what you need with XPath or even a CSS selector.

This removes the 'optional' link:
<TR class=\w+>
<TD>(?<day>\d{1,2})/(?<month>\d{1,2})/(?<year>\d{4})</TD>
<TD>(?:<A href=".*>)?(?<description>.+?)(?:</A>)?</TD>
<TD>(?<id>\d{3,30})</TD>
<TD.+?>(?<amount>[\d\.]{1,20},\d{1,10})</TD>
<TD.+?>(?<info>.+?)</TD>
<TD.+?>(?<balance>[\d\.]{1,20},\d{1,10})</TD></TR>

Related

regular expression: what's wrong with my expression?

I have a difficulty building a regex.
Suppose there is a html clip as below.
I want to use Javascript to cut the <tbody> part with the link of "apple"(which <a> is inside of the <td class="by">)
I construct the following expression :
/<tbody.*?text[\s\S]*?<td class="by"[\s\S]*?<a.*?>apple<\/a>[\s\S]*?<\/tbody>/g
But the result is different from what I wanted. Each match contains more than one block of <tbody>. How it should be? Regards!!!!
(I tested with https://regex101.com/ and get the unexpected selection. Please forgive me I can't figure out the problem :( )
<tbody id="text_0">
<td class="by">
...lots of other tags
cat
...lots of other tags
</td>
</tbody>
<tbody id="text_1">
...lots of other tags
<td class="by">
apple
</td>
...lots of other tags
</tbody>
<tbody id="text_2">
...lots of other tags
<td class="by">
cat
</td>
...lots of other tags
</tbody>
<tbody id="text_3">
...lots of other tags
<td class="by">
...lots of other tags
tiger
</td>
...lots of other tags
</tbody>
<tbody id="text_4">
<td class="by">
banana
</td>
</tbody>
<tbody id="text_5">
<td class="by">
peach
</td>
</tbody>
<tbody id="text_6">
<td class="by">
apple
</td>
</tbody>
<tbody id="text_7">
<td class="by">
banana
</td>
</tbody>
And this is what i expect to get
<tbody id="text_1">
<td class="by">
apple
</td>
</tbody>
<tbody id="text_6">
<td class="by">
apple
</td>
</tbody>

This is not an answer to the regex part of the question, but shouldn't the td elements be embedded in tr elements? tr stands for "table row", while tbody stands for "table body". tbody usually groups the table rows. It is not prohibited to have more than one tbody in the same table, but it is usually not necessary. (tbody is actually optional; you can have tr directly inside the table element.)

First, Regex is not a good solution for parsing anything like HTML or XML.
I can fix your pattern to work with this specific example but I can't guarantee that it will work in all cases. Regex just is not the right tool for the job.
But anyway, replace the first 2 instances of [\s\S] in your pattern with [^<].
<tbody.*?text[^<]*?<td class="by"[^<]*?<a.*?>apple<\/a>[\s\S]*?</tbody>

Start with this working regexp and go from there:
/<a href="(.*?)">apple<\/a>/g
If that is too broad and you want to make it more specific, add the next surrounding tag:
/<td.*?>\s*<a href="(.*?)">apple<\/a>/g
Then continue:
/<tbody.*?>\s*<td.*?>\s*<a href="(.*?)">apple<\/a>/g
Also, consider an alternate solution such as XPATH. Regular expressions can't really parse all variations of HTML.

Regular expression with multiple results

What's wrong with my regex ?
"/Blabla\(2\) :.*<tr><td class=\"generic\">(.*)<\/td>.+<\/tr>/Uis"
....
<tr>
<td class="aaa">Blabla(1) :</td>
<td>
<table class="bbb"><tbody>
<tr class="ccc"><th>title1</th><th>title2</th><th>title3</th></tr>
<tr><td class="generic">word1</td><td class="generic">word2 </td><td class="generic">word3</td></tr>
<tr><td class="generic">word4</td><td class="generic">word5 </td><td class="generic">word6</td></tr>
</tbody></table>
</td>
</tr>
<tr>
<td class="aaa">Blabla(2) :</td>
<td>
<table class="bbb"><tbody>
<tr class="ccc"><th>title1</th><th>title2</th><th>title3</th></tr>
<tr><td class="generic">word1b</td><td class="generic">word2b </td><td class="generic">word3b</td></tr>
<tr><td class="generic">word4b</td><td class="generic">word5b </td><td class="generic">word6b</td></tr>
</tbody></table>
</td>
</tr
What I want to do is to get the content of the FIRST TD of each TR from the block beginning with Blabla(2).
So the expected answer is word1b AND word4b
But only the first is returned...
Thank you for your help. Please don't answer me to use a DOM navigator, it's not possible in my case.

That's an interesting regex, in which I learned about the ungreedy flag, nice!
And for your problem, you might make use of \G to match immediately after the previous match and the flag g, assuming PCRE engine:
/(?:Blabla\(2\) :|(?<!^)\G).*<tr><td class=\"generic\">(.*)<\/td>.+<\/tr>/Uisg
regex101 demo
Or a little shorter with different delimiters:
'~(?:Blabla\(2\) :|(?<!^)\G).*<tr><td class="generic">(.*)</td>.+</tr>~Uisg'

Thanks to #Jerry, I learn today new tricks:
(Blabla\(2\) :.*?|\G)<tr><td class=\"generic\">\K([^<]+).+?<\/tr>\r\n

Extract attribute value from html element

Been struggling with this for a couple of hours now...
I have the following regex:
(?<=\bdata-video-id=""."">)(.*?)(title=.*?>)
The following input:
<div class="cameras">
<table class="results">
<colgroup>
<col class="col0">
<col class="col1">
</colgroup>
<thead>
<tr>
<th title="Name">
Name
</th>
<th title="Date">
Date
</th>
</tr>
</thead>
<tbody>
<tr data-video-id="1">
<td title="149 - Cam123">
149 - Cam123
</td>
<td title="Feb 18 2013">
Feb 18 2013
</td>
</tr>
<tr data-video-id="2">
<td title="150 - Cam456">
150 - Cam456
</td>
<td title="Feb 18 2013">
Feb 18 2013
</td>
</tr>
</tbody>
</table>
</div>
The regex outputs this:
<td title="149 - Cam123">
<td title="150 - Cam456">
But what I'd like to get is the contents of the title attribute of the 1st cell from every table row:
149 - Cam123
150 - Cam456
The number of rows may obviously vary but the number of columns is fixed.
Please help me fine tune the above regex.
Thanks
NOTE: The solution MUST be a regular expression. I do not have access to the code base therefore an HTML parser or any other kind of code intervention is not possible. The only way I can hook into the application is by injecting a different regex.

Based on the OP requirements that it MUST be a regex, then my suggestion would be to add a group wrapper to the inner title information:
(?<=\bdata-video-id=""."">).*?title="(.*?)">
Otherwise, the general solution is to not use a regex:
Why are you using a regex? The typical solution for this due to the complexities of the tags is to use an HTML parser
Here is a SO about this topic
Here is another even more popular response on using regex for XHTML which was pointed out by Jeff Atwood in this blogpost

jmeter grab value from response data

I have a question about grabbing a certain value from the html response data in Jmeter.
I've been trying both regular expression and xpath extractor(see below) but having no luck.
This is part of the response data I receive:
<table border="0" cellpadding="2" cellspacing="1" style="border-collapse: collapse" id="AutoNumber2" bordercolorlight="#999999" bordercolordark="#999999" width="100%">
<tr>
<td class="head" align="center" colspan="2">Routing Sheet</td>
</tr>
<tr class="altrow">
<td align="right" width="50%" class="formtext">Today's Date:</td>
<td valign="top" width="50%" class="formtext">06/19/2012</td>
</tr>
<tr class="altrow">
<td align="right" width="50%" class="formtext"> HCSC Received Date:</td>
<td valign="top" width="50%" class="formtext">06/19/2012</td>
</tr>
<tr class="tablerow">
<td align="right" width="50%" class="formtext"> Package Log Date:</td>
<td valign="top" width="50%" class="formtext">06/19/2012 04:21PM</td>
</tr>
<tr class="altrow">
<td align="right" width="50%" class="formtext"> Group Specialist:</td>
<td valign="top" width="50%" class="formtext">WATTS, JOHN</td>
</tr>
<tr class="tablerow">
<td align="right" width="50%" class="formtext"> Case Underwriter:</td>
<td valign="top" width="50%" class="formtext">N/A</td>
</tr>
<tr class="altrow">
<td align="right" width="50%" class="formtext"> Medical Underwriter:</td>
<td valign="top" width="50%" class="formtext">N/A</td>
</tr>
<tr class="tablerow">
<td align="right" width="50%" class="formtext">Case Number:</td>
<td valign="top" width="50%" class="formtext">7402628</td>
</tr>
And I'm trying to grab the case number.
I have been trying the regex extractor:
Case Number:</td><td valign="top" width="50%" class="formtext">(.+?)</td>
But got a null value back.
And for xpath extractor I tried this:
//table[#id='AutoNumber2']/tbody/tr[8]/td[2]
but it's not working either.
I've been thinking of using Beanshell to grab the source code as a string and parse the number.
Is there any better way of grabbing that number?
And how can I use beanshell to grab the source code of the response data?
I tried using xpath of /html but have no luck.
Thanks a lot

Try this, I tested it on your sample and it works :
Let me know if that works for you

Try this xpath:
//table[#id='AutoNumber2']/tr[8]/td[2]

If you are using XPath Extractor to parse HTML (not XML!..) response ensure that Use Tidy (tolerant parser) option is CHECKED.
Your xpath query should return value you want to extract.
Refine your xpath query, e.g.
//table[#id='AutoNumber2']/tbody/tr[td/text()='Case Number:']/td[2]/text()
To use Beanshell for parsing look into this: Using jmeter variables in xpath extractor.
You can first test your xpath query using any other tool - Firefox addons at least:
XPath Checker
XPather
XPath Finder

You can use ViewResultsTree listener component to test and tweak your regex expression on your actual response data.
To find out what happens in runtime use Debug component.
At the first glance I see that it doesn't match because you're missing new line in your regex expression (following Case Number:</td>).
See here for special characters that emulate new line.

Something about regular expression

If I want to get the current price 416.00 of the following content, what regexp I can use to get it? There are some places in the webpage with similar content, except the one I want has the word Discount in a few lines after the current price. 416,520 and 20% are variable. Thanks.
<tr>
<td class="txt_11px_b_EB6495" width="50" nowrap>Current Price?</td>
<td class="txt_11px_b_EB6495">HK$ 416.00</td>
</tr>
<tr>
<td class="txt_11px_n_999999">Original price?</td>
<td class="txt_11px_n_999999">HK$ 520.00</td>
</tr>
<tr>
<td class="txt_9px_n_999999"> </td>
<td class="txt_9px_n_999999">Discount 20%</td>
</tr>

You can use
" (\d+\.\d*)</td>"
That will match 520.00, 2.00, 123.1, and 123.

Use a HTML parser to get the text node, then extract the price using a regex.
You would use something like...
\d+(?:\.\d{2}|%)
I just tested it and it matched...
416.00
520.00
20%
I assumed (it was unclear to me) you want the prices and the percentage discount. I also matched the % so you can tell what are the percentages in the matches.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Clean html code with a Regex - regex

It is a very common problem. Please check this epic answer and stop using regexp to "parse" html, instead use a proper parser and get what you need with XPath or even a CSS selector.

Related

regular expression: what's wrong with my expression?

Regular expression with multiple results

Extract attribute value from html element

jmeter grab value from response data

Something about regular expression

Categories

Resources