Something about regular expression - regex

If I want to get the current price 416.00 of the following content, what regexp I can use to get it? There are some places in the webpage with similar content, except the one I want has the word Discount in a few lines after the current price. 416,520 and 20% are variable. Thanks.
<tr>
<td class="txt_11px_b_EB6495" width="50" nowrap>Current Price?</td>
<td class="txt_11px_b_EB6495">HK$ 416.00</td>
</tr>
<tr>
<td class="txt_11px_n_999999">Original price?</td>
<td class="txt_11px_n_999999">HK$ 520.00</td>
</tr>
<tr>
<td class="txt_9px_n_999999"> </td>
<td class="txt_9px_n_999999">Discount 20%</td>
</tr>

You can use
" (\d+\.\d*)</td>"
That will match 520.00, 2.00, 123.1, and 123.

Use a HTML parser to get the text node, then extract the price using a regex.
You would use something like...
\d+(?:\.\d{2}|%)
I just tested it and it matched...
416.00
520.00
20%
I assumed (it was unclear to me) you want the prices and the percentage discount. I also matched the % so you can tell what are the percentages in the matches.

Related

Regex to match start and end line [duplicate]

I have a CSV document littered with thousands of instances of a table that I need to remove. I assume I can use REGEX, but I can't seem to find an expression to remove it. I attached a sample at the bottom.
I thought <table(.*)</table> would work, but that seems to ignore line breaks. Is there somebody who can help me remove these?
<table cellpadding=""5"" align=""center"" class=""shippingcost"" style=""width: 525px;"">
<tbody>
<tr>
<td colspan=""2"" style=""text-align: center;"">Shipping:
</td>
</tr>
<tr class=""shippingcostrow"">
<td>
<div align=""right"">Domestic
Canada
International
</td>
<td width=""400"">
<div align=""left"">Insured shipping is included to all U.S. destinations.
Canadian buyers pay $28 for EMS insured shipping.
All International Buyers pay $35 for EMS insured shipping.
</td>
</tr>
</tbody>
</table>
Got it. SublimeText has special tags for REGEX apparently.
(?s)<table(.*?)</table>

Find a table's last cell by regular expression

I want to use Regular Expression (compatible with pcre) to select a table
cell in an XML or HTML file.This cell was expanded in several lines containing
other elements and relative attributes and values. Thiscell supposed to be at the last column.
for some reasons I can't and don't want to use ". matches newline" option.
for example in this code:
EDITED:
<table colcount="4">
<tr>
<td colspan="2">
<para><text> Mike</text></para>
</td>
<td>
<tab />
</td>
<td1>
<para><text>Jack</text></para>
<para><text>Sarah</text></para>
</td>
</tr1>
<tr>
<td>
<para><text>Bob</text></para>
<para><text>Rita</text></para>
</td>
<td2 colspan="3" with>
<para><text>Helen</text></para>
</td>
</tr2>
<tr>
<td style="with:445px;">
<para><text>Sam</text></para>
</td>
<td>
<para><text>Emma</text></para>
<para><text>George</text></para>
</td>
<td>
</td>
<td3 colspan="">
<tab />
</td>
</tr3>
</table>
/EDITED
I want to find and select the whole last cell together with its start and end tags (<td and </td>)
and the end tag of the corresponding row(</tr>), that is:
EDITED:
Here is what I want to select in the table like above using RegEx:
Either from <td1 to </tr1> - or from <td2 to </tr2> - or from <td3 to </tr3>
/EDITED
The format (indentation and new lines have to be preserved), I mean I can't put, for example
</tr> in front of of closing tag of the cell(</td>).
Indentation is only space character.
Thanks for any help...
Best you can do with regex is:
<td(([^<]|<(?!\/td>))*)<\/td>\s*<\/tr>(?!(.|\r|\n)*<tr)
But this is kinda ugly, resource intensive and breaks when you have nested tables. A better route is indeed to use an XML or HTML parser for whichever programming language you're using.
If you want to select the last cell from EVERY row, as your updated question suggests, leave out the negative lookahead like so:
<td(([^<]|<(?!\/td>))*)<\/td>\s*<\/tr>
Working example here: http://refiddle.com/gt2

SublimeText 2 / REGEX - Remove everything in between two tags including line breaks?

I have a CSV document littered with thousands of instances of a table that I need to remove. I assume I can use REGEX, but I can't seem to find an expression to remove it. I attached a sample at the bottom.
I thought <table(.*)</table> would work, but that seems to ignore line breaks. Is there somebody who can help me remove these?
<table cellpadding=""5"" align=""center"" class=""shippingcost"" style=""width: 525px;"">
<tbody>
<tr>
<td colspan=""2"" style=""text-align: center;"">Shipping:
</td>
</tr>
<tr class=""shippingcostrow"">
<td>
<div align=""right"">Domestic
Canada
International
</td>
<td width=""400"">
<div align=""left"">Insured shipping is included to all U.S. destinations.
Canadian buyers pay $28 for EMS insured shipping.
All International Buyers pay $35 for EMS insured shipping.
</td>
</tr>
</tbody>
</table>
Got it. SublimeText has special tags for REGEX apparently.
(?s)<table(.*?)</table>

Clean html code with a Regex

I am parsing some transactions, for example 3 transactions look like this:
<TR class=DefGVRow>
<TD>29/04/2013</TD>
<TD>DEPOSITO 0140959158</TD>
<TD>0140959158</TD>
<TD align=right>336,00</TD>
<TD align=center>+</TD>
<TD align=right>16.210,60</TD></TR>H
<TR class=DefGVAltRow>
<TD>29/04/2013</TD>
<TD>RETIRO ATM CTA/CTE</TD>
<TD>1171029739</TD>
<TD align=right>600,00</TD>
<TD align=center>-</TD>
<TD align=right>15.610,60</TD></TR>
<TR class=DefGVRow>
<TD>29/04/2013</TD>
<TD>C.SERV.CAJERO AUT.</TD>
<TD>1171029739</TD>
<TD align=right>3,25</TD>
<TD align=center>-</TD>
<TD align=right>15.607,35</TD></TR>
And my current Regex is:
<TR class=\w+>
<TD>(?<day>\d{1,2})/(?<month>\d{1,2})/(?<year>\d{4})</TD>
<TD>(?<description>.+?)</TD>
<TD>(?<id>\d{3,30})</TD>
<TD.+?>(?<amount>[\d\.]{1,20},\d{1,10})</TD>
<TD.+?>(?<info>.+?)</TD>
<TD.+?>(?<balance>[\d\.]{1,20},\d{1,10})</TD></TR>
How can I edit the
<TD>(?<description>.+?)</TD>
To process optional tags that match other parts of the same extraction? (basically: how to ignore the A tag when capturing the group)
Thanks!
It is a very common problem. Please check this epic answer and stop using regexp to "parse" html, instead use a proper parser and get what you need with XPath or even a CSS selector.
This removes the 'optional' link:
<TR class=\w+>
<TD>(?<day>\d{1,2})/(?<month>\d{1,2})/(?<year>\d{4})</TD>
<TD>(?:<A href=".*>)?(?<description>.+?)(?:</A>)?</TD>
<TD>(?<id>\d{3,30})</TD>
<TD.+?>(?<amount>[\d\.]{1,20},\d{1,10})</TD>
<TD.+?>(?<info>.+?)</TD>
<TD.+?>(?<balance>[\d\.]{1,20},\d{1,10})</TD></TR>

Extract attribute value from html element

Been struggling with this for a couple of hours now...
I have the following regex:
(?<=\bdata-video-id=""."">)(.*?)(title=.*?>)
The following input:
<div class="cameras">
<table class="results">
<colgroup>
<col class="col0">
<col class="col1">
</colgroup>
<thead>
<tr>
<th title="Name">
Name
</th>
<th title="Date">
Date
</th>
</tr>
</thead>
<tbody>
<tr data-video-id="1">
<td title="149 - Cam123">
149 - Cam123
</td>
<td title="Feb 18 2013">
Feb 18 2013
</td>
</tr>
<tr data-video-id="2">
<td title="150 - Cam456">
150 - Cam456
</td>
<td title="Feb 18 2013">
Feb 18 2013
</td>
</tr>
</tbody>
</table>
</div>
The regex outputs this:
<td title="149 - Cam123">
<td title="150 - Cam456">
But what I'd like to get is the contents of the title attribute of the 1st cell from every table row:
149 - Cam123
150 - Cam456
The number of rows may obviously vary but the number of columns is fixed.
Please help me fine tune the above regex.
Thanks
NOTE: The solution MUST be a regular expression. I do not have access to the code base therefore an HTML parser or any other kind of code intervention is not possible. The only way I can hook into the application is by injecting a different regex.
Based on the OP requirements that it MUST be a regex, then my suggestion would be to add a group wrapper to the inner title information:
(?<=\bdata-video-id=""."">).*?title="(.*?)">
Otherwise, the general solution is to not use a regex:
Why are you using a regex? The typical solution for this due to the complexities of the tags is to use an HTML parser
Here is a SO about this topic
Here is another even more popular response on using regex for XHTML which was pointed out by Jeff Atwood in this blogpost