Extract attribute value from html element - regex

Been struggling with this for a couple of hours now...
I have the following regex:
(?<=\bdata-video-id=""."">)(.*?)(title=.*?>)
The following input:
<div class="cameras">
<table class="results">
<colgroup>
<col class="col0">
<col class="col1">
</colgroup>
<thead>
<tr>
<th title="Name">
Name
</th>
<th title="Date">
Date
</th>
</tr>
</thead>
<tbody>
<tr data-video-id="1">
<td title="149 - Cam123">
149 - Cam123
</td>
<td title="Feb 18 2013">
Feb 18 2013
</td>
</tr>
<tr data-video-id="2">
<td title="150 - Cam456">
150 - Cam456
</td>
<td title="Feb 18 2013">
Feb 18 2013
</td>
</tr>
</tbody>
</table>
</div>
The regex outputs this:
<td title="149 - Cam123">
<td title="150 - Cam456">
But what I'd like to get is the contents of the title attribute of the 1st cell from every table row:
149 - Cam123
150 - Cam456
The number of rows may obviously vary but the number of columns is fixed.
Please help me fine tune the above regex.
Thanks
NOTE: The solution MUST be a regular expression. I do not have access to the code base therefore an HTML parser or any other kind of code intervention is not possible. The only way I can hook into the application is by injecting a different regex.

Based on the OP requirements that it MUST be a regex, then my suggestion would be to add a group wrapper to the inner title information:
(?<=\bdata-video-id=""."">).*?title="(.*?)">
Otherwise, the general solution is to not use a regex:
Why are you using a regex? The typical solution for this due to the complexities of the tags is to use an HTML parser
Here is a SO about this topic
Here is another even more popular response on using regex for XHTML which was pointed out by Jeff Atwood in this blogpost

Related

Xpath - Retrieveing Text value when condition contains a tag

I have section of a table and I am trying to get the value "Distributor 10"
<table class="d">
<tr>
<td class="ah">supplier<td>
<td class="ad">
Supplier 10
</td>
</tr>
<tr>
<td class="ah">distributor<pre><td>
<td class="ad">
Distributor 10
</td>
</tr>
</table>
If I am within Chrome Developer, I get this value by using the following xpath string
//tr/td[text()="distributor]/following-sibling::td[#class="ad"]/a/text()
But when I code this in python - it returns an empty list... From what I can see its is because of the <pre> tag next to "distributor"
When I amend the above mentioned xpath to look for "supplier" instead of distributor it works perfectly well
any suggestions would be welcome
Assuming you're using lxml you can use one of the following XPath to get this working :
//tr[contains(.,"distributor")]//a/text()
//a[parent::td[#class="ad"] and starts-with(#href,"/D")]/text()
Piece of code :
from lxml import etree
from io import StringIO
html = '''<table class="d">
<tr>
<td class="ah">supplier<td>
<td class="ad">
Supplier 10
</td>
</tr>
<tr>
<td class="ah">distributor<pre><td>
<td class="ad">
Distributor 10
</td>
</tr>
</table>'''
parser = etree.HTMLParser()
tree = etree.parse(StringIO(html), parser)
data = tree.xpath('//tr[contains(.,"distributor")]//a/text()')
print (data)
Output : ['Distributor 10']
Alternative : use lxml html cleaner class ("remove_tags") to remove the pre element from your page.
References :
https://lxml.de/api/lxml.html.clean.Cleaner-class.html
https://lxml.de/lxmlhtml.html#cleaning-up-html

How to parse specific conents from table with Scrapy

I'm trying to parse certain contents from table looking like below:
<table class="dataTbl col-4">
<tr>
<th scope="row">Rent</th>
<td>5.5</td>
<th scope="row">Management</th>
<td>3.3</td>
</tr>
<tr>
<th scope="row">Deposit</th>
<td>No</td>
<th scope="row">Other</th>
<td>No</td>
</tr>
<tr>
<th scope="row">Other2</th>
<td>No</td>
<th scope="row">Insurance</th>
<td>Yes</td>
</tr>
</table>
My goal is to find specific row (for example, Rent) and if there is a match, extract the content in the next <td> tag(For example, 5.5).
But how can I do it in Python?
I'm using Python3/Scrapy 1.3.0.
Thanks
In [9]: Selector(text=html).xpath('//th[text()="Rent"]/following-sibling::td[1]').extract()
Out[9]: ['<td>5.5</td>']
Use text()="Rent" to id the th tag
Use following-sibling:: get it's sibling and use [1] to get first
Using a python's regular expression.
r'\>text\<.+\n +\<td\>(\d+\.\d+)'
In your case, change text by Rent. Also, this is a useful web page to debug regular expressions.

Regex to match start and end line [duplicate]

I have a CSV document littered with thousands of instances of a table that I need to remove. I assume I can use REGEX, but I can't seem to find an expression to remove it. I attached a sample at the bottom.
I thought <table(.*)</table> would work, but that seems to ignore line breaks. Is there somebody who can help me remove these?
<table cellpadding=""5"" align=""center"" class=""shippingcost"" style=""width: 525px;"">
<tbody>
<tr>
<td colspan=""2"" style=""text-align: center;"">Shipping:
</td>
</tr>
<tr class=""shippingcostrow"">
<td>
<div align=""right"">Domestic
Canada
International
</td>
<td width=""400"">
<div align=""left"">Insured shipping is included to all U.S. destinations.
Canadian buyers pay $28 for EMS insured shipping.
All International Buyers pay $35 for EMS insured shipping.
</td>
</tr>
</tbody>
</table>
Got it. SublimeText has special tags for REGEX apparently.
(?s)<table(.*?)</table>

Validating html-style table data with XSLT

I need to be able to check xml with html-style table data to ensure that it's "rectangular". For example this is rectangular (2x2)
<table>
<tr>
<td>Foo</td>
<td>Bar</td>
</tr>
<tr>
<td>Baz</td>
<td>Qux</td>
</tr>
</table>
This is not
<table>
<tr>
<td>Foo</td>
<td>Bar</td>
</tr>
<tr>
<td>Baz</td>
</tr>
</table>
This is complicated by row and column spans and the fact that I need to accept two styles of markup, either where spanned cells are included as empty td or where span cells are omitted.
<!-- good (3x2), spanned cells included -->
<table>
<tr>
<td colspan="2">Foo</td>
<td/>
<td rowspan="2">Bar</td>
</tr>
<tr>
<td>Baz</td>
<td>Qux</td>
<td/>
</tr>
</table>
<!-- also good (3x2), spanned cells omitted -->
<table>
<tr>
<td colspan="2">Foo</td>
<td rowspan="2">Bar</td>
</tr>
<tr>
<td>Baz</td>
<td>Qux</td>
</tr>
</table>
Here are a bunch of examples of bad tables where it's ambiguous how to deal with them
<!-- bad, looks like spanned cells are included but more cells in row 1 than 2 -->
<table>
<tr>
<td colspan="2">Foo</td>
<td/>
<td rowspan="2">Bar</td>
<td>BAD</td>
</tr>
<tr>
<td>Baz</td>
<td>Qux</td>
<td/>
</tr>
</table>
<!-- bad, looks like spanned cells are omitted but more cells in row 1 than 2 -->
<table>
<tr>
<td colspan="2">Foo</td>
<td rowspan="2">Bar</td>
<td>BAD</td>
</tr>
<tr>
<td>Baz</td>
<td>Qux</td>
</tr>
</table>
<!-- bad, can't tell if spanned cells are included or omitted -->
<table>
<tr>
<td colspan="2">Foo</td>
<td rowspan="2">Bar</td>
</tr>
<tr>
<td>Baz</td>
<td>Qux</td>
<td/>
</tr>
</table>
<!-- bad, looks like spanned cells are omitted but a non-emtpy cell is overspanned -->
<table>
<tr>
<td colspan="2">Foo</td>
<td rowspan="2">Bar</td>
</tr>
<tr>
<td>Baz</td>
<td>Qux</td>
<td>BAD</td>
</tr>
</table>
I already have a working XSLT 2.0 solution for this problem that involves normalizing the data to the "spanned cells included" style then validating, however, my solution is cumbersome and starts to perform poorly for tables with an area of greater than 1000 cells. My normalization and validation routines involve iterating sequentially over the cells and passing along a param of cells that should be created by spans and inserting them when I pass their coordinates in the table. I'm not happy with either of them.
I'm looking for suggestions about cleverer ways in which to achieve this validation that hopefully would have better performance profiles on large tables. I need to account for th and td but omitted th from the examples for sake of simplicity, they can be included or ignored in any answers. I'm not checking to see if thead, tbody, and/or tfoot have the same width, this can also be included or omitted. I'm currently using XSLT 2.0 but I'd be interested in 3.0 solutions if they were significantly better than a solution implemented in 2.0.
I don't think this kind of problem is suited for XSLT - especially if you have to process very large tables.
I'd suggest to develop a solution using a procedural languge - maybe using XSLT to pre- or post- process the XML.

Something about regular expression

If I want to get the current price 416.00 of the following content, what regexp I can use to get it? There are some places in the webpage with similar content, except the one I want has the word Discount in a few lines after the current price. 416,520 and 20% are variable. Thanks.
<tr>
<td class="txt_11px_b_EB6495" width="50" nowrap>Current Price?</td>
<td class="txt_11px_b_EB6495">HK$ 416.00</td>
</tr>
<tr>
<td class="txt_11px_n_999999">Original price?</td>
<td class="txt_11px_n_999999">HK$ 520.00</td>
</tr>
<tr>
<td class="txt_9px_n_999999"> </td>
<td class="txt_9px_n_999999">Discount 20%</td>
</tr>
You can use
" (\d+\.\d*)</td>"
That will match 520.00, 2.00, 123.1, and 123.
Use a HTML parser to get the text node, then extract the price using a regex.
You would use something like...
\d+(?:\.\d{2}|%)
I just tested it and it matched...
416.00
520.00
20%
I assumed (it was unclear to me) you want the prices and the percentage discount. I also matched the % so you can tell what are the percentages in the matches.