Python 27 - BeautifulSoup and Tables

Python 27 - BeautifulSoup and Tables - python-2.7

I have this source:
<tr id="bitstampUSD">
<td class="arrow" change="up" latest_trade="1363480722">
<span class="down">▼</span>
</td>
<td class="symbol">
<nobr>
bitstampUSD
</nobr>
<span class="sub">USD (SEPA converted)</span>
</td>
<td>46.74
<span class="sub">41 min ago</span>
</td>
<td class="minichart break">
<span volume="**whole heaps of number here that I want**"
print="**more numbers I want**"
avg="**more numbers I want**"
class="marketsparkline"></span>
</td>
<td>**36.39**
<span class="sub change">**10.35 28.46%**</span>
</td>
<td>**141,043.10**
<span class="sub">**5,132,052.22 USD**</span>
</td>
<td>**25.25**
<span class="sub">**46.58** (24h)</span>
</td>
<td>**49.17**
<span class="sub">47 (24h)</span>
</td>
<td class="break">**46.7**</td>
<td>**46.74**</td>
<td class="break">**46.78**
<span class="sub change">-0.04 -0.09%</span>
</td>
<td>**819.54**
<span class="sub">**38,340.96** USD</span>
</td>
</tr>
So I want to get the data in bold. (Well, it's supposed to be in bold, I guess the code tags stop that from happening. The data inside two asterisks.
I managed to figure out how to get the bits in code which I didn't include here, because it was inside the classes. But here, some of it is outside the classes so I don't know how to grab it.
It may help to look at the entire source, if you want http://bitcoincharts.com/markets/
It's laid out differently than other table code I've seen before.

Well, this outputs a bit more than you requested, but should get you started:
soup = BeautifulSoup(f)
for td in soup.find_all('td', class_='minichart break'):
avg = td.span['avg']
print_ = td.span['print']
volume = td.span['volume']
print avg, print_, volume
for td in soup.find_all('td'):
print 'TD', td.text.split()
On your example I obtain:
**more numbers I want** **more numbers I want** **whole heaps of number here that I want**
[u'\u25bc']
[u'bitstampUSD', u'USD', u'(SEPA', u'converted)']
[u'46.74', u'41', u'min', u'ago']
[]
[u'**36.39**', u'**10.35', u'28.46%**']
[u'**141,043.10**', u'**5,132,052.22', u'USD**']
[u'**25.25**', u'**46.58**', u'(24h)']
[u'**49.17**', u'47', u'(24h)']
[u'**46.7**']
[u'**46.74**']
[u'**46.78**', u'-0.04', u'-0.09%']
[u'**819.54**', u'**38,340.96**', u'USD']

Related

XPath to find specific rows and related values based on a passed parameter

I'm trying to grab the value from the lights node, based on a house number set in a parameter. The problem is, based on certain conditions, houses may be in different row positions.
If the parameter being sent to me for the house number is House237, then how to I get the number of lights located within the row-2-Lights node?
Also, how do I do the same if the next run, the house number is House867? Below is my HTML:
<?xml version='1.0' encoding='utf-8'?>
<table id="neighborhood">
<tr onmouseover="leave('1')">
<td id="row-1-house">
<div class="houseCol">
<a href="#" onClick="goHome('867');return false">
House867
</a>
</div>
</td>
<td id="row-1-Lights">
<div class="decimal">14</div>
</td>
</tr>
<tr onmouseover="leave('2')">
<td id="row-2-house">
<div class="houseCol">
<a href="#" onClick="goHome('237');return false">
House237
</a>
</div>
</td>
<td id="row-2-Lights">
<div class="decimal">12</div>
</td>
</tr>
</table>

You can try the following XPath-1.0 expression. The parameter is the 'HouseXXX' string, the child of the a element.
/table[#id='neighborhood']/tr[td/div[#class='houseCol']/a[normalize-space(text())='House237']]/td[contains(#id,'Lights')]/div[#class='decimal']/text()
The output of this is
12
In this example the parameter is set to 'House237'. How you incorporate the parameter into the XPath expression depends on your usecase scenario.
For example, in XSLT you would replace 'House237' with a variable like $HouseNumber to set the parameter.

Python Selenium Click on object inside specific row

I have site and the HTML looks like this:
<tr role="row" class="odd">
<td class="sorting_1">555</td>
<td>
FruitType1 : Fruit1
</td>
<td>Fruit1</td>
<td>FruitType1</td>
<td>Somwhere</td>
<td></td>
<td>0</td>
<td>
<button class="copy_button btn_gray_inverse" id="555">Copy</button>
</td>
<td>
<button class="fruit check_btn" id="555" href="" value="0">
<i class="bt_check"></i>
</button>
</td>
<td>
<a class="fruit remove_btn" id="555" href="#">
<i class="bt_remove">
::before
</i>
</a>
</td>
</tr>
Im trying to click on button (<button class="fruit check_btn" id="555" href="" value="0">) inside this specific row. Rows can be different only in text under tr (Fruit1, Fruit2), with this code and its not working:
FruitList = self.driver.find_elements_by_xpath\
('//tr[#role="row" and contains (., "Fruit1")]')
for Fruit in FruitList:
Enabled = Fruit.find_element_by_xpath('//i[#class="bt_check"]')
Enabled.click()
It allways clicks on button from first awailable row on the page, not the one that is containing text "Fruit1".
Please help

To find the check button for Fruit1, I suggest you to do it in two steps.
First find the row for Fruit1:
fruit_row = driver.find_element_by_xpath("//td[text()='Fruit1']/..")
Note that I use /.. at the end of the Xpath in order to select the tr element that contains the td with text 'Fruit1' instead of the td himself.
Second step, find the button and click on it:
fruit_row.find_element_by_class_name("check_btn").click()

To click on <td> with text as Fruit1, you can use the following line of code :
driver.find_element_by_xpath("//td/a[contains(#href,'/fruit_list/555')]//following::td[text()='Fruit1']")
You can be more generic with :
driver.find_element_by_xpath("//td/a[contains(#href,'/fruit_list/555')]//following::td[1]")

How to extract parent and nested items in single regular expression?

I want to use regular expression to match the following html table:
<tbody class=\"DocTableBody \">
<tr data-fastRow=\"1\" class=\"DataRow TDRE\">
<td id=\"g-f-1\" class=\"TDC FieldDisabled Field TCLeft CellText g-f\" >
<div class=\"DTC\">
<label id=\"c_g-f-1\" class=\"DCC\" >01-Apr-2015</label>
</div>
</td>
<td id=\"g-g-1\" class=\"TDC FieldDisabled Field TCLeft CellTextHtml g-g\" >
<div class=\"DTC\">
<label id=\"c_g-g-1\" class=\"DCC\" >ACTIVE</label>
</div>
</td>
</tr>
<tr data-fastRow=\"2\" class=\"DataRow TDRO\">
<td id=\"g-f-2\" class=\"TDC FieldDisabled Field TCLeft CellText g-f\" >
<div class=\"DTC\">
<label id=\"c_g-f-2\" class=\"DCC\" >01-Apr-2015</label>
</div>
</td>
<td id=\"g-g-2\" class=\"TDC FieldDisabled Field TCLeft CellTextHtml g-g\" >
<div class=\"DTC\">
<label id=\"c_g-g-2\" class=\"DCC\" >ACTIVE</label>
</div>
</td>
</tr>
</tbody>
I expected to extract the following value:
"1"
01-Apr-2015
ACTIVE
"2"
01-Apr-2015
ACTIVE
I tried the following to extract the value in data-fastRow:
(?sUi)<tr data-fastRow=\\"(\d+)\\".+>.*<\/tr>
But I couldn't extract the nested items in <label.+>(.*)</label> in single regular expression.
Is that possible to extract parent and nested items in single regular expression?

It's a really bad idea to parse HTML with regular expressions.
Each languahe has its own libraries to parse HTML.
In Python for example you have BeautifulSoup.
It's by far much better to use such libraries.
Usually, such libraries has jQuery-Selector-like interface (or something like that), which allows you to find your data with extremely easy queries.

Tree-like matches in regex with a fixed chain

i have a very specific task to achieve with a single regex.
Here's the pattern of the text i have to extract the data from (note i'm parsing HTML-like code, stored in an immutable file) :
<tr>
<td > <a ><img /></a>
</td>
<td > <a ><span >RootData</span></a>
</td>
<td > Data1.1
</td>
<td > <a ><img /></a>
</td>
<td > <a ><span >Data1.2</span></a>
</td>
<td >  
</td></tr>
<tr>
<td > Data2.1
</td>
<td > <a ><img /></a>
</td>
<td > <a ><span >Data2.2</span></a>
</td>
<td >  
</td></tr>
...
First there's a root contained inside the first "tr". Still inside this one, there's some datq (Data1.1 and Data1.2) to extract.
Then comes a finite number of "tr" block each containing data to extract.
I'd like the matches to be like this :
match 1 : 'RootData' 'Data1.1' 'Data1.2'
match 2 : 'RootData' 'Data2.1' 'Data2.2'
etc
So far i see what to do with 2 regex and 2 loops (like 1 searching for the Root, and the other to find all datas from this root) but i'd like it to be in a single regex.
If some of you already encountered that and could help, that'd be nice :)
Thanks in advance.

If I understand you correctly, you'd like to have a single regular expression provide more than one match for the same input. Regular expressions do not work that way, and are probably just not the right tool for the problem you're trying to solve.

How to extract data by matching a variable with the tag value in python

!--This is the first table from where i get 4 id's (abc1---abc4) which i need to match with the table below and get the required data--!
<table width="100%" border="0" class=""BigClass">
<tbody>..</tbody>
</table>
!--This is the second table --!
<table width="100%" border="0" class=""BigClass">
<tbody>
<tr align="left">
<td valign="top" colspan="2">
<strong> 1.
First Topic
</strong>
<a name="abc1" id="abc1"></a>
</td>
</tr>
!--This is the place where the first speaker and his/her text comes---!
<tr align="left">
<td style="text-align:justify;line-height:2;padding-right:10px;" colspan="2">
<strong> " First Speaker " </strong>
<br>
" Some Text "
</td>
</tr>
!--This is where the second speaker comes in---!
<tr align="left">
<td style="text-align:justify;line-height:2;padding-right:10px;" colspan="2">
<strong> " Second Speaker " </strong>
<br>
" Some Text "
</td>
</tr>
<tr><td colspan="2"><br></td></tr>
<tr><td colspan="2"><br></td></tr>
!--Then here comes the row with another id--!
<tr align="left">
<td valign="top" colspan="2">
<strong> 2.
Second Topic
</strong>
<a name="abc2" id="abc2"></a>
</td>
</tr>
!--Just like before, this will also have set of speakers who have some text--!
I have two tables with the same class name which is BigClass. From the first table i extracted 4 ids which are abc1,abc2,abc3,abc4.
Now i want to check that if these ids is present in this second table(which it is)
after it matches with the ids in the second table, i want to extract the speakers and the text of those speakers.
You can see the code structure for the second table rom which i want to extract the data.

It seems the best way to extract speaker and text information is to extract all ids in a list and all speaker info in another list. Then just cross-reference the ids needed and get the corresponding speaker info.
I create a dictionary here with key as ids and value as speaker info. I found the speaker info by the condition that the td field has a style attribute defined in all fields containing speaker info.
For extracting info from HTML, I am using the BeautifulSoup library.
from bs4 import BeautifulSoup
from itertools import izip
soup = BeautifulSoup(open('table.html'))
idList = []
speakerList = []
idsRequired = ['abc1','abc2']
for a in soup.findAll('a'):
if 'id' in a.attrs.keys():
idList.append(a.attrs['id'])
for i in soup.findAll('td'):
if 'style' in i.attrs.keys():
speakerList.append(i.text)
for key,value in izip(idList,speakerList):
if key in idsRequired:
print value
This gives me the output as:
" First speaker "
" Some text "
" Second speaker "
" Some text "

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Python 27 - BeautifulSoup and Tables - python-2.7

Related

XPath to find specific rows and related values based on a passed parameter

Python Selenium Click on object inside specific row

How to extract parent and nested items in single regular expression?

Tree-like matches in regex with a fixed chain

How to extract data by matching a variable with the tag value in python

Categories

Resources