How to extract data by matching a variable with the tag value in python - python-2.7

!--This is the first table from where i get 4 id's (abc1---abc4) which i need to match with the table below and get the required data--!
<table width="100%" border="0" class=""BigClass">
<tbody>..</tbody>
</table>
!--This is the second table --!
<table width="100%" border="0" class=""BigClass">
<tbody>
<tr align="left">
<td valign="top" colspan="2">
<strong> 1.
First Topic
</strong>
<a name="abc1" id="abc1"></a>
</td>
</tr>
!--This is the place where the first speaker and his/her text comes---!
<tr align="left">
<td style="text-align:justify;line-height:2;padding-right:10px;" colspan="2">
<strong> " First Speaker " </strong>
<br>
" Some Text "
</td>
</tr>
!--This is where the second speaker comes in---!
<tr align="left">
<td style="text-align:justify;line-height:2;padding-right:10px;" colspan="2">
<strong> " Second Speaker " </strong>
<br>
" Some Text "
</td>
</tr>
<tr><td colspan="2"><br></td></tr>
<tr><td colspan="2"><br></td></tr>
!--Then here comes the row with another id--!
<tr align="left">
<td valign="top" colspan="2">
<strong> 2.
Second Topic
</strong>
<a name="abc2" id="abc2"></a>
</td>
</tr>
!--Just like before, this will also have set of speakers who have some text--!
I have two tables with the same class name which is BigClass. From the first table i extracted 4 ids which are abc1,abc2,abc3,abc4.
Now i want to check that if these ids is present in this second table(which it is)
after it matches with the ids in the second table, i want to extract the speakers and the text of those speakers.
You can see the code structure for the second table rom which i want to extract the data.

It seems the best way to extract speaker and text information is to extract all ids in a list and all speaker info in another list. Then just cross-reference the ids needed and get the corresponding speaker info.
I create a dictionary here with key as ids and value as speaker info. I found the speaker info by the condition that the td field has a style attribute defined in all fields containing speaker info.
For extracting info from HTML, I am using the BeautifulSoup library.
from bs4 import BeautifulSoup
from itertools import izip
soup = BeautifulSoup(open('table.html'))
idList = []
speakerList = []
idsRequired = ['abc1','abc2']
for a in soup.findAll('a'):
if 'id' in a.attrs.keys():
idList.append(a.attrs['id'])
for i in soup.findAll('td'):
if 'style' in i.attrs.keys():
speakerList.append(i.text)
for key,value in izip(idList,speakerList):
if key in idsRequired:
print value
This gives me the output as:
" First speaker "
" Some text "
" Second speaker "
" Some text "

Related

using jsoup to modify data

i have successfully used and got html from the website, i am having some troubles while showing the Data
Here is my generated code
<tr class="2" id="AS 2238_2022-10-18T08:50:00"> <td id=" Air"> <img src="/webfids/logos/AS.jpg" width="138" height="31" title=" Air" alt=" Air"> </td> <td id="2238"> 2238</td> <td id="Phoenix"> Phoenix</td> <td id="1666108200000"> 8:50A 10-18-22</td> <td id="AS 2238_2022-10-18T08:50:00_status"> <font class="default"> On Time </font></td> <td id="AS 2238_2022-10-18T08:50:00_gate">2A</td> <td id="AS 2238_2022-10-18T08:50:00_terminal"> </td> <td id="AS 2238_2022-10-18T08:50:00_codeShares"> </td> <td id="AS 2238_2022-10-18T08:50:00_CDS"> </td> <td id="marker" style="display: none">0</td> </tr>
i am trying to remove the last TD of every row, i have many rowd, i am running over the loop
here is my code
rows = TheTable.select("tr");
for ( row in rows ){
writedump(row.ToString());
writeoutput('<br><br><br>');
row.select('##marker').remove();
row.select("td:eq(0)").attr("rel", "nofollow");
// writeoutput(image.toString());
}
i am trying to remove the last TD
I want to remove the Img and just use the text in the img tag like title or alt
i am trying to remove the last TD
I want to remove the Img and just use the text in the img tag like title or alt
for( row in rows ){
// get the first image object
image = row.select( "img" )[ 1 ]
// extract the alt or title text
imageAlt = image.attr( "alt" )?:image.attr( "title" )?:""
// replace the image with the extracted text
image.parent().append( imageAlt )
image.remove()
//remove the last column
row.select( "td" ).last().remove()
}

Xpath - Retrieveing Text value when condition contains a tag

I have section of a table and I am trying to get the value "Distributor 10"
<table class="d">
<tr>
<td class="ah">supplier<td>
<td class="ad">
Supplier 10
</td>
</tr>
<tr>
<td class="ah">distributor<pre><td>
<td class="ad">
Distributor 10
</td>
</tr>
</table>
If I am within Chrome Developer, I get this value by using the following xpath string
//tr/td[text()="distributor]/following-sibling::td[#class="ad"]/a/text()
But when I code this in python - it returns an empty list... From what I can see its is because of the <pre> tag next to "distributor"
When I amend the above mentioned xpath to look for "supplier" instead of distributor it works perfectly well
any suggestions would be welcome
Assuming you're using lxml you can use one of the following XPath to get this working :
//tr[contains(.,"distributor")]//a/text()
//a[parent::td[#class="ad"] and starts-with(#href,"/D")]/text()
Piece of code :
from lxml import etree
from io import StringIO
html = '''<table class="d">
<tr>
<td class="ah">supplier<td>
<td class="ad">
Supplier 10
</td>
</tr>
<tr>
<td class="ah">distributor<pre><td>
<td class="ad">
Distributor 10
</td>
</tr>
</table>'''
parser = etree.HTMLParser()
tree = etree.parse(StringIO(html), parser)
data = tree.xpath('//tr[contains(.,"distributor")]//a/text()')
print (data)
Output : ['Distributor 10']
Alternative : use lxml html cleaner class ("remove_tags") to remove the pre element from your page.
References :
https://lxml.de/api/lxml.html.clean.Cleaner-class.html
https://lxml.de/lxmlhtml.html#cleaning-up-html

Selenium Python I want to check an element does not have a value I get the error NoSuchElementException: Message: Unable to find element with xpath

I have a HTML table with some rows and columns. I can get the value I want from for a row from column 3 which has the value "14"
When a user deletes a record from the GUI I would like to check that 14 is not present anymore.
I get the error:
NoSuchElementException: Message: Unable to find element with xpath == //table[#id="reporting_view_report_dg_main_body"]//tr//td[3]/div/span[#title="14"]
My XPATH to find the value is:
usn_id_element = self.get_element(By.XPATH, '//table[#id="reporting_view_report_dg_main_body"]//tr//td[3]/div/span[#title="14"]')
My function routine to check the value is not there is
def is_usn_id_not_displayed_in_all_records_report_results(self, usn_id): # When a record has been disconnected call this method to check the record for usn id is not there anymore.
usn_id_element = self.get_element(By.XPATH, '//table[#id="reporting_view_report_dg_main_body"]//tr//td[3]/div/span[#title="14"]')
print "usn_id_element"
print usn_id_element
print usn_id_element.text
if usn_id not in usn_id_element:
return True
get_element routine:
from selenium.webdriver.common.by import By
# returns the element if found
def get_element(self, how, what):
# params how: By locator type
# params what: locator value
try:
element = self.driver.find_element(by=how, value=what)
except NoSuchElementException, e:
print what
print "Element not found "
print e
screenshot_name = how + what + get_datetime_now() # create screenshot name of the name of the element + locator + todays date time. This way the screenshot name will be unique and be able to save
self.save_screenshot(screenshot_name)
raise
return element
The HTML snippet is:
<table id="reporting_view_report_dg_main_body" cellspacing="0" style="table-layout: fixed; width: 100%; margin-bottom: 17px;">
<colgroup>
<tbody>
<tr class="GFNQNVHJM" __gwt_subrow="0" __gwt_row="0"\>
<tr class="GFNQNVHIN" __gwt_subrow="0" __gwt_row="1"\>
<div __gwt_cell="cell-gwt-uid-9530" style="outline-style:none;">
<span title="14" style="white-space:nowrap;overflow:hidden;text-overflow:ellipsis;empty-cells:show;display:block;padding-right: 1px;">14</span>
</div>
</td>
<td class="GFNQNVHIM GFNQNVHKM"\>
<td class="GFNQNVHIM GFNQNVHKM"\>
</tr>
<tr class="GFNQNVHIN" __gwt_subrow="0" __gwt_row="13">
<td class="GFNQNVHIM GFNQNVHJN GFNQNVHLM">
<td class="GFNQNVHIM GFNQNVHJN">
<td class="GFNQNVHIM GFNQNVHJN">
<div __gwt_cell="cell-gwt-uid-9530" style="outline-style:none;">
<span class="" title="14" style="white-space:nowrap;overflow:hidden;text-overflow:ellipsis;empty-cells:show;display:block;padding-right: 1px;">14</span>
</div>
</td>
<td class="GFNQNVHIM GFNQNVHJN"\>
<td class="GFNQNVHIM GFNQNVHJN"\>
</tr>
<tr class="GFNQNVHJM" __gwt_subrow="0" __gwt_row="14"\>
<tr class="GFNQNVHIN" __gwt_subrow="0" __gwt_row="15"\>
</tbody>
</table>
How can check if the value is not there?
Thanks, Riaz
Right now you are checking the attribute 'title' has a value of 14 and not the contents of the cell. What happens after the delete occurs? Does the span remain in the cell? Does the value of the cell becomes blank and does the value of the attribute 'title' also becomes blank?
The xpath below checks that the value of the cell is blank after deletion. Assumption you get a blank cell after deletion.
"//table[#id='reporting_view_report_dg_main_body']//tr//td[3]/div/span[.='']"
If you wanna check with value of title after deletion
"//table[#id='reporting_view_report_dg_main_body']//tr//td[3]/div/span[not(#title='14')]"

get values from table with BeautifulSoup Python

I have a table where I am extracting links and text. Although I can only do one or the other. Any idea how to get both?
Essentially I need to pull the text: "TEXT TO EXTRACT HERE"
for tr in rows:
cols = tr.findAll('td')
count = len(cols)
if len(cols) >1:
third_column = tr.findAll('td')[2].contents
third_column_text = str(third_column)
third_columnSoup = BeautifulSoup(third_column_text)
#issue starts here. How can I get either the text of the elm <td>text here</td> or the href texttext here
for elm in third_columnSoup.findAll("a"):
#print elm.text, third_columnSoup
item = { "code": random.upper(),
"name": elm.text }
items.insert(item )
The HTML Code is the following
<table cellpadding="2" cellspacing="0" id="ListResults">
<tbody>
<tr class="even">
<td colspan="4">sort results: <a href=
"/~/search/af.aspx?some=LOL&Category=All&Page=0&string=&s=a"
rel="nofollow" title=
"sort results in alphabetical order">alphabetical</a> | <strong>rank</strong> ?</td>
</tr>
<tr class="even">
<th>aaa</th>
<th>vvv.</th>
<th>gdfgd</th>
<td></td>
</tr>
<tr class="odd">
<td align="right" width="32">******</td>
<td nowrap width="60"><a href="/aaa.html" title=
"More info and direct link for this meaning...">AAA</a></td>
<td>TEXT TO EXTRACT HERE</td>
<td width="24"></td>
</tr>
<tr class="even">
<td align="right" width="32">******</td>
<td nowrap width="60"><a href="/someLink.html"
title="More info and direct link for this meaning...">AAA</a></td>
<td><a href=
"http://www.fdssfdfdsa.com/aaa">TEXT TO EXTRACT HERE</a></td>
<td width="24">
<a href=
"/~/search/google.aspx?q=lhfjl&f=a&cx=partner-pub-2259206618774155:1712475319&cof=FORID:10&ie=UTF-8"><img border="0"
height="21" src="/~/st/i/find2.gif" width="21"></a>
</td>
</tr>
<tr>
<td width="24"></td>
</tr>
<tr>
<td align="center" colspan="4" style="padding-top:6pt">
<b>Note:</b> We have 5575 other definitions for <strong><a href=
"http://www.ddfsadfsa.com/aaa.html">aaa</a></strong> in our
database</td>
</tr>
</tbody>
</table>
You can just use the text property on a td element:
from bs4 import BeautifulSoup
html = """HERE GOES THE HTML"""
soup = BeautifulSoup(html, 'html.parser')
for tr in soup.find_all('tr'):
columns = tr.find_all('td')
if len(columns) > 2:
print columns[2].text
prints:
TEXT TO EXTRACT HERE
TEXT TO EXTRACT HERE
Hope that helps.
The way to do it is by doing the following:
third_column = tr.find_all('td')[2].contents
third_column_text = str(third_column)
third_columnSoup = BeautifulSoup(third_column_text)
if third_columnSoup:
print third_columnSoup.text

How to handle dynamically changing id's with similar starting name using Webdriver

I am automating the test for web application. I have a scenario for creating an admin, for which i have to enter the name, email address and phone number text boxes. But ids of this text boxes are dynamic.
userName, id='oe-field-input-41'
Email, id='oe-field-input-42'
phone number, id='oe-field-input-43'
First Query:
The numbers in the ids are dynamic, it keep changes
I tired to use the xpath for handling the dynamic value.
xpath = //*[starts-with(#id,'oe-field-input-')]
In this it enter the text into first text box successfully
Second Query:
I am not able use the same xpath for next two text boxes, as it enters the email and phone number into name field only
Please help me to resolve this dynamic value handling.
Edited: added the html code,
<table class="oe_form_group " cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr class="oe_form_group_row">
<td class="oe_form_group_cell oe_form_group_cell_label" width="1%" colspan="1">
<td class="oe_form_group_cell" width="99%" colspan="1">
<span class="oe_form_field oe_form_field_many2one oe_form_field_with_button">
<a class="oe_m2o_cm_button oe_e" tabindex="-1" href="#" draggable="false" style="display: inline;">/</a>
<div>
</span>
</td>
</tr>
<tr class="oe_form_group_row">
<td class="oe_form_group_cell oe_form_group_cell_label" width="1%" colspan="1">
<td class="oe_form_group_cell" width="99%" colspan="1">
<span class="oe_form_field oe_form_field_email">
<div>
<input id="oe-field-input-35" type="text" maxlength="240">
</div>
</span>
</td>
</tr>
<tr class="oe_form_group_row">
<td class="oe_form_group_cell oe_form_group_cell_label" width="1%" colspan="1">
<td class="oe_form_group_cell" width="99%" colspan="1">
<span class="oe_form_field oe_form_field_char">
<input id="oe-field-input-36" type="text" maxlength="32">
</span>
</td>
</tr>
<tr class="oe_form_group_row">
<td class="oe_form_group_cell oe_form_group_cell_label" width="1%" colspan="1">
<td class="oe_form_group_cell" width="99%" colspan="1">
<span class="oe_form_field oe_form_field_char">
<input id="oe-field-input-37" type="text" maxlength="32">
</span>
</td>
</tr>
<tr class="oe_form_group_row">
</tbody>
you can try alternate way for locating unique element by label or so. For example:
css=.oe_form_group_row:contains(case_sensitive_text) input
xpath=//tr[#class = 'oe_form_group_row'][contains(.,'case_sensitive_text')]//input
If you are using ISFW you should create custom component for such form fields.
You do have some classes which are good for identification, e.g. oe_form_field_email, oe_form_field_char. It's a little complicated to use them because they're not on the input fields themselves, and the second one is not unique; but it's quite possible:
.//span[contains(#class, 'oe_form_field_email')]//input
That is an xpath which identifies the Email field as being the input which is a descendant of a span with the oe_form_field_email class. You could also use the same logic in a css selector like this, more efficiently:
span.oe_form_field_email input
For the two other fields, there is no unique class which can tell them apart so you're going to have to rely on the order (I'm assuming username comes before phone number), and that means you have to use xpaths:
(//tr//span[contains(#class, 'oe_form_field_char')])[1]//input
(//tr//span[contains(#class, 'oe_form_field_char')])[2]//input
Those xpaths pick out the first and second fields respectively, which are inputs which are descendants of a span of class oe_form_field_char.
P.S. I used Firepath in firefox to verify the xpath and css locators.
The problem here is, that your XPath does the correct selection, but Selenium will always pick the first one if multiple results are returned for your query.
You can select each of the input fields directly by using:
//input[1]
//input[2]
//input[3]
If there are other input fields, you can tighten your selection by selecting only input nodes with oe-field-input in their id attribute like this:
//input[starts-with(#id,'oe-field-input-')][1]
//input[starts-with(#id,'oe-field-input-')][2]
//input[starts-with(#id,'oe-field-input-')][3]
Use the following xpath works like a charm. Although I don't recommend this kind of an xpath. Since we don't have text against the text box no other choice.
//div/input[contains(#id, 'oe-field-input')] - First text box
//tr[#class = 'oe_form_group_row'][2]//input - Second text box
//tr[#class = 'oe_form_group_row'][3]//input - Third text box
You can use below XPATH.
//tr[#class = 'oe_form_group_row'][2]//input for First Text box
//tr[#class = 'oe_form_group_row'][3]//input for Second Text box
//tr[#class = 'oe_form_group_row'][4]//input for Third text box.
I have tested avove xpath.
But the better way if you have development access then ask developers to make is standaralized and recommand tags like "name" , "value", or attach text e.g. Email:, Password. So you can use these in your xpath.