get values from table with BeautifulSoup Python

get values from table with BeautifulSoup Python - python-2.7

I have a table where I am extracting links and text. Although I can only do one or the other. Any idea how to get both?
Essentially I need to pull the text: "TEXT TO EXTRACT HERE"
for tr in rows:
cols = tr.findAll('td')
count = len(cols)
if len(cols) >1:
third_column = tr.findAll('td')[2].contents
third_column_text = str(third_column)
third_columnSoup = BeautifulSoup(third_column_text)
#issue starts here. How can I get either the text of the elm <td>text here</td> or the href texttext here
for elm in third_columnSoup.findAll("a"):
#print elm.text, third_columnSoup
item = { "code": random.upper(),
"name": elm.text }
items.insert(item )
The HTML Code is the following
<table cellpadding="2" cellspacing="0" id="ListResults">
<tbody>
<tr class="even">
<td colspan="4">sort results: <a href=
"/~/search/af.aspx?some=LOL&Category=All&Page=0&string=&s=a"
rel="nofollow" title=
"sort results in alphabetical order">alphabetical</a> | <strong>rank</strong> ?</td>
</tr>
<tr class="even">
<th>aaa</th>
<th>vvv.</th>
<th>gdfgd</th>
<td></td>
</tr>
<tr class="odd">
<td align="right" width="32">******</td>
<td nowrap width="60"><a href="/aaa.html" title=
"More info and direct link for this meaning...">AAA</a></td>
<td>TEXT TO EXTRACT HERE</td>
<td width="24"></td>
</tr>
<tr class="even">
<td align="right" width="32">******</td>
<td nowrap width="60"><a href="/someLink.html"
title="More info and direct link for this meaning...">AAA</a></td>
<td><a href=
"http://www.fdssfdfdsa.com/aaa">TEXT TO EXTRACT HERE</a></td>
<td width="24">
<a href=
"/~/search/google.aspx?q=lhfjl&f=a&cx=partner-pub-2259206618774155:1712475319&cof=FORID:10&ie=UTF-8"><img border="0"
height="21" src="/~/st/i/find2.gif" width="21"></a>
</td>
</tr>
<tr>
<td width="24"></td>
</tr>
<tr>
<td align="center" colspan="4" style="padding-top:6pt">
<b>Note:</b> We have 5575 other definitions for <strong><a href=
"http://www.ddfsadfsa.com/aaa.html">aaa</a></strong> in our
database</td>
</tr>
</tbody>
</table>

You can just use the text property on a td element:
from bs4 import BeautifulSoup
html = """HERE GOES THE HTML"""
soup = BeautifulSoup(html, 'html.parser')
for tr in soup.find_all('tr'):
columns = tr.find_all('td')
if len(columns) > 2:
print columns[2].text
prints:
TEXT TO EXTRACT HERE
TEXT TO EXTRACT HERE
Hope that helps.

The way to do it is by doing the following:
third_column = tr.find_all('td')[2].contents
third_column_text = str(third_column)
third_columnSoup = BeautifulSoup(third_column_text)
if third_columnSoup:
print third_columnSoup.text

Related

How to find the element part of the anchor tag

I am totally new to the selenium. Please accept apologies for asking daft or silly question.
I have below on the website. What I am interested is that how can I get the data-selectdate value using selenium + python . Once I have the data-selectdate value, I would like to compare this against the the date I am interested in.
You help is deeply appreciated.
Note: I am not using Beautiful soup or anything.
Cheers
<table role="grid" tabindex="0" summary="October 2018">
<thead>
<tr>
<th role="columnheader" id="dayMonday"><abbr title="Monday">Mon</abbr></th>
<th role="columnheader" id="dayTuesday"><abbr title="Tuesday">Tue</abbr></th>
<th role="columnheader" id="dayWednesday"><abbr title="Wednesday">Wed</abbr></th>
<th role="columnheader" id="dayThursday"><abbr title="Thursday">Thur</abbr></th>
<th role="columnheader" id="dayFriday"><abbr title="Friday">Fri</abbr></th>
<th role="columnheader" id="daySaturday"><abbr title="Saturday">Sat</abbr></th>
</tr>
</thead>
<tbody>
<tr>
<td role="gridcell" headers="dayMonday">
<a data-selectdate="2018-10-22T00:00:00+01:00" data-selected="false" id="day22"
class="day-appointments-available">22</a>
</td>
<td role="gridcell" headers="dayTuesday">
<a data-selectdate="2018-10-23T00:00:00+01:00" data-selected="false" id="day23"
class="day-appointments-available">23</a>
</td>
<td role="gridcell" headers="dayWednesday">
<a data-selectdate="2018-10-24T00:00:00+01:00" data-selected="false" id="day24"
class="day-appointments-available">24</a>
</td>
<td role="gridcell" headers="dayThursday">
<a data-selectdate="2018-10-25T00:00:00+01:00" data-selected="false" id="day25"
class="day-appointments-available">25</a>
</td>
<td role="gridcell" headers="dayFriday">
<a data-selectdate="2018-10-26T00:00:00+01:00" data-selected="false" id="day26"
class="day-appointments-available">26</a>
</td>
<td role="gridcell" headers="daySaturday">
<a data-selectdate="2018-10-27T00:00:00+01:00" data-selected="false" id="day27"
class="day-appointments-available">27</a>
</td>
</tr>
</tbody>
</table>

To get the values of the attribute data-selectdate you can use the following solution:
elements = driver.find_elements_by_css_selector("table[summary='October 2018'] tbody td[role='gridcell'][headers^='day']>a")
for element in elements:
print(element.get_attribute("data-selectdate"))

You can use get_attribute api of element class to read attribute value of element.
css_locator = "table tr:nth-child(1) > td[headers='dayMonday'] > a"
ele = driver.find_element_by_css_selector(css_locator)
selectdate = ele.get_attribute('data-selectdate')

Multidimensional list in Thymeleaf (Java) - List<List<Object>>

Have anyone of you any suggestion, how to iterate through a multidimensional list in Thymeleaf?
My multidimensional list looks as follow:
#Override
public List<List<PreferredZone>> findZonesByPosition(List<Position> positionList) {
List <PreferredZone> prefZone = new ArrayList<>();
List<List<PreferredZone>> listPrefZone = new ArrayList<>();
long positionId = 0;
for (int i = 0; i < positionList.size(); i++) {
positionId = positionList.get(i).getPositionId();
prefZone = prefZoneDAO.findFilteredZone(positionId);
listPrefZone.add(prefZone);
}
return listPrefZone;
}
In my controller as attribute:
List<List<PreferredZone>> prefZoneList = prefZoneService.findZonesByPosition(positionList);
model.addAllAttributes(prefZoneList);
Finally I try to iterate this two dimensional list in a HTML table:
<table th:each="prefList :#{prefZoneList}" class="table table-striped display hover">
<thead>
<tr>
<th>ISO</th>
<th>Name</th>
<th>Ausschluss</th>
</tr>
</thead>
<!-- Loop für die Daten -->
<tr th:each="row, iterState :${prefList}" class="clickable-row">
<td th:text="${row[__${iterState.index}__]}.zoneIso"></td>
<td th:text="${row[__${iterState.index}__]}.zoneName"></td>
<td style="text-align:center;">
<input type="checkbox" th:value="${${row[__${iterState.index}__]}.zoneId}" id="zone" class="checkbox-round" />
</td>
</tr>
</table>
It doesn't work however. I don't have any other idea how to solve this.
I have to have a multidimensional list, because I have got a table with multiple records and each record contains a button to open a modal window. Each of this windows contains either a HTML table where I have to display the records.
Have you got any suggestion for me?

You have a mistake in #{prefZoneList} and (as noted in comments) in using iterState.index
Try it:
<table th:each="prefList : ${prefZoneList}" class="table table-striped display hover">
<thead>
<tr>
<th>ISO</th>
<th>Name</th>
<th>Ausschluss</th>
</tr>
</thead>
<tr th:each="row : ${prefList}" class="clickable-row">
<td th:text="${row.zoneIso}"></td>
<td th:text="${row.zoneName}"></td>
<td style="text-align:center;">
<input type="checkbox" th:value="${row.zoneId}" id="zone" class="checkbox-round" />
</td>
</tr>
</table>
Syntax #{...} - a message Expressions
iterState.index is the current iteration index, starting with 0, using like ${prefList[__${iterState.index}__].element} where element - filed in prefList.

BeautifulSoup my for loop is printing all the data from the td tag. I would like to exclude the last section of the td tag

I have the following sample HTML table from a html file.
<table>
<tr>
<th>Class</th>
<th class="failed">Fail</th>
<th class="failed">Error</th>
<th>Skip</th>
<th>Success</th>
<th>Total</th>
</tr>
<tr>
<td>Regression_TestCase.RegressionProject_TestCase2.RegressionProject_TestCase2</td>
<td class="failed">1</td>
<td class="failed">9</td>
<td>0</td>
<td>219</td>
<td>229</td>
</tr>
<tr>
<td><strong>Total</strong></td>
<td class="failed">1</td>
<td class="failed">9</td>
<td>0</td>
<td>219</td>
<td>229</td>
</tr>
</table>
I am trying to print the text from the <td> tags where <td> starts from:
Regression_TestCase.RegressionProject_TestCase2.RegressionProject_TestCase2
I do not want to include the text from the <td> tags where <td> starts from:
<td><strong>Total</strong></td>
My code is printing the text from every single <td> tag:
def extract_data_from_report():
html_report = open(r"E:\SeleniumTestReport.html",'r').read()
soup = BeautifulSoup(html_report, "html.parser")
th = soup.find_all('th')
td = soup.find_all('td')
for item in th:
print item.text,
print "\n"
for item in td:
print item.text,
My desired output:
Class Fail Error Skip Success Total
Regression_TestCase 1 9 0 219 229

You can find all rows (tr elements) except the first one (to skip the headers) and the last one - the "total" row. Sample implementation that produces a list of dictionaries as a result:
from pprint import pprint
from bs4 import BeautifulSoup
data = """
<table>
<tr>
<th>Class</th>
<th class="failed">Fail</th>
<th class="failed">Error</th>
<th>Skip</th>
<th>Success</th>
<th>Total</th>
</tr>
<tr>
<td>Regression_TestCase.RegressionProject_TestCase2.RegressionProject_TestCase2</td>
<td class="failed">1</td>
<td class="failed">9</td>
<td>0</td>
<td>219</td>
<td>229</td>
</tr>
<tr>
<td><strong>Total</strong></td>
<td class="failed">1</td>
<td class="failed">9</td>
<td>0</td>
<td>219</td>
<td>229</td>
</tr>
</table>"""
soup = BeautifulSoup(data, "html.parser")
headers = [header.get_text(strip=True) for header in soup.find_all("th")]
rows = [dict(zip(headers, [td.get_text(strip=True) for td in row.find_all("td")]))
for row in soup.find_all("tr")[1:-1]]
pprint(rows)
Prints:
[{u'Class': u'Regression_TestCase.RegressionProject_TestCase2.RegressionProject_TestCase2',
u'Error': u'9',
u'Fail': u'1',
u'Skip': u'0',
u'Success': u'219',
u'Total': u'229'}]

Repeat Groups to form objects

I have a html table like this:
<table style="width:100%">
<tr>
<td class="country">Germany</td>
</tr>
<tr>
<td class="city">Berlin</td>
</tr>
<tr>
<td class="city">Cologne</td>
</tr>
<tr>
<td class="city">Munich</td>
</tr>
<tr>
<td class="country">France</td>
</tr>
<tr>
<td class="city">Paris</td>
</tr>
<tr>
<td class="country">USA</td>
</tr>
<tr>
<td class="city">New York</td>
</tr>
<tr>
<td class="city">Las Vegas</td>
</tr>
</table>
From this table, I want to generate Objects like the classes Country and City. Country would have a List of Cities.
Now to the problem:
It's easy to create a regex to get all countries and all cities, but i wonder if i can get groups for the cities to repeat until the next country starts? I need to do this, because I can't figure out programmatically which city belongs to which country if I have them in seperated regex-matches.
It should be like (quick&dirty solution):
country">([\w]*)<{.*\n.*\n.*\n.*"city">([\w]*)}
the curly braces should be repeated until the next country item shows up.
If you have a completely different idea on how to get objects out of a html table in c#, let me know!
Thanks in advance!

Agree that for any non-trivial HTML a HTML parser like HtmlAgilityPack should be used. With that said, if your HTML is as simple as the snippet above, this works, even if there are multiple line breaks in the string:
string HTML = #"
<table style='width:100%'>
<tr><td class='country'>Germany</td></tr>
<tr><td class='city'>Berlin</td></tr>
<tr><td class='city'>Cologne</td></tr>
<tr><td class='city'>Munich</td></tr>
<tr><td class='country'>France</td></tr>
<tr><td class='city'>Paris</td></tr>
<tr><td class='country'>USA</td></tr>
<tr><td class='city'>New York</td></tr>
<tr><td class='city'>Las Vegas</td></tr>
</table>";
var regex = new Regex(
#"
class=[^>]*?
(?<class>[-\w\d_]+)
[^>]*>
(?<text>[^<]+)
<
",
RegexOptions.Compiled | RegexOptions.IgnoreCase
| RegexOptions.IgnorePatternWhitespace
);
var country = string.Empty;
var Countries = new Dictionary<string, List<string>>();
foreach (Match match in regex.Matches(HTML))
{
string countryCity = match.Groups["class"].Value.Trim();
string text = match.Groups["text"].Value.Trim();
if (countryCity.Equals("country", StringComparison.OrdinalIgnoreCase))
{
country = text;
Countries.Add(text, new List<string>());
}
else
{
Countries[country].Add(text);
}
}

JQuery - Problem with selectors (siblings, parents...)

I got a coldfusion query where the result is grouped on country names. With a click on this one, I try to open or close the list under the country. But i cannot work correctly with this siblings and this parents. The result is, if i click on a country name, the fourth one, for example, it close all childrens, and the three country name which are before too.
Can someone help me to choose the right selectors ?
Thank you in advance ,
Michel
The code:
<script type="text/javascript" language="javascript">
$(document).ready(function(){
var toggleMinus = '<cfoutput>#variables.strWebAddress#</cfoutput>/images/bullet_toggle_minus.png';
var togglePlus = '<cfoutput>#variables.strWebAddress#</cfoutput>/images/bullet_toggle_plus.png';
var $subHead = $('table#categorylist tbody th:first-child');
$subHead.prepend('<img src="' +toggleMinus+ '" alt="collapse this section" /> ');
$('img', $subHead).addClass('clickable').click(function(){
var toggleSrc = $(this).attr('src');
if(toggleSrc == toggleMinus){
$(this).attr('src',togglePlus).parents('.country').siblings().fadeOut('fast');
}else{
$(this).attr('src',toggleMinus).parents('.country').siblings().fadeIn('fast');
}
});
});
</script>
<table width="95%" border="0" cellspacing="2" cellpadding="2" align="center id="categorylist">
<thead>
<tr>
<th class="text3" width="15%">
<cfmodule template="../custom_tags/get_message.cfm" keyName="L_ACTOR_CODENUMBER">
</th>
<th class="text3" width="15%">
<cfmodule template="../custom_tags/get_message.cfm" keyName="L_ACTOR_CODE">
</th>
<th class="text3" width="55%">
<cfmodule template="../custom_tags/get_message.cfm" keyName="L_ACTOR_NAME">
</th>
<th class="text3" width="15%">
<cfmodule template="../custom_tags/get_message.cfm" keyName="L_ACTIVE">
</th>
</tr>
</thead>
<tbody id="content">
<cfoutput query="qryCategoryUrl" group="country_name" groupcasesensitive="false">
<tr class="country">
<th style="font-weight:bold; text-align:left;" colspan="4">#country_name#</th>
</tr>
<cfoutput>
<tr>
<td valign="top" class="text3">#Replace(ACTOR_CODENUMBER, Chr(13) & Chr(10), "<br>", "ALL")# </td>
<td valign="top" class="text3">#Replace(ACTOR_CODE, Chr(13) & Chr(10), "<br>", "ALL")# </td>
<td valign="top" class="text3">#Replace(ACTOR_NAME, Chr(13) & Chr(10), "<br>", "ALL")# </td>
<td valign="top" class="text3"><cfmodule template="../custom_tags//get_message.cfm" keyName="#ACTIVE_display(qryCategoryUrl.ACTIVE)#"></td>
</tr>
</cfoutput>
</cfoutput>
</tbody>
</table>

Instead of:
.parents('.country').siblings().fadeOut('fast');
Try this:
.closest('.country').nextUntil('.country').fadeOut('fast');
And of course, apply the same change to the .fadeIn(). You might also look into .fadeToggle()docs.
Here's a (reduced) example: http://jsfiddle.net/redler/5sqJz/. While it doesn't affect the example, presumably you would be setting the initial state of those detail rows as hidden.

woah all that cfmodule usage, cfmodule can be a memory hog.
Although what I always recommend is that people try their pages in whatever browser, and use the SelectorGadget bookmarklet at http://www.selectorgadget.com/
This makes it easier to test and check the correct selector, for your app needs.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

get values from table with BeautifulSoup Python - python-2.7

The way to do it is by doing the following: third_column = tr.find_all('td')[2].contents third_column_text = str(third_column) third_columnSoup = BeautifulSoup(third_column_text) if third_columnSoup: print third_columnSoup.text

Related

How to find the element part of the anchor tag

Multidimensional list in Thymeleaf (Java) - List<List<Object>>

BeautifulSoup my for loop is printing all the data from the td tag. I would like to exclude the last section of the td tag

Repeat Groups to form objects

JQuery - Problem with selectors (siblings, parents...)

Categories

Resources