Parsing HTML Tables with BS4

Parsing HTML Tables with BS4 - python-2.7

I've been trying different methods of scraping data from this site (http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=WR&college=) and can't seem to get any of them to work. I've tried playing with the indices given, but can't seem to make it work. I think I've tried too many things at this point,so if someone could point me in the right direction I would really appreciate it.
I would like to pull all of the information and export it to a .csv file, but at this point I'm just trying to get the name and position to print to get started.
Here's my code:
import urllib2
from bs4 import BeautifulSoup
import re
url = ('http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=&college=')
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
table = soup.find('table')
for row in table.findAll('tr')[0:]:
col = row.findAll('tr')
name = col[1].string
position = col[3].string
player = (name, position)
print "|".join(player)
Here's the error I'm getting:
line 14, in name = col[1].string
IndexError: list index out of range.
--UPDATE--
Ok, I've made a little progress. It now allows me to go from start to finish, but it requires knowing how many rows are in the table. How would I get it to just go through them until the end?
Updated Code:
import urllib2
from bs4 import BeautifulSoup
import re
url = ('http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=&college=')
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
table = soup.find('table')
for row in table.findAll('tr')[1:250]:
col = row.findAll('td')
name = col[1].getText()
position = col[3].getText()
player = (name, position)
print "|".join(player)

I figured it out after only 8 hours or so. Learning is fun. Thanks for the help Kevin!
It now includes the code to output the scraped data to a csv file. Next up is taking that data and filtering out for certain positions....
Here's my code:
import urllib2
from bs4 import BeautifulSoup
import csv
url = ('http://nflcombineresults.com/nflcombinedata.php?year=2000&pos=&college=')
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
table = soup.find('table')
f = csv.writer(open("2000scrape.csv", "w"))
f.writerow(["Name", "Position", "Height", "Weight", "40-yd", "Bench", "Vertical", "Broad", "Shuttle", "3-Cone"])
# variable to check length of rows
x = (len(table.findAll('tr')) - 1)
# set to run through x
for row in table.findAll('tr')[1:x]:
col = row.findAll('td')
name = col[1].getText()
position = col[3].getText()
height = col[4].getText()
weight = col[5].getText()
forty = col[7].getText()
bench = col[8].getText()
vertical = col[9].getText()
broad = col[10].getText()
shuttle = col[11].getText()
threecone = col[12].getText()
player = (name, position, height, weight, forty, bench, vertical, broad, shuttle, threecone, )
f.writerow(player)

I can't run your script due to firewall permissions, but I believe the problem is on this line:
col = row.findAll('tr')
row is a tr tag, and you're asking BeautifulSoup to find all tr tags within that tr tag. You probably meant to do:
col = row.findAll('td')
Furthermore, since the actual text isn't directly inside the tds but is also hidden within nested divs and as, it may be useful to use the getText method instead of .string:
name = col[1].getText()
position = col[3].getText()

Simple way to parse the table column wise:
def table_to_list(table):
data = []
all_th = table.find_all('th')
all_heads = [th.get_text() for th in all_th]
for tr in table.find_all('tr'):
all_th = tr.find_all('th')
if all_th:
continue
all_td = tr.find_all('td')
data.append([td.get_text() for td in all_td])
return list(zip(all_heads, *data))
r = requests.get(url, headers=headers)
bs = BeautifulSoup(r.text)
all_tables = bs.find_all('table')
table_to_list(all_tables[0])

Related

How to limit the number of strings in print?

The following is my python3 programme to display the 12 subcategories of Wikipedia category. It prints 12 subcategories. Now, i want to show only first 3 subcategories in print. How? But in future while developing my programme, i am going to write all the 12 subcategories in a file.
from bs4 import BeautifulSoup
import requests
url = 'https://en.wikipedia.org/wiki/Category:proprietary software'
content = requests.get(url).content
soup = BeautifulSoup(content,'lxml')
noOFsubcategories = soup.find('p')
print('------------------------------------------------------------------')
print(noOFsubcategories.text+'------------------------------------------------------------------')
tag = soup.find('div', {'class' : 'mw-category'})
links = tag.findAll('a')
counter = 1
for link in links:
print ( str(counter) + " " + link.text)
counter = counter + 1

You can simply do for link in links[:3]: to display only the first three elements from a list.

For Loop trying to scrape TripAdvisor Restaurant data

I am trying to scrape a list of all the restaurants in Hong Kong and their corresponding URLs. Currently, in my code below, I am able to scrape the 1st and 2nd pages. But I want my for loop towards the bottom to be a bit more dynamic and keep scraping until it hits the amount of entries I specified in range().
I am still a novice at this so any help would be awesome.
#import libraries
import requests
from bs4 import BeautifulSoup
import csv
#scrape the first page because this URL is different then when you start moving to different pages
url0 = 'https://www.tripadvisor.com/Restaurants-g294217-Hong_Kong.html#EATERY_LIST_CONTENTS'
r = requests.get(url0)
data = r.text
soup = BeautifulSoup(r.text, "html.parser")
for link in soup.findAll('a', {'property_title'}):
print 'https://www.tripadvisor.com/Restaurant_Review-g294217-' + link.get('href')
print link.string
#loop to move into the next pages. entries are in increments of 30 per page
for i in range(0, 120, 30):
entries = str(30)
#url format offsets the restaurants in increments of 30 after the oa; hence entries as variable
url1 = 'https://www.tripadvisor.com/Restaurants-g294217-oa' + entries + '-Hong_Kong.html#EATERY_LIST_CONTENTS'
r1 = requests.get(url1)
data1 = r1.text
soup1 = BeautifulSoup(data1, "html.parser")
for link in soup1.findAll('a', {'property_title'}):
print 'https://www.tripadvisor.com/Restaurant_Review-g294217-' + link.get('href')
print link.string
break

Ended up adding a while that got it to loop the way I wanted it to. Hope this helps people in the future
for i in range(30, 120, 30):
while i <= range:
i = str(i)
#url format offsets the restaurants in increments of 30 after the oa; hence entries as variable
url1 = 'https://www.tripadvisor.com/Restaurants-g294217-oa' + i + '-Hong_Kong.html#EATERY_LIST_CONTENTS'
r1 = requests.get(url1)
data1 = r1.text
soup1 = BeautifulSoup(data1, "html.parser")
for link in soup1.findAll('a', {'property_title'}):
print 'https://www.tripadvisor.com/Restaurant_Review-g294217-' + link.get('href')
print link.string
break

Checking duplicate files against a dictionary of filesizes and names

This is pretty simple code - i've just completed Charles Severances Python for Informatics course, so if possible pls help me to keep it simple.
I'm trying to find duplicate documents in folders.
What i'm having trouble with is printing out the original, and the duplicate so i can manually check the accuracy of what it found. Later i'll look at how to automate deleting duplicates, looking for other filetypes etc.
A similarly structured piece of code worked well for itunes, but here i'm putting originals into a dictionary, and it seems i'm not getting the info back out.
Pls keep it simple, so i can learn. I know i can copy code to do the job, but i'm more interested in learning where i've gone wrong.
cheers
jeff
import os
from os.path import join
import re
import hashlib
location = '/Users/jeff/desktop/typflashdrive'
doccount = 0
dupdoc = 0
d = dict()
for (dirname, dirs, files) in os.walk(location):
for x in files:
size = hashlib.md5(x).hexdigest()
item = os.path.join(dirname,x)
#print os.path.getsize(item), item
#size = os.path.getsize(item)
if item.endswith ('.doc'):
doccount = doccount + 1
if size not in d:
original = item
d[size] = original
else:
copy = item
for key in d: print key, d[size],'\n', size, copy,'\n','\n',
#print item,'\n', copy,'\n','\n',
dupdoc=dupdoc+1
print '.doc Files:', doccount,'.', 'You have', dupdoc, 'duplicate .doc files:',

Your biggest mistake is that you're taking the hash of the filenames instead of the file content.
I have corrected that and also cleaned up the rest of the code:
import os
import hashlib
location = '/Users/jeff/desktop/typflashdrive'
doc_count = 0
dup_doc_count = 0
hash_vs_file = {}
for (dirname, dirs, files) in os.walk(location):
for filename in files:
file_path = os.path.join(dirname, filename)
file_hash = hashlib.md5(open(file_path).read()).hexdigest()
if filename.endswith('.doc'):
doc_count = doc_count + 1
if file_hash not in hash_vs_file:
hash_vs_file[file_hash] = [file_path]
else:
dup_doc_count += 1
hash_vs_file[file_hash].append(file_path)
print 'doc_count = ', doc_count
print 'dup_doc_count = ', dup_doc_count
for file_hash in hash_vs_file:
print file_hash
for file_path in hash_vs_file[file_hash]:
print file_path
print "\n\n\n"

Loop stopped on half of data

I parse html file locating on my disk to get some data of it.I locate the data but I cant add it all to list.Only half of them successfully appended to list.html structure do not changed.
from bs4 import BeautifulSoup
import urllib2
Numeric = []
x1 = []
dara = urlopen("file:///C:/Users/user/Desktop/SuperLoto_Results__539-796.htm").read()
soup =BeautifulSoup(urllib2.urlopen("file:///C:/Users/user/Desktop/SuperLoto_Results__539-796.htm").read(),'lxml')
for row in soup.find_all('tr'):
col = row.find_all('td')
x1.extend(col[4])
Numeric.extend(col[0])
html file I parsed

I ran it successfully in Python 3.4. Here is my code and the output. Please note that I changed x1.extend(col[4]) to x1.extend(col[3]) because you indicated you wanted the data in the fourth cell
Numeric = []
x1 = []
soup =BeautifulSoup(urllib.request.urlopen("file:///C:/Users/Home/Downloads/SuperLoto_Results__539-796.htm").read(),'lxml')
for row in soup.find_all('tr'):
try:
col = row.find_all('td')
x1.extend(col[3])
Numeric.extend(col[0])
except:
print("error")
print(x1.__len__())
print(Numeric.__len__())
The output is:
error
259
259

What's wrong with this for scraping the table and data needed?

I'm trying to scrape data for Miami Heat and their opponent from a table at http://www.scoresandodds.com/grid_20111225.html. The problem I have is that tables for NBA and NFL and other sports are all identicaly marked and all the data I get is from the NFL table. Another problem is that I would like to scrape data for the entire season and the number of different tables changes and the position of Miami changes in the table. This is the code I've been using for different tables till now;
So why is this not getting the job done? Thx for you patience; I'm a real begginer, and I've been trying to solve this problem for some days now, to no effect.
def tableSnO(htmlSnO):
gameSections = soup.findAll('div', 'gameSection')
for gameSection in gameSections:
header = gameSection.find('div', 'header')
if header.get('id') == 'nba':
rows = gameSections.findAll('tr')
def parse_string(el):
text = ''.join(el.findAll(text=True))
return text.strip()
for row in rows:
data = map(parse_string, row.findAll('td'))
return data
Lately I decided to try a different approach; if I scrape the entire page and get the index of the data in question (this is where it stops:) I could just get the next set of data from the list, since that structure of the table never changes. I could also get the opponent's team name the same way I get the htmlSnO . It feels like this is such basic stuff and it's killing me that I can't get it right.
def tableSnO(htmlSnO):
oddslist = soupSnO.find('table', {"width" : "100%", "cellspacing" : "0", "cellpadding" : "0"})
rows = oddslist.findAll('tr',)
def parse_string(el):
text = ''.join(el.findAll(text=True))
return text.strip()
for row in rows:
data = map(parse_string, row.findAll('td'))
for teamName in data:
if re.match("(.*)MIAMI HEAT(.*)", teamName):
return teamName
return data.index(teamName)

New and final answer with working code:
The section of the page you want has this:
<div class="gameSection">
<div class="header" id="nba">
This should let you get at the NBA tables:
def tableSnO(htmlSnO):
gameSections = soup.findAll('div', 'gameSection')
for gameSection in gameSections:
header = gameSection.find('div', 'header')
if header.get('id') == 'nba':
# process this gameSection
print gameSection.prettify()
As a complete example, here's the full code I used to test:
import sys
import urllib2
from bs4 import BeautifulSoup
f = urllib2.urlopen('http://www.scoresandodds.com/grid_20111225.html')
html = f.read()
soup = BeautifulSoup(html)
gameSections = soup.findAll('div', 'gameSection')
for gameSection in gameSections:
header = gameSection.find('div', 'header')
if header.get('id') == 'nba':
table = gameSection.find('table', 'data')
print table.prettify()
This prints the NBA data table.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Parsing HTML Tables with BS4 - python-2.7

Related

How to limit the number of strings in print?

For Loop trying to scrape TripAdvisor Restaurant data

Checking duplicate files against a dictionary of filesizes and names

Loop stopped on half of data

What's wrong with this for scraping the table and data needed?

Categories

Resources