I am trying to scrape a list of all the restaurants in Hong Kong and their corresponding URLs. Currently, in my code below, I am able to scrape the 1st and 2nd pages. But I want my for loop towards the bottom to be a bit more dynamic and keep scraping until it hits the amount of entries I specified in range().
I am still a novice at this so any help would be awesome.
#import libraries
import requests
from bs4 import BeautifulSoup
import csv
#scrape the first page because this URL is different then when you start moving to different pages
url0 = 'https://www.tripadvisor.com/Restaurants-g294217-Hong_Kong.html#EATERY_LIST_CONTENTS'
r = requests.get(url0)
data = r.text
soup = BeautifulSoup(r.text, "html.parser")
for link in soup.findAll('a', {'property_title'}):
print 'https://www.tripadvisor.com/Restaurant_Review-g294217-' + link.get('href')
print link.string
#loop to move into the next pages. entries are in increments of 30 per page
for i in range(0, 120, 30):
entries = str(30)
#url format offsets the restaurants in increments of 30 after the oa; hence entries as variable
url1 = 'https://www.tripadvisor.com/Restaurants-g294217-oa' + entries + '-Hong_Kong.html#EATERY_LIST_CONTENTS'
r1 = requests.get(url1)
data1 = r1.text
soup1 = BeautifulSoup(data1, "html.parser")
for link in soup1.findAll('a', {'property_title'}):
print 'https://www.tripadvisor.com/Restaurant_Review-g294217-' + link.get('href')
print link.string
break
Ended up adding a while that got it to loop the way I wanted it to. Hope this helps people in the future
for i in range(30, 120, 30):
while i <= range:
i = str(i)
#url format offsets the restaurants in increments of 30 after the oa; hence entries as variable
url1 = 'https://www.tripadvisor.com/Restaurants-g294217-oa' + i + '-Hong_Kong.html#EATERY_LIST_CONTENTS'
r1 = requests.get(url1)
data1 = r1.text
soup1 = BeautifulSoup(data1, "html.parser")
for link in soup1.findAll('a', {'property_title'}):
print 'https://www.tripadvisor.com/Restaurant_Review-g294217-' + link.get('href')
print link.string
break
Related
The following is my python3 programme to display the 12 subcategories of Wikipedia category. It prints 12 subcategories. Now, i want to show only first 3 subcategories in print. How? But in future while developing my programme, i am going to write all the 12 subcategories in a file.
from bs4 import BeautifulSoup
import requests
url = 'https://en.wikipedia.org/wiki/Category:proprietary software'
content = requests.get(url).content
soup = BeautifulSoup(content,'lxml')
noOFsubcategories = soup.find('p')
print('------------------------------------------------------------------')
print(noOFsubcategories.text+'------------------------------------------------------------------')
tag = soup.find('div', {'class' : 'mw-category'})
links = tag.findAll('a')
counter = 1
for link in links:
print ( str(counter) + " " + link.text)
counter = counter + 1
You can simply do for link in links[:3]: to display only the first three elements from a list.
I parse html file locating on my disk to get some data of it.I locate the data but I cant add it all to list.Only half of them successfully appended to list.html structure do not changed.
from bs4 import BeautifulSoup
import urllib2
Numeric = []
x1 = []
dara = urlopen("file:///C:/Users/user/Desktop/SuperLoto_Results__539-796.htm").read()
soup =BeautifulSoup(urllib2.urlopen("file:///C:/Users/user/Desktop/SuperLoto_Results__539-796.htm").read(),'lxml')
for row in soup.find_all('tr'):
col = row.find_all('td')
x1.extend(col[4])
Numeric.extend(col[0])
html file I parsed
I ran it successfully in Python 3.4. Here is my code and the output. Please note that I changed x1.extend(col[4]) to x1.extend(col[3]) because you indicated you wanted the data in the fourth cell
Numeric = []
x1 = []
soup =BeautifulSoup(urllib.request.urlopen("file:///C:/Users/Home/Downloads/SuperLoto_Results__539-796.htm").read(),'lxml')
for row in soup.find_all('tr'):
try:
col = row.find_all('td')
x1.extend(col[3])
Numeric.extend(col[0])
except:
print("error")
print(x1.__len__())
print(Numeric.__len__())
The output is:
error
259
259
I am trying to code a program in Python 2.7.9 to crawl and gather the club names, addresses and phone numbers from the website http://tennishub.co.uk/
The following code gets the job done, except for it doesn't move on to the subsequent pages for each location such as
/Berkshire/1
/Berkshire/2
/Berkshire/3
..and so on.
import requests
from bs4 import BeautifulSoup
def tennis_club():
url = 'http://tennishub.co.uk/'
r = requests.get(url)
soup = BeautifulSoup(r.text)
for link in soup.select('div.countylist a'):
href = 'http://tennishub.co.uk' + link.get('href')
pages_data(href)
def pages_data(item_url):
r = requests.get(item_url)
soup = BeautifulSoup(r.text)
g_data = soup.select('table.display-table')
for item in g_data:
print item.contents[1].text
print item.contents[3].findAll('td')[1].text
try:
print item.contents[3].find_all('td',{'class':'telrow'})[0].text
except:
pass
try:
print item.contents[5].findAll('td',{'class':'emailrow'})[0].text
except:
pass
print item_url
tennis_club()
I have tried tweaking the code to the best of my understanding but it doesn't work at all.
Can someone please advise what do I need to do so that the program goes through all the pages of a location, collects the data and move on the to next location and so on.
You are going to need to put another for loop into this code:
for link in soup.select('div.countylist a'):
href = 'http://tennishub.co.uk' + link.get('href')
# new for loop goes here #
pages_data(href)
If you want to brute force it you just have the for loop go as many times as the area with the most clubs (Surrey), however you would double, triple, quadruple, etc. count the last clubs for many of the areas. This is ugly but you can get away with it if you are using a database where you don't insert duplicates. However it is unacceptable if you are writing to a file. In that case you will need to pull the number in parenthesis after the area Berkshire (39). To get that number you can do a get_text() on the div.countylist which would change the above to
for link in soup.select('div.countylist'):
for endHref in link.find_all('a'):
numClubs = endHref.next
#need to clean up endHrefNum here to remove spaces and parens
endHrefNum = numClubs//10 + 1 #add one because // gives the floor
href = 'http://tennishub.co.uk' + endHref.get('href') + / + endHrefNum
pages_data(href)
(disclaimer: I didn't run this through bs4 so there might be syntax errors (and you might need to use something other than .next, but the logic should help you)
I've been trying different methods of scraping data from this site (http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=WR&college=) and can't seem to get any of them to work. I've tried playing with the indices given, but can't seem to make it work. I think I've tried too many things at this point,so if someone could point me in the right direction I would really appreciate it.
I would like to pull all of the information and export it to a .csv file, but at this point I'm just trying to get the name and position to print to get started.
Here's my code:
import urllib2
from bs4 import BeautifulSoup
import re
url = ('http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=&college=')
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
table = soup.find('table')
for row in table.findAll('tr')[0:]:
col = row.findAll('tr')
name = col[1].string
position = col[3].string
player = (name, position)
print "|".join(player)
Here's the error I'm getting:
line 14, in name = col[1].string
IndexError: list index out of range.
--UPDATE--
Ok, I've made a little progress. It now allows me to go from start to finish, but it requires knowing how many rows are in the table. How would I get it to just go through them until the end?
Updated Code:
import urllib2
from bs4 import BeautifulSoup
import re
url = ('http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=&college=')
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
table = soup.find('table')
for row in table.findAll('tr')[1:250]:
col = row.findAll('td')
name = col[1].getText()
position = col[3].getText()
player = (name, position)
print "|".join(player)
I figured it out after only 8 hours or so. Learning is fun. Thanks for the help Kevin!
It now includes the code to output the scraped data to a csv file. Next up is taking that data and filtering out for certain positions....
Here's my code:
import urllib2
from bs4 import BeautifulSoup
import csv
url = ('http://nflcombineresults.com/nflcombinedata.php?year=2000&pos=&college=')
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
table = soup.find('table')
f = csv.writer(open("2000scrape.csv", "w"))
f.writerow(["Name", "Position", "Height", "Weight", "40-yd", "Bench", "Vertical", "Broad", "Shuttle", "3-Cone"])
# variable to check length of rows
x = (len(table.findAll('tr')) - 1)
# set to run through x
for row in table.findAll('tr')[1:x]:
col = row.findAll('td')
name = col[1].getText()
position = col[3].getText()
height = col[4].getText()
weight = col[5].getText()
forty = col[7].getText()
bench = col[8].getText()
vertical = col[9].getText()
broad = col[10].getText()
shuttle = col[11].getText()
threecone = col[12].getText()
player = (name, position, height, weight, forty, bench, vertical, broad, shuttle, threecone, )
f.writerow(player)
I can't run your script due to firewall permissions, but I believe the problem is on this line:
col = row.findAll('tr')
row is a tr tag, and you're asking BeautifulSoup to find all tr tags within that tr tag. You probably meant to do:
col = row.findAll('td')
Furthermore, since the actual text isn't directly inside the tds but is also hidden within nested divs and as, it may be useful to use the getText method instead of .string:
name = col[1].getText()
position = col[3].getText()
Simple way to parse the table column wise:
def table_to_list(table):
data = []
all_th = table.find_all('th')
all_heads = [th.get_text() for th in all_th]
for tr in table.find_all('tr'):
all_th = tr.find_all('th')
if all_th:
continue
all_td = tr.find_all('td')
data.append([td.get_text() for td in all_td])
return list(zip(all_heads, *data))
r = requests.get(url, headers=headers)
bs = BeautifulSoup(r.text)
all_tables = bs.find_all('table')
table_to_list(all_tables[0])
I'm trying to create a data-scraping file for a class, and the data I have to scrape requires that I use while loops to get the right data into separate arrays-- i.e. for states, and SAT averages, etc.
However, once I set up the while loops, my regex that cleared the majority of the html tags from the data broke, and I am getting an error that reads:
Attribute Error: 'NoneType' object has no attribute 'groups'
My Code is:
import re, util
from BeautifulSoup import BeautifulStoneSoup
# create a comma-delineated file
delim = ", "
#base url for sat data
base = "http://www.usatoday.com/news/education/2007-08-28-sat-table_N.htm"
#get webpage object for site
soup = util.mysoupopen(base)
#get column headings
colCols = soup.findAll("td", {"class":"vaTextBold"})
#get data
dataCols = soup.findAll("td", {"class":"vaText"})
#append data to cols
for i in range(len(dataCols)):
colCols.append(dataCols[i])
#open a csv file to write the data to
fob=open("sat.csv", 'a')
#initiate the 5 arrays
states = []
participate = []
math = []
read = []
write = []
#split into 5 lists for each row
for i in range(len(colCols)):
if i%5 == 0:
states.append(colCols[i])
i=1
while i<=250:
participate.append(colCols[i])
i = i+5
i=2
while i<=250:
math.append(colCols[i])
i = i+5
i=3
while i<=250:
read.append(colCols[i])
i = i+5
i=4
while i<=250:
write.append(colCols[i])
i = i+5
#write data to the file
for i in range(len(states)):
states = str(states[i])
participate = str(participate[i])
math = str(math[i])
read = str(read[i])
write = str(write[i])
#regex to remove html from data scraped
#remove <td> tags
line = re.search(">(.*)<", states).groups()[0] + delim + re.search(">(.*)<", participate).groups()[0]+ delim + re.search(">(.*)<", math).groups()[0] + delim + re.search(">(.*)<", read).groups()[0] + delim + re.search(">(.*)<", write).groups()[0]
#append data point to the file
fob.write(line)
Any ideas regarding why this error suddenly appeared? The regex was working fine until I tried to split the data into different lists. I have already tried printing the various strings inside the final "for" loop to see if any of them were "None" for the first i value (0), but they were all the string that they were supposed to be.
Any help would be greatly appreciated!
It looks like the regex search is failing on (one of) the strings, so it returns None instead of a MatchObject.
Try the following instead of the very long #remove <td> tags line:
out_list = []
for item in (states, participate, math, read, write):
try:
out_list.append(re.search(">(.*)<", item).groups()[0])
except AttributeError:
print "Regex match failed on", item
sys.exit()
line = delim.join(out_list)
That way, you can find out where your regex is failing.
Also, I suggest you use .group(1) instead of .groups()[0]. The former is more explicit.