Loop stopped on half of data - list

I parse html file locating on my disk to get some data of it.I locate the data but I cant add it all to list.Only half of them successfully appended to list.html structure do not changed.
from bs4 import BeautifulSoup
import urllib2
Numeric = []
x1 = []
dara = urlopen("file:///C:/Users/user/Desktop/SuperLoto_Results__539-796.htm").read()
soup =BeautifulSoup(urllib2.urlopen("file:///C:/Users/user/Desktop/SuperLoto_Results__539-796.htm").read(),'lxml')
for row in soup.find_all('tr'):
col = row.find_all('td')
x1.extend(col[4])
Numeric.extend(col[0])
html file I parsed

I ran it successfully in Python 3.4. Here is my code and the output. Please note that I changed x1.extend(col[4]) to x1.extend(col[3]) because you indicated you wanted the data in the fourth cell
Numeric = []
x1 = []
soup =BeautifulSoup(urllib.request.urlopen("file:///C:/Users/Home/Downloads/SuperLoto_Results__539-796.htm").read(),'lxml')
for row in soup.find_all('tr'):
try:
col = row.find_all('td')
x1.extend(col[3])
Numeric.extend(col[0])
except:
print("error")
print(x1.__len__())
print(Numeric.__len__())
The output is:
error
259
259

Related

For Loop trying to scrape TripAdvisor Restaurant data

I am trying to scrape a list of all the restaurants in Hong Kong and their corresponding URLs. Currently, in my code below, I am able to scrape the 1st and 2nd pages. But I want my for loop towards the bottom to be a bit more dynamic and keep scraping until it hits the amount of entries I specified in range().
I am still a novice at this so any help would be awesome.
#import libraries
import requests
from bs4 import BeautifulSoup
import csv
#scrape the first page because this URL is different then when you start moving to different pages
url0 = 'https://www.tripadvisor.com/Restaurants-g294217-Hong_Kong.html#EATERY_LIST_CONTENTS'
r = requests.get(url0)
data = r.text
soup = BeautifulSoup(r.text, "html.parser")
for link in soup.findAll('a', {'property_title'}):
print 'https://www.tripadvisor.com/Restaurant_Review-g294217-' + link.get('href')
print link.string
#loop to move into the next pages. entries are in increments of 30 per page
for i in range(0, 120, 30):
entries = str(30)
#url format offsets the restaurants in increments of 30 after the oa; hence entries as variable
url1 = 'https://www.tripadvisor.com/Restaurants-g294217-oa' + entries + '-Hong_Kong.html#EATERY_LIST_CONTENTS'
r1 = requests.get(url1)
data1 = r1.text
soup1 = BeautifulSoup(data1, "html.parser")
for link in soup1.findAll('a', {'property_title'}):
print 'https://www.tripadvisor.com/Restaurant_Review-g294217-' + link.get('href')
print link.string
break
Ended up adding a while that got it to loop the way I wanted it to. Hope this helps people in the future
for i in range(30, 120, 30):
while i <= range:
i = str(i)
#url format offsets the restaurants in increments of 30 after the oa; hence entries as variable
url1 = 'https://www.tripadvisor.com/Restaurants-g294217-oa' + i + '-Hong_Kong.html#EATERY_LIST_CONTENTS'
r1 = requests.get(url1)
data1 = r1.text
soup1 = BeautifulSoup(data1, "html.parser")
for link in soup1.findAll('a', {'property_title'}):
print 'https://www.tripadvisor.com/Restaurant_Review-g294217-' + link.get('href')
print link.string
break

How to remove unwanted items from a parse file

from googlefinance import getQuotes
import json
import time as t
import re
List = ["A","AA","AAB"]
Time=t.localtime() # Sets variable Time to retrieve date/time info
Date2= ('%d-%d-%d %dh:%dm:%dsec'%(Time[0],Time[1],Time[2],Time[3],Time[4],Time[5])) #formats time stamp
while True:
for i in List:
try: #allows elements to be called and if an error does the next step
Data = json.dumps(getQuotes(i.lower()),indent=1) #retrieves Data from google finance
regex = ('"LastTradePrice": "(.+?)",') #sets parse
pattern = re.compile(regex) #compiles parse
price = re.findall(pattern,Data) #retrieves parse
print(i)
print(price)
except: #sets Error coding
Error = (i + ' Failed to load on: ' + Date2)
print (Error)
It will display the quote as: ['(number)'].
I would like it to only display the number, which means removing the brackets and quotes.
Any help would be great.
Changing:
print(price)
into:
print(price[0])
prints this:
A
42.14
AA
10.13
AAB
0.110
Try to use type() function to know the datatype, in your case type(price)
it the data type is list use print(price[0])
you will get the output (number), for brecess you need to check google data and regex.

Is it possible to do a times and date plot by reading the data from temporary file?

I have twitter data in a textfile in a following format "RT Find out how AZ is targeting escape pathways to further personalise breastcancer treatment SABCS14 Thu Dec 11 03:09:12 +0000 2014". So from this i need to find same tweets with same text and get their time and date. After that i have to do a time and date plot. Following is my code what i have tried.
import matplotlib
import os
import tempfile
temp = tempfile.TemporaryFile(mode = 'w+t')
count = 0
f = open('bcd.txt','r')
lines = f.read().splitlines()
for i in range(len(lines)):
line = lines[i]
try:
next_line = lines[i+1]
except IndexError:
continue
if line == next_line:
count +=1
temp.writelines([line+'\n'])
temp.seek(0)
for line in temp:
line.split('\t')
First i have compared the line with next line, then i wrote that into a temporary file then tried to go through that temporary file and extract the time and date of particular tweets but i was unsuccessful. Anykind of help would be appreciated. Thanks in advance.
link to data file
https://drive.google.com/file/d/0BxE3rA3-6F8eVGVwOFlVRjNFTUE/view?usp=sharing

Parsing HTML Tables with BS4

I've been trying different methods of scraping data from this site (http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=WR&college=) and can't seem to get any of them to work. I've tried playing with the indices given, but can't seem to make it work. I think I've tried too many things at this point,so if someone could point me in the right direction I would really appreciate it.
I would like to pull all of the information and export it to a .csv file, but at this point I'm just trying to get the name and position to print to get started.
Here's my code:
import urllib2
from bs4 import BeautifulSoup
import re
url = ('http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=&college=')
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
table = soup.find('table')
for row in table.findAll('tr')[0:]:
col = row.findAll('tr')
name = col[1].string
position = col[3].string
player = (name, position)
print "|".join(player)
Here's the error I'm getting:
line 14, in name = col[1].string
IndexError: list index out of range.
--UPDATE--
Ok, I've made a little progress. It now allows me to go from start to finish, but it requires knowing how many rows are in the table. How would I get it to just go through them until the end?
Updated Code:
import urllib2
from bs4 import BeautifulSoup
import re
url = ('http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=&college=')
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
table = soup.find('table')
for row in table.findAll('tr')[1:250]:
col = row.findAll('td')
name = col[1].getText()
position = col[3].getText()
player = (name, position)
print "|".join(player)
I figured it out after only 8 hours or so. Learning is fun. Thanks for the help Kevin!
It now includes the code to output the scraped data to a csv file. Next up is taking that data and filtering out for certain positions....
Here's my code:
import urllib2
from bs4 import BeautifulSoup
import csv
url = ('http://nflcombineresults.com/nflcombinedata.php?year=2000&pos=&college=')
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
table = soup.find('table')
f = csv.writer(open("2000scrape.csv", "w"))
f.writerow(["Name", "Position", "Height", "Weight", "40-yd", "Bench", "Vertical", "Broad", "Shuttle", "3-Cone"])
# variable to check length of rows
x = (len(table.findAll('tr')) - 1)
# set to run through x
for row in table.findAll('tr')[1:x]:
col = row.findAll('td')
name = col[1].getText()
position = col[3].getText()
height = col[4].getText()
weight = col[5].getText()
forty = col[7].getText()
bench = col[8].getText()
vertical = col[9].getText()
broad = col[10].getText()
shuttle = col[11].getText()
threecone = col[12].getText()
player = (name, position, height, weight, forty, bench, vertical, broad, shuttle, threecone, )
f.writerow(player)
I can't run your script due to firewall permissions, but I believe the problem is on this line:
col = row.findAll('tr')
row is a tr tag, and you're asking BeautifulSoup to find all tr tags within that tr tag. You probably meant to do:
col = row.findAll('td')
Furthermore, since the actual text isn't directly inside the tds but is also hidden within nested divs and as, it may be useful to use the getText method instead of .string:
name = col[1].getText()
position = col[3].getText()
Simple way to parse the table column wise:
def table_to_list(table):
data = []
all_th = table.find_all('th')
all_heads = [th.get_text() for th in all_th]
for tr in table.find_all('tr'):
all_th = tr.find_all('th')
if all_th:
continue
all_td = tr.find_all('td')
data.append([td.get_text() for td in all_td])
return list(zip(all_heads, *data))
r = requests.get(url, headers=headers)
bs = BeautifulSoup(r.text)
all_tables = bs.find_all('table')
table_to_list(all_tables[0])

Attribute Error for strings created from lists

I'm trying to create a data-scraping file for a class, and the data I have to scrape requires that I use while loops to get the right data into separate arrays-- i.e. for states, and SAT averages, etc.
However, once I set up the while loops, my regex that cleared the majority of the html tags from the data broke, and I am getting an error that reads:
Attribute Error: 'NoneType' object has no attribute 'groups'
My Code is:
import re, util
from BeautifulSoup import BeautifulStoneSoup
# create a comma-delineated file
delim = ", "
#base url for sat data
base = "http://www.usatoday.com/news/education/2007-08-28-sat-table_N.htm"
#get webpage object for site
soup = util.mysoupopen(base)
#get column headings
colCols = soup.findAll("td", {"class":"vaTextBold"})
#get data
dataCols = soup.findAll("td", {"class":"vaText"})
#append data to cols
for i in range(len(dataCols)):
colCols.append(dataCols[i])
#open a csv file to write the data to
fob=open("sat.csv", 'a')
#initiate the 5 arrays
states = []
participate = []
math = []
read = []
write = []
#split into 5 lists for each row
for i in range(len(colCols)):
if i%5 == 0:
states.append(colCols[i])
i=1
while i<=250:
participate.append(colCols[i])
i = i+5
i=2
while i<=250:
math.append(colCols[i])
i = i+5
i=3
while i<=250:
read.append(colCols[i])
i = i+5
i=4
while i<=250:
write.append(colCols[i])
i = i+5
#write data to the file
for i in range(len(states)):
states = str(states[i])
participate = str(participate[i])
math = str(math[i])
read = str(read[i])
write = str(write[i])
#regex to remove html from data scraped
#remove <td> tags
line = re.search(">(.*)<", states).groups()[0] + delim + re.search(">(.*)<", participate).groups()[0]+ delim + re.search(">(.*)<", math).groups()[0] + delim + re.search(">(.*)<", read).groups()[0] + delim + re.search(">(.*)<", write).groups()[0]
#append data point to the file
fob.write(line)
Any ideas regarding why this error suddenly appeared? The regex was working fine until I tried to split the data into different lists. I have already tried printing the various strings inside the final "for" loop to see if any of them were "None" for the first i value (0), but they were all the string that they were supposed to be.
Any help would be greatly appreciated!
It looks like the regex search is failing on (one of) the strings, so it returns None instead of a MatchObject.
Try the following instead of the very long #remove <td> tags line:
out_list = []
for item in (states, participate, math, read, write):
try:
out_list.append(re.search(">(.*)<", item).groups()[0])
except AttributeError:
print "Regex match failed on", item
sys.exit()
line = delim.join(out_list)
That way, you can find out where your regex is failing.
Also, I suggest you use .group(1) instead of .groups()[0]. The former is more explicit.