I am attempting to scrape data off of a website using a combination of urllib2 and beautifulsoup. At the moment, here is my code:
site2='http://football.fantasysports.yahoo.com/archive/nfl/2008/619811/draftresults'
players=[]
teams=[]
response=urllib2.urlopen(site2)
html=response.read()
soup=BeautifulSoup(html)
playername = soup.find_all('a', class_="name")
teamname = soup.find_all('td', class_="last")
My problem is, that when I view the source code in Chrome, these tags are readily available and working, but when I try and run the program, the tags are no longer there.
One hint may be that the first line of the source code reads like such:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
While if I print my soup or html object the first line is <!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile 1.0//EN" "http://www.wapforum.org/DTD/xhtml-mobile10.dtd">.
It appears that the url appears in a mobile form when I try and scrape it using urllib2. If this is not what this means, or you do in fact know how to have urllib2 open the url as a browser (preferably chrome) would, please let me know! Please also be quite specific as to how I can solve the problem, as I am a novice coder and admittedly my depth of knowledge is shallow at best!
Thanks everyone!
The website tries to figure out what browser the source of the request is from the 'User-agent'. According to the urllib2 docs, the default user-agent is Python-urllib/2.6. You could try setting that to that of a browser using OpenerDirector. Again, from the docs:
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
opener.open('http://www.example.com/')
Related
I am scraping a webpage using the beautifulsoup and requests in python3.5 . problem is when i tried to parse the email address in the p it gives me the [email protected]. I have tried the other links but no gain. cf_email tag is not even there. I am parsing through this
email_addresses=[]
for email_address in detail.findAll('p'):
email_addresses.append(email_address.text)
information = {}
information['email'] = email_addresses
emails are in the <p> tags.
i have this html in inspecting element.
<div class="email">
<p>test1#hotmail.com</p>
<p>test2#yahoo.com</p>
<p>test3#yahoo.com</p>
<div>
when i open the page source i have noticed this .
<p>[email protected]</p>
The page does not actually contain the email address. This is probably being done as a protection against spammers; there will be some javascript that replaces the holding text with the actual value.
In other words, the site is trying to stop people doing exactly what you are trying to do.
I am using Goose to read the title/text-body of an article from a URL. However, this does not work with a twitter URL, I guess due to the different HTML tag structure. Is there a way to read the tweet text from such a link?
One such example of a tweet (shortened link) is as follows:
https://twitter.com/UniteAlbertans/status/899468829151043584/photo/1
NOTE: I know how to read Tweets through twitter API. However, I am not interested in that. I just want to get the text by parsing the HTML source without all the twitter authentication hassle.
Scrape yourself
Open the url of the tweet, pass to HTML parser of your choice and extract the XPaths you are interested in.
Scraping is discussed in: http://docs.python-guide.org/en/latest/scenarios/scrape/
XPaths can be obtained by right-clicking to element you want, selecting "Inspect", right clicking on the highlighted line in Inspector and selecting "Copy" > "Copy XPath" if the structure of the site is always the same. Otherwise choose properties that define exactly the object you want.
In your case:
//div[contains(#class, 'permalink-tweet-container')]//strong[contains(#class, 'fullname')]/text()
will get you the name of the author and
//div[contains(#class, 'permalink-tweet-container')]//p[contains(#class, 'tweet-text')]//text()
will get you the content of the Tweet.
The full working example:
from lxml import html
import requests
page = requests.get('https://twitter.com/UniteAlbertans/status/899468829151043584')
tree = html.fromstring(page.content)
tree.xpath('//div[contains(#class, "permalink-tweet-container")]//p[contains(#class, "tweet-text")]//text()')
results in:
['Breaking:\n10 sailors missing, 5 injured after USS John S. McCain collides with merchant vessel near Singapore...\n\n', 'https://www.', 'washingtonpost.com/world/another-', 'us-navy-destroyer-collides-with-a-merchant-ship-rescue-efforts-underway/2017/08/20/c42f15b2-8602-11e7-9ce7-9e175d8953fa_story.html?utm_term=.e3e91fff99ba&wpisrc=al_alert-COMBO-world%252Bnation&wpmk=1', u'\xa0', u'\u2026', 'pic.twitter.com/UiGEZq7Eq6']
I am new at Python and trying my hands at building some small web crawlers. I am trying to code this program in Python 2.7 with BeautifulSoup that would extract all profile URLs from this page and the subsequent pages
http://www.bda-findadentist.org.uk/pagination.php?limit=50&page=1
Here I am trying to scrape all the URLs that are linked to the details page, such as this
http://www.bda-findadentist.org.uk/practice_details.php?practice_id=6034&no=61881
However, I am lost as to how to make my program recognize these URLs. They are not within a DIV class or ID, rather they are encapsulated within a TD bgcolor tag
<td bgcolor="E7F3F1">View Details</td>
Please advise on how I can make my program identify these URLs and scrape them. I tried the following, but neither worked
for link in soup.select('td bgcolor=E7F3F1 a'):
for link in soup.select('td#bgcolor#E7F3F1 a'):
for link in soup.findAll('a[practice_id=*]'):
My full program is as follows:
import requests
from bs4 import BeautifulSoup
def bda_crawler(pages):
page = 1
while page <= pages:
url = 'http://www.bda-findadentist.org.uk/pagination.php?limit=50&page=' + str(page)
code = requests.get(url)
text = code.text
soup = BeautifulSoup(text)
for link in soup.findAll('a[practice_id=*]'):
href = "http://www.bda-findadentist.org.uk" + link.get('href')
print (href)
page += 1
bda_crawler(2)
Please help
Many thanks
I'm still new to using BeautifulSoup to scrape information from a website. For this piece of code I'm specifically trying to grab this value and others like it and display it back to me the user in a more condensed easy to read display. The below is a screenshot i took with the highlighted div and class i am trying to parse:
This is the code I'm using:
import urllib2
from bs4 import BeautifulSoup
a =("http://forecast.weather.gov/MapClick.php?lat=39.32196712788175&lon=-82.10190859830237&site=all&smap=1#.VQM_kOGGP7l")
website = urllib2.urlopen(a)
html = website.read()
soup = BeautifulSoup(html)
x = soup.find_all("div",{"class": "point-forecast-icons-low"})
print x
However once it runs it returns this "[]" I get no errors but nothing happens. What I thought at first was maybe it couldn't find anything inside the <div> that I told it to search for but usually I would get back a none from the code saying nothing was found. So what i believe to be going on now with my code is maybe since it is a that its not opening the div up to pull the other content from inside it, but that is just my best guess at the moment.
You are getting [] because point-forecast-icons-low class is not an attribute of the div rather it's an attribute of the p tag. Try this instead.
x = soup.find_all("p", attrs={"class": "point-forecast-icons-low"})
At exam.com is not about the weather:
Tokyo: 25°C
I want to use Django 1.1 and lxml to get information at the website. I want to get information that is of "25" only.
HTML exam.com structure as follows:
<p id="resultWeather">
<b>Weather</b>
Tokyo:
<b>25</b>°C
</p>
I'm a student. I'm doing a small project with my friends. Please explain to me easily understand. Thank you very much!
BeautifulSoup is more suitable for html parsing than lxml.
something like this can be helpful:
def get_weather():
import urllib
from BeautifulSoup import BeautifulSoup
data = urllib.urlopen('http://exam.com/').read()
soup = BeautifulSoup(data)
return soup.find('p', {'id': 'resultWeather'}).findAll('b')[-1].string
get page contents with urllib, parse it with BeautifulSoup, find P with id=resultWeather, find last B in our P and get it's content