I am new at Python and trying my hands at building some small web crawlers. I am trying to code this program in Python 2.7 with BeautifulSoup that would extract all profile URLs from this page and the subsequent pages
http://www.bda-findadentist.org.uk/pagination.php?limit=50&page=1
Here I am trying to scrape all the URLs that are linked to the details page, such as this
http://www.bda-findadentist.org.uk/practice_details.php?practice_id=6034&no=61881
However, I am lost as to how to make my program recognize these URLs. They are not within a DIV class or ID, rather they are encapsulated within a TD bgcolor tag
<td bgcolor="E7F3F1">View Details</td>
Please advise on how I can make my program identify these URLs and scrape them. I tried the following, but neither worked
for link in soup.select('td bgcolor=E7F3F1 a'):
for link in soup.select('td#bgcolor#E7F3F1 a'):
for link in soup.findAll('a[practice_id=*]'):
My full program is as follows:
import requests
from bs4 import BeautifulSoup
def bda_crawler(pages):
page = 1
while page <= pages:
url = 'http://www.bda-findadentist.org.uk/pagination.php?limit=50&page=' + str(page)
code = requests.get(url)
text = code.text
soup = BeautifulSoup(text)
for link in soup.findAll('a[practice_id=*]'):
href = "http://www.bda-findadentist.org.uk" + link.get('href')
print (href)
page += 1
bda_crawler(2)
Please help
Many thanks
Related
I primarily work in Python 2.7. I'm trying to extract the written content (body text) of hundreds of articles from their respective URLs. To simplify things, I've started by trying to extract the text from just one website in my list, and I've been able to do so successfully using BeautifulSoup4. My code looks like this:
import urllib2
from bs4 import BeautifulSoup
url = 'http://insertaddresshere'
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'utf-8')
response = urllib2.urlopen(request)
soup = BeautifulSoup((response),"html.parser")
texts = soup.find_all("p")
for item in texts:
print item.text
This gets me the body text of a single article. I know how to iterate through a csv file and write to a new one, but the list of sites I need to iterate through are all from different domains, so the HTML code varies a lot. Is there any way to find body text from multiple articles that have different element labels (here, "p") for said body text? Is it possible to use BeautifulSoup to do this?
I'm still new to using BeautifulSoup to scrape information from a website. For this piece of code I'm specifically trying to grab this value and others like it and display it back to me the user in a more condensed easy to read display. The below is a screenshot i took with the highlighted div and class i am trying to parse:
This is the code I'm using:
import urllib2
from bs4 import BeautifulSoup
a =("http://forecast.weather.gov/MapClick.php?lat=39.32196712788175&lon=-82.10190859830237&site=all&smap=1#.VQM_kOGGP7l")
website = urllib2.urlopen(a)
html = website.read()
soup = BeautifulSoup(html)
x = soup.find_all("div",{"class": "point-forecast-icons-low"})
print x
However once it runs it returns this "[]" I get no errors but nothing happens. What I thought at first was maybe it couldn't find anything inside the <div> that I told it to search for but usually I would get back a none from the code saying nothing was found. So what i believe to be going on now with my code is maybe since it is a that its not opening the div up to pull the other content from inside it, but that is just my best guess at the moment.
You are getting [] because point-forecast-icons-low class is not an attribute of the div rather it's an attribute of the p tag. Try this instead.
x = soup.find_all("p", attrs={"class": "point-forecast-icons-low"})
I would like to be able to scrape all the messages from message pages of Yahoo finance for a specific stock.
Here is an example page:
http://finance.yahoo.com/mb/AMD/
I like to be able to get all the messages in there.
If I click on the "Messages" button on the above link I go to this link:
http://finance.yahoo.com/mb/forumview/?&v=m&bn=d56b9fc4-b0f1-3e88-b1f5-e1c40c0067e7
which has more than 10 pages.
How can I use Python code to scrape this data by just knowing the stock symbol "AMD"?
The basics:
tickers = ['AMD', 'AAPL', 'GOOG']
for t in tickers:
url = 'http://finance.yahoo.com/mb/' + t + '/'
r = br.open(url)
html = r.read()
soup = BeautifulSoup(html)
print soup
The content you want is located within particular html tags. Use soup.find_all to get what you want. To move between pages, use Selenium.
I am attempting to scrape data off of a website using a combination of urllib2 and beautifulsoup. At the moment, here is my code:
site2='http://football.fantasysports.yahoo.com/archive/nfl/2008/619811/draftresults'
players=[]
teams=[]
response=urllib2.urlopen(site2)
html=response.read()
soup=BeautifulSoup(html)
playername = soup.find_all('a', class_="name")
teamname = soup.find_all('td', class_="last")
My problem is, that when I view the source code in Chrome, these tags are readily available and working, but when I try and run the program, the tags are no longer there.
One hint may be that the first line of the source code reads like such:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
While if I print my soup or html object the first line is <!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile 1.0//EN" "http://www.wapforum.org/DTD/xhtml-mobile10.dtd">.
It appears that the url appears in a mobile form when I try and scrape it using urllib2. If this is not what this means, or you do in fact know how to have urllib2 open the url as a browser (preferably chrome) would, please let me know! Please also be quite specific as to how I can solve the problem, as I am a novice coder and admittedly my depth of knowledge is shallow at best!
Thanks everyone!
The website tries to figure out what browser the source of the request is from the 'User-agent'. According to the urllib2 docs, the default user-agent is Python-urllib/2.6. You could try setting that to that of a browser using OpenerDirector. Again, from the docs:
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
opener.open('http://www.example.com/')
At exam.com is not about the weather:
Tokyo: 25°C
I want to use Django 1.1 and lxml to get information at the website. I want to get information that is of "25" only.
HTML exam.com structure as follows:
<p id="resultWeather">
<b>Weather</b>
Tokyo:
<b>25</b>°C
</p>
I'm a student. I'm doing a small project with my friends. Please explain to me easily understand. Thank you very much!
BeautifulSoup is more suitable for html parsing than lxml.
something like this can be helpful:
def get_weather():
import urllib
from BeautifulSoup import BeautifulSoup
data = urllib.urlopen('http://exam.com/').read()
soup = BeautifulSoup(data)
return soup.find('p', {'id': 'resultWeather'}).findAll('b')[-1].string
get page contents with urllib, parse it with BeautifulSoup, find P with id=resultWeather, find last B in our P and get it's content