Scraping a redirected site using requests and BeautifulSoup - python-2.7

I am using requests and BeautifulSoup4 to scrape a NBA website.
from bs4 import BeautifulSoup
import requests
r = requests.get('http://www.nba.com/games/20111225/BOSNYK/boxscore.html')
soup = BeautifulSoup(r.text)
The url to the site actually leads to 'http://www.nba.com/games/20111225/BOSNYK/gameinfo.html#nbaGIboxscore' when it is entered into a browser and I thought that using requests is the proper way of simulating this.
The trouble is I don't know the keywords of this effect and having trouble finding the solutions online.

You can use regex or bs4 in order to find the redirected site and then use requests in order to scrape him.
For example:
import bs4
import requests
original_url = 'http://www.nba.com/games/20111225/BOSNYK/'
old_suffix = 'boxscore.html'
r = requests.get(original_url + old_suffix)
site_content = bs4.BeautifulSoup(r.text, 'lxml')
meta = site_content.find_all('meta')[0]
meta_content = meta.attrs.get('content')
new_suffix = meta.attrs.get('content')[6:]
new_url_to_scrape = original_url + new_suffix
And then scrape the new_url_to_scarpe.
Enjoy!

Related

Issue scraping website with bs4 (beautiful soup) python 2.7

What I am attempting to accomplish is a simple python web scraping script for google trends and running into an issue when grabbing the class
from bs4 import BeautifulSoup
import requests
results = requests.get("https://trends.google.com/trends/trendingsearches/daily?geo=US")
soup = BeautifulSoup(results.text, 'lxml')
keyword_list = soup.find_all('.details-top')
for keyword in keyword_list:
print(keyword)
When printing tag I receive and empty class however when I print soup I receive the entire HTML document. My goal is to print out the text of each "Keyword" that was searched for the page https://trends.google.com/trends/trendingsearches/daily?geo=AU
this has a list of results:
1. covid-19
2.Woolworths jobs
If you use google developer options select inspect and hover over the title you will see div.details-top.
how would I just print the text of the title of each
I can see that data being dynamically retrieved from an API call in the dev tools network tab. You can issue an xhr to that url then use regex on the response text to parse out the query titles.
import requests, re
from bs4 import BeautifulSoup as bs
r = requests.get('https://trends.google.com/trends/api/dailytrends?hl=en-GB&tz=0&geo=AU&ns=15').text
p = re.compile(r'"query":"(.*?)"')
titles = p.findall(r)
print(titles) # 2.7 use print titles

How to crawl multiple domains using single crawler?

How can I crawl data from multiple domains using a single crawler. I have done crawling of single sites using beautiful soup but couldn't figure out how to create a generic one.
Well this question is flawed, sites that you want to scrape have to have something in common for instance.
from bs4 import BeautifulSoup
from urllib import request
import urllib.request
for counter in range(0,10):
# site = input("Type the name of your website") Python 3+
site = raw_input("Type the name of your website")
# Takes the website you typed and stores it in > site < variable
make_request_to_site = request.urlopen(site).read()
# Makes a request to the site that we stored in a var
soup = BeautifulSoup(make_request_to_site, "html.parser")
# We pass it through BeautifulSoup parser in this case html.parser
# Next we make a loop to find all links in the site that we stored
for link in soup.findAll('a'):
print link['href']
As mentioned, each site has their own distinct setup for selectors (, , etc). A single generic crawler won't be able to go into a url and intuitively understand what to scrape.
BeautifulSoup might not be the best choice for this type of request. Scrapy is another web crawler library that's a bit more robust that BS4.
Similar question here on stackoverflow: Scrapy approach to scraping multiple URLs
Scrapy Documentation:
https://doc.scrapy.org/en/latest/intro/tutorial.html

Regular expression to find precise pdf links in a webpage

Given url='http://normanpd.normanok.gov/content/daily-activity', the website has three types of arrests, incidents, and case summaries. I was asked to use regular expressions to discover the URL strings of all the Incidents pdf documents in Python.
The pdfs are to be downloaded in a defined location.
I have gone through the link and found that Incident pdf files URLs are in the form of:
normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf
I have written code :
import urllib.request
url="http://normanpd.normanok.gov/content/daily-activity"
response = urllib.request.urlopen(url)
data = response.read() # a `bytes` object
text = data.decode('utf-8')
urls=re.findall(r'(\w|/|-/%)+\sIncident\s(%|\w)+\.pdf$',text)
But in the URLs list, the values are empty.
I am a beginner in python3 and regex commands. Can anyone help me?
This is not an advisable method. Instead, use an HTML parsing library like bs4 (BeautifulSoup) to find the links and then only regex to filter the results.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url="http://normanpd.normanok.gov/content/daily-activity"
response = urlopen(url).read()
soup= BeautifulSoup(response, "html.parser")
links = soup.find_all('a', href=re.compile(r'(Incident%20Summary\.pdf)'))
for el in links:
print("http://normanpd.normanok.gov" + el['href'])
Output :
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-23%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-22%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-21%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-20%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-18%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-17%20Daily%20Incident%20Summary.pdf
But if you were asked to use only regexes, then try something simpler:
import urllib.request
import re
url="http://normanpd.normanok.gov/content/daily-activity"
response = urllib.request.urlopen(url)
data = response.read() # a `bytes` object
text = data.decode('utf-8')
urls=re.findall(r'(filebrowser_download.+?Daily%20Incident.+?\.pdf)',text)
print(urls)
for link in urls:
print("http://normanpd.normanok.gov/" + link)
Using BeautifulSoup this is an easy way:
soup = BeautifulSoup(open_page, 'html.parser')
links = []
for link in soup.find_all('a'):
current = link.get('href')
if current.endswith('pdf') and "Incident" in current:
links.append('{0}{1}'.format(url,current))

BeautifulSoup doesn't return any value

I am new to Beautifulsoup and seems to have encountered a problem. The code I wrote is correct to my knowledge but the output is empty. It doesn't show any value.
import requests
from bs4 import BeautifulSoup
url = requests.get("https://www.nseindia.com/")
soup = BeautifulSoup(url.content, "html.parser")
nifty = soup.find_all("span", {"id": "lastPriceNIFTY"})
for x in nifty:
print x.text
The page seems to be rendered by javascript. requests will fail to get the content which is loaded by JavaScript, it will get the partial page before the JavaScript rendering. You can use the dryscrape library for this like so:
import dryscrape
from bs4 import BeautifulSoup
sess = dryscrape.Session()
sess.visit("https://www.nseindia.com/")
soup = BeautifulSoup(sess.body(), "lxml")
nifty = soup.select("span[id^=lastPriceNIFTY]")
print nifty[0:2] #printing sample i.e first two entries.
Output:
[<span class="number" id="lastPriceNIFTY 50"><span class="change green">8,792.80 </span></span>, <span class="value" id="lastPriceNIFTY 50 Pre Open" style="color:#000000"><span class="change green">8,812.35 </span></span>]

Unable to find all links with BeautifulSoup to extract links from a website (Link identification)

I’m using this code found here ( retrieve links from web page using python and BeautifulSoup) to extract all links from a website using.
import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer
http = httplib2.Http()
status, response = http.request('http://www.bestwestern.com.au')
for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
if link.has_attr('href'):
print link['href']
I’m using this site http://www.bestwestern.com.au as test.
Unfortunately, I notice that the code is not extracting some links for example this one http://www.bestwestern.com.au/about-us/careers/ . I don’t know why.
In the code of the page this is what I found.
<li>Careers</li>
I think the extractor should normally identify it.
On the BeautifulSoup documentation I can read: “The most common type of unexpected behavior is that you can’t find a tag that you know is in the document. You saw it going in, but find_all() returns [] or find() returns None. This is another common problem with Python’s built-in HTML parser, which sometimes skips tags it doesn’t understand. Again, the solution is to install lxml or html5lib.”
So I installed html5lib. But I still have the same behavior.
Thank you for your help
Ok so this is a old question but I stumbled upon it in my search and it seems like it should be relatively simple to accomplish. I did switch from httplib2 to requests.
import requests
from bs4 import BeautifulSoup, SoupStrainer
baseurl = 'http://www.bestwestern.com.au'
SEEN_URLS = []
def get_links(url):
response = requests.get(url)
for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a', href=True)):
print(link['href'])
SEEN_URLS.append(link['href'])
if baseurl in link['href'] and link['href'] not in SEEN_URLS:
get_links(link['href'])
if __name__ == '__main__':
get_links(baseurl)
One problem is - you are using BeautifulSoup version 3 which is not being maintained anymore. You need to upgrade to BeautifulSoup version 4:
pip install beautifulsoup4
Another problem is that there is no "careers" link on the main page, but there is one on the "sitemap" page - request it and parse with the default html.parser parser - you'll see "careers" link printed among others:
import requests
from bs4 import BeautifulSoup, SoupStrainer
response = requests.get('http://www.bestwestern.com.au/sitemap/')
for link in BeautifulSoup(response.content, "html.parser", parse_only=SoupStrainer('a', href=True)):
print(link['href'])
Note how I've moved the "has to have href" rule to the soup strainer.