Python and Beautiful Soup Web Scraping - python-2.7

I am trying to scrape the stats off the table on this webpage: http://stats.nba.com/teams/traditional/ but I am unable to find the html for the table. This is in python 2.7.10.
from bs4 import BeautifulSoup
import json
import urllib
html = urllib.urlopen('http://stats.nba.com/teams/traditional/').read()
soup = BeautifulSoup(html, "html.parser")
for table in soup.find_all('tr'):
print(table)
This is the code I have now, but nothing is being outputted.
If I try this with different elements on the page it works fine.

The table is loaded dynamically, so when you grab the html, there are no tr tags in it to be found.

The table you're looking for is NOT in that specific page/URL.
The stats you're trying to scrape come from this url:
http://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=
When you browse a webpage/url in a modern browser, more requests are made "behind the scene" other than the original url you use to fully render the whole page.
I know this sounds counter-intuitive, you can check out this answer for a bit more detailed explanation.

Try this code. It is giving me the HTML code. I am using requests to obtain information.
import datetime
import BeautifulSoup
import os
import sys
import pdb
import webbrowser
import urllib2
import requests
from datetime import datetime
from requests.auth import HTTPBasicAuth
from HTMLParser import HTMLParser
from urllib import urlopen
from bs4 import BeautifulSoup
url="http://stats.nba.com/teams/traditional/"
data=requests.get(url)
if (data.status_code<400):
print("AUTHENTICATED:STATUS_CODE"+" "+str(data.status_code))
sample=data.content
soup=BeautifulSoup(sample,'html.parser')
print soup

You can use selenium and PhantomJS (or chomedriver, firefox etc.) to load the page, thereby also loading all the javascript. All you need is to download selenium and the PhantomJS webdriver, then place a sleep timer after the get(url) to ensure that the page loads (actually, using a function such as WebDriverWait would be much better than sleep, but you can look more into that if you need it). Now your soup content will look exactly like that what you see when looking at the site through your browser.
from bs4 import BeautifulSoup
from selenium import webdriver
from time import sleep
url = 'http://stats.nba.com/teams/traditional/'
browser = webdriver.PhantomJS('*path to PhantomJS driver')
browser.get(url)
sleep(10)
soup = BeautifulSoup(browser.page_source, "html.parser")
for table in soup.find_all('tr'):
print(table)

Related

Cannot find a link

I am trying to click a tab (Regulatory Regional) on a webpage: https://www5.fdic.gov/idasp/advSearchLanding.asp
However, it does not recognize the command. Here, I have attached the code.
import urllib2
import urllib
from bs4 import BeautifulSoup
import subprocess
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
browser = webdriver.Chrome("/usr/local/bin/chromedriver")
import time
s1_url = 'https://www5.fdic.gov/idasp/advSearchLanding.asp'
browser.get(s1_url)
Problem: choose regulatory regional tab but it does not click it.
browser.find_element_by_xpath('//[#id="Banks_Regulatory_Tab"]/a').click()
Got an exception:
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//*[#id="Banks_Regulatory_Tab"]/a"}
Required element located inside an iframe. To be able to handle it you need to switch to that iframe:
browser.switch_to.frame("content")
browser.find_element_by_link_text("Regulatory Regional").click()

StaleElementReferenceException occurs during scraping infinite scroll with Selenium in Python

I am trying to scroll down an infinite scroll page and get the links of news. The problem is when I scrolled down the page for let say 100 times, and I tried to get the links, Python launched an error that says: "StaleElementReferenceException: Message: stale element reference: element is not attached to the page document". I think its because the page is get updated and scrolled page is not available any more. here is my code for scrolling the page with Selenium Webdriver:
import urllib2
from bs4 import BeautifulSoup
from __future__ import print_function
from selenium import webdriver #open webdriver for specific browser
from selenium.webdriver.common.keys import Keys # for necessary browser action
from selenium.webdriver.common.by import By # For selecting html code
import time
driver = webdriver.Chrome('C:\\Program Files (x86)\\Google\\Chrome\\chromedriver.exe')
driver.get('http://seekingalpha.com/market-news/top-news')
for i in range(0,100):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(15)
URL = driver.find_elements_by_class_name('market_current_title')
print URL
and the code for getting the URLs
for a in URL:
links = a.get_attribute('href')
print(links)
I am wondering if there is any solution to settle this problem or it is possible to get URLs for this specific page with request library, as I couldn't do that.

How can i go to specific page of a website and fetch desired data using python and save it into excel sheet.this code need url till desired page

import requests from bs4
import BeautifulSoup
import xlrd file="C:\Users\Ashadeep\PycharmProjects\untitled1\xlwt.ashadee.xls"
workbook=xlrd.open_workbook(file)
sheet=workbook.sheet_by_index(0)
print(sheet.cell_value(0,0))
r = requests.get(sheet.cell_value(0,0))
r.content soup = BeautifulSoup(r.content,"html.parser") g_data=soup.find_all("div",{"class":"admissionhelp-left"})
print(g_data)
text=soup.find_all("Tel") for item in g_data:print(item.text)
Are you trying to download an Excel file from the web and save it to your HDD? I don't see any URL, but you can try one of these 3 ideas.
import urllib
dls = "http://www.muellerindustries.com/uploads/pdf/UW SPD0114.xls"
urllib.urlretrieve(dls, "test.xls")
import requests
dls = "http://www.muellerindustries.com/uploads/pdf/UW SPD0114.xls"
resp = requests.get(dls)
with open('test.xls', 'wb') as output:
output.write(resp.content)
Or, if you don't necessarily need to go through the browser, you can use the urllib module to save a file to a specified location.
import urllib
url = 'http://www.example.com/file/processing/path/excelfile.xls'
local_fname = '/home/John/excelfile.xls'
filename, headers = urllib.retrieveurl(url, local_fname)

BeautifulSoup: Get all product links from specific category

I want to get all the product links from specific category by using BeautifulSoup in Python.
I have tried the following but don't get a result:
import lxml
import urllib2
from bs4 import BeautifulSoup
html=urllib2.urlopen("http://www.bedbathandbeyond.com/store/category/bedding/bedding/quilts-coverlets/12018/1-96?pagSortOpt=DEFAULT-0&view=grid")
br= BeautifulSoup(html.read(),'lxml')
for links in br.findAll('a', class_='prodImg'):
print links['href']
You use urllib2 wrong.
import lxml
import urllib2
from bs4 import BeautifulSoup
#create a http request
req=urllib2.Request("http://www.bedbathandbeyond.com/store/category/bedding/bedding/quilts-coverlets/12018/1-96?pagSortOpt=DEFAULT-0&view=grid")
# send the request
response = urllib2.urlopen(req)
# read the content of the response
html = response.read()
br= BeautifulSoup(html,'lxml')
for links in br.findAll('a', class_='prodImg'):
print links['href']
from bs4 import BeautifulSoup
import requests
html=requests.get("http://www.bedbathandbeyond.com/store/category/bedding/bedding/quilts-coverlets/12018/1-96?pagSortOpt=DEFAULT-0&view=grid")
br= BeautifulSoup(html.content,"lxml")
data=br.findAll('div',attrs={'class':'productShadow'})
for div in br.find_all('a'):
print div.get('href')
try this code

Web scraping with Python modules urllib2 and BeautifulSoup

Recently I've tried to use urllib2 and BeautifulSoup to extract the source coede of some web page, however, failed with the output of improper code.
The script is as follows (run in Python IDLE)
import urllib2
from bs4 import BeautifulSoup
web = "http://www.qq.com"
page = urllib2.urlopen(web)
soup = BeautifulSoup(page, "html.parser")
print soup.prettify()
I found that the charset of "http://www.qq.com" is gb2312, so added something in the above script like this:
import urllib2
from bs4 import BeautifulSoup
web = "http://www.qq.com"
page = urllib2.urlopen(web)
soup = BeautifulSoup(page, "html.parser", from_encoding="gb2312")
print soup.prettify()
But the result is frustrating. Is there any solution available?
The screenshot of error message:
Error Message
Last Weekend I added the module sys in the above code but it prints nothing, without a warning this time.
#coding=utf-8
import urllib2
from bs4 import BeautifulSoup
import sys
reload(sys)
sys.setdefaultencoding('gbk')
web = "http://www.qq.com"
page = urllib2.urlopen(web)
soup = BeautifulSoup(page, "html.parser")
print soup.prettify()
Can you post the error message? Or is the problem that it's just not displaying Chinese characters to the screen?
Try switching to gb18030 encoding. Even though the page says its charset is gb2313, there must be a character that's messing up the decoding. Switching encodings turned my terminal output from garbage to Chinese characters (Source)
import urllib2
from bs4 import BeautifulSoup
web = "http://www.qq.com"
page = urllib2.urlopen(web)
soup = BeautifulSoup(page, "html.parser", from_encoding="gb18030")
print soup.prettify()