Web scraping with Python modules urllib2 and BeautifulSoup - python-2.7

Recently I've tried to use urllib2 and BeautifulSoup to extract the source coede of some web page, however, failed with the output of improper code.
The script is as follows (run in Python IDLE)
import urllib2
from bs4 import BeautifulSoup
web = "http://www.qq.com"
page = urllib2.urlopen(web)
soup = BeautifulSoup(page, "html.parser")
print soup.prettify()
I found that the charset of "http://www.qq.com" is gb2312, so added something in the above script like this:
import urllib2
from bs4 import BeautifulSoup
web = "http://www.qq.com"
page = urllib2.urlopen(web)
soup = BeautifulSoup(page, "html.parser", from_encoding="gb2312")
print soup.prettify()
But the result is frustrating. Is there any solution available?
The screenshot of error message:
Error Message
Last Weekend I added the module sys in the above code but it prints nothing, without a warning this time.
#coding=utf-8
import urllib2
from bs4 import BeautifulSoup
import sys
reload(sys)
sys.setdefaultencoding('gbk')
web = "http://www.qq.com"
page = urllib2.urlopen(web)
soup = BeautifulSoup(page, "html.parser")
print soup.prettify()

Can you post the error message? Or is the problem that it's just not displaying Chinese characters to the screen?
Try switching to gb18030 encoding. Even though the page says its charset is gb2313, there must be a character that's messing up the decoding. Switching encodings turned my terminal output from garbage to Chinese characters (Source)
import urllib2
from bs4 import BeautifulSoup
web = "http://www.qq.com"
page = urllib2.urlopen(web)
soup = BeautifulSoup(page, "html.parser", from_encoding="gb18030")
print soup.prettify()

Related

Python fetch data from website

Below python code not working to fetch the data from given link. Please help me how to make it possible
import urllib2
from bs4 import BeautifulSoup
quote_page = 'http://www.smartvidya.co.in/2016/11/ugc-net-paper-1-previous-year-questions_14.html'
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find('div', attrs={'class': 'MsoNormal'})
print name_box
Try this :
import urllib2
from bs4 import BeautifulSoup
quote_page = 'http://www.smartvidya.co.in/2016/11/ugc-net-paper-1-previous-year-questions_14.html'
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
for name_box in soup.findAll('div',attrs={'class': 'MsoNormal'}):
print name_box.text
Hope this helps!

Python and Beautiful Soup Web Scraping

I am trying to scrape the stats off the table on this webpage: http://stats.nba.com/teams/traditional/ but I am unable to find the html for the table. This is in python 2.7.10.
from bs4 import BeautifulSoup
import json
import urllib
html = urllib.urlopen('http://stats.nba.com/teams/traditional/').read()
soup = BeautifulSoup(html, "html.parser")
for table in soup.find_all('tr'):
print(table)
This is the code I have now, but nothing is being outputted.
If I try this with different elements on the page it works fine.
The table is loaded dynamically, so when you grab the html, there are no tr tags in it to be found.
The table you're looking for is NOT in that specific page/URL.
The stats you're trying to scrape come from this url:
http://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=
When you browse a webpage/url in a modern browser, more requests are made "behind the scene" other than the original url you use to fully render the whole page.
I know this sounds counter-intuitive, you can check out this answer for a bit more detailed explanation.
Try this code. It is giving me the HTML code. I am using requests to obtain information.
import datetime
import BeautifulSoup
import os
import sys
import pdb
import webbrowser
import urllib2
import requests
from datetime import datetime
from requests.auth import HTTPBasicAuth
from HTMLParser import HTMLParser
from urllib import urlopen
from bs4 import BeautifulSoup
url="http://stats.nba.com/teams/traditional/"
data=requests.get(url)
if (data.status_code<400):
print("AUTHENTICATED:STATUS_CODE"+" "+str(data.status_code))
sample=data.content
soup=BeautifulSoup(sample,'html.parser')
print soup
You can use selenium and PhantomJS (or chomedriver, firefox etc.) to load the page, thereby also loading all the javascript. All you need is to download selenium and the PhantomJS webdriver, then place a sleep timer after the get(url) to ensure that the page loads (actually, using a function such as WebDriverWait would be much better than sleep, but you can look more into that if you need it). Now your soup content will look exactly like that what you see when looking at the site through your browser.
from bs4 import BeautifulSoup
from selenium import webdriver
from time import sleep
url = 'http://stats.nba.com/teams/traditional/'
browser = webdriver.PhantomJS('*path to PhantomJS driver')
browser.get(url)
sleep(10)
soup = BeautifulSoup(browser.page_source, "html.parser")
for table in soup.find_all('tr'):
print(table)

How can i go to specific page of a website and fetch desired data using python and save it into excel sheet.this code need url till desired page

import requests from bs4
import BeautifulSoup
import xlrd file="C:\Users\Ashadeep\PycharmProjects\untitled1\xlwt.ashadee.xls"
workbook=xlrd.open_workbook(file)
sheet=workbook.sheet_by_index(0)
print(sheet.cell_value(0,0))
r = requests.get(sheet.cell_value(0,0))
r.content soup = BeautifulSoup(r.content,"html.parser") g_data=soup.find_all("div",{"class":"admissionhelp-left"})
print(g_data)
text=soup.find_all("Tel") for item in g_data:print(item.text)
Are you trying to download an Excel file from the web and save it to your HDD? I don't see any URL, but you can try one of these 3 ideas.
import urllib
dls = "http://www.muellerindustries.com/uploads/pdf/UW SPD0114.xls"
urllib.urlretrieve(dls, "test.xls")
import requests
dls = "http://www.muellerindustries.com/uploads/pdf/UW SPD0114.xls"
resp = requests.get(dls)
with open('test.xls', 'wb') as output:
output.write(resp.content)
Or, if you don't necessarily need to go through the browser, you can use the urllib module to save a file to a specified location.
import urllib
url = 'http://www.example.com/file/processing/path/excelfile.xls'
local_fname = '/home/John/excelfile.xls'
filename, headers = urllib.retrieveurl(url, local_fname)

BeautifulSoup: Get all product links from specific category

I want to get all the product links from specific category by using BeautifulSoup in Python.
I have tried the following but don't get a result:
import lxml
import urllib2
from bs4 import BeautifulSoup
html=urllib2.urlopen("http://www.bedbathandbeyond.com/store/category/bedding/bedding/quilts-coverlets/12018/1-96?pagSortOpt=DEFAULT-0&view=grid")
br= BeautifulSoup(html.read(),'lxml')
for links in br.findAll('a', class_='prodImg'):
print links['href']
You use urllib2 wrong.
import lxml
import urllib2
from bs4 import BeautifulSoup
#create a http request
req=urllib2.Request("http://www.bedbathandbeyond.com/store/category/bedding/bedding/quilts-coverlets/12018/1-96?pagSortOpt=DEFAULT-0&view=grid")
# send the request
response = urllib2.urlopen(req)
# read the content of the response
html = response.read()
br= BeautifulSoup(html,'lxml')
for links in br.findAll('a', class_='prodImg'):
print links['href']
from bs4 import BeautifulSoup
import requests
html=requests.get("http://www.bedbathandbeyond.com/store/category/bedding/bedding/quilts-coverlets/12018/1-96?pagSortOpt=DEFAULT-0&view=grid")
br= BeautifulSoup(html.content,"lxml")
data=br.findAll('div',attrs={'class':'productShadow'})
for div in br.find_all('a'):
print div.get('href')
try this code

How to get all application link in a page?

I have this code:
from bs4 import BeautifulSoup
import urllib
url = 'http://www.brothersoft.com/windows/mp3_audio/midi_tools/'
html = urllib.urlopen(url)
soup = BeautifulSoup(html)
for a in soup.select('div.freeText dl a[href]'):
print "http://www.borthersoft.com"+a['href'].encode('utf-8','replace')
What I get is:
http://www.borthersoft.com/synthfont-159403.html
http://www.borthersoft.com/midi-maker-23747.html
http://www.borthersoft.com/keyboard-music-22890.html
http://www.borthersoft.com/mp3-editor-for-free-227857.html
http://www.borthersoft.com/midipiano---midi-file-player-recorder-61384.html
http://www.borthersoft.com/notation-composer-32499.html
http://www.borthersoft.com/general-midi-keyboard-165831.html
http://www.borthersoft.com/digital-music-mentor-31262.html
http://www.borthersoft.com/unisyn-250033.html
http://www.borthersoft.com/midi-maestro-13002.html
http://www.borthersoft.com/music-editor-free-139151.html
http://www.borthersoft.com/midi-converter-studio-46419.html
http://www.borthersoft.com/virtual-piano-65133.html
http://www.borthersoft.com/yamaha-9000-drumkit-282701.html
http://www.borthersoft.com/virtual-midi-keyboard-260919.html
http://www.borthersoft.com/anvil-studio-6269.html
http://www.borthersoft.com/midicutter-258103.html
http://www.borthersoft.com/softick-audio-gateway-55913.html
http://www.borthersoft.com/ipmidi-161641.html
http://www.borthersoft.com/d.accord-keyboard-chord-dictionary-28598.html
There should be 526 application links to be printed out.
But I only get twenty?
What is not enough with the code?
There's only 20 application links in a page.
You have to iterate all pages to get all links:
from bs4 import BeautifulSoup
import urllib
for page in range(1, 27+1): # currently there are 27 pages.
url = 'http://www.brothersoft.com/windows/mp3_audio/midi_tools/{}.html'.format(page)
html = urllib.urlopen(url)
soup = BeautifulSoup(html)
for a in soup.select('div.freeText dl a[href]'):
print "http://www.borthersoft.com"+a['href'].encode('utf-8','replace')