Python fetch data from website - python-2.7

Below python code not working to fetch the data from given link. Please help me how to make it possible
import urllib2
from bs4 import BeautifulSoup
quote_page = 'http://www.smartvidya.co.in/2016/11/ugc-net-paper-1-previous-year-questions_14.html'
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find('div', attrs={'class': 'MsoNormal'})
print name_box

Try this :
import urllib2
from bs4 import BeautifulSoup
quote_page = 'http://www.smartvidya.co.in/2016/11/ugc-net-paper-1-previous-year-questions_14.html'
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
for name_box in soup.findAll('div',attrs={'class': 'MsoNormal'}):
print name_box.text
Hope this helps!

Related

How to Run Multiples URLs in Requests from a File

I'm trying to scrap multiples websites from URLs in a txt file. There's one url per line.
my code is:
Import requests
from bs4 import BeautifulSoup
file = open('url.txt', 'r')
filelines = file.readline()
urllist = requests.get(filelines)
soup = BeautifulSoup(urllist.content, "html.parser")
content = soup.find_all("span", {"class": "title-main-info"})
print content
But it prints only the last url content (last line). What i'm doing wrong?
Thanks
Try this. It should work:
import requests
from bs4 import BeautifulSoup
with open('url.txt', 'r') as f:
for links in f.readlines():
urllist= requests.get(links.strip())
soup = BeautifulSoup(urllist.content, "html.parser")
content = soup.find_all("span", {"class": "title-main-info"})
print content

Why cant i append data to a newline in python 2.7?

This is the code that i have written :
import urllib2
import codecs
import urllib
import re
from bs4 import BeautifulSoup
from lxml.html import fromstring
import codecs
url="http://www.thehindu.com/sci-tech/science/iit-bombay-birds-eye-view-and-quantum-biology/article18191268.ece"
htmltext = urllib.urlopen(url).read()
resp = urllib.urlopen(url)
respData =resp.read()
paras = re.findall(r'<p>(.*?)</p>',str(respData))
soup = BeautifulSoup(htmltext,"lxml")
webpage_title = soup.find_all('h1', attrs = {"class": "title"})
webpage_title = webpage_title[0].get_text(strip=True)
with codecs.open("E:\\Crawler_paras_sorted_test_webpages_complete.txt", "w+", encoding="utf-8") as f:
f.write(webpage_title)
soup = BeautifulSoup(htmltext,"lxml")
ut_container = soup.find("div", {"class": "ut-container"})
time = ut_container.find("none").text.strip()
with codecs.open("E:\\Crawler_paras_sorted_test_webpages_complete.txt", "a+",encoding="utf-8") as f:
f.write(time)
The output that is written to the file is :
IIT Bombay: Bird’s eye view and quantum biologyApril 22, 2017 18:56 IST
I want the output to be saved like this :
IIT Bombay: Bird’s eye view and quantum biology
April 22, 2017 18:56 IST
Since it is very general, I am just giving an idea for this context.
You need to just put a new line after writing webpage_title.
f.writelines(webpage_title)
f.write("\n")
I used windows style "\r\n".It works like a charm :
import urllib2
import codecs
import urllib
import re
from bs4 import BeautifulSoup
from lxml.html import fromstring
import codecs
url="http://www.thehindu.com/sci-tech/science/iit-bombay-birds-eye-view-and-quantum-biology/article18191268.ece"
htmltext = urllib.urlopen(url).read()
resp = urllib.urlopen(url)
respData =resp.read()
paras = re.findall(r'<p>(.*?)</p>',str(respData))
soup = BeautifulSoup(htmltext,"lxml")
webpage_title = soup.find_all('h1', attrs = {"class": "title"})
webpage_title = webpage_title[0].get_text(strip=True)
with codecs.open("E:\\Crawler_paras_sorted_test_webpages_complete.txt", "w+", encoding="utf-8") as f:
f.write(webpage_title+"\r\n")
soup = BeautifulSoup(htmltext,"lxml")
ut_container = soup.find("div", {"class": "ut-container"})
time = ut_container.find("none").text.strip()
with codecs.open("E:\\Crawler_paras_sorted_test_webpages_complete.txt", "a+",encoding="utf-8") as f:
f.write(time)

Web scraping with Python modules urllib2 and BeautifulSoup

Recently I've tried to use urllib2 and BeautifulSoup to extract the source coede of some web page, however, failed with the output of improper code.
The script is as follows (run in Python IDLE)
import urllib2
from bs4 import BeautifulSoup
web = "http://www.qq.com"
page = urllib2.urlopen(web)
soup = BeautifulSoup(page, "html.parser")
print soup.prettify()
I found that the charset of "http://www.qq.com" is gb2312, so added something in the above script like this:
import urllib2
from bs4 import BeautifulSoup
web = "http://www.qq.com"
page = urllib2.urlopen(web)
soup = BeautifulSoup(page, "html.parser", from_encoding="gb2312")
print soup.prettify()
But the result is frustrating. Is there any solution available?
The screenshot of error message:
Error Message
Last Weekend I added the module sys in the above code but it prints nothing, without a warning this time.
#coding=utf-8
import urllib2
from bs4 import BeautifulSoup
import sys
reload(sys)
sys.setdefaultencoding('gbk')
web = "http://www.qq.com"
page = urllib2.urlopen(web)
soup = BeautifulSoup(page, "html.parser")
print soup.prettify()
Can you post the error message? Or is the problem that it's just not displaying Chinese characters to the screen?
Try switching to gb18030 encoding. Even though the page says its charset is gb2313, there must be a character that's messing up the decoding. Switching encodings turned my terminal output from garbage to Chinese characters (Source)
import urllib2
from bs4 import BeautifulSoup
web = "http://www.qq.com"
page = urllib2.urlopen(web)
soup = BeautifulSoup(page, "html.parser", from_encoding="gb18030")
print soup.prettify()

How can I get all the software link?

I have this code:
import urllib
import urlparse
from bs4 import BeautifulSoup
url = "http://www.downloadcrew.com/?act=search&cat=51"
pageHtml = urllib.urlopen(url)
soup = BeautifulSoup(pageHtml)
for a in soup.select("div.productListingTitle a[href]"):
try:
print (a["href"]).encode("utf-8","replace")
except:
print "no link"
pass
But when I run it, I only get 20 links only. The output should be more than 20 links.
Because you only download the first page of content.
Just use a loop to donwload all pages:
import urllib
import urlparse
from bs4 import BeautifulSoup
for i in xrange(3):
url = "http://www.downloadcrew.com/?act=search&page=%d&cat=51" % i
pageHtml = urllib.urlopen(url)
soup = BeautifulSoup(pageHtml)
for a in soup.select("div.productListingTitle a[href]"):
try:
print (a["href"]).encode("utf-8","replace")
except:
print "no link"
if you do'nt know the count of pages, you can
import urllib
import urlparse
from bs4 import BeautifulSoup
i = 0
while 1:
url = "http://www.downloadcrew.com/?act=search&page=%d&cat=51" % i
pageHtml = urllib.urlopen(url)
soup = BeautifulSoup(pageHtml)
has_more = 0
for a in soup.select("div.productListingTitle a[href]"):
has_more = 1
try:
print (a["href"]).encode("utf-8","replace")
except:
print "no link"
if has_more:
i += 1
else:
break
I run it on my computer and it get 60 link of three pages.
Good luck~

How to get all application link in a page?

I have this code:
from bs4 import BeautifulSoup
import urllib
url = 'http://www.brothersoft.com/windows/mp3_audio/midi_tools/'
html = urllib.urlopen(url)
soup = BeautifulSoup(html)
for a in soup.select('div.freeText dl a[href]'):
print "http://www.borthersoft.com"+a['href'].encode('utf-8','replace')
What I get is:
http://www.borthersoft.com/synthfont-159403.html
http://www.borthersoft.com/midi-maker-23747.html
http://www.borthersoft.com/keyboard-music-22890.html
http://www.borthersoft.com/mp3-editor-for-free-227857.html
http://www.borthersoft.com/midipiano---midi-file-player-recorder-61384.html
http://www.borthersoft.com/notation-composer-32499.html
http://www.borthersoft.com/general-midi-keyboard-165831.html
http://www.borthersoft.com/digital-music-mentor-31262.html
http://www.borthersoft.com/unisyn-250033.html
http://www.borthersoft.com/midi-maestro-13002.html
http://www.borthersoft.com/music-editor-free-139151.html
http://www.borthersoft.com/midi-converter-studio-46419.html
http://www.borthersoft.com/virtual-piano-65133.html
http://www.borthersoft.com/yamaha-9000-drumkit-282701.html
http://www.borthersoft.com/virtual-midi-keyboard-260919.html
http://www.borthersoft.com/anvil-studio-6269.html
http://www.borthersoft.com/midicutter-258103.html
http://www.borthersoft.com/softick-audio-gateway-55913.html
http://www.borthersoft.com/ipmidi-161641.html
http://www.borthersoft.com/d.accord-keyboard-chord-dictionary-28598.html
There should be 526 application links to be printed out.
But I only get twenty?
What is not enough with the code?
There's only 20 application links in a page.
You have to iterate all pages to get all links:
from bs4 import BeautifulSoup
import urllib
for page in range(1, 27+1): # currently there are 27 pages.
url = 'http://www.brothersoft.com/windows/mp3_audio/midi_tools/{}.html'.format(page)
html = urllib.urlopen(url)
soup = BeautifulSoup(html)
for a in soup.select('div.freeText dl a[href]'):
print "http://www.borthersoft.com"+a['href'].encode('utf-8','replace')