How I can get the name of url with BeautifulSoup.
I've this code:
from BeautifulSoup import BeautifulSoup
import urllib2
import re
html_page = urllib2.urlopen("http://www.youtube.com")
soup = BeautifulSoup(html_page)
list = soup.findAll('div', attrs={'class':'profileBox'})
for div in list:
print div.a['href']
---------------------------------
sam utx
-------------------------------------
This print the href("/sam") but I need is the url's name (sam utx).
How I can make this?
You can select the text inside the div itself with this :
div.a.string
You can read more about this here.
Related
I have a problem on API. Its turns to me empty list
I tried to search browser but none is my answer.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import urllib import re
site = "http://www.hurriyet.com.tr"
regex = "<span class='news-title'>(.+?)</span>"
comp = re.compile(regex)
print(comp) print(regex)
htmlkod = urllib.urlopen(site).read()
titles = re.findall(regex, htmlkod)
print(titles)
i=1
for title in titles:
print str(i), title.decode("iso8859-9")
print(title)
i+=1
I Expect the its turn to me news titles but its turn me "[]" empty list
I recommend to use BeautifulSoup instead of regex like :
from urllib import urlopen
from bs4 import BeautifulSoup
site = "http://www.hurriyet.com.tr"
openurl = urlopen(site)
soup = BeautifulSoup(openurl, "html.parser")
getTitle = soup.findAll('span', attrs={'class': 'news-title'})
for title in getTitle:
print title.text
I want to download all csv files, any idea how I do this?
from bs4 import BeautifulSoup
import requests
url = requests.get('http://www.football-data.co.uk/englandm.php').text
soup = BeautifulSoup(url)
for link in soup.findAll("a"):
print link.get("href")
Something like this should work:
from bs4 import BeautifulSoup
from time import sleep
import requests
if __name__ == '__main__':
url = requests.get('http://www.football-data.co.uk/englandm.php').text
soup = BeautifulSoup(url)
for link in soup.findAll("a"):
current_link = link.get("href")
if current_link.endswith('csv'):
print('Found CSV: ' + current_link)
print('Downloading %s' % current_link)
sleep(10)
response = requests.get('http://www.football-data.co.uk/%s' % current_link, stream=True)
fn = current_link.split('/')[0] + '_' + current_link.split('/')[1] + '_' + current_link.split('/')[2]
with open(fn, "wb") as handle:
for data in response.iter_content():
handle.write(data)
You just need to filter the hrefs which you can do with a css selector,a[href$=.csv] which will find the href's ending in .csv then join each to the base url, request and finally write the content:
from bs4 import BeautifulSoup
import requests
from urlparse import urljoin
from os.path import basename
base = "http://www.football-data.co.uk/"
url = requests.get('http://www.football-data.co.uk/englandm.php').text
soup = BeautifulSoup(url)
for link in (urljoin(base, a["href"]) for a in soup.select("a[href$=.csv]")):
with open(basename(link), "w") as f:
f.writelines(requests.get(link))
Which will give you five files, E0.csv, E1.csv, E2.csv, E3.csv, E4.csv with all the data inside.
I want to get all the product links from specific category by using BeautifulSoup in Python.
I have tried the following but don't get a result:
import lxml
import urllib2
from bs4 import BeautifulSoup
html=urllib2.urlopen("http://www.bedbathandbeyond.com/store/category/bedding/bedding/quilts-coverlets/12018/1-96?pagSortOpt=DEFAULT-0&view=grid")
br= BeautifulSoup(html.read(),'lxml')
for links in br.findAll('a', class_='prodImg'):
print links['href']
You use urllib2 wrong.
import lxml
import urllib2
from bs4 import BeautifulSoup
#create a http request
req=urllib2.Request("http://www.bedbathandbeyond.com/store/category/bedding/bedding/quilts-coverlets/12018/1-96?pagSortOpt=DEFAULT-0&view=grid")
# send the request
response = urllib2.urlopen(req)
# read the content of the response
html = response.read()
br= BeautifulSoup(html,'lxml')
for links in br.findAll('a', class_='prodImg'):
print links['href']
from bs4 import BeautifulSoup
import requests
html=requests.get("http://www.bedbathandbeyond.com/store/category/bedding/bedding/quilts-coverlets/12018/1-96?pagSortOpt=DEFAULT-0&view=grid")
br= BeautifulSoup(html.content,"lxml")
data=br.findAll('div',attrs={'class':'productShadow'})
for div in br.find_all('a'):
print div.get('href')
try this code
I have this code:
import urllib
import urlparse
from bs4 import BeautifulSoup
url = "http://www.downloadcrew.com/?act=search&cat=51"
pageHtml = urllib.urlopen(url)
soup = BeautifulSoup(pageHtml)
for a in soup.select("div.productListingTitle a[href]"):
try:
print (a["href"]).encode("utf-8","replace")
except:
print "no link"
pass
But when I run it, I only get 20 links only. The output should be more than 20 links.
Because you only download the first page of content.
Just use a loop to donwload all pages:
import urllib
import urlparse
from bs4 import BeautifulSoup
for i in xrange(3):
url = "http://www.downloadcrew.com/?act=search&page=%d&cat=51" % i
pageHtml = urllib.urlopen(url)
soup = BeautifulSoup(pageHtml)
for a in soup.select("div.productListingTitle a[href]"):
try:
print (a["href"]).encode("utf-8","replace")
except:
print "no link"
if you do'nt know the count of pages, you can
import urllib
import urlparse
from bs4 import BeautifulSoup
i = 0
while 1:
url = "http://www.downloadcrew.com/?act=search&page=%d&cat=51" % i
pageHtml = urllib.urlopen(url)
soup = BeautifulSoup(pageHtml)
has_more = 0
for a in soup.select("div.productListingTitle a[href]"):
has_more = 1
try:
print (a["href"]).encode("utf-8","replace")
except:
print "no link"
if has_more:
i += 1
else:
break
I run it on my computer and it get 60 link of three pages.
Good luck~
I have this code:
from bs4 import BeautifulSoup
import urllib
url = 'http://www.brothersoft.com/windows/mp3_audio/midi_tools/'
html = urllib.urlopen(url)
soup = BeautifulSoup(html)
for a in soup.select('div.freeText dl a[href]'):
print "http://www.borthersoft.com"+a['href'].encode('utf-8','replace')
What I get is:
http://www.borthersoft.com/synthfont-159403.html
http://www.borthersoft.com/midi-maker-23747.html
http://www.borthersoft.com/keyboard-music-22890.html
http://www.borthersoft.com/mp3-editor-for-free-227857.html
http://www.borthersoft.com/midipiano---midi-file-player-recorder-61384.html
http://www.borthersoft.com/notation-composer-32499.html
http://www.borthersoft.com/general-midi-keyboard-165831.html
http://www.borthersoft.com/digital-music-mentor-31262.html
http://www.borthersoft.com/unisyn-250033.html
http://www.borthersoft.com/midi-maestro-13002.html
http://www.borthersoft.com/music-editor-free-139151.html
http://www.borthersoft.com/midi-converter-studio-46419.html
http://www.borthersoft.com/virtual-piano-65133.html
http://www.borthersoft.com/yamaha-9000-drumkit-282701.html
http://www.borthersoft.com/virtual-midi-keyboard-260919.html
http://www.borthersoft.com/anvil-studio-6269.html
http://www.borthersoft.com/midicutter-258103.html
http://www.borthersoft.com/softick-audio-gateway-55913.html
http://www.borthersoft.com/ipmidi-161641.html
http://www.borthersoft.com/d.accord-keyboard-chord-dictionary-28598.html
There should be 526 application links to be printed out.
But I only get twenty?
What is not enough with the code?
There's only 20 application links in a page.
You have to iterate all pages to get all links:
from bs4 import BeautifulSoup
import urllib
for page in range(1, 27+1): # currently there are 27 pages.
url = 'http://www.brothersoft.com/windows/mp3_audio/midi_tools/{}.html'.format(page)
html = urllib.urlopen(url)
soup = BeautifulSoup(html)
for a in soup.select('div.freeText dl a[href]'):
print "http://www.borthersoft.com"+a['href'].encode('utf-8','replace')