Python/BeautifulSoup: Retrieving 'href' attribute - python-2.7

I am trying to get the href attribute from a website I am scraping. My script:
from bs4 import BeautifulSoup
import requests
import csv
i = 1
for i in range(1, 2, 1):
i = str(i)
baseurl = "https://www.quandoo.nl/amsterdam?page=" + i
r1 = requests.get(baseurl)
data = r1.text
soup = BeautifulSoup(data, "html.parser")
for link in soup.findAll('span', {'class', "merchant-title", 'itemprop', "name", 'a'}):
print link
Returns the following:
<span class="merchant-title" itemprop="name">Ristorante Due Napoletani</span>
<span class="merchant-title" itemprop="name">YamYam</span>
<span class="merchant-title" itemprop="name">The Golden Temple</span>
<span class="merchant-title" itemprop="name">Sampurna</span>
<span class="merchant-title" itemprop="name">Motto Sushi</span>
<span class="merchant-title" itemprop="name">Takumi-Ya</span>
<span class="merchant-title" itemprop="name">Casa di David</span>
(This is only part of it. I didn't want to bombard you with the entire output.) I have no issue pulling out the string with the restaurants name, but I can't find a configuration to give me just the href attribute. And the .strip() method doesn't seem feasible with my current configuration. Any help would be great.

Try with this code, it works for me:
from bs4 import BeautifulSoup
import requests
import csv
import re
i = 1
for i in range(1, 2, 1):
i = str(i)
baseurl = "https://www.quandoo.nl/amsterdam?page=" + i
r1 = requests.get(baseurl)
data = r1.text
soup = BeautifulSoup(data, "html.parser")
for link in soup.findAll('span', {'class', "merchant-title", 'itemprop', "name", 'a'}):
match = re.search(r'href=[\'"]?([^\'" >]+)', str(link)).group(0)
print match

Related

get picture from dynamic content python

I'm trying to get the href of the picture from an url without using selenium
def():
try:
page = urllib2.urlopen('')
except httplib.IncompleteRead, e:
page = e.partial
response = BeautifulSoup(page)
print response
var = response.find("div", {"id":"il_m"}).find('p')
but i got None as a result.What should I do to ge the href ?
You can also get the link from an anchor tag with a download attribute:
In [2]: from bs4 import BeautifulSoup
In [3]: import urllib2
In [4]: r = urllib2.urlopen('http://icecat.us/index.php/product/image_gallery?num=9010647&id=9409545&lang=us&imgrefurl=philips.com')
In [5]: soup = BeautifulSoup(r,"html.parser")
In [6]: print(soup.select_one("p a[download]")["href"])
http://images.icecat.biz/img/gallery/9010647-Philips-_FP.jpg
You should also take note of the text Images may be subject to copyright.. on the page.
You're not targeting the right p tag:
First of all, you want to extract the href from <a> node and not <p>
The first <p> child element that is found is this one <p class="il_r" id="url_domain" </p>
What you could do is to target the 5th <p> element's <a> which is the image. One way of doing this is var = response.find("div", id = "il_m").find_all('p')[4].find('a')

Why can't I extract the subheading of a page using BeautifulSoup?

I am trying to extract the name and subheading of this page (for example). I have no problem extracting the name, but it's unsuccessful for the subheading. Using inspect element in Chrome, I identified that the subheading text "Canada Census, 1901" is embedded as follows:
<div class="person-info">
<div class="title ng-binding">Helen Brad in household of Geo Wilcock</div>
<div class="subhead ng-scope ng-binding" data-ng-if="!recordPersonCentric">Canada Census, 1901</div>
So I coded my script as follows:
import urllib2
import re
import csv
from bs4 import BeautifulSoup
import time
def get_FamSearch():
link = "https://example.org/pal:/MM9.1.1/KH11-999"
openLink = urllib2.urlopen(link)
Soup_FamSearch = BeautifulSoup(openLink, "html")
openLink.close()
NameParentTag = Soup_FamSearch.find("tr", class_="result-item highlight-person")
if NameParentTag:
Name = NameParentTag.find("td", class_="result-value-bold").get_text(strip=True)
name_decode = Name.encode("ascii", "ignore")
print name_decode
SubheadTag = Soup_FamSearch.find("div", class_="subhead ng-scope ng-binding")
if SubheadTag:
print SubheadTag.get_text(strip=True)
get_FamSearch()
This is the results, without able to locate and extract the subheading:
Helen Brad
[Finished in 2.2s]
The page you are getting via urllib2 doesn't contain the div with subhead class. The actual heading is constructed asynchronously with the help of javascript being executed on the browser-side.
The data you need is presented differently, here's what works for me:
print Soup_FamSearch.find('dt', text='Title').find_next_sibling('dd').text.strip()
Prints:
Canada Census, 1901

Python, BeautifulSoup code seems to work, but no data in the CSV?

I have about 500 html files in a directory, and I want to extract data from them and save the results in a CSV.
The code I'm using doesn't get any error messages, and seems to be scanning all the files, but the resulting CSV is empty except for the top row.
I'm fairly new to python and I'm clearly doing something wrong. I hope someone out there can help!
from bs4 import BeautifulSoup
import csv
import urllib2
import os
def processData( pageFile ):
f = open(pageFile, "r")
page = f.read()
f.close()
soup = BeautifulSoup(page)
metaData = soup.find_all('div class="item_details"')
priceData = soup.find_all('div class="price_big"')
# define where we will store info
vendors = []
shipsfroms = []
shipstos = []
prices = []
for html in metaData:
text = BeautifulSoup(str(html).strip()).get_text().encode("utf-8").replace("\n", "")
vendors.append(text.split("vendor:")[1].split("ships from:")[0].strip())
shipsfroms.append(text.split("ships from:")[1].split("ships to:")[0].strip())
shipstos.append(text.split("ships to:")[1].strip())
for price in priceData:
prices.append(BeautifulSoup(str(price)).get_text().encode("utf-8").strip())
csvfile = open('drugs.csv', 'ab')
writer = csv.writer(csvfile)
for shipsto, shipsfrom, vendor, price in zip(shipstos, shipsfroms, vendors, prices):
writer.writerow([shipsto, shipsfrom, vendor, price])
csvfile.close()
dir = "drugs"
csvFile = "drugs.csv"
csvfile = open(csvFile, 'wb')
writer = csv.writer(csvfile)
writer.writerow(["Vendors", "ShipsTo", "ShipsFrom", "Prices"])
csvfile.close()
fileList = os.listdir(dir)
totalLen = len(fileList)
count = 1
for htmlFile in fileList:
path = os.path.join(dir, htmlFile)
processData(path)
print "Processed '" + path + "'(" + str(count) + "/" + str(totalLen) + ")..."
count = count + 1
I suspect that I'm telling BS to look in the wrong part of the html code? But I can't see what it should be instead. Here's an excerpt of the html code with the info I need:
</div>
<div class="item" style="overflow: hidden;">
<div class="item_image" style="width: 180px; height: 125px;" id="image_255"></div>
<div class="item_body">
<div class="item_title">200mg High Quality DMT</div>
<div class="item_details">
vendor: ringo deathstarr<br>
ships from: United States<br>
ships to: Worldwide
</div>
</div>
<div class="item_price">
<div class="price_big">฿0.031052</div>
add to cart
</div>
</div>
Disclaimer: the information is for a research project about online drug trade.
The way you are doing is wrong. Here is a working example:
metaData = soup.find_all("div", {"class":"item_details"})
priceData = soup.find_all("div", {"class":"price_big"})
You can find more about it's usage from here.

Using beautifulsoup to extract data from html content - HTML Parsing

The contents of my script using beautifulsoup library is as follows:
<meta content="Free" itemprop="price" />
and
<div class="content" itemprop="datePublished">November 4, 2013</div>
I would want to pull the words Free and November 4, 2013 from that output. Will using a Regex help or does beautifulsoup has any such attributes that will pull out this directly? Here is the code I used below:
from BeautifulSoup import BeautifulSoup
import urllib
import re
pageFile = urllib.urlopen("https://play.google.com/store/apps/details?id=com.ea.game.fifa14_na")
pageHtml = pageFile.read()
pageFile.close()
soup = BeautifulSoup("".join(pageHtml))
item = soup.find("meta", {"itemprop":"price"})
print item
items = soup.find("div",{"itemprop":"datePublished"})
print items
Ok got it! Just access the values by the following method(for the above case):
from BeautifulSoup import BeautifulSoup
import urllib
pageFile = urllib.urlopen("https://play.google.com/store/apps/details?id=com.ea.game.fifa14_na")
pageHtml = pageFile.read()
pageFile.close()
soup = BeautifulSoup("".join(pageHtml))
item = soup.find("meta", {"itemprop":"price"}) # meta content="Free" itemprop="price"
print item['content']
items = soup.find("div",{"itemprop":"datePublished"})
print items.string
No need to add regex. Just a read up through the documentation would help.

Using Beautiful Soup4 can I get the <a> embedded text?

I am using Beautiful Soup4 with Python and now any way i have come upto the below. So now in what way I can get the values Dog,Cat, Horse and also the Ids. Please help!
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
soup.find_all('a')
# [<a class="Animal" href="http://example.com/elsie" id="link1">Dog</a>,
# <a class="Animal" href="http://example.com/lacie" id="link2">Cat</a>,
# <a class="Animal" href="http://example.com/tillie" id="link3">Horse</a>]
Documentation
for a in soup.find_all('a'):
id = a.get('id')
value = a.string # This is a NavigableString
unicode_value = unicode(value)