get picture from dynamic content python - python-2.7

I'm trying to get the href of the picture from an url without using selenium
def():
try:
page = urllib2.urlopen('')
except httplib.IncompleteRead, e:
page = e.partial
response = BeautifulSoup(page)
print response
var = response.find("div", {"id":"il_m"}).find('p')
but i got None as a result.What should I do to ge the href ?

You can also get the link from an anchor tag with a download attribute:
In [2]: from bs4 import BeautifulSoup
In [3]: import urllib2
In [4]: r = urllib2.urlopen('http://icecat.us/index.php/product/image_gallery?num=9010647&id=9409545&lang=us&imgrefurl=philips.com')
In [5]: soup = BeautifulSoup(r,"html.parser")
In [6]: print(soup.select_one("p a[download]")["href"])
http://images.icecat.biz/img/gallery/9010647-Philips-_FP.jpg
You should also take note of the text Images may be subject to copyright.. on the page.

You're not targeting the right p tag:
First of all, you want to extract the href from <a> node and not <p>
The first <p> child element that is found is this one <p class="il_r" id="url_domain" </p>
What you could do is to target the 5th <p> element's <a> which is the image. One way of doing this is var = response.find("div", id = "il_m").find_all('p')[4].find('a')

Related

Python/BeautifulSoup: Retrieving 'href' attribute

I am trying to get the href attribute from a website I am scraping. My script:
from bs4 import BeautifulSoup
import requests
import csv
i = 1
for i in range(1, 2, 1):
i = str(i)
baseurl = "https://www.quandoo.nl/amsterdam?page=" + i
r1 = requests.get(baseurl)
data = r1.text
soup = BeautifulSoup(data, "html.parser")
for link in soup.findAll('span', {'class', "merchant-title", 'itemprop', "name", 'a'}):
print link
Returns the following:
<span class="merchant-title" itemprop="name">Ristorante Due Napoletani</span>
<span class="merchant-title" itemprop="name">YamYam</span>
<span class="merchant-title" itemprop="name">The Golden Temple</span>
<span class="merchant-title" itemprop="name">Sampurna</span>
<span class="merchant-title" itemprop="name">Motto Sushi</span>
<span class="merchant-title" itemprop="name">Takumi-Ya</span>
<span class="merchant-title" itemprop="name">Casa di David</span>
(This is only part of it. I didn't want to bombard you with the entire output.) I have no issue pulling out the string with the restaurants name, but I can't find a configuration to give me just the href attribute. And the .strip() method doesn't seem feasible with my current configuration. Any help would be great.
Try with this code, it works for me:
from bs4 import BeautifulSoup
import requests
import csv
import re
i = 1
for i in range(1, 2, 1):
i = str(i)
baseurl = "https://www.quandoo.nl/amsterdam?page=" + i
r1 = requests.get(baseurl)
data = r1.text
soup = BeautifulSoup(data, "html.parser")
for link in soup.findAll('span', {'class', "merchant-title", 'itemprop', "name", 'a'}):
match = re.search(r'href=[\'"]?([^\'" >]+)', str(link)).group(0)
print match

Reading a webpage using lxml and xpath

I'm trying to get the latest prices for some of the markets from PredictIt. For example, the "Will Donald Trump win the 2016 Republican presidential nomination?" market found at https://www.predictit.org/contract/838/ I specifically want the text that will be "Latest Price: ??"
Chrome tells me that the xpath is /html/body/div[7]/div/div[2]/div[2]/p[1]/strong/text()
import urllib2
url = 'https://www.predictit.org/Contract/838/'
page = urllib2.urlopen(url)
date = page.read()
from lxml import html
etree = html.fromstring(data)
price = etree.xpath('/html/body/div[7]/div/div[2]/div[2]/p[1]/strong/text()')
Everything seems ok, but
print price
returns an empty list.
Any ideas?
If you can rely on the string 'Latest Price' being inside a <strong> tag, then you could use:
In [305]: root.xpath('//strong[contains(text(), "Latest Price:")]/text()')
Out[305]: ['Latest Price: 34']
Or, perhaps more robustly, you could search all <p> tags and their descendants for text which includes the string 'Latest Price':
In [312]: root.xpath('//p/descendant-or-self::*[contains(text(), "Latest Price")]/text()')
Out[312]: ['Latest Price: 34']
import urllib2
url = 'https://www.predictit.org/Contract/838/'
page = urllib2.urlopen(url)
data = page.read()
import lxml.html as LH
root = LH.fromstring(data)
price = None
for text in root.xpath('//p/descendant-or-self::*[contains(text(), "Latest Price:")]/text()'):
price = float(text.split(':', 1)[-1])
print(price)
# 35
The reason why the XPath /html/body/div[7]/div/div[2]/div[2]/p[1]/strong/text() may be failing is because the HTML received from urllib2.urlopen(url).read() may be different than the HTML received by Chrome. Chrome's browser processes JavaScript which may change the DOM. urllib2 does not process JavaScript. If you needed the DOM after executing JavaScript, then you would need an automated browser like Selenium instead of urllib2. Happily in this case, the content you are looking for is not supplied by JavaScript. However, an overly specific XPath such as /html/body/div[7]/div/div[2]/div[2]/p[1]/strong/text() may trip you up.
Using the HTML returned by urllib2, there appear to be only 6 <div> tags:
In [315]: root.xpath('/html/body/div')
Out[315]:
[<Element div at 0x7f0bd63632b8>,
<Element div at 0x7f0bd6363310>,
<Element div at 0x7f0bd6363368>,
<Element div at 0x7f0bd63633c0>,
<Element div at 0x7f0bd6363418>,
<Element div at 0x7f0bd6363470>]
Trying to access the 7th <div> tag yields an empty list:
In [316]: root.xpath('/html/body/div[7]')
Out[316]: []

Why can't I extract the subheading of a page using BeautifulSoup?

I am trying to extract the name and subheading of this page (for example). I have no problem extracting the name, but it's unsuccessful for the subheading. Using inspect element in Chrome, I identified that the subheading text "Canada Census, 1901" is embedded as follows:
<div class="person-info">
<div class="title ng-binding">Helen Brad in household of Geo Wilcock</div>
<div class="subhead ng-scope ng-binding" data-ng-if="!recordPersonCentric">Canada Census, 1901</div>
So I coded my script as follows:
import urllib2
import re
import csv
from bs4 import BeautifulSoup
import time
def get_FamSearch():
link = "https://example.org/pal:/MM9.1.1/KH11-999"
openLink = urllib2.urlopen(link)
Soup_FamSearch = BeautifulSoup(openLink, "html")
openLink.close()
NameParentTag = Soup_FamSearch.find("tr", class_="result-item highlight-person")
if NameParentTag:
Name = NameParentTag.find("td", class_="result-value-bold").get_text(strip=True)
name_decode = Name.encode("ascii", "ignore")
print name_decode
SubheadTag = Soup_FamSearch.find("div", class_="subhead ng-scope ng-binding")
if SubheadTag:
print SubheadTag.get_text(strip=True)
get_FamSearch()
This is the results, without able to locate and extract the subheading:
Helen Brad
[Finished in 2.2s]
The page you are getting via urllib2 doesn't contain the div with subhead class. The actual heading is constructed asynchronously with the help of javascript being executed on the browser-side.
The data you need is presented differently, here's what works for me:
print Soup_FamSearch.find('dt', text='Title').find_next_sibling('dd').text.strip()
Prints:
Canada Census, 1901

Using beautifulsoup to extract data from html content - HTML Parsing

The contents of my script using beautifulsoup library is as follows:
<meta content="Free" itemprop="price" />
and
<div class="content" itemprop="datePublished">November 4, 2013</div>
I would want to pull the words Free and November 4, 2013 from that output. Will using a Regex help or does beautifulsoup has any such attributes that will pull out this directly? Here is the code I used below:
from BeautifulSoup import BeautifulSoup
import urllib
import re
pageFile = urllib.urlopen("https://play.google.com/store/apps/details?id=com.ea.game.fifa14_na")
pageHtml = pageFile.read()
pageFile.close()
soup = BeautifulSoup("".join(pageHtml))
item = soup.find("meta", {"itemprop":"price"})
print item
items = soup.find("div",{"itemprop":"datePublished"})
print items
Ok got it! Just access the values by the following method(for the above case):
from BeautifulSoup import BeautifulSoup
import urllib
pageFile = urllib.urlopen("https://play.google.com/store/apps/details?id=com.ea.game.fifa14_na")
pageHtml = pageFile.read()
pageFile.close()
soup = BeautifulSoup("".join(pageHtml))
item = soup.find("meta", {"itemprop":"price"}) # meta content="Free" itemprop="price"
print item['content']
items = soup.find("div",{"itemprop":"datePublished"})
print items.string
No need to add regex. Just a read up through the documentation would help.

Using Beautiful Soup4 can I get the <a> embedded text?

I am using Beautiful Soup4 with Python and now any way i have come upto the below. So now in what way I can get the values Dog,Cat, Horse and also the Ids. Please help!
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
soup.find_all('a')
# [<a class="Animal" href="http://example.com/elsie" id="link1">Dog</a>,
# <a class="Animal" href="http://example.com/lacie" id="link2">Cat</a>,
# <a class="Animal" href="http://example.com/tillie" id="link3">Horse</a>]
Documentation
for a in soup.find_all('a'):
id = a.get('id')
value = a.string # This is a NavigableString
unicode_value = unicode(value)