Reading a webpage using lxml and xpath - python-2.7

I'm trying to get the latest prices for some of the markets from PredictIt. For example, the "Will Donald Trump win the 2016 Republican presidential nomination?" market found at https://www.predictit.org/contract/838/ I specifically want the text that will be "Latest Price: ??"
Chrome tells me that the xpath is /html/body/div[7]/div/div[2]/div[2]/p[1]/strong/text()
import urllib2
url = 'https://www.predictit.org/Contract/838/'
page = urllib2.urlopen(url)
date = page.read()
from lxml import html
etree = html.fromstring(data)
price = etree.xpath('/html/body/div[7]/div/div[2]/div[2]/p[1]/strong/text()')
Everything seems ok, but
print price
returns an empty list.
Any ideas?

If you can rely on the string 'Latest Price' being inside a <strong> tag, then you could use:
In [305]: root.xpath('//strong[contains(text(), "Latest Price:")]/text()')
Out[305]: ['Latest Price: 34']
Or, perhaps more robustly, you could search all <p> tags and their descendants for text which includes the string 'Latest Price':
In [312]: root.xpath('//p/descendant-or-self::*[contains(text(), "Latest Price")]/text()')
Out[312]: ['Latest Price: 34']
import urllib2
url = 'https://www.predictit.org/Contract/838/'
page = urllib2.urlopen(url)
data = page.read()
import lxml.html as LH
root = LH.fromstring(data)
price = None
for text in root.xpath('//p/descendant-or-self::*[contains(text(), "Latest Price:")]/text()'):
price = float(text.split(':', 1)[-1])
print(price)
# 35
The reason why the XPath /html/body/div[7]/div/div[2]/div[2]/p[1]/strong/text() may be failing is because the HTML received from urllib2.urlopen(url).read() may be different than the HTML received by Chrome. Chrome's browser processes JavaScript which may change the DOM. urllib2 does not process JavaScript. If you needed the DOM after executing JavaScript, then you would need an automated browser like Selenium instead of urllib2. Happily in this case, the content you are looking for is not supplied by JavaScript. However, an overly specific XPath such as /html/body/div[7]/div/div[2]/div[2]/p[1]/strong/text() may trip you up.
Using the HTML returned by urllib2, there appear to be only 6 <div> tags:
In [315]: root.xpath('/html/body/div')
Out[315]:
[<Element div at 0x7f0bd63632b8>,
<Element div at 0x7f0bd6363310>,
<Element div at 0x7f0bd6363368>,
<Element div at 0x7f0bd63633c0>,
<Element div at 0x7f0bd6363418>,
<Element div at 0x7f0bd6363470>]
Trying to access the 7th <div> tag yields an empty list:
In [316]: root.xpath('/html/body/div[7]')
Out[316]: []

Related

get picture from dynamic content python

I'm trying to get the href of the picture from an url without using selenium
def():
try:
page = urllib2.urlopen('')
except httplib.IncompleteRead, e:
page = e.partial
response = BeautifulSoup(page)
print response
var = response.find("div", {"id":"il_m"}).find('p')
but i got None as a result.What should I do to ge the href ?
You can also get the link from an anchor tag with a download attribute:
In [2]: from bs4 import BeautifulSoup
In [3]: import urllib2
In [4]: r = urllib2.urlopen('http://icecat.us/index.php/product/image_gallery?num=9010647&id=9409545&lang=us&imgrefurl=philips.com')
In [5]: soup = BeautifulSoup(r,"html.parser")
In [6]: print(soup.select_one("p a[download]")["href"])
http://images.icecat.biz/img/gallery/9010647-Philips-_FP.jpg
You should also take note of the text Images may be subject to copyright.. on the page.
You're not targeting the right p tag:
First of all, you want to extract the href from <a> node and not <p>
The first <p> child element that is found is this one <p class="il_r" id="url_domain" </p>
What you could do is to target the 5th <p> element's <a> which is the image. One way of doing this is var = response.find("div", id = "il_m").find_all('p')[4].find('a')

Why can't I extract the subheading of a page using BeautifulSoup?

I am trying to extract the name and subheading of this page (for example). I have no problem extracting the name, but it's unsuccessful for the subheading. Using inspect element in Chrome, I identified that the subheading text "Canada Census, 1901" is embedded as follows:
<div class="person-info">
<div class="title ng-binding">Helen Brad in household of Geo Wilcock</div>
<div class="subhead ng-scope ng-binding" data-ng-if="!recordPersonCentric">Canada Census, 1901</div>
So I coded my script as follows:
import urllib2
import re
import csv
from bs4 import BeautifulSoup
import time
def get_FamSearch():
link = "https://example.org/pal:/MM9.1.1/KH11-999"
openLink = urllib2.urlopen(link)
Soup_FamSearch = BeautifulSoup(openLink, "html")
openLink.close()
NameParentTag = Soup_FamSearch.find("tr", class_="result-item highlight-person")
if NameParentTag:
Name = NameParentTag.find("td", class_="result-value-bold").get_text(strip=True)
name_decode = Name.encode("ascii", "ignore")
print name_decode
SubheadTag = Soup_FamSearch.find("div", class_="subhead ng-scope ng-binding")
if SubheadTag:
print SubheadTag.get_text(strip=True)
get_FamSearch()
This is the results, without able to locate and extract the subheading:
Helen Brad
[Finished in 2.2s]
The page you are getting via urllib2 doesn't contain the div with subhead class. The actual heading is constructed asynchronously with the help of javascript being executed on the browser-side.
The data you need is presented differently, here's what works for me:
print Soup_FamSearch.find('dt', text='Title').find_next_sibling('dd').text.strip()
Prints:
Canada Census, 1901

django xml parsing -parse img src attribute inside the xml tag

I need the URL inside the description tag of RSS file. I am trying to parse the images in the following link.
"ibnlive.in.com/ibnrss/rss/shows/worldview.xml"
I need the image link in that. I am using urllib and beautiful soup to parse details.
I am trying to parse the title,description,link and images inside the item tag. I can parse the title, description and link. But I can't parse image inside the description tag.
XML:
<item>
<title>World View: US shutdown ends, is the relief only temporary?</title>
<link>http://ibnlive.in.com/videos/429157/world-view-us-shutdown-ends-is-the-relief-only-temporary.html</link>
<description><img src='http://static.ibnlive.in.com/ibnlive/pix/sitepix/10_2013/worldview_1810a_90x62.jpg' width='90' height='62'>The US Senate overwhelmingly approved a deal on Wednesday to end a political crisis that partially shut down the federal government and brought the world's biggest economy to the edge of a debt default that could have threatened financial calamity.</description>
<pubDate>Fri, 18 Oct 2013 09:34:32 +0530</pubDate>
<guid>http://ibnlive.in.com/videos/429157/world-view-us-shutdown-ends-is-the-relief-only-temporary.html</guid>
<copyright>IBNLive</copyright>
<language>en-us</language>
</item>
views.py
from django.conf import settings
from django.shortcuts import render
from django.http import HttpResponse
from django.utils.html import strip_tags
from os.path import basename, splitext
import os
import urllib
from bs4 import BeautifulSoup
def international(request):
arr=[]
#asianage,oneinindia-papers
a=["http://news.oneindia.in/rss/news-international-fb.xml","http://www.asianage.com/rss/37"]
for i in a:
source_txt=urllib.urlopen(i)
b=BeautifulSoup(source_txt.read())
for q in b.findAll('item'):
d={}
d['desc']=strip_tags(q.description.string).strip('&nbsp')
if q.guid:
d['link']=q.guid.string
else:
d['link']=strip_tags(q.comments)
d['title']=q.title.string
for r in q.findAll('description'):
d['image']=r['src']
arr.append(d)
return render(request,'feedpars.html',{'arr':arr})
HTML
<html>
<head></head>
<body>
{% for i in arr %}
<p>{{i.title}}</p>
<p>{{i.desc}}</p>
<p>{{i.guid}}</p>
<img src="{{i.image}}" style="width:100px;height:100px;"><hr>
{% endfor %}
</body>
</html>
Nothing gets displayed in my output.
1/ as I already told you here How to get the url of image in descripttion tag of xml file while parsing? : this is a rss feed so use the appropriate tool: https://pypi.python.org/pypi/feedparser
2/ there's no proper "img" tag in the description, the html markup has been entity-encoded. To get the url, you have to either decode the description's content (to get the tag back) and pass the resulting html fragment to your HTML parser or - since it will probably not be as complex as a full html doc - just use a plain regexp on the encoded content.

How to get large amounts of href links of very large contents of website with Beautifulsoup

I am parsing a large html website that has over 1000 href links. I am using Beautifulsoup to get all the links but second time when I run the program again, beautifulsoup cannot handle it. (find specific all 'td' tags. how will I overcome this problem? Though I can load the html page with urllib, all the links cannot be printed. When I use it with find one 'td' tag, it is passed.
Tag = self.__Page.find('table', {'class':'RSLTS'}).findAll('td')
print Tag
for a in Tag.find('a', href= True):
print "found", a['href']
Now working as
Tag = self.__Page.find('table', {'class':'RSLTS'}).find('td')
print Tag
for a in Tag.find('a', href= True):
print "found", a['href']
You need to iterate over them:
tds = self.__Page.find('table', class_='RSLTS').find_all('td')
for td in tds:
a = td.find('a', href=True)
if a:
print "found", a['href']
Although I'd just use lxml if you have a ton of stuff:
root.xpath('table[contains(#class, "RSLTS")]/td/a/#href')

Issue with html tags while scraping data using beautiful soup

Common piece of code:
# -*- coding: cp1252 -*-
import csv
import urllib2
import sys
import time
from bs4 import BeautifulSoup
from itertools import islice
page = urllib2.urlopen('http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html').read()
soup = BeautifulSoup(page)
prices = soup.findAll('div', {"class": "price"})
After this I am trying following codes to get data:
Code 1:
for price in prices:
print unicode(price.string).encode('utf8')
Output1: No Output, code runs without any error and prints nothing.
Code 2:
for price in prices:
textcontent3= u' '.join(price.stripped_strings)
if textcontent3:
print textcontent3
Output2: No output again, same situation as in Output1.
Code 3:
for price in prices:
fonttag = price.find('div')
if fonttag is not None:
print unicode(fonttag.string).encode('utf8').strip()
Output3: No output, same as in Output1
After this I tried printing the concerned part of the html:
Code 4:
print prices
Output4:
</span></div>, <div class="price">
<span id="price"><br/>
</span></div>, <div class="price">
<span id="price"><br/>
</span></div>]
As it can be seen from the Output4, no price value is coming in the html beautiful soup is scraping for me. While on webpage this html structure looks like this:
<div class="price"><span id="price">49,90 €</span><br>einmalig</div>
Beautiful soup is not extracting the price values as mentioned in the html page, thus I am not able to scrape data for the price.
Please help me in solving this issue & pardon my ignorance as I am new to programming.
The page uses a large JavaScript structure to load the prices. You can load just that structure:
scripts = soup.find_all('script')
script = next(s.text for s in scripts if s.string and 'window.rates' in s.string)
datastring = script.split('phones=')[1].split(';window.')[0]
This results in a large JavaScript structure, starting with:
{sku844082:{name:"Samsung Galaxy SII",image:"/images/m677391_300468.jpg",deliveryTime:"Vorauss. verfügbar ab Anfang Januar",sku1444291:{p:"prod954312",e:"19.90"},sku1444286:{p:"prod954312",e:"19.90"},sku1444283:{p:"prod954312",e:"39.90"},sku1444275:{p:"prod954312",e:"59.90"},sku1104261:{p:"prod954312",e:"99.90"}},sku894279:{name:"BlackBerry Torch 9810",image:"/images/m727477_300464.jpg",deliveryTime:"Lieferbar innerhalb 48 Stunden",sku1444275:{p:"prod1004495",e:"179.90"},sku1104261:{p:"prod1004495",e:"259.90"},sku1444291:{p:"prod1004495",e:"29.90"},sku1444286:{p:"prod1004495",e:"29.90"},sku1444283:{p:"prod1004495",e:"49.90"}},sku864221:{name:"BlackBerry Bold 9900",image:"/images/m707491_300465.jpg",deliveryTime:"Lieferbar innerhalb 48 Stunden",sku1444275:{p:"prod974431",e:"129.90"},sku1104261:{p:"prod974431",e:"169.90"},sku1444291:{p:"prod974431",e:"49.90"},sku1444286:{p:"prod974431",e:"49.90"},sku1444283:{p:"prod974431",e:"89.90"}}
Unfortunately, that's not directly loadable with the json module; although valid JavaScript, without quoting around the keys it is not valid JSON. You'd need to use regular expressions to clean that up further, or grab the p:"someprice" information directly from that string.
Luckily the structure can be fixed with a small amount of regular expression magic:
import re
import json
datastring = re.sub(ur'([{,])([a-z]\w*):', ur'\1"\2":', datastring)
data = json.loads(datastring)
This gives you a large dictionary, with SKU keys and dictionaries with nested dicts as data, including nested SKUs with p product codes and e prices:
>>> from pprint import pprint
>>> pprint(data['sku864221'])
{u'deliveryTime': u'Lieferbar innerhalb 48 Stunden',
u'image': u'/images/m707491_300465.jpg',
u'name': u'BlackBerry Bold 9900',
u'sku1104261': {u'e': u'169.90', u'p': u'prod974431'},
u'sku1444275': {u'e': u'129.90', u'p': u'prod974431'},
u'sku1444283': {u'e': u'89.90', u'p': u'prod974431'},
u'sku1444286': {u'e': u'49.90', u'p': u'prod974431'},
u'sku1444291': {u'e': u'49.90', u'p': u'prod974431'}}