Why can't I extract the subheading of a page using BeautifulSoup? - python-2.7

I am trying to extract the name and subheading of this page (for example). I have no problem extracting the name, but it's unsuccessful for the subheading. Using inspect element in Chrome, I identified that the subheading text "Canada Census, 1901" is embedded as follows:
<div class="person-info">
<div class="title ng-binding">Helen Brad in household of Geo Wilcock</div>
<div class="subhead ng-scope ng-binding" data-ng-if="!recordPersonCentric">Canada Census, 1901</div>
So I coded my script as follows:
import urllib2
import re
import csv
from bs4 import BeautifulSoup
import time
def get_FamSearch():
link = "https://example.org/pal:/MM9.1.1/KH11-999"
openLink = urllib2.urlopen(link)
Soup_FamSearch = BeautifulSoup(openLink, "html")
openLink.close()
NameParentTag = Soup_FamSearch.find("tr", class_="result-item highlight-person")
if NameParentTag:
Name = NameParentTag.find("td", class_="result-value-bold").get_text(strip=True)
name_decode = Name.encode("ascii", "ignore")
print name_decode
SubheadTag = Soup_FamSearch.find("div", class_="subhead ng-scope ng-binding")
if SubheadTag:
print SubheadTag.get_text(strip=True)
get_FamSearch()
This is the results, without able to locate and extract the subheading:
Helen Brad
[Finished in 2.2s]

The page you are getting via urllib2 doesn't contain the div with subhead class. The actual heading is constructed asynchronously with the help of javascript being executed on the browser-side.
The data you need is presented differently, here's what works for me:
print Soup_FamSearch.find('dt', text='Title').find_next_sibling('dd').text.strip()
Prints:
Canada Census, 1901

Related

get picture from dynamic content python

I'm trying to get the href of the picture from an url without using selenium
def():
try:
page = urllib2.urlopen('')
except httplib.IncompleteRead, e:
page = e.partial
response = BeautifulSoup(page)
print response
var = response.find("div", {"id":"il_m"}).find('p')
but i got None as a result.What should I do to ge the href ?
You can also get the link from an anchor tag with a download attribute:
In [2]: from bs4 import BeautifulSoup
In [3]: import urllib2
In [4]: r = urllib2.urlopen('http://icecat.us/index.php/product/image_gallery?num=9010647&id=9409545&lang=us&imgrefurl=philips.com')
In [5]: soup = BeautifulSoup(r,"html.parser")
In [6]: print(soup.select_one("p a[download]")["href"])
http://images.icecat.biz/img/gallery/9010647-Philips-_FP.jpg
You should also take note of the text Images may be subject to copyright.. on the page.
You're not targeting the right p tag:
First of all, you want to extract the href from <a> node and not <p>
The first <p> child element that is found is this one <p class="il_r" id="url_domain" </p>
What you could do is to target the 5th <p> element's <a> which is the image. One way of doing this is var = response.find("div", id = "il_m").find_all('p')[4].find('a')

django xml parsing -parse img src attribute inside the xml tag

I need the URL inside the description tag of RSS file. I am trying to parse the images in the following link.
"ibnlive.in.com/ibnrss/rss/shows/worldview.xml"
I need the image link in that. I am using urllib and beautiful soup to parse details.
I am trying to parse the title,description,link and images inside the item tag. I can parse the title, description and link. But I can't parse image inside the description tag.
XML:
<item>
<title>World View: US shutdown ends, is the relief only temporary?</title>
<link>http://ibnlive.in.com/videos/429157/world-view-us-shutdown-ends-is-the-relief-only-temporary.html</link>
<description><img src='http://static.ibnlive.in.com/ibnlive/pix/sitepix/10_2013/worldview_1810a_90x62.jpg' width='90' height='62'>The US Senate overwhelmingly approved a deal on Wednesday to end a political crisis that partially shut down the federal government and brought the world's biggest economy to the edge of a debt default that could have threatened financial calamity.</description>
<pubDate>Fri, 18 Oct 2013 09:34:32 +0530</pubDate>
<guid>http://ibnlive.in.com/videos/429157/world-view-us-shutdown-ends-is-the-relief-only-temporary.html</guid>
<copyright>IBNLive</copyright>
<language>en-us</language>
</item>
views.py
from django.conf import settings
from django.shortcuts import render
from django.http import HttpResponse
from django.utils.html import strip_tags
from os.path import basename, splitext
import os
import urllib
from bs4 import BeautifulSoup
def international(request):
arr=[]
#asianage,oneinindia-papers
a=["http://news.oneindia.in/rss/news-international-fb.xml","http://www.asianage.com/rss/37"]
for i in a:
source_txt=urllib.urlopen(i)
b=BeautifulSoup(source_txt.read())
for q in b.findAll('item'):
d={}
d['desc']=strip_tags(q.description.string).strip('&nbsp')
if q.guid:
d['link']=q.guid.string
else:
d['link']=strip_tags(q.comments)
d['title']=q.title.string
for r in q.findAll('description'):
d['image']=r['src']
arr.append(d)
return render(request,'feedpars.html',{'arr':arr})
HTML
<html>
<head></head>
<body>
{% for i in arr %}
<p>{{i.title}}</p>
<p>{{i.desc}}</p>
<p>{{i.guid}}</p>
<img src="{{i.image}}" style="width:100px;height:100px;"><hr>
{% endfor %}
</body>
</html>
Nothing gets displayed in my output.
1/ as I already told you here How to get the url of image in descripttion tag of xml file while parsing? : this is a rss feed so use the appropriate tool: https://pypi.python.org/pypi/feedparser
2/ there's no proper "img" tag in the description, the html markup has been entity-encoded. To get the url, you have to either decode the description's content (to get the tag back) and pass the resulting html fragment to your HTML parser or - since it will probably not be as complex as a full html doc - just use a plain regexp on the encoded content.

Issue with html tags while scraping data using beautiful soup

Common piece of code:
# -*- coding: cp1252 -*-
import csv
import urllib2
import sys
import time
from bs4 import BeautifulSoup
from itertools import islice
page = urllib2.urlopen('http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html').read()
soup = BeautifulSoup(page)
prices = soup.findAll('div', {"class": "price"})
After this I am trying following codes to get data:
Code 1:
for price in prices:
print unicode(price.string).encode('utf8')
Output1: No Output, code runs without any error and prints nothing.
Code 2:
for price in prices:
textcontent3= u' '.join(price.stripped_strings)
if textcontent3:
print textcontent3
Output2: No output again, same situation as in Output1.
Code 3:
for price in prices:
fonttag = price.find('div')
if fonttag is not None:
print unicode(fonttag.string).encode('utf8').strip()
Output3: No output, same as in Output1
After this I tried printing the concerned part of the html:
Code 4:
print prices
Output4:
</span></div>, <div class="price">
<span id="price"><br/>
</span></div>, <div class="price">
<span id="price"><br/>
</span></div>]
As it can be seen from the Output4, no price value is coming in the html beautiful soup is scraping for me. While on webpage this html structure looks like this:
<div class="price"><span id="price">49,90 €</span><br>einmalig</div>
Beautiful soup is not extracting the price values as mentioned in the html page, thus I am not able to scrape data for the price.
Please help me in solving this issue & pardon my ignorance as I am new to programming.
The page uses a large JavaScript structure to load the prices. You can load just that structure:
scripts = soup.find_all('script')
script = next(s.text for s in scripts if s.string and 'window.rates' in s.string)
datastring = script.split('phones=')[1].split(';window.')[0]
This results in a large JavaScript structure, starting with:
{sku844082:{name:"Samsung Galaxy SII",image:"/images/m677391_300468.jpg",deliveryTime:"Vorauss. verfügbar ab Anfang Januar",sku1444291:{p:"prod954312",e:"19.90"},sku1444286:{p:"prod954312",e:"19.90"},sku1444283:{p:"prod954312",e:"39.90"},sku1444275:{p:"prod954312",e:"59.90"},sku1104261:{p:"prod954312",e:"99.90"}},sku894279:{name:"BlackBerry Torch 9810",image:"/images/m727477_300464.jpg",deliveryTime:"Lieferbar innerhalb 48 Stunden",sku1444275:{p:"prod1004495",e:"179.90"},sku1104261:{p:"prod1004495",e:"259.90"},sku1444291:{p:"prod1004495",e:"29.90"},sku1444286:{p:"prod1004495",e:"29.90"},sku1444283:{p:"prod1004495",e:"49.90"}},sku864221:{name:"BlackBerry Bold 9900",image:"/images/m707491_300465.jpg",deliveryTime:"Lieferbar innerhalb 48 Stunden",sku1444275:{p:"prod974431",e:"129.90"},sku1104261:{p:"prod974431",e:"169.90"},sku1444291:{p:"prod974431",e:"49.90"},sku1444286:{p:"prod974431",e:"49.90"},sku1444283:{p:"prod974431",e:"89.90"}}
Unfortunately, that's not directly loadable with the json module; although valid JavaScript, without quoting around the keys it is not valid JSON. You'd need to use regular expressions to clean that up further, or grab the p:"someprice" information directly from that string.
Luckily the structure can be fixed with a small amount of regular expression magic:
import re
import json
datastring = re.sub(ur'([{,])([a-z]\w*):', ur'\1"\2":', datastring)
data = json.loads(datastring)
This gives you a large dictionary, with SKU keys and dictionaries with nested dicts as data, including nested SKUs with p product codes and e prices:
>>> from pprint import pprint
>>> pprint(data['sku864221'])
{u'deliveryTime': u'Lieferbar innerhalb 48 Stunden',
u'image': u'/images/m707491_300465.jpg',
u'name': u'BlackBerry Bold 9900',
u'sku1104261': {u'e': u'169.90', u'p': u'prod974431'},
u'sku1444275': {u'e': u'129.90', u'p': u'prod974431'},
u'sku1444283': {u'e': u'89.90', u'p': u'prod974431'},
u'sku1444286': {u'e': u'49.90', u'p': u'prod974431'},
u'sku1444291': {u'e': u'49.90', u'p': u'prod974431'}}

Using Beautiful Soup4 can I get the <a> embedded text?

I am using Beautiful Soup4 with Python and now any way i have come upto the below. So now in what way I can get the values Dog,Cat, Horse and also the Ids. Please help!
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
soup.find_all('a')
# [<a class="Animal" href="http://example.com/elsie" id="link1">Dog</a>,
# <a class="Animal" href="http://example.com/lacie" id="link2">Cat</a>,
# <a class="Animal" href="http://example.com/tillie" id="link3">Horse</a>]
Documentation
for a in soup.find_all('a'):
id = a.get('id')
value = a.string # This is a NavigableString
unicode_value = unicode(value)

How to use lxml to get a message from a website?

At exam.com is not about the weather:
Tokyo: 25°C
I want to use Django 1.1 and lxml to get information at the website. I want to get information that is of "25" only.
HTML exam.com structure as follows:
<p id="resultWeather">
<b>Weather</b>
Tokyo:
<b>25</b>°C
</p>
I'm a student. I'm doing a small project with my friends. Please explain to me easily understand. Thank you very much!
BeautifulSoup is more suitable for html parsing than lxml.
something like this can be helpful:
def get_weather():
import urllib
from BeautifulSoup import BeautifulSoup
data = urllib.urlopen('http://exam.com/').read()
soup = BeautifulSoup(data)
return soup.find('p', {'id': 'resultWeather'}).findAll('b')[-1].string
get page contents with urllib, parse it with BeautifulSoup, find P with id=resultWeather, find last B in our P and get it's content