At exam.com is not about the weather:
Tokyo: 25°C
I want to use Django 1.1 and lxml to get information at the website. I want to get information that is of "25" only.
HTML exam.com structure as follows:
<p id="resultWeather">
<b>Weather</b>
Tokyo:
<b>25</b>°C
</p>
I'm a student. I'm doing a small project with my friends. Please explain to me easily understand. Thank you very much!
BeautifulSoup is more suitable for html parsing than lxml.
something like this can be helpful:
def get_weather():
import urllib
from BeautifulSoup import BeautifulSoup
data = urllib.urlopen('http://exam.com/').read()
soup = BeautifulSoup(data)
return soup.find('p', {'id': 'resultWeather'}).findAll('b')[-1].string
get page contents with urllib, parse it with BeautifulSoup, find P with id=resultWeather, find last B in our P and get it's content
Related
I'm using Django and Python 3.7 . I want to speed up my HTML parsing. Currently, I'm looking for three types of elements in my document, like so
req = urllib2.Request(fullurl, headers=settings.HDR)
html = urllib2.urlopen(req).read()
comments_soup = BeautifulSoup(html, features="html.parser")
score_elts = comments_soup.findAll("div", {"class": "score"})
comments_elts = comments_soup.findAll("a", attrs={'class': 'comments'})
bad_elts = comments_soup.findAll("span", text=re.compile("low score"))
I have read that SoupStrainer is one way to improve performacne -- https://www.crummy.com/software/BeautifulSoup/bs4/doc/#parsing-only-part-of-a-document . However, all the examples only talk about parsing an HTML doc with a single strainer. In my case, I have three. How can I pass three strainers into my parsing, or would that actually create worse performance that just doing it the way I'm doing it now?
I don't think you can pass multiple Strainers into the BeautifulSoup constructor. What you can instead do is to wrap all your conditions into one Strainer and pass it to the BeautifulSoup Constructor.
For simple cases such as just the tag names, you can pass a list into the SoupStrainer
html="""
<a>yes</a>
<p>yes</p>
<span>no</span>
"""
from bs4 import BeautifulSoup
from bs4 import SoupStrainer
custom_strainer = SoupStrainer(["a","p"])
soup=BeautifulSoup(html, "lxml", parse_only=custom_strainer)
print(soup)
Output
<a>yes</a><p>yes</p>
For specifying some more logic, you can also pass in a custom function(you may have to do this).
html="""
<html class="test">
<a class="wanted">yes</a>
<a class="not-wanted">no</a>
<p>yes</p>
<span>no</span>
</html>
"""
from bs4 import BeautifulSoup
from bs4 import SoupStrainer
def my_function(elem,attrs):
if elem=='a' and attrs['class']=="wanted":
return True
elif elem=='p':
return True
custom_strainer= SoupStrainer(my_function)
soup=BeautifulSoup(html, "lxml", parse_only=custom_strainer)
print(soup)
Output
<a class="wanted">yes</a><p>yes</p>
As specified in the documentation
Parsing only part of a document won’t save you much time parsing the
document, but it can save a lot of memory, and it’ll make searching
the document much faster.
I think you should check out the Improving performance section of the documentation.
from bs4 import BeautifulSoup
import re
import HTMLParser
import urllib
url = raw_input('enter - ')
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
scripts=soup.find_all('script')
for tag in scripts:
try:
Script = tag["src"]
print Script
except:
print "No source"
using this code I m not getting all the java script from html document.
I have checked your code and it seems that you are getting all the javascript. At least you check for all the tags. Of course some of the javascript may be directly embedded into the html and thereby won't have a src attribute. Merely the actual javascript between the <script>...</script> tags. You can get the javscript between these embedded tags using tag.contents in your loop.
Furthermore, I would advise to specify a parser. By default bs4 uses html.parser. Other parsers may perform better/differently. Check out: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
from bs4 import BeautifulSoup
import urllib2
r = urllib2.urlopen('<your url>').read()
soup = BeautifulSoup(r, 'html.parser')
for s in soup.findAll('script'):
print s.get('src')
I am trying to extract the name and subheading of this page (for example). I have no problem extracting the name, but it's unsuccessful for the subheading. Using inspect element in Chrome, I identified that the subheading text "Canada Census, 1901" is embedded as follows:
<div class="person-info">
<div class="title ng-binding">Helen Brad in household of Geo Wilcock</div>
<div class="subhead ng-scope ng-binding" data-ng-if="!recordPersonCentric">Canada Census, 1901</div>
So I coded my script as follows:
import urllib2
import re
import csv
from bs4 import BeautifulSoup
import time
def get_FamSearch():
link = "https://example.org/pal:/MM9.1.1/KH11-999"
openLink = urllib2.urlopen(link)
Soup_FamSearch = BeautifulSoup(openLink, "html")
openLink.close()
NameParentTag = Soup_FamSearch.find("tr", class_="result-item highlight-person")
if NameParentTag:
Name = NameParentTag.find("td", class_="result-value-bold").get_text(strip=True)
name_decode = Name.encode("ascii", "ignore")
print name_decode
SubheadTag = Soup_FamSearch.find("div", class_="subhead ng-scope ng-binding")
if SubheadTag:
print SubheadTag.get_text(strip=True)
get_FamSearch()
This is the results, without able to locate and extract the subheading:
Helen Brad
[Finished in 2.2s]
The page you are getting via urllib2 doesn't contain the div with subhead class. The actual heading is constructed asynchronously with the help of javascript being executed on the browser-side.
The data you need is presented differently, here's what works for me:
print Soup_FamSearch.find('dt', text='Title').find_next_sibling('dd').text.strip()
Prints:
Canada Census, 1901
Im using soup.findAll('table') to try to find the table in an html file, but it will not appear.
The table indeed exists in the file, and with regex Im able to locate it this way:
import sys
import urllib2
from bs4 import BeautifulSoup
import re
webpage = open(r'd:\samplefile.html', 'r').read()
soup = BeautifulSoup(webpage)
print re.findall("TABLE",webpage) #works, prints ['TABLE','TABLE']
print soup.findAll("TABLE") # prints an empty list []
I know I am correctly generating the soup since when I do:
print [tag.name for tag in soup.findAll(align=None)]
It will correctly print tags that it finds. I already tried also with different ways to write "TABLE" like "table", "Table", etc.
Also, if I open the file and edit it with a text editor, it has "TABLE" on it.
Why beautifulsoup doesnt find the table??
Context
python 2.x
BeautifulSoup HTML parser
Problem
bsoup findall does not return all the expected tags, or it returns none at all, even though the user knows that the tag exists in the markup
Solution
Try specifying the exact parser when initializing the BeautifulSoup constructor
## BEFORE
soup = BeautifulSoup(webpage)
## AFTER
soup = BeautifulSoup(webpage, "html5lib")
Rationale
The target markup may include mal-formed HTML, and there are varying degrees of success with different parsers.
See also
related post by Martijn that addresses the same issue
I am using Beautiful Soup4 with Python and now any way i have come upto the below. So now in what way I can get the values Dog,Cat, Horse and also the Ids. Please help!
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
soup.find_all('a')
# [<a class="Animal" href="http://example.com/elsie" id="link1">Dog</a>,
# <a class="Animal" href="http://example.com/lacie" id="link2">Cat</a>,
# <a class="Animal" href="http://example.com/tillie" id="link3">Horse</a>]
Documentation
for a in soup.find_all('a'):
id = a.get('id')
value = a.string # This is a NavigableString
unicode_value = unicode(value)