Extract the HTML from between two HTML tags in BeautifulSoup 4.6 - python-2.7

I want to get the HTML between two tags with bs4. Is there a way to do javascript's .innerHTML in Beautiful Soup?
This is code that finds a span with the class "title", and gets the text out of it.
def get_title(soup):
title = soup.find('span', {'class': 'title'})
return title.text.encode('utf-8')
This function incorrectly returns the text of the span without the subscripts. 'Title about H2O and CO2'
The following code is the result of title = soup.find('span', {'class': 'title'}):
<span class="title">Title about H<sub>2</sub>O and CO<sub>2</sub></span>
How do I get the result without the original span?
Desired result: 'Title about H<sub>2</sub>O and CO<sub>2</sub>'?

After finding out that JavaScript has .innerHTML, I was able to google the way to do it in beautiful soup. I found the answer in this question.
After selecting the element with BS4, you can use .decode_contents(formmater='html') to get the innerHTML.
element.decode_contents(formatter="html")

Related

Cleaning text after beautiful soup removing specific patterns

The ultimate goal is to have a clean plain text for voice processing. That means I need to remove sub-headers, links, bullet points etc. The code below shows steps I have taken to clean one example url bit by bit. I'm stuck now with two things which are common and always have the same structure.
'By Name of correspondent, city'
'Read more: link'
I'm not good at regex but I think it might help removing these two parts. Or maybe someone could suggest another way of dealing with these patterns. Thanks!
My code:
import requests
from bs4 import BeautifulSoup
import translitcodec
import codecs
def get_text(url):
page_class = 'story-body__inner'
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
# remove unwanted parts by class
try:
soup.find('div', class_='social-embed-post social-embed-twitter').decompose()
soup.find('div', class_='social-embed').decompose()
soup.find('a', class_='off-screen jump-link').decompose()
soup.find('p', class_='off-screen').decompose()
soup.find('a', class_='embed-report-link').decompose()
soup.find('a', class_='story-body__link').decompose()
except: AttributeError
# delete unwanted tags:
for s in soup(['figure', 'script', 'style', 'table', 'ul', 'h2', 'blockquote']):
s.decompose()
# use separator to separate paragraphs and subtitles!
article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all( 'div', {'class': page_class})]
text = '\n'.join(article_soup)
text = codecs.encode(text, 'translit/one').encode('ascii', 'replace') #replace traslit with ascii
text = u"{}".format(text) #encode to unicode
print text
return text
url = 'http://www.bbc.co.uk/news/world-us-canada-41724827'
get_text(url)
You don't need regex for this.
Since you only want the main content of the news article (not even headings, since you removed the h2 tags in your code), it's much easier finding all the p elements first and then filtering out the items you don't need.
Three things you want to removed are:
Newsreader's details: These are contained within strong tags inside the paragraphs. As far as I've seen, there are no other paragraphs containing strong elements.
Citations to other articles: those beginning with "Read more: " followed by a link. Luckily, there's a fixed string before the a element inside paragraphs like this. So you don't need regex. You can simply find using p.find(text='Read more: ').
Text from Twitter post: These don't appear on the web browser. After each twitter image embedded in the page, there's a p element that contains the text "End of Twitter post by #some_twitter_id". You don't want this, obviously.
Edit:
The main news content can be found in a single div with a class value of story-body__inner.
I've updated the code to fix the issue of the non-printing of paragraphs containing links. The and inside the second condition had to be replaced with or. I've added another condition and not (p.has_attr('dir')), since the paragraphs containing Twitter posts have a dir attribute in them.
paragraphs = soup.find('div', {'class': 'story-body__inner'}).findAll('p')
for p in paragraphs:
if p.find('strong') == None \
and (p.find(text='Read more: ') == None or p.find('a') == None) \
and not (p.has_attr('class') and 'off-screen' in p['class']) \
and not (p.has_attr('dir')):
print(p.text.strip())

BeautifulSoup: Get generic tags from a specific class only

I get all the text I want from an HTML file when I use beautifulsoup like this:
category = soup.find_all("ol", {"class":"breadcrumb"})
catname = BeautifulSoup(str(category).strip()).get_text().encode("utf-8")
Output:
Home
Digital Goods
E-Books
BUT I want to skip the first category, i.e. 'Home'. I know that I can simply replace that word with "", but my question is really about how I get get beautifulsoup to get a very specific tag in the location I have singled out above.
The HTML code looks like this:
<ol class="breadcrumb">
<li>Home</li>
<li>Digital Goods</li>
<li>E-Books</li>
</ol>
Is there anything I can do to get the second and third 'li' tags from this 'breadcrumb' section, and not others in the file?
Example (which does not work but illustrates what I'm looking for):
category = soup.find_all("ol", {"class":"breadcrumb"}), find_all("li")[1:]
what about this:
category = soup.find("ol", {"class":"breadcrumb"}).findAll('li')[1:]
catname = BeautifulSoup(str(category).strip()).get_text().encode("utf-8")
?
My output is then:
[Digital Goods, E-Books]

Scrapy Extract number from page text with regex

I have been looking for a few hours on how to search all text on a page and if it matches a regex then extract it. I have my spider set up as follows:
def parse(self, response):
title = response.xpath('//title/text()').extract()
units = response.xpath('//body/text()').re(r"Units: (\d)")
print title, units
I would like to pull out the number after "Units: " on the pages. When I run scrapy on a page with Units: 351 in the body I only get the title of the page with a bunch of escapes before and after it and nothing for units.
I am new to scrapy and have a little python experience. Any help with how to extract the integer after Units: and remove the extra escape characters "u'\r\n\t..." from the title would be much appreciated.
EDIT:
As per comment here is an partial html extract of an example page. Note this could be within different tags aside from the p in this example:
<body>
<div> Some content and multiple Divs here <div>
<h1>This is the count for Dala</h1>
<p><strong>Number of Units:</strong> 801</p>
<p>We will have other content here and more divs beyond</p>
</body>
Based on the answer below this is what got most of the way there. Still working on removing Units: and extra escape characters.
units = response.xpath('string(//body)').re("(Units: [\d]+)")
Try:
response.xpath('string(//body)').re(r"Units: (\d)")

Scraperwiki scrape query: using lxml to extract links

I suspect this is a trivial query but hope someone can help me with a query I've got using lxml in a scraper I'm trying to build.
https://scraperwiki.com/scrapers/thisisscraper/
I'm working line-by-line through the tutorial 3 and have got so far with trying to extract the next page link. I can use cssselect to identify the link, but I can't work out how to isolate just the href attribute rather than the whole anchor tag.
Can anyone help?
def scrape_and_look_for_next_link(url):
html = scraperwiki.scrape(url)
print html
root = lxml.html.fromstring(html) #turn the HTML into lxml object
scrape_page(root)
next_link = root.cssselect('ol.pagination li a')[-1]
attribute = lxml.html.tostring(next_link)
attribute = lxml.html.fromstring(attribute)
#works up until this point
attribute = attribute.xpath('/#href')
attribute = lxml.etree.tostring(attribute)
print attribute
CSS selectors can select elements that have an href attribute with eg. a[href] but they can not extract the attribute value by themselves.
Once you have the element from cssselect, you can use next_link.get('href') to get the value of the attribute.
link = link.attrib['href']
should work

Add a newline after each closing html tag in web2py

Original
I want to parse a string of html code and add newlines after closing tags + after the initial form tag. Here's the code so far. It's giving me an error in the "re.sub" line. I don't understand why the regex fails.
def user():
tags = "<form><label for=\"email_field\">Email:</label><input type=\"email\" name=\"email_field\"/><label for=\"password_field\">Password:</label><input type=\"password\" name=\"password_field\"/><input type=\"submit\" value=\"Login\"/></form>"
result = re.sub("(</.*?>)", "\1\n", tags)
return dict(form_code=result)
PS. I have a feeling this might not be the best way... but I still want to learn how to do this.
EDIT
I was missing "import re" from my default.py. Thanks ruakh for this.
import re
Now my page source code shows up like this (inspected in client browser). The actual page shows the form code as text, not as UI elements.
<form><label for="email_field">Email:</label>
<input type="email" name="email_field"/><label
for="password_field">Password:</label>
<input type="password" name="password_field"/><input
type="submit" value="Login"/></form>
EDIT 2
The form code is rendered as UI elements after adding XML() helper into default.py. Thanks Anthony for helping. Corrected line below:
return dict(form_code=XML(result))
FINAL EDIT
Fixing the regex I figured myself. This is not optimal solution but at least it works. The final code:
import re
def user():
tags = "<form><label for=\"email_field\">Email:</label><input type=\"email\" name=\"email_field\"/><label for=\"password_field\">Password:</label><input type=\"password\" name=\"password_field\"/><input type=\"submit\" value=\"Login\"/></form>"
tags = re.sub(r"(<form>)", r"<form>\n ", tags)
tags = re.sub(r"(</.*?>)", r"\1\n ", tags)
tags = re.sub(r"(/>)", r"/>\n ", tags)
tags = re.sub(r"( </form>)", r"</form>\n", tags)
return dict(form_code=XML(tags))
The only issue I see is that you need to change "\1\n" to r"\1\n" (using the "raw" string notation); otherwise \1 is interpreted as an octal escape (meaning the character U+0001). But that shouldn't give you an error, per se. What error-message are you getting?
By default, web2py escapes all text inserted in the view for security reasons. To avoid that, simply use the XML() helper, either in the controller:
return dict(form_code=XML(result))
or in the view:
{{=XML(form_code)}}
Don't do this unless the code is coming from a trusted source -- otherwise it could contain malicious Javascript.