Cleaning text after beautiful soup removing specific patterns - regex

The ultimate goal is to have a clean plain text for voice processing. That means I need to remove sub-headers, links, bullet points etc. The code below shows steps I have taken to clean one example url bit by bit. I'm stuck now with two things which are common and always have the same structure.
'By Name of correspondent, city'
'Read more: link'
I'm not good at regex but I think it might help removing these two parts. Or maybe someone could suggest another way of dealing with these patterns. Thanks!
My code:
import requests
from bs4 import BeautifulSoup
import translitcodec
import codecs
def get_text(url):
page_class = 'story-body__inner'
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
# remove unwanted parts by class
try:
soup.find('div', class_='social-embed-post social-embed-twitter').decompose()
soup.find('div', class_='social-embed').decompose()
soup.find('a', class_='off-screen jump-link').decompose()
soup.find('p', class_='off-screen').decompose()
soup.find('a', class_='embed-report-link').decompose()
soup.find('a', class_='story-body__link').decompose()
except: AttributeError
# delete unwanted tags:
for s in soup(['figure', 'script', 'style', 'table', 'ul', 'h2', 'blockquote']):
s.decompose()
# use separator to separate paragraphs and subtitles!
article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all( 'div', {'class': page_class})]
text = '\n'.join(article_soup)
text = codecs.encode(text, 'translit/one').encode('ascii', 'replace') #replace traslit with ascii
text = u"{}".format(text) #encode to unicode
print text
return text
url = 'http://www.bbc.co.uk/news/world-us-canada-41724827'
get_text(url)

You don't need regex for this.
Since you only want the main content of the news article (not even headings, since you removed the h2 tags in your code), it's much easier finding all the p elements first and then filtering out the items you don't need.
Three things you want to removed are:
Newsreader's details: These are contained within strong tags inside the paragraphs. As far as I've seen, there are no other paragraphs containing strong elements.
Citations to other articles: those beginning with "Read more: " followed by a link. Luckily, there's a fixed string before the a element inside paragraphs like this. So you don't need regex. You can simply find using p.find(text='Read more: ').
Text from Twitter post: These don't appear on the web browser. After each twitter image embedded in the page, there's a p element that contains the text "End of Twitter post by #some_twitter_id". You don't want this, obviously.
Edit:
The main news content can be found in a single div with a class value of story-body__inner.
I've updated the code to fix the issue of the non-printing of paragraphs containing links. The and inside the second condition had to be replaced with or. I've added another condition and not (p.has_attr('dir')), since the paragraphs containing Twitter posts have a dir attribute in them.
paragraphs = soup.find('div', {'class': 'story-body__inner'}).findAll('p')
for p in paragraphs:
if p.find('strong') == None \
and (p.find(text='Read more: ') == None or p.find('a') == None) \
and not (p.has_attr('class') and 'off-screen' in p['class']) \
and not (p.has_attr('dir')):
print(p.text.strip())

Related

Is it possible to extract text from a Wagtail streamfield block and strip formatting

I'm trying to extract the text from a Wagtail PostPage, and display that text in a Django page outside of Wagtail.
The relevant part of my PostPage model is:
body = StreamField([
('heading', blocks.CharBlock()),
('paragraph', blocks.RichTextBlock()),
('image', ImageChooserBlock()),
], blank=True)
In my template, I am displaying the posts that get passed like so:
{% for blog in blogs %}
<a class="blog-post-link" href="{% pageurl blog %}">{{ blog.title}}</a>
{{ blog.body|truncatewords_html:10 }}
Read More
{% endfor %}
This works, but the text that returns has all the formatting that was applied in the DraftTail editor.
Is there any way to pull just the text and pass it to the template from the wagtail side, or would the text have to be reformatted in the template using custom template tags or something else?
Extra question: I was concerned about blog.body pulling in the heading or the image defined in the streamfield, but so far it seems to jump to the first paragraph when looking for what to display. This is good, but is there a way to guarantee this behavior?
For this display, I don't think there is a distinction between pages routed by wagtail's page router vs routed by a specific Django page. I think what you may want to do is use Wagtail's richtext filter and then truncate - more or less as you are doing with {% pageurl blog %}.
If you are concerned about getting headings and images in this other context, is it possible to create a new field for use on this display? If you already have a bunch of blog posts, you might need to write a script to fill in the text once you have added the field. But in that script it should be easier to skip the tags you don't want. AND then you can allow authors to edit the summary/blurb field so it makes better sense in the 10 word summary.
If you just want plain text, you can use the streamfield render_as_block() method then parse the html with BeautifulSoup to grab the inner text and clean out any unwanted text from there.
I use this function to do some NLP on StreamField corpus:
import re
from html import unescape
from bs4 import BeautifulSoup
def get_streamfield_text(
streamfield,
strip_newlines=True,
strip_punctuation=True,
lowercase=False,
strip_tags=['style', 'script', 'code']
):
html = streamfield.render_as_block()
soup = BeautifulSoup(unescape(html), "html.parser")
# strip unwanted tags tags (e.g. ['code', 'script', 'style'])
# <style> & <script> by default
if strip_tags:
for script in soup(strip_tags):
script.extract()
inner_text = ' '.join(soup.findAll(text=True))
# replace with space
inner_text = inner_text.replace('\xa0',' ')
# replace & with and
inner_text = inner_text.replace(' & ',' and ')
# strip font awesome text
inner_text = re.sub(r'\bfa-[^ ]*', '', inner_text)
if strip_newlines:
inner_text = re.sub(r'([\n]+.?)+', ' ', inner_text)
if strip_punctuation:
# replace xx/yy with xx yy, leave fractions (1/2)
inner_text = re.sub(r'(?<=\D)/(?=\D)', ' ', inner_text)
# strip full stops, leave decimal points and point separators
inner_text = re.sub(r'\.(?=\s)', '', inner_text)
punctuation = '!"#$%&\'()*+,-:;<=>?#[\\]^_`{|}~“”‘’–«»‹›¿¡'
inner_text = inner_text.translate(str.maketrans('', '', punctuation))
if lowercase:
inner_text = inner_text.lower()
# strip excess whitespace
inner_text = re.sub(r' +', ' ', inner_text).strip()
return inner_text
Set the parameters to true/false as needed.
I don't do this in rendering though, too much overhead. I fire it off in an after edit/create hook and save the result to a read-only field.

Extract the HTML from between two HTML tags in BeautifulSoup 4.6

I want to get the HTML between two tags with bs4. Is there a way to do javascript's .innerHTML in Beautiful Soup?
This is code that finds a span with the class "title", and gets the text out of it.
def get_title(soup):
title = soup.find('span', {'class': 'title'})
return title.text.encode('utf-8')
This function incorrectly returns the text of the span without the subscripts. 'Title about H2O and CO2'
The following code is the result of title = soup.find('span', {'class': 'title'}):
<span class="title">Title about H<sub>2</sub>O and CO<sub>2</sub></span>
How do I get the result without the original span?
Desired result: 'Title about H<sub>2</sub>O and CO<sub>2</sub>'?
After finding out that JavaScript has .innerHTML, I was able to google the way to do it in beautiful soup. I found the answer in this question.
After selecting the element with BS4, you can use .decode_contents(formmater='html') to get the innerHTML.
element.decode_contents(formatter="html")

Methods in Python 2.7 that enable text extraction from multiple HTML pages with different element tags?

I primarily work in Python 2.7. I'm trying to extract the written content (body text) of hundreds of articles from their respective URLs. To simplify things, I've started by trying to extract the text from just one website in my list, and I've been able to do so successfully using BeautifulSoup4. My code looks like this:
import urllib2
from bs4 import BeautifulSoup
url = 'http://insertaddresshere'
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'utf-8')
response = urllib2.urlopen(request)
soup = BeautifulSoup((response),"html.parser")
texts = soup.find_all("p")
for item in texts:
print item.text
This gets me the body text of a single article. I know how to iterate through a csv file and write to a new one, but the list of sites I need to iterate through are all from different domains, so the HTML code varies a lot. Is there any way to find body text from multiple articles that have different element labels (here, "p") for said body text? Is it possible to use BeautifulSoup to do this?

Scrapy Extract number from page text with regex

I have been looking for a few hours on how to search all text on a page and if it matches a regex then extract it. I have my spider set up as follows:
def parse(self, response):
title = response.xpath('//title/text()').extract()
units = response.xpath('//body/text()').re(r"Units: (\d)")
print title, units
I would like to pull out the number after "Units: " on the pages. When I run scrapy on a page with Units: 351 in the body I only get the title of the page with a bunch of escapes before and after it and nothing for units.
I am new to scrapy and have a little python experience. Any help with how to extract the integer after Units: and remove the extra escape characters "u'\r\n\t..." from the title would be much appreciated.
EDIT:
As per comment here is an partial html extract of an example page. Note this could be within different tags aside from the p in this example:
<body>
<div> Some content and multiple Divs here <div>
<h1>This is the count for Dala</h1>
<p><strong>Number of Units:</strong> 801</p>
<p>We will have other content here and more divs beyond</p>
</body>
Based on the answer below this is what got most of the way there. Still working on removing Units: and extra escape characters.
units = response.xpath('string(//body)').re("(Units: [\d]+)")
Try:
response.xpath('string(//body)').re(r"Units: (\d)")

Add a newline after each closing html tag in web2py

Original
I want to parse a string of html code and add newlines after closing tags + after the initial form tag. Here's the code so far. It's giving me an error in the "re.sub" line. I don't understand why the regex fails.
def user():
tags = "<form><label for=\"email_field\">Email:</label><input type=\"email\" name=\"email_field\"/><label for=\"password_field\">Password:</label><input type=\"password\" name=\"password_field\"/><input type=\"submit\" value=\"Login\"/></form>"
result = re.sub("(</.*?>)", "\1\n", tags)
return dict(form_code=result)
PS. I have a feeling this might not be the best way... but I still want to learn how to do this.
EDIT
I was missing "import re" from my default.py. Thanks ruakh for this.
import re
Now my page source code shows up like this (inspected in client browser). The actual page shows the form code as text, not as UI elements.
<form><label for="email_field">Email:</label>
<input type="email" name="email_field"/><label
for="password_field">Password:</label>
<input type="password" name="password_field"/><input
type="submit" value="Login"/></form>
EDIT 2
The form code is rendered as UI elements after adding XML() helper into default.py. Thanks Anthony for helping. Corrected line below:
return dict(form_code=XML(result))
FINAL EDIT
Fixing the regex I figured myself. This is not optimal solution but at least it works. The final code:
import re
def user():
tags = "<form><label for=\"email_field\">Email:</label><input type=\"email\" name=\"email_field\"/><label for=\"password_field\">Password:</label><input type=\"password\" name=\"password_field\"/><input type=\"submit\" value=\"Login\"/></form>"
tags = re.sub(r"(<form>)", r"<form>\n ", tags)
tags = re.sub(r"(</.*?>)", r"\1\n ", tags)
tags = re.sub(r"(/>)", r"/>\n ", tags)
tags = re.sub(r"( </form>)", r"</form>\n", tags)
return dict(form_code=XML(tags))
The only issue I see is that you need to change "\1\n" to r"\1\n" (using the "raw" string notation); otherwise \1 is interpreted as an octal escape (meaning the character U+0001). But that shouldn't give you an error, per se. What error-message are you getting?
By default, web2py escapes all text inserted in the view for security reasons. To avoid that, simply use the XML() helper, either in the controller:
return dict(form_code=XML(result))
or in the view:
{{=XML(form_code)}}
Don't do this unless the code is coming from a trusted source -- otherwise it could contain malicious Javascript.