Scrapy Extract number from page text with regex - regex

I have been looking for a few hours on how to search all text on a page and if it matches a regex then extract it. I have my spider set up as follows:
def parse(self, response):
title = response.xpath('//title/text()').extract()
units = response.xpath('//body/text()').re(r"Units: (\d)")
print title, units
I would like to pull out the number after "Units: " on the pages. When I run scrapy on a page with Units: 351 in the body I only get the title of the page with a bunch of escapes before and after it and nothing for units.
I am new to scrapy and have a little python experience. Any help with how to extract the integer after Units: and remove the extra escape characters "u'\r\n\t..." from the title would be much appreciated.
EDIT:
As per comment here is an partial html extract of an example page. Note this could be within different tags aside from the p in this example:
<body>
<div> Some content and multiple Divs here <div>
<h1>This is the count for Dala</h1>
<p><strong>Number of Units:</strong> 801</p>
<p>We will have other content here and more divs beyond</p>
</body>
Based on the answer below this is what got most of the way there. Still working on removing Units: and extra escape characters.
units = response.xpath('string(//body)').re("(Units: [\d]+)")

Try:
response.xpath('string(//body)').re(r"Units: (\d)")

Related

Cleaning text after beautiful soup removing specific patterns

The ultimate goal is to have a clean plain text for voice processing. That means I need to remove sub-headers, links, bullet points etc. The code below shows steps I have taken to clean one example url bit by bit. I'm stuck now with two things which are common and always have the same structure.
'By Name of correspondent, city'
'Read more: link'
I'm not good at regex but I think it might help removing these two parts. Or maybe someone could suggest another way of dealing with these patterns. Thanks!
My code:
import requests
from bs4 import BeautifulSoup
import translitcodec
import codecs
def get_text(url):
page_class = 'story-body__inner'
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
# remove unwanted parts by class
try:
soup.find('div', class_='social-embed-post social-embed-twitter').decompose()
soup.find('div', class_='social-embed').decompose()
soup.find('a', class_='off-screen jump-link').decompose()
soup.find('p', class_='off-screen').decompose()
soup.find('a', class_='embed-report-link').decompose()
soup.find('a', class_='story-body__link').decompose()
except: AttributeError
# delete unwanted tags:
for s in soup(['figure', 'script', 'style', 'table', 'ul', 'h2', 'blockquote']):
s.decompose()
# use separator to separate paragraphs and subtitles!
article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all( 'div', {'class': page_class})]
text = '\n'.join(article_soup)
text = codecs.encode(text, 'translit/one').encode('ascii', 'replace') #replace traslit with ascii
text = u"{}".format(text) #encode to unicode
print text
return text
url = 'http://www.bbc.co.uk/news/world-us-canada-41724827'
get_text(url)
You don't need regex for this.
Since you only want the main content of the news article (not even headings, since you removed the h2 tags in your code), it's much easier finding all the p elements first and then filtering out the items you don't need.
Three things you want to removed are:
Newsreader's details: These are contained within strong tags inside the paragraphs. As far as I've seen, there are no other paragraphs containing strong elements.
Citations to other articles: those beginning with "Read more: " followed by a link. Luckily, there's a fixed string before the a element inside paragraphs like this. So you don't need regex. You can simply find using p.find(text='Read more: ').
Text from Twitter post: These don't appear on the web browser. After each twitter image embedded in the page, there's a p element that contains the text "End of Twitter post by #some_twitter_id". You don't want this, obviously.
Edit:
The main news content can be found in a single div with a class value of story-body__inner.
I've updated the code to fix the issue of the non-printing of paragraphs containing links. The and inside the second condition had to be replaced with or. I've added another condition and not (p.has_attr('dir')), since the paragraphs containing Twitter posts have a dir attribute in them.
paragraphs = soup.find('div', {'class': 'story-body__inner'}).findAll('p')
for p in paragraphs:
if p.find('strong') == None \
and (p.find(text='Read more: ') == None or p.find('a') == None) \
and not (p.has_attr('class') and 'off-screen' in p['class']) \
and not (p.has_attr('dir')):
print(p.text.strip())

Extract the HTML from between two HTML tags in BeautifulSoup 4.6

I want to get the HTML between two tags with bs4. Is there a way to do javascript's .innerHTML in Beautiful Soup?
This is code that finds a span with the class "title", and gets the text out of it.
def get_title(soup):
title = soup.find('span', {'class': 'title'})
return title.text.encode('utf-8')
This function incorrectly returns the text of the span without the subscripts. 'Title about H2O and CO2'
The following code is the result of title = soup.find('span', {'class': 'title'}):
<span class="title">Title about H<sub>2</sub>O and CO<sub>2</sub></span>
How do I get the result without the original span?
Desired result: 'Title about H<sub>2</sub>O and CO<sub>2</sub>'?
After finding out that JavaScript has .innerHTML, I was able to google the way to do it in beautiful soup. I found the answer in this question.
After selecting the element with BS4, you can use .decode_contents(formmater='html') to get the innerHTML.
element.decode_contents(formatter="html")

Web crawler to extract data particular subset of html tags

1.
<p class="followText">Follow us</p>
<p><a class="symbol ss-social-circle ss-facebook" href="http://www.facebook.com/HowStuffWorks" target="_blank">Facebook</a></p>
2.
<p>Gyroscopes can be very perplexing objects because they move in peculiar ways and even seem to defy gravity. These special properties make ­gyroscopes extremely important in everything from your bicycle to the advanced navigation system on the space shuttle. A typical airplane uses about a dozen gyroscopes in everything from its compass to its autopilot. The Russian Mir space station used 11 gyroscopes to keep its orientation to the sun, and the Hubble Space Telescope has a batch of navigational gyros as well. Gyroscopic effects are also central to things like yo-yos and Frisbees!</p>
This is part of source of the website http://science.howstuffworks.com/gyroscope.htm, from which I'm trying to extract contents of the <p> tag from.
This is the code I'm using to do that
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://science.howstuffworks.com/gyroscope' + str(page) + ".htm"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('p'):
paragraph = link.string
print paragraph
But I'm getting both types of data( both 1 and 2) inside the p tag.
I need to get only the data from the part 2 section and not part 1.
Please suggest me a way to leave out tags with attributes but keep the basic tags of the same html tag.

How to get large amounts of href links of very large contents of website with Beautifulsoup

I am parsing a large html website that has over 1000 href links. I am using Beautifulsoup to get all the links but second time when I run the program again, beautifulsoup cannot handle it. (find specific all 'td' tags. how will I overcome this problem? Though I can load the html page with urllib, all the links cannot be printed. When I use it with find one 'td' tag, it is passed.
Tag = self.__Page.find('table', {'class':'RSLTS'}).findAll('td')
print Tag
for a in Tag.find('a', href= True):
print "found", a['href']
Now working as
Tag = self.__Page.find('table', {'class':'RSLTS'}).find('td')
print Tag
for a in Tag.find('a', href= True):
print "found", a['href']
You need to iterate over them:
tds = self.__Page.find('table', class_='RSLTS').find_all('td')
for td in tds:
a = td.find('a', href=True)
if a:
print "found", a['href']
Although I'd just use lxml if you have a ton of stuff:
root.xpath('table[contains(#class, "RSLTS")]/td/a/#href')

Add a newline after each closing html tag in web2py

Original
I want to parse a string of html code and add newlines after closing tags + after the initial form tag. Here's the code so far. It's giving me an error in the "re.sub" line. I don't understand why the regex fails.
def user():
tags = "<form><label for=\"email_field\">Email:</label><input type=\"email\" name=\"email_field\"/><label for=\"password_field\">Password:</label><input type=\"password\" name=\"password_field\"/><input type=\"submit\" value=\"Login\"/></form>"
result = re.sub("(</.*?>)", "\1\n", tags)
return dict(form_code=result)
PS. I have a feeling this might not be the best way... but I still want to learn how to do this.
EDIT
I was missing "import re" from my default.py. Thanks ruakh for this.
import re
Now my page source code shows up like this (inspected in client browser). The actual page shows the form code as text, not as UI elements.
<form><label for="email_field">Email:</label>
<input type="email" name="email_field"/><label
for="password_field">Password:</label>
<input type="password" name="password_field"/><input
type="submit" value="Login"/></form>
EDIT 2
The form code is rendered as UI elements after adding XML() helper into default.py. Thanks Anthony for helping. Corrected line below:
return dict(form_code=XML(result))
FINAL EDIT
Fixing the regex I figured myself. This is not optimal solution but at least it works. The final code:
import re
def user():
tags = "<form><label for=\"email_field\">Email:</label><input type=\"email\" name=\"email_field\"/><label for=\"password_field\">Password:</label><input type=\"password\" name=\"password_field\"/><input type=\"submit\" value=\"Login\"/></form>"
tags = re.sub(r"(<form>)", r"<form>\n ", tags)
tags = re.sub(r"(</.*?>)", r"\1\n ", tags)
tags = re.sub(r"(/>)", r"/>\n ", tags)
tags = re.sub(r"( </form>)", r"</form>\n", tags)
return dict(form_code=XML(tags))
The only issue I see is that you need to change "\1\n" to r"\1\n" (using the "raw" string notation); otherwise \1 is interpreted as an octal escape (meaning the character U+0001). But that shouldn't give you an error, per se. What error-message are you getting?
By default, web2py escapes all text inserted in the view for security reasons. To avoid that, simply use the XML() helper, either in the controller:
return dict(form_code=XML(result))
or in the view:
{{=XML(form_code)}}
Don't do this unless the code is coming from a trusted source -- otherwise it could contain malicious Javascript.