Web Scraping between tags - regex

I am trying to get all of the content between tags from a webpage. The code I have is outputting empty arrays. When I print the htmltext it shows the complete contents of the page, but will not show the contents of the tags.
import urllib
import re
urlToOpen = "webAddress"
htmlfile = urllib.urlopen(urlToOpen)
htmltext = htmlfile.read()
regex = '<h5> (.*) </h5>'
pattern = re.compile(regex)
names = re.findall(pattern,htmltext)
print "The h5 tag contains: ", names

You did a mistake while calling the string urlToOpen. Write str(urlToOpen) instead of urlToOpen.
import urllib2
import re
urlToOpen = "http://stackoverflow.com/questions/25107611/web-scraping-between-tags"
htmlfile = urllib2.urlopen(str(urlToOpen))
htmltext = htmlfile.read()
regex = '<title>(.+?)</title>'
pattern = re.compile(regex)
names = re.findall(pattern,htmltext)
print names

Dont give spaces between tags and regex expression. Write like this:
regex = '<h5>(.+?)</h5>'

Related

Extracting URL from a string

I'm just starting regular expression for python and came across this problem where I'm supposed to extract URLs from the string:
str = "<tag>http://example-1.com</tag><tag>http://example-2.com</tag>"
The code I have is:
import re
url = re.findall('<tag>(.*)</tag>', str)
print(url)
returns:
[http://example-1.com</tag><tag>http://example-2.com]
If anyone could point me in the direction on how I might approach this problem would it would be most appreciative!
Thanks everyone!
You are using a regular expression, and matching HTML with such expressions get too complicated, too fast.
You can use BeautifulSoup to parse HTML.
For example:
from bs4 import BeautifulSoup
str = "<tag>http://example-1.com</tag><tag>http://example-2.com</tag>"
soup = BeautifulSoup(str, 'html.parser')
tags = soup.find_all('tag')
for tag in tags:
print tag.text
Using only re package:
import re
str = "<tag>http://example-1.com</tag><tag>http://example-2.com</tag>"
url = re.findall('<tag>(.*?)</tag>', str)
print(url)
returns:
['http://example-1.com', 'http://example-2.com']
Hope it helps!

BeautifulSoup and regexp: Attribute error

I try to extract information with beautifulsoup4 methods by means of reg. exp.
But I get the following answer:
AttributeError: 'NoneType' object has no attribute 'group'
I do not understand what is wrong.. I am trying to:
get the Typologie name: 'herenhuizen'
get the weblink
Here is my code:
import requests
from bs4 import BeautifulSoup
import re
url = 'https://inventaris.onroerenderfgoed.be/erfgoedobjecten/4778'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
text = soup.prettify()
##block
p = re.compile('(?s)(?<=(Typologie))(.*?)(?=(</a>))', re.VERBOSE)
block = p.search(text).group(2)
##typo_url
p = re.compile('(?s)(?<=(href=\"))(.*?)(?=(\">))', re.VERBOSE)
typo_url = p.search(block).group(2)
## typo_name
p = re.compile('\b(\w+)(\W*?)$', re.VERBOSE)
typo_name = p.search(block).group(1)
Does someone have an idea where is the mistake?
I would change this:
## typo_name
block_reverse = block[::-1]
p = re.compile('(\w+)', re.VERBOSE)
typo_name_reverse = p.search(block_reverse).group(1)
typo_name = typo_name_reverse[::-1]
print(typo_name)
Sometimes it's easier to just reverse the string if you are looking for stuff at the end. This just finds the name at the end of your block. There are a number of ways to find what you are looking for, and we could come up with all kinds of clever regexes, but if this works that's probably enough :)
update
However I just noticed the reason the original regex was not working is to use \b it needs to be escaped like \\b or be raw like this:
## typo_name
p = re.compile(r'\b(\w+)(\W*?)$', re.VERBOSE)
typo_name = p.search(block).group(1)
Some good followed Q and A here: Does Python re module support word boundaries (\b)?

Beautifulsoup: exclude unwanted parts

I realise it's probably a very specific question but I'm struggling to get rid of some parts of text I get using the code below. I need a plain article text which I locate by finding "p" tags under 'class':'mol-para-with-font'. Somehow I get lots of other stuff like author's byline, date stamp, and most importantly text from adverts on the page. Examining html I cannot see them containing the same 'class':'mol-para-with-font' so I'm puzzled (or maybe I've been staring at it for too long...). I know there are lots of html gurus here so I'll be grateful for your help.
My code:
import requests
import translitcodec
import codecs
def get_text(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
# delete unwanted tags:
for s in soup(['figure', 'script', 'style', 'table']):
s.decompose()
article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all( ['p', {'class':'mol-para-with-font'}])]
article = '\n'.join(article_soup)
text = codecs.encode(article, 'translit/one').encode('ascii', 'replace') #replace traslit with ascii
text = u"{}".format(text) #encode to unicode
print text
url = 'http://www.dailymail.co.uk/femail/article-4703718/How-Alexander-McQueen-Kate-s-royal-tours.html'
get_text(url)
Only 'p'-s with class="mol-para-with-font" ?
This will give it to you:
import requests
from bs4 import BeautifulSoup as BS
url = 'http://www.dailymail.co.uk/femail/article-4703718/How-Alexander-McQueen-Kate-s-royal-tours.html'
r = requests.get(url)
soup = BS(r.content, "lxml")
for i in soup.find_all('p', class_='mol-para-with-font'):
print(i.text)

How to extract URLs matching a pattern

I'm trying to extract URLs from a webpage with the following pattern :
'http://www.realclearpolitics.com/epolls/????/governor/??/-.html'
My current code extracts all the links. How could I change my code to only extract URLs that match the pattern? Thank you!
import requests
from bs4 import BeautifulSoup
def find_governor_races(html):
url = html
base_url = 'http://www.realclearpolitics.com/'
page = requests.get(html).text
soup = BeautifulSoup(page,'html.parser')
links = []
for a in soup.findAll('a', href=True):
links.append(a['href'])
find_governor_races('http://www.realclearpolitics.com/epolls/2010/governor/2010_elections_governor_map.html')
You can provide a regular expression pattern as an href argument value for the .find_all():
import re
pattern = re.compile(r"http://www.realclearpolitics.com\/epolls/\d+/governor/.*?/.*?.html")
links = soup.find_all("a", href=pattern)

Parsing webpages, using re, how to determine the line of a found string?

I am lookng at a website in python using code like this:
import urllib
import urllib2
import re
aResp = aResp = urllib2.urlopen("http://stackoverflow.com/");
web_pg = aResp.read();
pattern = "<title>Stack Overflow</title>"
m = re.search(pattern, web_pg)
if m:
print "found"
else:
print "Nothing found"
And I am trying to look at the tag after this and get the test inside of it. Is there any easy way to find out this information????
If it is simpler I could make do with something that just gives the line number of m and a way to get the HTML code of that line.
To capture text, use the () braces like so:
import urllib
import urllib2
import re
aResp = aResp = urllib2.urlopen("http://stackoverflow.com/");
web_pg = aResp.read();
pattern = "<title>(.*?)</title>"
m = re.search(pattern, web_pg)
if m:
print m.group(1)
else:
print "Nothing found"
The .group() function returns the first occurrence of the match.