BeautifulSoup and regexp: Attribute error - regex

I try to extract information with beautifulsoup4 methods by means of reg. exp.
But I get the following answer:
AttributeError: 'NoneType' object has no attribute 'group'
I do not understand what is wrong.. I am trying to:
get the Typologie name: 'herenhuizen'
get the weblink
Here is my code:
import requests
from bs4 import BeautifulSoup
import re
url = 'https://inventaris.onroerenderfgoed.be/erfgoedobjecten/4778'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
text = soup.prettify()
##block
p = re.compile('(?s)(?<=(Typologie))(.*?)(?=(</a>))', re.VERBOSE)
block = p.search(text).group(2)
##typo_url
p = re.compile('(?s)(?<=(href=\"))(.*?)(?=(\">))', re.VERBOSE)
typo_url = p.search(block).group(2)
## typo_name
p = re.compile('\b(\w+)(\W*?)$', re.VERBOSE)
typo_name = p.search(block).group(1)
Does someone have an idea where is the mistake?

I would change this:
## typo_name
block_reverse = block[::-1]
p = re.compile('(\w+)', re.VERBOSE)
typo_name_reverse = p.search(block_reverse).group(1)
typo_name = typo_name_reverse[::-1]
print(typo_name)
Sometimes it's easier to just reverse the string if you are looking for stuff at the end. This just finds the name at the end of your block. There are a number of ways to find what you are looking for, and we could come up with all kinds of clever regexes, but if this works that's probably enough :)
update
However I just noticed the reason the original regex was not working is to use \b it needs to be escaped like \\b or be raw like this:
## typo_name
p = re.compile(r'\b(\w+)(\W*?)$', re.VERBOSE)
typo_name = p.search(block).group(1)
Some good followed Q and A here: Does Python re module support word boundaries (\b)?

Related

Python 2 regex search only for https and export

I have a List with many Links inside (http and https). Now I just want all URLs with https.
Is there a regex for that? I found only one for both.
The URLs are in "". Maybe this makes It easier?
Does someone have any idea?
Yes.
regular expressions are very capable at matching all kinds of strings.
The following example program works as you suggest:
import re
links = ["http://www.x.com", "https://www.y.com", "http://www.a.com", "https://www.b.com",]
r = re.compile("^https")
httpslinks = list(filter(r.match, links))
print(httpslinks)
This prints out only the https links.
What the regular expression is doing is looking for string that start with https. The hat ^ operator looks for string starting with the following characters, in this case "https"
If you are facing a space-delimited string, as you somewhat suggested in the comments, then you can just convert the links to a list using split like so:
links = "http://www.x.com https://www.y.com http://www.a.com https://www.b.com"
r = re.compile("^https")
httpslinks = list(filter(r.match, links.split(" ")))
You can read more on regular expressions here.
The part about list(filter( is only necessary for python2.x, python3.x seems to do this automatically.
Now it works:
Thanks to everyone.
import re
from bs4 import BeautifulSoup
with open('copyfromfile.txt', 'r') as file:
text = file.read()
text = text.replace('"Url":', '[<a href=')
text = text.replace(',"At"', '</a>] ')
soup = BeautifulSoup(text, 'html.parser')
for link in soup.find_all('a'):
link2 = link.get('href')
if link2.find("video") == -1:
link3 = 0
else:
f = open("C:/users/%Username%/desktop/copy.txt", "a+")
f.write(str(link2))
f.write("\n")
f.close()

Extracting URL from a string

I'm just starting regular expression for python and came across this problem where I'm supposed to extract URLs from the string:
str = "<tag>http://example-1.com</tag><tag>http://example-2.com</tag>"
The code I have is:
import re
url = re.findall('<tag>(.*)</tag>', str)
print(url)
returns:
[http://example-1.com</tag><tag>http://example-2.com]
If anyone could point me in the direction on how I might approach this problem would it would be most appreciative!
Thanks everyone!
You are using a regular expression, and matching HTML with such expressions get too complicated, too fast.
You can use BeautifulSoup to parse HTML.
For example:
from bs4 import BeautifulSoup
str = "<tag>http://example-1.com</tag><tag>http://example-2.com</tag>"
soup = BeautifulSoup(str, 'html.parser')
tags = soup.find_all('tag')
for tag in tags:
print tag.text
Using only re package:
import re
str = "<tag>http://example-1.com</tag><tag>http://example-2.com</tag>"
url = re.findall('<tag>(.*?)</tag>', str)
print(url)
returns:
['http://example-1.com', 'http://example-2.com']
Hope it helps!

Beautifulsoup: exclude unwanted parts

I realise it's probably a very specific question but I'm struggling to get rid of some parts of text I get using the code below. I need a plain article text which I locate by finding "p" tags under 'class':'mol-para-with-font'. Somehow I get lots of other stuff like author's byline, date stamp, and most importantly text from adverts on the page. Examining html I cannot see them containing the same 'class':'mol-para-with-font' so I'm puzzled (or maybe I've been staring at it for too long...). I know there are lots of html gurus here so I'll be grateful for your help.
My code:
import requests
import translitcodec
import codecs
def get_text(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
# delete unwanted tags:
for s in soup(['figure', 'script', 'style', 'table']):
s.decompose()
article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all( ['p', {'class':'mol-para-with-font'}])]
article = '\n'.join(article_soup)
text = codecs.encode(article, 'translit/one').encode('ascii', 'replace') #replace traslit with ascii
text = u"{}".format(text) #encode to unicode
print text
url = 'http://www.dailymail.co.uk/femail/article-4703718/How-Alexander-McQueen-Kate-s-royal-tours.html'
get_text(url)
Only 'p'-s with class="mol-para-with-font" ?
This will give it to you:
import requests
from bs4 import BeautifulSoup as BS
url = 'http://www.dailymail.co.uk/femail/article-4703718/How-Alexander-McQueen-Kate-s-royal-tours.html'
r = requests.get(url)
soup = BS(r.content, "lxml")
for i in soup.find_all('p', class_='mol-para-with-font'):
print(i.text)

Python web-crawling and regular expression

I'm crawling the game players name with regular expression on "op.gg" web site.
I used reqexr.com website to check my regular expression of what I want to get and I found 200 players.
But my python codes doesn't work. I intended to insert 200 datas into list. but the list is empty.
I think a single quotation mark(') doesn't work on my python code.
here is my piece of codes..
import requests
from bs4 import BeautifulSoup
import re
user_name = input()
def hex_user_name(user_name):
hex_user_name = [hex(x) for x in user_name.encode('utf-8')]
for i,j in enumerate(hex_user_name):
hex_user_name[i] = '%'+j[2:].upper()
return ''.join(hex_user_name)
def get_user_name(user_name):
q = re.compile('k\'>([^<]{1,16})', re.M)
site = 'http://www.op.gg/summoner/userName=' + user_name
source_code = requests.get(site)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'lxml')
name = soup.find_all('a')
listB = q.findall(re.sub('[\s\n,]*', '' ,str(name)))
print(listB)
get_user_name(hex_user_name(user_name))
I strongly doubt that this line
q = re.compile('k\'>([^<]{1,16})', re.M)
has a problem.. but I couldn't find any mistake.
this is what I want to use on regular expression: k\'>([^<]*)
And 이곳은지옥인가(Korean word) is what I want to get the data on HTML code.
<div class="SummonerName">
<a href="//www.op.gg/summoner/userName=%EC%9D%B4%EA%B3%B3%EC%9D%80%EC%A7%80%EC%98%A5%EC%9D%B8%EA%B0%80" class="Link" target='_blank'>이곳은지옥인가</a>
</div>
I really appreciate you guys helping me out..
Your regex is working
>>> x = ('<a href="//www.op.gg/summoner/userName=%EC%9D%B4%EA%B3%B3%EC%9D'
'%80%EC%A7%80%EC%98%A5%EC%9D%B8%EA%B0%80" class="Link" '
'target=\'_blank\'>이곳은지옥인가</a>')
>>> import re
>>> q = re.compile('k\'>([^<]{1,16})', re.M)
>>> q.findall(x)
['이곳은지옥인가']
Probably enough plain_text to your regex
listB = q.findall(re.sub('[\s\n,]*', '' , plain_text))
because soup.find_all('a') returns a list, so you would need to loop through that.
Coercing the list into a str ca get messy, because it will escape the ' and/or "
>>> li = ['k\'b"n', 'sdd']
>>> str(li)
'[\'k\\\'b"n\', \'sdd\']'
>>>
>>> li
['k\'b"n', 'sdd']
>>>
>>>
>>> li = ["k'b\"n", 'sdd']
>>> li
['k\'b"n', 'sdd']
>>> str(li)
'[\'k\\\'b"n\', \'sdd\']'
>>>
That would easily break your regex.

Web Scraping between tags

I am trying to get all of the content between tags from a webpage. The code I have is outputting empty arrays. When I print the htmltext it shows the complete contents of the page, but will not show the contents of the tags.
import urllib
import re
urlToOpen = "webAddress"
htmlfile = urllib.urlopen(urlToOpen)
htmltext = htmlfile.read()
regex = '<h5> (.*) </h5>'
pattern = re.compile(regex)
names = re.findall(pattern,htmltext)
print "The h5 tag contains: ", names
You did a mistake while calling the string urlToOpen. Write str(urlToOpen) instead of urlToOpen.
import urllib2
import re
urlToOpen = "http://stackoverflow.com/questions/25107611/web-scraping-between-tags"
htmlfile = urllib2.urlopen(str(urlToOpen))
htmltext = htmlfile.read()
regex = '<title>(.+?)</title>'
pattern = re.compile(regex)
names = re.findall(pattern,htmltext)
print names
Dont give spaces between tags and regex expression. Write like this:
regex = '<h5>(.+?)</h5>'