Python web-crawling and regular expression - regex

I'm crawling the game players name with regular expression on "op.gg" web site.
I used reqexr.com website to check my regular expression of what I want to get and I found 200 players.
But my python codes doesn't work. I intended to insert 200 datas into list. but the list is empty.
I think a single quotation mark(') doesn't work on my python code.
here is my piece of codes..
import requests
from bs4 import BeautifulSoup
import re
user_name = input()
def hex_user_name(user_name):
hex_user_name = [hex(x) for x in user_name.encode('utf-8')]
for i,j in enumerate(hex_user_name):
hex_user_name[i] = '%'+j[2:].upper()
return ''.join(hex_user_name)
def get_user_name(user_name):
q = re.compile('k\'>([^<]{1,16})', re.M)
site = 'http://www.op.gg/summoner/userName=' + user_name
source_code = requests.get(site)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'lxml')
name = soup.find_all('a')
listB = q.findall(re.sub('[\s\n,]*', '' ,str(name)))
print(listB)
get_user_name(hex_user_name(user_name))
I strongly doubt that this line
q = re.compile('k\'>([^<]{1,16})', re.M)
has a problem.. but I couldn't find any mistake.
this is what I want to use on regular expression: k\'>([^<]*)
And 이곳은지옥인가(Korean word) is what I want to get the data on HTML code.
<div class="SummonerName">
<a href="//www.op.gg/summoner/userName=%EC%9D%B4%EA%B3%B3%EC%9D%80%EC%A7%80%EC%98%A5%EC%9D%B8%EA%B0%80" class="Link" target='_blank'>이곳은지옥인가</a>
</div>
I really appreciate you guys helping me out..

Your regex is working
>>> x = ('<a href="//www.op.gg/summoner/userName=%EC%9D%B4%EA%B3%B3%EC%9D'
'%80%EC%A7%80%EC%98%A5%EC%9D%B8%EA%B0%80" class="Link" '
'target=\'_blank\'>이곳은지옥인가</a>')
>>> import re
>>> q = re.compile('k\'>([^<]{1,16})', re.M)
>>> q.findall(x)
['이곳은지옥인가']
Probably enough plain_text to your regex
listB = q.findall(re.sub('[\s\n,]*', '' , plain_text))
because soup.find_all('a') returns a list, so you would need to loop through that.
Coercing the list into a str ca get messy, because it will escape the ' and/or "
>>> li = ['k\'b"n', 'sdd']
>>> str(li)
'[\'k\\\'b"n\', \'sdd\']'
>>>
>>> li
['k\'b"n', 'sdd']
>>>
>>>
>>> li = ["k'b\"n", 'sdd']
>>> li
['k\'b"n', 'sdd']
>>> str(li)
'[\'k\\\'b"n\', \'sdd\']'
>>>
That would easily break your regex.

Related

FInd and Replace using Beautiful Soup

Trying to replace using beautiful soup and regex. Soup finds what I need I then turn that into a string and use regex to replace but its not working. Is there a find and replace using beautiful soup.
soup = BeautifulSoup(file_data, 'html.parser')
found_data = soup.find(class_='front_page_feature')
found_data = str(found_data)
print(re.sub(found_data, mysql_data,file_data,flags = re.DOTALL))
You want to replace the text contents of a node you found. Just set the .string property with the mysql_data to change it:
>>> from bs4 import BeautifulSoup
>>> file_data = """<p class="front_page_feature">Some data</p><p class="else">Other data</p>"""
>>> soup = BeautifulSoup(file_data, 'html.parser')
>>> found_data = soup.find(class_='front_page_feature')
>>> mysql_data = "New Text"
>>> found_data.string = mysql_data
>>> soup
# => <p class="front_page_feature">New Text</p><p class="else">Other data</p>
# ^^^^^^^^
You may certainly manipulate the text in any other way, even using a regex then.

BeautifulSoup and regexp: Attribute error

I try to extract information with beautifulsoup4 methods by means of reg. exp.
But I get the following answer:
AttributeError: 'NoneType' object has no attribute 'group'
I do not understand what is wrong.. I am trying to:
get the Typologie name: 'herenhuizen'
get the weblink
Here is my code:
import requests
from bs4 import BeautifulSoup
import re
url = 'https://inventaris.onroerenderfgoed.be/erfgoedobjecten/4778'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
text = soup.prettify()
##block
p = re.compile('(?s)(?<=(Typologie))(.*?)(?=(</a>))', re.VERBOSE)
block = p.search(text).group(2)
##typo_url
p = re.compile('(?s)(?<=(href=\"))(.*?)(?=(\">))', re.VERBOSE)
typo_url = p.search(block).group(2)
## typo_name
p = re.compile('\b(\w+)(\W*?)$', re.VERBOSE)
typo_name = p.search(block).group(1)
Does someone have an idea where is the mistake?
I would change this:
## typo_name
block_reverse = block[::-1]
p = re.compile('(\w+)', re.VERBOSE)
typo_name_reverse = p.search(block_reverse).group(1)
typo_name = typo_name_reverse[::-1]
print(typo_name)
Sometimes it's easier to just reverse the string if you are looking for stuff at the end. This just finds the name at the end of your block. There are a number of ways to find what you are looking for, and we could come up with all kinds of clever regexes, but if this works that's probably enough :)
update
However I just noticed the reason the original regex was not working is to use \b it needs to be escaped like \\b or be raw like this:
## typo_name
p = re.compile(r'\b(\w+)(\W*?)$', re.VERBOSE)
typo_name = p.search(block).group(1)
Some good followed Q and A here: Does Python re module support word boundaries (\b)?

How would I access this information via Beautifulsoup?

How would I find the value for example context with beautifulsoup?
This is what some of what I get when I print my Beautiful var in Python.
<script>
(function (root) {
root['__playIT'] = {"context":{"dispatcher":{"stores"}
}(this));
</script>
With BeautifulSoup, you can only locate the desired script element. Then, to extract the actual context value, you can use, for example, regular expressions:
import re
from bs4 import BeautifulSoup
data = """
<script>
(function (root) {
root['__playIT'] = {"context":{"dispatcher":{"stores"}
}(this));
</script>"""
soup = BeautifulSoup(data, "html.parser")
pattern = re.compile(r'"context":(\{.*?)$', re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
result = pattern.search(script.text).group(1)
print(result)
Prints:
{"dispatcher":{"stores"}
Note that, if the value would have been the valid JSON string, you could have loaded it with json.loads().

Web Scraping between tags

I am trying to get all of the content between tags from a webpage. The code I have is outputting empty arrays. When I print the htmltext it shows the complete contents of the page, but will not show the contents of the tags.
import urllib
import re
urlToOpen = "webAddress"
htmlfile = urllib.urlopen(urlToOpen)
htmltext = htmlfile.read()
regex = '<h5> (.*) </h5>'
pattern = re.compile(regex)
names = re.findall(pattern,htmltext)
print "The h5 tag contains: ", names
You did a mistake while calling the string urlToOpen. Write str(urlToOpen) instead of urlToOpen.
import urllib2
import re
urlToOpen = "http://stackoverflow.com/questions/25107611/web-scraping-between-tags"
htmlfile = urllib2.urlopen(str(urlToOpen))
htmltext = htmlfile.read()
regex = '<title>(.+?)</title>'
pattern = re.compile(regex)
names = re.findall(pattern,htmltext)
print names
Dont give spaces between tags and regex expression. Write like this:
regex = '<h5>(.+?)</h5>'

Parsing webpages, using re, how to determine the line of a found string?

I am lookng at a website in python using code like this:
import urllib
import urllib2
import re
aResp = aResp = urllib2.urlopen("http://stackoverflow.com/");
web_pg = aResp.read();
pattern = "<title>Stack Overflow</title>"
m = re.search(pattern, web_pg)
if m:
print "found"
else:
print "Nothing found"
And I am trying to look at the tag after this and get the test inside of it. Is there any easy way to find out this information????
If it is simpler I could make do with something that just gives the line number of m and a way to get the HTML code of that line.
To capture text, use the () braces like so:
import urllib
import urllib2
import re
aResp = aResp = urllib2.urlopen("http://stackoverflow.com/");
web_pg = aResp.read();
pattern = "<title>(.*?)</title>"
m = re.search(pattern, web_pg)
if m:
print m.group(1)
else:
print "Nothing found"
The .group() function returns the first occurrence of the match.