FInd and Replace using Beautiful Soup - regex

Trying to replace using beautiful soup and regex. Soup finds what I need I then turn that into a string and use regex to replace but its not working. Is there a find and replace using beautiful soup.
soup = BeautifulSoup(file_data, 'html.parser')
found_data = soup.find(class_='front_page_feature')
found_data = str(found_data)
print(re.sub(found_data, mysql_data,file_data,flags = re.DOTALL))

You want to replace the text contents of a node you found. Just set the .string property with the mysql_data to change it:
>>> from bs4 import BeautifulSoup
>>> file_data = """<p class="front_page_feature">Some data</p><p class="else">Other data</p>"""
>>> soup = BeautifulSoup(file_data, 'html.parser')
>>> found_data = soup.find(class_='front_page_feature')
>>> mysql_data = "New Text"
>>> found_data.string = mysql_data
>>> soup
# => <p class="front_page_feature">New Text</p><p class="else">Other data</p>
# ^^^^^^^^
You may certainly manipulate the text in any other way, even using a regex then.

Related

Beautifulsoup: exclude unwanted parts

I realise it's probably a very specific question but I'm struggling to get rid of some parts of text I get using the code below. I need a plain article text which I locate by finding "p" tags under 'class':'mol-para-with-font'. Somehow I get lots of other stuff like author's byline, date stamp, and most importantly text from adverts on the page. Examining html I cannot see them containing the same 'class':'mol-para-with-font' so I'm puzzled (or maybe I've been staring at it for too long...). I know there are lots of html gurus here so I'll be grateful for your help.
My code:
import requests
import translitcodec
import codecs
def get_text(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
# delete unwanted tags:
for s in soup(['figure', 'script', 'style', 'table']):
s.decompose()
article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all( ['p', {'class':'mol-para-with-font'}])]
article = '\n'.join(article_soup)
text = codecs.encode(article, 'translit/one').encode('ascii', 'replace') #replace traslit with ascii
text = u"{}".format(text) #encode to unicode
print text
url = 'http://www.dailymail.co.uk/femail/article-4703718/How-Alexander-McQueen-Kate-s-royal-tours.html'
get_text(url)
Only 'p'-s with class="mol-para-with-font" ?
This will give it to you:
import requests
from bs4 import BeautifulSoup as BS
url = 'http://www.dailymail.co.uk/femail/article-4703718/How-Alexander-McQueen-Kate-s-royal-tours.html'
r = requests.get(url)
soup = BS(r.content, "lxml")
for i in soup.find_all('p', class_='mol-para-with-font'):
print(i.text)

Regular expression to find precise pdf links in a webpage

Given url='http://normanpd.normanok.gov/content/daily-activity', the website has three types of arrests, incidents, and case summaries. I was asked to use regular expressions to discover the URL strings of all the Incidents pdf documents in Python.
The pdfs are to be downloaded in a defined location.
I have gone through the link and found that Incident pdf files URLs are in the form of:
normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf
I have written code :
import urllib.request
url="http://normanpd.normanok.gov/content/daily-activity"
response = urllib.request.urlopen(url)
data = response.read() # a `bytes` object
text = data.decode('utf-8')
urls=re.findall(r'(\w|/|-/%)+\sIncident\s(%|\w)+\.pdf$',text)
But in the URLs list, the values are empty.
I am a beginner in python3 and regex commands. Can anyone help me?
This is not an advisable method. Instead, use an HTML parsing library like bs4 (BeautifulSoup) to find the links and then only regex to filter the results.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url="http://normanpd.normanok.gov/content/daily-activity"
response = urlopen(url).read()
soup= BeautifulSoup(response, "html.parser")
links = soup.find_all('a', href=re.compile(r'(Incident%20Summary\.pdf)'))
for el in links:
print("http://normanpd.normanok.gov" + el['href'])
Output :
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-23%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-22%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-21%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-20%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-18%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-17%20Daily%20Incident%20Summary.pdf
But if you were asked to use only regexes, then try something simpler:
import urllib.request
import re
url="http://normanpd.normanok.gov/content/daily-activity"
response = urllib.request.urlopen(url)
data = response.read() # a `bytes` object
text = data.decode('utf-8')
urls=re.findall(r'(filebrowser_download.+?Daily%20Incident.+?\.pdf)',text)
print(urls)
for link in urls:
print("http://normanpd.normanok.gov/" + link)
Using BeautifulSoup this is an easy way:
soup = BeautifulSoup(open_page, 'html.parser')
links = []
for link in soup.find_all('a'):
current = link.get('href')
if current.endswith('pdf') and "Incident" in current:
links.append('{0}{1}'.format(url,current))

BeautifulSoup doesn't return any value

I am new to Beautifulsoup and seems to have encountered a problem. The code I wrote is correct to my knowledge but the output is empty. It doesn't show any value.
import requests
from bs4 import BeautifulSoup
url = requests.get("https://www.nseindia.com/")
soup = BeautifulSoup(url.content, "html.parser")
nifty = soup.find_all("span", {"id": "lastPriceNIFTY"})
for x in nifty:
print x.text
The page seems to be rendered by javascript. requests will fail to get the content which is loaded by JavaScript, it will get the partial page before the JavaScript rendering. You can use the dryscrape library for this like so:
import dryscrape
from bs4 import BeautifulSoup
sess = dryscrape.Session()
sess.visit("https://www.nseindia.com/")
soup = BeautifulSoup(sess.body(), "lxml")
nifty = soup.select("span[id^=lastPriceNIFTY]")
print nifty[0:2] #printing sample i.e first two entries.
Output:
[<span class="number" id="lastPriceNIFTY 50"><span class="change green">8,792.80 </span></span>, <span class="value" id="lastPriceNIFTY 50 Pre Open" style="color:#000000"><span class="change green">8,812.35 </span></span>]

Python web-crawling and regular expression

I'm crawling the game players name with regular expression on "op.gg" web site.
I used reqexr.com website to check my regular expression of what I want to get and I found 200 players.
But my python codes doesn't work. I intended to insert 200 datas into list. but the list is empty.
I think a single quotation mark(') doesn't work on my python code.
here is my piece of codes..
import requests
from bs4 import BeautifulSoup
import re
user_name = input()
def hex_user_name(user_name):
hex_user_name = [hex(x) for x in user_name.encode('utf-8')]
for i,j in enumerate(hex_user_name):
hex_user_name[i] = '%'+j[2:].upper()
return ''.join(hex_user_name)
def get_user_name(user_name):
q = re.compile('k\'>([^<]{1,16})', re.M)
site = 'http://www.op.gg/summoner/userName=' + user_name
source_code = requests.get(site)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'lxml')
name = soup.find_all('a')
listB = q.findall(re.sub('[\s\n,]*', '' ,str(name)))
print(listB)
get_user_name(hex_user_name(user_name))
I strongly doubt that this line
q = re.compile('k\'>([^<]{1,16})', re.M)
has a problem.. but I couldn't find any mistake.
this is what I want to use on regular expression: k\'>([^<]*)
And 이곳은지옥인가(Korean word) is what I want to get the data on HTML code.
<div class="SummonerName">
<a href="//www.op.gg/summoner/userName=%EC%9D%B4%EA%B3%B3%EC%9D%80%EC%A7%80%EC%98%A5%EC%9D%B8%EA%B0%80" class="Link" target='_blank'>이곳은지옥인가</a>
</div>
I really appreciate you guys helping me out..
Your regex is working
>>> x = ('<a href="//www.op.gg/summoner/userName=%EC%9D%B4%EA%B3%B3%EC%9D'
'%80%EC%A7%80%EC%98%A5%EC%9D%B8%EA%B0%80" class="Link" '
'target=\'_blank\'>이곳은지옥인가</a>')
>>> import re
>>> q = re.compile('k\'>([^<]{1,16})', re.M)
>>> q.findall(x)
['이곳은지옥인가']
Probably enough plain_text to your regex
listB = q.findall(re.sub('[\s\n,]*', '' , plain_text))
because soup.find_all('a') returns a list, so you would need to loop through that.
Coercing the list into a str ca get messy, because it will escape the ' and/or "
>>> li = ['k\'b"n', 'sdd']
>>> str(li)
'[\'k\\\'b"n\', \'sdd\']'
>>>
>>> li
['k\'b"n', 'sdd']
>>>
>>>
>>> li = ["k'b\"n", 'sdd']
>>> li
['k\'b"n', 'sdd']
>>> str(li)
'[\'k\\\'b"n\', \'sdd\']'
>>>
That would easily break your regex.

Web Scraping between tags

I am trying to get all of the content between tags from a webpage. The code I have is outputting empty arrays. When I print the htmltext it shows the complete contents of the page, but will not show the contents of the tags.
import urllib
import re
urlToOpen = "webAddress"
htmlfile = urllib.urlopen(urlToOpen)
htmltext = htmlfile.read()
regex = '<h5> (.*) </h5>'
pattern = re.compile(regex)
names = re.findall(pattern,htmltext)
print "The h5 tag contains: ", names
You did a mistake while calling the string urlToOpen. Write str(urlToOpen) instead of urlToOpen.
import urllib2
import re
urlToOpen = "http://stackoverflow.com/questions/25107611/web-scraping-between-tags"
htmlfile = urllib2.urlopen(str(urlToOpen))
htmltext = htmlfile.read()
regex = '<title>(.+?)</title>'
pattern = re.compile(regex)
names = re.findall(pattern,htmltext)
print names
Dont give spaces between tags and regex expression. Write like this:
regex = '<h5>(.+?)</h5>'