How would I access this information via Beautifulsoup? - python-2.7

How would I find the value for example context with beautifulsoup?
This is what some of what I get when I print my Beautiful var in Python.
<script>
(function (root) {
root['__playIT'] = {"context":{"dispatcher":{"stores"}
}(this));
</script>

With BeautifulSoup, you can only locate the desired script element. Then, to extract the actual context value, you can use, for example, regular expressions:
import re
from bs4 import BeautifulSoup
data = """
<script>
(function (root) {
root['__playIT'] = {"context":{"dispatcher":{"stores"}
}(this));
</script>"""
soup = BeautifulSoup(data, "html.parser")
pattern = re.compile(r'"context":(\{.*?)$', re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
result = pattern.search(script.text).group(1)
print(result)
Prints:
{"dispatcher":{"stores"}
Note that, if the value would have been the valid JSON string, you could have loaded it with json.loads().

Related

I have the following list of strings yet I want to apply filter so that I may certain item from the lists.How to do that?

I am trying to obtain the image data from the following website.
However, I am getting a list of data that contains the links that are not needed. I want to apply the filter so that I can only get the data that starts with /PIAimages. How to apply the filter to do that?
import requests
from bs4 import BeautifulSoup
import csv
result = []
response = requests.get("https://www.ikea.com/sa/en/catalog/products/00361049/")
assert response.ok
page = BeautifulSoup(response.text, "html.parser")
for des in page.find_all('img'):
image= des.get('src')
print(image)
Expected output:
/PIAimages/0531313_PE647261_S1.JPG
/PIAimages/0513228_PE638849_S1.JPG
/PIAimages/0618875_PE688687_S1.JPG
/PIAimages/0325432_PE517964_S1.JPG
/PIAimages/0690287_PE723209_S1.JPG
/PIAimages/0513996_PE639275_S1.JPG
/PIAimages/0325450_PE517970_S1.JPG
Actual output:
/ms/img/header/ikea-logo.svg
/ms/en_SA/img/header/ikea-store.png
/ms/img/header/main_menu_shadow.gif
/sa/en/images/products/strandmon-wing-chair-beige__0513996_PE639275_S4.JPG
/PIAimages/0531313_PE647261_S1.JPG
/PIAimages/0513228_PE638849_S1.JPG
/PIAimages/0618875_PE688687_S1.JPG
/PIAimages/0325432_PE517964_S1.JPG
/PIAimages/0690287_PE723209_S1.JPG
/PIAimages/0513996_PE639275_S1.JPG
/PIAimages/0325450_PE517970_S1.JPG
/ms/img/static/loading.gif
/ms/img/static/stock_check_green.gif
/ms/img/ads/services/ways_to_shop/20172_otav20a_assembly_20x20.jpg
/ms/en_SA/img/icons/picking-with-delivery.jpg
/ms/img/ads/services/ways_to_shop/20172_otav24a_pickingdelivery_20x20.jpg
/sa/en/images/products/strandmon-wing-chair-beige__0739100_PH147003_S4.JPG
https://smetrics.ikea.com/b/ss/ikeaallnojavascriptprod/5/?c8=sa&pageName=nojavascript
Use If clause then append data into list.
import requests
from bs4 import BeautifulSoup
result = []
response = requests.get("https://www.ikea.com/sa/en/catalog/products/00361049/")
assert response.ok
page = BeautifulSoup(response.text, "html.parser")
for des in page.find_all('img'):
image= des.get('src')
if 'PIAimages' in image:
result.append(image)
print(result)
OR use regular expression.This is much faster.
import requests
import re
from bs4 import BeautifulSoup
result = []
response = requests.get("https://www.ikea.com/sa/en/catalog/products/00361049/")
assert response.ok
page = BeautifulSoup(response.text, "html.parser")
for des in page.find_all('img', src=re.compile("PIAimages")):
image= des.get('src')
result.append(image)
print(result)
I think it faster and more concise to use css attribute = value selector with starts with operator. You specify the start substring for the src in the selector so only qualifying elements are returned.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://www.ikea.com/sa/en/catalog/products/00361049/")
page = BeautifulSoup(response.text, "lxml")
images = [item['src'] for item in page.select('img[src^=\/PIAimages]')]
print(images)

Regular expression to find precise pdf links in a webpage

Given url='http://normanpd.normanok.gov/content/daily-activity', the website has three types of arrests, incidents, and case summaries. I was asked to use regular expressions to discover the URL strings of all the Incidents pdf documents in Python.
The pdfs are to be downloaded in a defined location.
I have gone through the link and found that Incident pdf files URLs are in the form of:
normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf
I have written code :
import urllib.request
url="http://normanpd.normanok.gov/content/daily-activity"
response = urllib.request.urlopen(url)
data = response.read() # a `bytes` object
text = data.decode('utf-8')
urls=re.findall(r'(\w|/|-/%)+\sIncident\s(%|\w)+\.pdf$',text)
But in the URLs list, the values are empty.
I am a beginner in python3 and regex commands. Can anyone help me?
This is not an advisable method. Instead, use an HTML parsing library like bs4 (BeautifulSoup) to find the links and then only regex to filter the results.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url="http://normanpd.normanok.gov/content/daily-activity"
response = urlopen(url).read()
soup= BeautifulSoup(response, "html.parser")
links = soup.find_all('a', href=re.compile(r'(Incident%20Summary\.pdf)'))
for el in links:
print("http://normanpd.normanok.gov" + el['href'])
Output :
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-23%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-22%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-21%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-20%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-18%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-17%20Daily%20Incident%20Summary.pdf
But if you were asked to use only regexes, then try something simpler:
import urllib.request
import re
url="http://normanpd.normanok.gov/content/daily-activity"
response = urllib.request.urlopen(url)
data = response.read() # a `bytes` object
text = data.decode('utf-8')
urls=re.findall(r'(filebrowser_download.+?Daily%20Incident.+?\.pdf)',text)
print(urls)
for link in urls:
print("http://normanpd.normanok.gov/" + link)
Using BeautifulSoup this is an easy way:
soup = BeautifulSoup(open_page, 'html.parser')
links = []
for link in soup.find_all('a'):
current = link.get('href')
if current.endswith('pdf') and "Incident" in current:
links.append('{0}{1}'.format(url,current))

BeautifulSoup doesn't return any value

I am new to Beautifulsoup and seems to have encountered a problem. The code I wrote is correct to my knowledge but the output is empty. It doesn't show any value.
import requests
from bs4 import BeautifulSoup
url = requests.get("https://www.nseindia.com/")
soup = BeautifulSoup(url.content, "html.parser")
nifty = soup.find_all("span", {"id": "lastPriceNIFTY"})
for x in nifty:
print x.text
The page seems to be rendered by javascript. requests will fail to get the content which is loaded by JavaScript, it will get the partial page before the JavaScript rendering. You can use the dryscrape library for this like so:
import dryscrape
from bs4 import BeautifulSoup
sess = dryscrape.Session()
sess.visit("https://www.nseindia.com/")
soup = BeautifulSoup(sess.body(), "lxml")
nifty = soup.select("span[id^=lastPriceNIFTY]")
print nifty[0:2] #printing sample i.e first two entries.
Output:
[<span class="number" id="lastPriceNIFTY 50"><span class="change green">8,792.80 </span></span>, <span class="value" id="lastPriceNIFTY 50 Pre Open" style="color:#000000"><span class="change green">8,812.35 </span></span>]

How to extract URLs matching a pattern

I'm trying to extract URLs from a webpage with the following pattern :
'http://www.realclearpolitics.com/epolls/????/governor/??/-.html'
My current code extracts all the links. How could I change my code to only extract URLs that match the pattern? Thank you!
import requests
from bs4 import BeautifulSoup
def find_governor_races(html):
url = html
base_url = 'http://www.realclearpolitics.com/'
page = requests.get(html).text
soup = BeautifulSoup(page,'html.parser')
links = []
for a in soup.findAll('a', href=True):
links.append(a['href'])
find_governor_races('http://www.realclearpolitics.com/epolls/2010/governor/2010_elections_governor_map.html')
You can provide a regular expression pattern as an href argument value for the .find_all():
import re
pattern = re.compile(r"http://www.realclearpolitics.com\/epolls/\d+/governor/.*?/.*?.html")
links = soup.find_all("a", href=pattern)

Web Scraping between tags

I am trying to get all of the content between tags from a webpage. The code I have is outputting empty arrays. When I print the htmltext it shows the complete contents of the page, but will not show the contents of the tags.
import urllib
import re
urlToOpen = "webAddress"
htmlfile = urllib.urlopen(urlToOpen)
htmltext = htmlfile.read()
regex = '<h5> (.*) </h5>'
pattern = re.compile(regex)
names = re.findall(pattern,htmltext)
print "The h5 tag contains: ", names
You did a mistake while calling the string urlToOpen. Write str(urlToOpen) instead of urlToOpen.
import urllib2
import re
urlToOpen = "http://stackoverflow.com/questions/25107611/web-scraping-between-tags"
htmlfile = urllib2.urlopen(str(urlToOpen))
htmltext = htmlfile.read()
regex = '<title>(.+?)</title>'
pattern = re.compile(regex)
names = re.findall(pattern,htmltext)
print names
Dont give spaces between tags and regex expression. Write like this:
regex = '<h5>(.+?)</h5>'