Regular expression to find precise pdf links in a webpage

Regular expression to find precise pdf links in a webpage - regex

Given url='http://normanpd.normanok.gov/content/daily-activity', the website has three types of arrests, incidents, and case summaries. I was asked to use regular expressions to discover the URL strings of all the Incidents pdf documents in Python.
The pdfs are to be downloaded in a defined location.
I have gone through the link and found that Incident pdf files URLs are in the form of:
normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf
I have written code :
import urllib.request
url="http://normanpd.normanok.gov/content/daily-activity"
response = urllib.request.urlopen(url)
data = response.read() # a `bytes` object
text = data.decode('utf-8')
urls=re.findall(r'(\w|/|-/%)+\sIncident\s(%|\w)+\.pdf$',text)
But in the URLs list, the values are empty.
I am a beginner in python3 and regex commands. Can anyone help me?

This is not an advisable method. Instead, use an HTML parsing library like bs4 (BeautifulSoup) to find the links and then only regex to filter the results.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url="http://normanpd.normanok.gov/content/daily-activity"
response = urlopen(url).read()
soup= BeautifulSoup(response, "html.parser")
links = soup.find_all('a', href=re.compile(r'(Incident%20Summary\.pdf)'))
for el in links:
print("http://normanpd.normanok.gov" + el['href'])
Output :
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-23%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-22%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-21%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-20%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-18%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-17%20Daily%20Incident%20Summary.pdf
But if you were asked to use only regexes, then try something simpler:
import urllib.request
import re
url="http://normanpd.normanok.gov/content/daily-activity"
response = urllib.request.urlopen(url)
data = response.read() # a `bytes` object
text = data.decode('utf-8')
urls=re.findall(r'(filebrowser_download.+?Daily%20Incident.+?\.pdf)',text)
print(urls)
for link in urls:
print("http://normanpd.normanok.gov/" + link)

Using BeautifulSoup this is an easy way:
soup = BeautifulSoup(open_page, 'html.parser')
links = []
for link in soup.find_all('a'):
current = link.get('href')
if current.endswith('pdf') and "Incident" in current:
links.append('{0}{1}'.format(url,current))

Related

Issue scraping website with bs4 (beautiful soup) python 2.7

What I am attempting to accomplish is a simple python web scraping script for google trends and running into an issue when grabbing the class
from bs4 import BeautifulSoup
import requests
results = requests.get("https://trends.google.com/trends/trendingsearches/daily?geo=US")
soup = BeautifulSoup(results.text, 'lxml')
keyword_list = soup.find_all('.details-top')
for keyword in keyword_list:
print(keyword)
When printing tag I receive and empty class however when I print soup I receive the entire HTML document. My goal is to print out the text of each "Keyword" that was searched for the page https://trends.google.com/trends/trendingsearches/daily?geo=AU
this has a list of results:
1. covid-19
2.Woolworths jobs
If you use google developer options select inspect and hover over the title you will see div.details-top.
how would I just print the text of the title of each

I can see that data being dynamically retrieved from an API call in the dev tools network tab. You can issue an xhr to that url then use regex on the response text to parse out the query titles.
import requests, re
from bs4 import BeautifulSoup as bs
r = requests.get('https://trends.google.com/trends/api/dailytrends?hl=en-GB&tz=0&geo=AU&ns=15').text
p = re.compile(r'"query":"(.*?)"')
titles = p.findall(r)
print(titles) # 2.7 use print titles

I have the following list of strings yet I want to apply filter so that I may certain item from the lists.How to do that?

I am trying to obtain the image data from the following website.
However, I am getting a list of data that contains the links that are not needed. I want to apply the filter so that I can only get the data that starts with /PIAimages. How to apply the filter to do that?
import requests
from bs4 import BeautifulSoup
import csv
result = []
response = requests.get("https://www.ikea.com/sa/en/catalog/products/00361049/")
assert response.ok
page = BeautifulSoup(response.text, "html.parser")
for des in page.find_all('img'):
image= des.get('src')
print(image)
Expected output:
/PIAimages/0531313_PE647261_S1.JPG
/PIAimages/0513228_PE638849_S1.JPG
/PIAimages/0618875_PE688687_S1.JPG
/PIAimages/0325432_PE517964_S1.JPG
/PIAimages/0690287_PE723209_S1.JPG
/PIAimages/0513996_PE639275_S1.JPG
/PIAimages/0325450_PE517970_S1.JPG
Actual output:
/ms/img/header/ikea-logo.svg
/ms/en_SA/img/header/ikea-store.png
/ms/img/header/main_menu_shadow.gif
/sa/en/images/products/strandmon-wing-chair-beige__0513996_PE639275_S4.JPG
/PIAimages/0531313_PE647261_S1.JPG
/PIAimages/0513228_PE638849_S1.JPG
/PIAimages/0618875_PE688687_S1.JPG
/PIAimages/0325432_PE517964_S1.JPG
/PIAimages/0690287_PE723209_S1.JPG
/PIAimages/0513996_PE639275_S1.JPG
/PIAimages/0325450_PE517970_S1.JPG
/ms/img/static/loading.gif
/ms/img/static/stock_check_green.gif
/ms/img/ads/services/ways_to_shop/20172_otav20a_assembly_20x20.jpg
/ms/en_SA/img/icons/picking-with-delivery.jpg
/ms/img/ads/services/ways_to_shop/20172_otav24a_pickingdelivery_20x20.jpg
/sa/en/images/products/strandmon-wing-chair-beige__0739100_PH147003_S4.JPG
https://smetrics.ikea.com/b/ss/ikeaallnojavascriptprod/5/?c8=sa&pageName=nojavascript

Use If clause then append data into list.
import requests
from bs4 import BeautifulSoup
result = []
response = requests.get("https://www.ikea.com/sa/en/catalog/products/00361049/")
assert response.ok
page = BeautifulSoup(response.text, "html.parser")
for des in page.find_all('img'):
image= des.get('src')
if 'PIAimages' in image:
result.append(image)
print(result)
OR use regular expression.This is much faster.
import requests
import re
from bs4 import BeautifulSoup
result = []
response = requests.get("https://www.ikea.com/sa/en/catalog/products/00361049/")
assert response.ok
page = BeautifulSoup(response.text, "html.parser")
for des in page.find_all('img', src=re.compile("PIAimages")):
image= des.get('src')
result.append(image)
print(result)

I think it faster and more concise to use css attribute = value selector with starts with operator. You specify the start substring for the src in the selector so only qualifying elements are returned.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://www.ikea.com/sa/en/catalog/products/00361049/")
page = BeautifulSoup(response.text, "lxml")
images = [item['src'] for item in page.select('img[src^=\/PIAimages]')]
print(images)

BeautifulSoup doesn't return any value

I am new to Beautifulsoup and seems to have encountered a problem. The code I wrote is correct to my knowledge but the output is empty. It doesn't show any value.
import requests
from bs4 import BeautifulSoup
url = requests.get("https://www.nseindia.com/")
soup = BeautifulSoup(url.content, "html.parser")
nifty = soup.find_all("span", {"id": "lastPriceNIFTY"})
for x in nifty:
print x.text

The page seems to be rendered by javascript. requests will fail to get the content which is loaded by JavaScript, it will get the partial page before the JavaScript rendering. You can use the dryscrape library for this like so:
import dryscrape
from bs4 import BeautifulSoup
sess = dryscrape.Session()
sess.visit("https://www.nseindia.com/")
soup = BeautifulSoup(sess.body(), "lxml")
nifty = soup.select("span[id^=lastPriceNIFTY]")
print nifty[0:2] #printing sample i.e first two entries.
Output:
[<span class="number" id="lastPriceNIFTY 50"><span class="change green">8,792.80 </span></span>, <span class="value" id="lastPriceNIFTY 50 Pre Open" style="color:#000000"><span class="change green">8,812.35 </span></span>]

How to extract URLs matching a pattern

I'm trying to extract URLs from a webpage with the following pattern :
'http://www.realclearpolitics.com/epolls/????/governor/??/-.html'
My current code extracts all the links. How could I change my code to only extract URLs that match the pattern? Thank you!
import requests
from bs4 import BeautifulSoup
def find_governor_races(html):
url = html
base_url = 'http://www.realclearpolitics.com/'
page = requests.get(html).text
soup = BeautifulSoup(page,'html.parser')
links = []
for a in soup.findAll('a', href=True):
links.append(a['href'])
find_governor_races('http://www.realclearpolitics.com/epolls/2010/governor/2010_elections_governor_map.html')

You can provide a regular expression pattern as an href argument value for the .find_all():
import re
pattern = re.compile(r"http://www.realclearpolitics.com\/epolls/\d+/governor/.*?/.*?.html")
links = soup.find_all("a", href=pattern)

Scraping urls from html, save in csv using BeautifulSoup

I'm trying to save all hyperlinked urls in an online forum in a CSV file, for a research project.
When I 'print' the html scraping result it seems to be working fine, in the sense that it prints all the urls I want, but I'm unable to write these to separate rows in the CSV.
I'm clearly doing something wrong, but I don't know what! So any help will be greatly appreciated.
Here's the code I've written:
import urllib2
from bs4 import BeautifulSoup
import csv
import re
soup = BeautifulSoup(urllib2.urlopen('http://forum.sex141.com/eforum/forumdisplay.php? fid=28&page=5').read())
urls = []
for url in soup.find_all('a', href=re.compile('viewthread.php')):
print url['href']
csvfile = open('Ss141.csv', 'wb')
writer = csv.writer(csvfile)
for url in zip(urls):
writer.writerow([url])
csvfile.close()

You do need to add your matches to the urls list:
for url in soup.find_all('a', href=re.compile('viewthread.php')):
print url['href']
urls.append(url)
and you don't need to use zip() here.
Best just write your urls as you find them, instead of collecting them in a list first:
soup = BeautifulSoup(urllib2.urlopen('http://forum.sex141.com/eforum/forumdisplay.php?fid=28&page=5').read())
with open('Ss141.csv', 'wb') as csvfile:
writer = csv.writer(csvfile)
for url in soup.find_all('a', href=re.compile('viewthread.php')):
writer.writerow([url['href']])
The with statement will close the file object for you when the block is done.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regular expression to find precise pdf links in a webpage - regex

Using BeautifulSoup this is an easy way: soup = BeautifulSoup(open_page, 'html.parser') links = [] for link in soup.find_all('a'): current = link.get('href') if current.endswith('pdf') and "Incident" in current: links.append('{0}{1}'.format(url,current))

Related

Issue scraping website with bs4 (beautiful soup) python 2.7

I have the following list of strings yet I want to apply filter so that I may certain item from the lists.How to do that?

BeautifulSoup doesn't return any value

How to extract URLs matching a pattern

Scraping urls from html, save in csv using BeautifulSoup

Categories

Resources