saving Urls to list - list

I want to make a list of urls I have gotten from a webpage, I know it's simple but I can't get my head around it (had a headache since yesterday lunch!)
Anyway here is my code
from bs4 import BeautifulSoup
from urllib import request
webpage = request.urlopen("http://www.dsfire.gov.uk/News/Newsdesk/IncidentsPast7days.cfm?siteCategoryId=3&T1ID=26&T2ID=35")
soup = BeautifulSoup(webpage)
incidents = soup.find(id="CollapsiblePanel1")
for line in incidents.find_all('a'):
link = line.get('href')
Links = Links + Link
I know it isn't working but I'm not sure how to make it lol sorry and thanks in advance!
Raif

Change this:
for line in incidents.find_all('a'):
link = line.get('href')
Links = Links + Link
To this:
Links = []
for line in incidents.find_all('a'):
Links.append(line.get('href'))

Related

Soup.find and findAll unable to find table elements on hockey-reference.com

I'm just a beginner at webscraping and python in general so I'm sorry if the answer is obvious, but I can't figure out I'm unable to find any of the table elements on https://www.hockey-reference.com/leagues/NHL_2018.html.
My initial thought was that this was a result of the whole div being commented out, so following some advice I found on here in another similar post, I replaced the comment characters and confirmed that they were removed when I saved the soup.text to a text file and searched. I was still unable to find any tables however.
In trying to search a little further I took the ID out of my .find and did a findAll and still table was coming up empty.
Here's the code I was trying to use, any advice is much appreciated!
import csv
import requests
from BeautifulSoup import BeautifulSoup
import re
comm = re.compile("<!--|-->")
url = 'https://www.hockey-reference.com/leagues/NHL_2018.html'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(comm.sub("", html))
table = soup.find('table', id="stats")
When searching for all of the table elements I was using
table = soup.findAll('table')
I'm also aware that there is a csv version on the site, I was just eager to practice.
Give a parser along with your markup, for example BeautifulSoup(html,'lxml') . Try the below code
url = 'https://www.hockey-reference.com/leagues/NHL_2018.html'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html,'lxml')
table = soup.findAll('table')

Regular expression to find precise pdf links in a webpage

Given url='http://normanpd.normanok.gov/content/daily-activity', the website has three types of arrests, incidents, and case summaries. I was asked to use regular expressions to discover the URL strings of all the Incidents pdf documents in Python.
The pdfs are to be downloaded in a defined location.
I have gone through the link and found that Incident pdf files URLs are in the form of:
normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf
I have written code :
import urllib.request
url="http://normanpd.normanok.gov/content/daily-activity"
response = urllib.request.urlopen(url)
data = response.read() # a `bytes` object
text = data.decode('utf-8')
urls=re.findall(r'(\w|/|-/%)+\sIncident\s(%|\w)+\.pdf$',text)
But in the URLs list, the values are empty.
I am a beginner in python3 and regex commands. Can anyone help me?
This is not an advisable method. Instead, use an HTML parsing library like bs4 (BeautifulSoup) to find the links and then only regex to filter the results.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url="http://normanpd.normanok.gov/content/daily-activity"
response = urlopen(url).read()
soup= BeautifulSoup(response, "html.parser")
links = soup.find_all('a', href=re.compile(r'(Incident%20Summary\.pdf)'))
for el in links:
print("http://normanpd.normanok.gov" + el['href'])
Output :
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-23%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-22%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-21%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-20%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-18%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-17%20Daily%20Incident%20Summary.pdf
But if you were asked to use only regexes, then try something simpler:
import urllib.request
import re
url="http://normanpd.normanok.gov/content/daily-activity"
response = urllib.request.urlopen(url)
data = response.read() # a `bytes` object
text = data.decode('utf-8')
urls=re.findall(r'(filebrowser_download.+?Daily%20Incident.+?\.pdf)',text)
print(urls)
for link in urls:
print("http://normanpd.normanok.gov/" + link)
Using BeautifulSoup this is an easy way:
soup = BeautifulSoup(open_page, 'html.parser')
links = []
for link in soup.find_all('a'):
current = link.get('href')
if current.endswith('pdf') and "Incident" in current:
links.append('{0}{1}'.format(url,current))

working with hrefs extracted from Beautifulsoup

I am a Python beginner learning web crawling.
On this one project, I had to retrieve some hrefs and then to print out the text content within each of these href links. Here is my code so far:
import requests, bs4, os, webbrowser
url = 'http://www.constructeursdefrance.com/resultat/?dpt=53'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,'html.parser')
for a in soup.select('.link'):
links = a.find('a').attrs['href']
I tried many things with the links but it would say "unicode is not callable".
How can I work with these links and eventually iterate over them to extract the text within?
Thanks
you code is almost done, just a little change
import requests, bs4, os, webbrowser
url = 'http://www.constructeursdefrance.com/resultat/?dpt=53'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,'html.parser')
links = []
for div in soup.select('.link'):
link = div.a.get('href')
links.append(link)
print(links)
out:
['http://www.constructeursdefrance.com/concept-habitat/',
'http://www.constructeursdefrance.com/maisons-bois-cruard/',
'http://www.constructeursdefrance.com/passiva-concept/',
'http://www.constructeursdefrance.com/les-constructions-de-la-mayenne/',
'http://www.constructeursdefrance.com/maisonsdenfrance53/',
'http://www.constructeursdefrance.com/lemasson53/',
'http://www.constructeursdefrance.com/ecb53/',
'http://www.constructeursdefrance.com/villadeale-53/',
'http://www.constructeursdefrance.com/habitat-plus-53/']
select('.link') will return a list of div tag which has a child tag a,
so you can get a tag by div.a and than get href by div.a.get('href')
Try the following:
import requests, bs4, os, webbrowser
url = 'http://www.constructeursdefrance.com/resultat/?dpt=53'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,'lxml')
links = soup.findAll('a')
for link in links:
try:
print link.attrs['href']
except:
pass
Hope this helps.

Web-scraping with Python: NoneType error, can't scrape table's data

this is my first attempt at coding so please forgive my daftness. I'm trying to learn web scraping by practising with this link:
https://data.gov.sg/dataset/industrial-arbitration-court-awards-by-nature-of-trade-disputes?view_id=d3e444ef-54ed-4d0b-b715-1ee465f6d882&resource_id=c24d0d00-2d12-4f68-8fc9-4121433332e0
I've honestly spent hours trying to figure out what's wrong with my code here:
import csv
import requests
from BeautifulSoup import BeautifulSoup
url = 'https://data.gov.sg/dataset/industrial-arbitration-court-awards-by-nature-of-trade-disputes?view_id=d3e444ef-54ed-4d0b-b715-1ee465f6d882&resource_id=c24d0d00-2d12-4f68-8fc9-4121433332e0'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('tbody')
list_of_rows = []
for row in table.find('tr'):
list_of_cells = []
for cell in row.findAll('td'):
list_of_cells.append()
list_of_rows.append(list_of_cells)
outfile = open("./indarb.csv","wb")
writer = csv.writer(outfile)
My terminal then spits out this: 'NoneType' object has no attribute 'find', saying there's an error in line 13. Not sure if it helps in queries but this is a list of what I've tried:
Different permutations of 'find'/'findAll'
Instead of '.find', used '.findAll'
Instead of '.findAll', used '.find'
Different permutations for line 10
Tried soup.find('tbody')
Tried soup.find('table')
Opened source code, tried soup.find('table', attrs={'class':'table table-condensed'})
Different permutations for line 13
similarly tried with just 'tr' tag; or
tried adding 'attrs={}' stuff
I've really tried but can't figure out why I can't scrape that simple 10 row table. If anyone could post code that works, that'd be phenomenal. Thank you for your patience!
The URL you request in your code is not HTML but JSON.
You have a few mistakes, the biggest is you are using BeautifulSoup3 which has not been developed for years, you should be use bs4, you also need to use find_all when you want want multiple tags. Also you have not passed cell to list_of_cells.append() on line 13 so that is the cause of your other error:
from bs4 import BeautifulSoup
url = 'https://data.gov.sg/dataset/industrial-arbitration-court-awards-by-nature-of-trade-disputes?view_id=d3e444ef-54ed-4d0b-b715-1ee465f6d882&resource_id=c24d0d00-2d12-4f68-8fc9-4121433332e0%27'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table')
list_of_rows = []
for row in table.find_all('tr'):
list_of_cells = []
for cell in row.find_all('td'):
list_of_cells.append(cell)
list_of_rows.append(list_of_cells)
I am not sure exactly what you want but that appends the tds from the first table on the page. There is also and api you can use and adownloadable csv if you do actually want the data.

Scraping urls from html, save in csv using BeautifulSoup

I'm trying to save all hyperlinked urls in an online forum in a CSV file, for a research project.
When I 'print' the html scraping result it seems to be working fine, in the sense that it prints all the urls I want, but I'm unable to write these to separate rows in the CSV.
I'm clearly doing something wrong, but I don't know what! So any help will be greatly appreciated.
Here's the code I've written:
import urllib2
from bs4 import BeautifulSoup
import csv
import re
soup = BeautifulSoup(urllib2.urlopen('http://forum.sex141.com/eforum/forumdisplay.php? fid=28&page=5').read())
urls = []
for url in soup.find_all('a', href=re.compile('viewthread.php')):
print url['href']
csvfile = open('Ss141.csv', 'wb')
writer = csv.writer(csvfile)
for url in zip(urls):
writer.writerow([url])
csvfile.close()
You do need to add your matches to the urls list:
for url in soup.find_all('a', href=re.compile('viewthread.php')):
print url['href']
urls.append(url)
and you don't need to use zip() here.
Best just write your urls as you find them, instead of collecting them in a list first:
soup = BeautifulSoup(urllib2.urlopen('http://forum.sex141.com/eforum/forumdisplay.php?fid=28&page=5').read())
with open('Ss141.csv', 'wb') as csvfile:
writer = csv.writer(csvfile)
for url in soup.find_all('a', href=re.compile('viewthread.php')):
writer.writerow([url['href']])
The with statement will close the file object for you when the block is done.