working with hrefs extracted from Beautifulsoup

working with hrefs extracted from Beautifulsoup - python-2.7

I am a Python beginner learning web crawling.
On this one project, I had to retrieve some hrefs and then to print out the text content within each of these href links. Here is my code so far:
import requests, bs4, os, webbrowser
url = 'http://www.constructeursdefrance.com/resultat/?dpt=53'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,'html.parser')
for a in soup.select('.link'):
links = a.find('a').attrs['href']
I tried many things with the links but it would say "unicode is not callable".
How can I work with these links and eventually iterate over them to extract the text within?
Thanks

you code is almost done, just a little change
import requests, bs4, os, webbrowser
url = 'http://www.constructeursdefrance.com/resultat/?dpt=53'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,'html.parser')
links = []
for div in soup.select('.link'):
link = div.a.get('href')
links.append(link)
print(links)
out:
['http://www.constructeursdefrance.com/concept-habitat/',
'http://www.constructeursdefrance.com/maisons-bois-cruard/',
'http://www.constructeursdefrance.com/passiva-concept/',
'http://www.constructeursdefrance.com/les-constructions-de-la-mayenne/',
'http://www.constructeursdefrance.com/maisonsdenfrance53/',
'http://www.constructeursdefrance.com/lemasson53/',
'http://www.constructeursdefrance.com/ecb53/',
'http://www.constructeursdefrance.com/villadeale-53/',
'http://www.constructeursdefrance.com/habitat-plus-53/']
select('.link') will return a list of div tag which has a child tag a,
so you can get a tag by div.a and than get href by div.a.get('href')

Try the following:
import requests, bs4, os, webbrowser
url = 'http://www.constructeursdefrance.com/resultat/?dpt=53'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,'lxml')
links = soup.findAll('a')
for link in links:
try:
print link.attrs['href']
except:
pass
Hope this helps.

Related

Beautifulsoup: exclude unwanted parts

I realise it's probably a very specific question but I'm struggling to get rid of some parts of text I get using the code below. I need a plain article text which I locate by finding "p" tags under 'class':'mol-para-with-font'. Somehow I get lots of other stuff like author's byline, date stamp, and most importantly text from adverts on the page. Examining html I cannot see them containing the same 'class':'mol-para-with-font' so I'm puzzled (or maybe I've been staring at it for too long...). I know there are lots of html gurus here so I'll be grateful for your help.
My code:
import requests
import translitcodec
import codecs
def get_text(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
# delete unwanted tags:
for s in soup(['figure', 'script', 'style', 'table']):
s.decompose()
article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all( ['p', {'class':'mol-para-with-font'}])]
article = '\n'.join(article_soup)
text = codecs.encode(article, 'translit/one').encode('ascii', 'replace') #replace traslit with ascii
text = u"{}".format(text) #encode to unicode
print text
url = 'http://www.dailymail.co.uk/femail/article-4703718/How-Alexander-McQueen-Kate-s-royal-tours.html'
get_text(url)

Only 'p'-s with class="mol-para-with-font" ?
This will give it to you:
import requests
from bs4 import BeautifulSoup as BS
url = 'http://www.dailymail.co.uk/femail/article-4703718/How-Alexander-McQueen-Kate-s-royal-tours.html'
r = requests.get(url)
soup = BS(r.content, "lxml")
for i in soup.find_all('p', class_='mol-para-with-font'):
print(i.text)

web scraping with BeautifulSoup and python

I'm triyng to print out all the Ip address from this website https://hidemy.name/es/proxy-list/#list
but nothing happens
code in python 2.7:
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages): #go throw max pages of the website starting from 1
page = 0
value = 0
print('proxies')
while page <= 18:
value += 64
url = 'https://hidemy.name/es/proxy-list/?start=' + str(value) + '#list' #add page number to link
source_code = requests.get(url) #get website html code
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('td',{'class': 'tdl'}): #get the link of this class
proxy = link.string #get the string of the link
print(proxy)
page += 1
trade_spider(1)

You don't seeing any output because there is no matching elements in your soup.
I've tried to dump all the variables to output stream and figured out that this website is blocking crawlers. Try to print plain_text variable. It'll probably only contain warning message like:
It seems you are bot. If so, please use separate API interface. It
cheap and easy to use.

Regular expression to find precise pdf links in a webpage

Given url='http://normanpd.normanok.gov/content/daily-activity', the website has three types of arrests, incidents, and case summaries. I was asked to use regular expressions to discover the URL strings of all the Incidents pdf documents in Python.
The pdfs are to be downloaded in a defined location.
I have gone through the link and found that Incident pdf files URLs are in the form of:
normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf
I have written code :
import urllib.request
url="http://normanpd.normanok.gov/content/daily-activity"
response = urllib.request.urlopen(url)
data = response.read() # a `bytes` object
text = data.decode('utf-8')
urls=re.findall(r'(\w|/|-/%)+\sIncident\s(%|\w)+\.pdf$',text)
But in the URLs list, the values are empty.
I am a beginner in python3 and regex commands. Can anyone help me?

This is not an advisable method. Instead, use an HTML parsing library like bs4 (BeautifulSoup) to find the links and then only regex to filter the results.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url="http://normanpd.normanok.gov/content/daily-activity"
response = urlopen(url).read()
soup= BeautifulSoup(response, "html.parser")
links = soup.find_all('a', href=re.compile(r'(Incident%20Summary\.pdf)'))
for el in links:
print("http://normanpd.normanok.gov" + el['href'])
Output :
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-23%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-22%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-21%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-20%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-18%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-17%20Daily%20Incident%20Summary.pdf
But if you were asked to use only regexes, then try something simpler:
import urllib.request
import re
url="http://normanpd.normanok.gov/content/daily-activity"
response = urllib.request.urlopen(url)
data = response.read() # a `bytes` object
text = data.decode('utf-8')
urls=re.findall(r'(filebrowser_download.+?Daily%20Incident.+?\.pdf)',text)
print(urls)
for link in urls:
print("http://normanpd.normanok.gov/" + link)

Using BeautifulSoup this is an easy way:
soup = BeautifulSoup(open_page, 'html.parser')
links = []
for link in soup.find_all('a'):
current = link.get('href')
if current.endswith('pdf') and "Incident" in current:
links.append('{0}{1}'.format(url,current))

Web-scraping with Python: NoneType error, can't scrape table's data

this is my first attempt at coding so please forgive my daftness. I'm trying to learn web scraping by practising with this link:
https://data.gov.sg/dataset/industrial-arbitration-court-awards-by-nature-of-trade-disputes?view_id=d3e444ef-54ed-4d0b-b715-1ee465f6d882&resource_id=c24d0d00-2d12-4f68-8fc9-4121433332e0
I've honestly spent hours trying to figure out what's wrong with my code here:
import csv
import requests
from BeautifulSoup import BeautifulSoup
url = 'https://data.gov.sg/dataset/industrial-arbitration-court-awards-by-nature-of-trade-disputes?view_id=d3e444ef-54ed-4d0b-b715-1ee465f6d882&resource_id=c24d0d00-2d12-4f68-8fc9-4121433332e0'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('tbody')
list_of_rows = []
for row in table.find('tr'):
list_of_cells = []
for cell in row.findAll('td'):
list_of_cells.append()
list_of_rows.append(list_of_cells)
outfile = open("./indarb.csv","wb")
writer = csv.writer(outfile)
My terminal then spits out this: 'NoneType' object has no attribute 'find', saying there's an error in line 13. Not sure if it helps in queries but this is a list of what I've tried:
Different permutations of 'find'/'findAll'
Instead of '.find', used '.findAll'
Instead of '.findAll', used '.find'
Different permutations for line 10
Tried soup.find('tbody')
Tried soup.find('table')
Opened source code, tried soup.find('table', attrs={'class':'table table-condensed'})
Different permutations for line 13
similarly tried with just 'tr' tag; or
tried adding 'attrs={}' stuff
I've really tried but can't figure out why I can't scrape that simple 10 row table. If anyone could post code that works, that'd be phenomenal. Thank you for your patience!

The URL you request in your code is not HTML but JSON.

You have a few mistakes, the biggest is you are using BeautifulSoup3 which has not been developed for years, you should be use bs4, you also need to use find_all when you want want multiple tags. Also you have not passed cell to list_of_cells.append() on line 13 so that is the cause of your other error:
from bs4 import BeautifulSoup
url = 'https://data.gov.sg/dataset/industrial-arbitration-court-awards-by-nature-of-trade-disputes?view_id=d3e444ef-54ed-4d0b-b715-1ee465f6d882&resource_id=c24d0d00-2d12-4f68-8fc9-4121433332e0%27'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table')
list_of_rows = []
for row in table.find_all('tr'):
list_of_cells = []
for cell in row.find_all('td'):
list_of_cells.append(cell)
list_of_rows.append(list_of_cells)
I am not sure exactly what you want but that appends the tds from the first table on the page. There is also and api you can use and adownloadable csv if you do actually want the data.

saving Urls to list

I want to make a list of urls I have gotten from a webpage, I know it's simple but I can't get my head around it (had a headache since yesterday lunch!)
Anyway here is my code
from bs4 import BeautifulSoup
from urllib import request
webpage = request.urlopen("http://www.dsfire.gov.uk/News/Newsdesk/IncidentsPast7days.cfm?siteCategoryId=3&T1ID=26&T2ID=35")
soup = BeautifulSoup(webpage)
incidents = soup.find(id="CollapsiblePanel1")
for line in incidents.find_all('a'):
link = line.get('href')
Links = Links + Link
I know it isn't working but I'm not sure how to make it lol sorry and thanks in advance!
Raif

Change this:
for line in incidents.find_all('a'):
link = line.get('href')
Links = Links + Link
To this:
Links = []
for line in incidents.find_all('a'):
Links.append(line.get('href'))

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

working with hrefs extracted from Beautifulsoup - python-2.7

Related

Beautifulsoup: exclude unwanted parts

web scraping with BeautifulSoup and python

Regular expression to find precise pdf links in a webpage

Web-scraping with Python: NoneType error, can't scrape table's data

saving Urls to list

Categories

Resources