web scraping with BeautifulSoup and python - python-2.7

I'm triyng to print out all the Ip address from this website https://hidemy.name/es/proxy-list/#list
but nothing happens
code in python 2.7:
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages): #go throw max pages of the website starting from 1
page = 0
value = 0
print('proxies')
while page <= 18:
value += 64
url = 'https://hidemy.name/es/proxy-list/?start=' + str(value) + '#list' #add page number to link
source_code = requests.get(url) #get website html code
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('td',{'class': 'tdl'}): #get the link of this class
proxy = link.string #get the string of the link
print(proxy)
page += 1
trade_spider(1)

You don't seeing any output because there is no matching elements in your soup.
I've tried to dump all the variables to output stream and figured out that this website is blocking crawlers. Try to print plain_text variable. It'll probably only contain warning message like:
It seems you are bot. If so, please use separate API interface. It
cheap and easy to use.

Related

Issue scraping website with bs4 (beautiful soup) python 2.7

What I am attempting to accomplish is a simple python web scraping script for google trends and running into an issue when grabbing the class
from bs4 import BeautifulSoup
import requests
results = requests.get("https://trends.google.com/trends/trendingsearches/daily?geo=US")
soup = BeautifulSoup(results.text, 'lxml')
keyword_list = soup.find_all('.details-top')
for keyword in keyword_list:
print(keyword)
When printing tag I receive and empty class however when I print soup I receive the entire HTML document. My goal is to print out the text of each "Keyword" that was searched for the page https://trends.google.com/trends/trendingsearches/daily?geo=AU
this has a list of results:
1. covid-19
2.Woolworths jobs
If you use google developer options select inspect and hover over the title you will see div.details-top.
how would I just print the text of the title of each
I can see that data being dynamically retrieved from an API call in the dev tools network tab. You can issue an xhr to that url then use regex on the response text to parse out the query titles.
import requests, re
from bs4 import BeautifulSoup as bs
r = requests.get('https://trends.google.com/trends/api/dailytrends?hl=en-GB&tz=0&geo=AU&ns=15').text
p = re.compile(r'"query":"(.*?)"')
titles = p.findall(r)
print(titles) # 2.7 use print titles

Extracting CVE Info with a Python 3 regular expression

I frequently need a list of CVEs listed on a vendor's security bulletin page. Sometimes that's simple to copy off, but often they're mixed in with a bunch of text.
I haven't touched Python in a good while, so I thought this would be a great exercise to figure out how to extract that info – especially since I keep finding myself doing it manually.
Here's my current code:
#!/usr/bin/env python3
# REQUIREMENTS
# python3
# BeautifulSoup (pip3 install beautifulsoup)
# python 3 certificates (Applications/Python 3.x/ Install Certificates.command) <-- this one took me forever to figure out!
import sys
if sys.version_info[0] < 3:
raise Exception("Use Python 3: python3 " + sys.argv[0])
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
#specify/get the url to scrape
#url ='https://chromereleases.googleblog.com/2020/02/stable-channel-update-for-desktop.html'
#url = 'https://source.android.com/security/bulletin/2020-02-01.html'
url = input("What is the URL? ") or 'https://chromereleases.googleblog.com/2020/02/stable-channel-update-for-desktop.html'
print("Checking URL: " + url)
# CVE regular expression
cve_pattern = 'CVE-\d{4}-\d{4,7}'
# query the website and return the html
page = urlopen(url).read()
# parse the html returned using beautiful soup
soup = BeautifulSoup(page, 'html.parser')
count = 0
############################################################
# ANDROID === search for CVE references within <td> tags ===
# find all <td> tags
all_tds = soup.find_all("td")
#print all_tds
for td in all_tds:
if "cve" in td.text.lower():
print(td.text)
############################################################
# CHROME === search for CVE reference within <span> tags ===
# find all <span> tags
all_spans = soup.find_all("span")
for span in all_spans:
# this code returns results in triplicate
for i in re.finditer(cve_pattern, span.text):
count += 1
print(count, i.group())
# this code works, but only returns the first match
# match = re.search(cve_pattern,span.text)
# if match:
# print(match.group(0))
What I have working for the Android URL works fine; the problem I'm having is for the Chrome URL. They have the CVE info inside <span> tags, and I'm trying to leverage regular expressions to pull that out.
Using the re.finditer approach, I end up with results in triplicate.
Using the re.search approach it misses CVE-2019-19925 – they listed two CVEs on that same line.
Can you offer any advice on the best way to get this working?
I finally worked it out myself. No need for BeautifulSoup; everything is RegEx now. To work around the duplicate/triplicate results I was seeing before, I convert the re.findall list result to a dictionary (retaining order of unique values) and back to a list.
import sys
if sys.version_info[0] < 3:
raise Exception("Use Python 3: python3 " + sys.argv[0])
import requests
import re
# Specify/get the url to scrape (included a default for easier testing)
### there is no input validation taking place here ###
url = input("What is the URL? ") #or 'https://chromereleases.googleblog.com/2020/02/stable-channel-update-for-desktop.html'
print()
# CVE regular expression
cve_pattern = r'CVE-\d{4}-\d{4,7}'
# query the website and return the html
page = requests.get(url)
# initialize count to 0
count = 0
#search for CVE references using RegEx
cves = re.findall(cve_pattern, page.text)
# after several days of fiddling, I was still getting double and sometimes triple results on certain pages. This next line
# converts the list of objects returned from re.findall to a dictionary (which retains order) to get unique values, then back to a list.
# (thanks to https://stackoverflow.com/a/48028065/9205677)
# I found order to be important sometimes, as the most severely rated CVEs are often listed first on the page
cves = list(dict.fromkeys(cves))
# print the results to the screen
for cve in cves:
print(cve)
count += 1
print()
print(str(count) + " CVEs found at " + url)
print()

Regular expression to find precise pdf links in a webpage

Given url='http://normanpd.normanok.gov/content/daily-activity', the website has three types of arrests, incidents, and case summaries. I was asked to use regular expressions to discover the URL strings of all the Incidents pdf documents in Python.
The pdfs are to be downloaded in a defined location.
I have gone through the link and found that Incident pdf files URLs are in the form of:
normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf
I have written code :
import urllib.request
url="http://normanpd.normanok.gov/content/daily-activity"
response = urllib.request.urlopen(url)
data = response.read() # a `bytes` object
text = data.decode('utf-8')
urls=re.findall(r'(\w|/|-/%)+\sIncident\s(%|\w)+\.pdf$',text)
But in the URLs list, the values are empty.
I am a beginner in python3 and regex commands. Can anyone help me?
This is not an advisable method. Instead, use an HTML parsing library like bs4 (BeautifulSoup) to find the links and then only regex to filter the results.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url="http://normanpd.normanok.gov/content/daily-activity"
response = urlopen(url).read()
soup= BeautifulSoup(response, "html.parser")
links = soup.find_all('a', href=re.compile(r'(Incident%20Summary\.pdf)'))
for el in links:
print("http://normanpd.normanok.gov" + el['href'])
Output :
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-23%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-22%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-21%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-20%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-18%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-17%20Daily%20Incident%20Summary.pdf
But if you were asked to use only regexes, then try something simpler:
import urllib.request
import re
url="http://normanpd.normanok.gov/content/daily-activity"
response = urllib.request.urlopen(url)
data = response.read() # a `bytes` object
text = data.decode('utf-8')
urls=re.findall(r'(filebrowser_download.+?Daily%20Incident.+?\.pdf)',text)
print(urls)
for link in urls:
print("http://normanpd.normanok.gov/" + link)
Using BeautifulSoup this is an easy way:
soup = BeautifulSoup(open_page, 'html.parser')
links = []
for link in soup.find_all('a'):
current = link.get('href')
if current.endswith('pdf') and "Incident" in current:
links.append('{0}{1}'.format(url,current))

Web-scraping with Python: NoneType error, can't scrape table's data

this is my first attempt at coding so please forgive my daftness. I'm trying to learn web scraping by practising with this link:
https://data.gov.sg/dataset/industrial-arbitration-court-awards-by-nature-of-trade-disputes?view_id=d3e444ef-54ed-4d0b-b715-1ee465f6d882&resource_id=c24d0d00-2d12-4f68-8fc9-4121433332e0
I've honestly spent hours trying to figure out what's wrong with my code here:
import csv
import requests
from BeautifulSoup import BeautifulSoup
url = 'https://data.gov.sg/dataset/industrial-arbitration-court-awards-by-nature-of-trade-disputes?view_id=d3e444ef-54ed-4d0b-b715-1ee465f6d882&resource_id=c24d0d00-2d12-4f68-8fc9-4121433332e0'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('tbody')
list_of_rows = []
for row in table.find('tr'):
list_of_cells = []
for cell in row.findAll('td'):
list_of_cells.append()
list_of_rows.append(list_of_cells)
outfile = open("./indarb.csv","wb")
writer = csv.writer(outfile)
My terminal then spits out this: 'NoneType' object has no attribute 'find', saying there's an error in line 13. Not sure if it helps in queries but this is a list of what I've tried:
Different permutations of 'find'/'findAll'
Instead of '.find', used '.findAll'
Instead of '.findAll', used '.find'
Different permutations for line 10
Tried soup.find('tbody')
Tried soup.find('table')
Opened source code, tried soup.find('table', attrs={'class':'table table-condensed'})
Different permutations for line 13
similarly tried with just 'tr' tag; or
tried adding 'attrs={}' stuff
I've really tried but can't figure out why I can't scrape that simple 10 row table. If anyone could post code that works, that'd be phenomenal. Thank you for your patience!
The URL you request in your code is not HTML but JSON.
You have a few mistakes, the biggest is you are using BeautifulSoup3 which has not been developed for years, you should be use bs4, you also need to use find_all when you want want multiple tags. Also you have not passed cell to list_of_cells.append() on line 13 so that is the cause of your other error:
from bs4 import BeautifulSoup
url = 'https://data.gov.sg/dataset/industrial-arbitration-court-awards-by-nature-of-trade-disputes?view_id=d3e444ef-54ed-4d0b-b715-1ee465f6d882&resource_id=c24d0d00-2d12-4f68-8fc9-4121433332e0%27'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table')
list_of_rows = []
for row in table.find_all('tr'):
list_of_cells = []
for cell in row.find_all('td'):
list_of_cells.append(cell)
list_of_rows.append(list_of_cells)
I am not sure exactly what you want but that appends the tds from the first table on the page. There is also and api you can use and adownloadable csv if you do actually want the data.

Scraping messy source page with Beautiful Soup

I try to do some web scraping using Python and Beautiful Soup, but the source page of the webpage is not the prettiest. The code below is a minor part of the source page:
...717301758],"birthdayFriends":2,"lastActiveTimes":{"719317510":0,"719435783":0,...
I want to get the parameter '2' after the string 'birthdayFriends', but I have no idea how to get it. So far i have written the code below, but it only prints a empty list.
import urllib2
from bs4 import BeautifulSoup
# Create an OpenerDirector with support for Basic HTTP Authentication...
auth_handler = urllib2.HTTPBasicAuthHandler()
auth_handler.add_password(realm='PDQ Application',
uri='myWebpage',
user='myUsername',
passwd='myPassword')
opener = urllib2.build_opener(auth_handler)
# ...and install it globally so it can be used with urlopen.
urllib2.install_opener(opener)
page = urllib2.urlopen('myWebpage')
soup = BeautifulSoup(page.read())
bf = soup.findAll('birthdayFriends')
print bf
>> []
suppose somewhere in the html there is a script tag like the following:
<script>
var x = {"birthdayFriends":2,"lastActiveTimes":{"719317510":0,"719435783":0}}
</script>
then your code might look something like:
script = soup.findAll('script')[0] # or the number it appears in the file
# take the json part
j = bf.text.split('=')[1]
import json
# load json string to a dictionary
d = json.loads(j, strict=False)
print d["birthdayFriends"]
in case the content of the script tag is more complicated, consider loop over the script lines or see How can I parse Javascript variables using python?
also, for parsing JavaScript in python also see pynoceros