Cleaning up re.search output? - regex

I have been writing a script which will recover for me CVSS3 scores when i enter a vulnerability name, i've pretty much got it working as intended except for a minor annoying detail.
π ~/Documents/Tools/Scripts ❯ python3 CVSS3-Grabber.py
Paste Vulnerability Name: PHP 7.2.x < 7.2.21 Multiple Vulnerabilities.
Base Score: None
Vector: <re.Match object; span=(27869, 27913), match='CVSS:3.0/AV:N/AC:L/PR:N/UI:R/S:U/C:L/I:N/A:H'>
Temporal Vector: <re.Match object; span=(27986, 28008), match='CVSS:3.0/E:U/RL:O/RC:C'>
As can be seen the output could be much neater, i would much prefer something like this:
π ~/Documents/Tools/Scripts ❯ python3 CVSS3-Grabber.py
Paste Vulnerability Name: PHP 7.2.x < 7.2.21 Multiple Vulnerabilities.
Base Score: None
Vector: CVSS:3.0/AV:N/AC:L/PR:N/UI:R/S:U/C:L/I:N/A:H
However i have been struggling to figure out how to get the output nicer, is there an easy part of the re module that im missing that can do this for me? or perhaps putting the output into a file first would then allow me to manipulate the text to how i need it.
Here is my code, would appreciate any feedback on how to improve as i have recently gotten back into python and scripting in general.
import requests
import re
from bs4 import BeautifulSoup
from googlesearch import search
def get_url():
vuln = input("Paste Vulnerability Name: ") + "tenable"
for url in search(vuln, tld='com',lang='en',num=1,start=0,stop=1,pause=2.0):
return url
def get_scores(url):
response = requests.get(url)
html = response.text
cvss3_temporal_v = re.search("CVSS:3.0/E:./RL:./RC:.",html)
cvss3_v = re.search("CVSS:3.0/AV:./AC:./PR:./UI:./S:./C:./I:./A:.",html)
cvss3_basescore = re.search("Base Score:....",html)
print("Base Score: ",cvss3_basescore)
print("Vector: ",cvss3_v)
print("Temporal Vector: ",cvss3_temporal_v)
urll = get_url()
get_scores(urll)
### IMPROVEMENTS ###
# Include the base score in output
# Tidy up output
# Vulnerability list?
# modify to accept flags, i.e python3 CVSS3-Grabber.py -v VULNAME ???
# State whether it is a failing issue or Action point
Thanks!

Don't print the match object. Print the match value.
In Python the value is accessible through the .group() method. If there are no regex subgroups (or you want the entire match, like in this case), don't specify any arguments when you call it:
print("Vector: ", cvss3_v.group())

Related

Extracting CVE Info with a Python 3 regular expression

I frequently need a list of CVEs listed on a vendor's security bulletin page. Sometimes that's simple to copy off, but often they're mixed in with a bunch of text.
I haven't touched Python in a good while, so I thought this would be a great exercise to figure out how to extract that info – especially since I keep finding myself doing it manually.
Here's my current code:
#!/usr/bin/env python3
# REQUIREMENTS
# python3
# BeautifulSoup (pip3 install beautifulsoup)
# python 3 certificates (Applications/Python 3.x/ Install Certificates.command) <-- this one took me forever to figure out!
import sys
if sys.version_info[0] < 3:
raise Exception("Use Python 3: python3 " + sys.argv[0])
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
#specify/get the url to scrape
#url ='https://chromereleases.googleblog.com/2020/02/stable-channel-update-for-desktop.html'
#url = 'https://source.android.com/security/bulletin/2020-02-01.html'
url = input("What is the URL? ") or 'https://chromereleases.googleblog.com/2020/02/stable-channel-update-for-desktop.html'
print("Checking URL: " + url)
# CVE regular expression
cve_pattern = 'CVE-\d{4}-\d{4,7}'
# query the website and return the html
page = urlopen(url).read()
# parse the html returned using beautiful soup
soup = BeautifulSoup(page, 'html.parser')
count = 0
############################################################
# ANDROID === search for CVE references within <td> tags ===
# find all <td> tags
all_tds = soup.find_all("td")
#print all_tds
for td in all_tds:
if "cve" in td.text.lower():
print(td.text)
############################################################
# CHROME === search for CVE reference within <span> tags ===
# find all <span> tags
all_spans = soup.find_all("span")
for span in all_spans:
# this code returns results in triplicate
for i in re.finditer(cve_pattern, span.text):
count += 1
print(count, i.group())
# this code works, but only returns the first match
# match = re.search(cve_pattern,span.text)
# if match:
# print(match.group(0))
What I have working for the Android URL works fine; the problem I'm having is for the Chrome URL. They have the CVE info inside <span> tags, and I'm trying to leverage regular expressions to pull that out.
Using the re.finditer approach, I end up with results in triplicate.
Using the re.search approach it misses CVE-2019-19925 – they listed two CVEs on that same line.
Can you offer any advice on the best way to get this working?
I finally worked it out myself. No need for BeautifulSoup; everything is RegEx now. To work around the duplicate/triplicate results I was seeing before, I convert the re.findall list result to a dictionary (retaining order of unique values) and back to a list.
import sys
if sys.version_info[0] < 3:
raise Exception("Use Python 3: python3 " + sys.argv[0])
import requests
import re
# Specify/get the url to scrape (included a default for easier testing)
### there is no input validation taking place here ###
url = input("What is the URL? ") #or 'https://chromereleases.googleblog.com/2020/02/stable-channel-update-for-desktop.html'
print()
# CVE regular expression
cve_pattern = r'CVE-\d{4}-\d{4,7}'
# query the website and return the html
page = requests.get(url)
# initialize count to 0
count = 0
#search for CVE references using RegEx
cves = re.findall(cve_pattern, page.text)
# after several days of fiddling, I was still getting double and sometimes triple results on certain pages. This next line
# converts the list of objects returned from re.findall to a dictionary (which retains order) to get unique values, then back to a list.
# (thanks to https://stackoverflow.com/a/48028065/9205677)
# I found order to be important sometimes, as the most severely rated CVEs are often listed first on the page
cves = list(dict.fromkeys(cves))
# print the results to the screen
for cve in cves:
print(cve)
count += 1
print()
print(str(count) + " CVEs found at " + url)
print()

Obfuscating a string that can afterwards be used in a web address

I have a small simple server in Flask, and I'd like to be able to route to a user's page with the following route:
#app.route("/something/<string:username>", methods=["GET"])
When it's a clear username it's not a problem, however I want to add simple obfuscation so that when given a key produces a new string that can still be used in a web address.
I tried my luck with several methods I found in Stack Overflow, but the output strings have various issues like non-ASCII characters, or characters that give me issues in the routing (like having a / which confuses Flask).
Ideally I'd like to have two functions, obfuscate(key, string) and deobfuscate(key, string) so I'll be able to use like so:
#app.route("/something/<string:username>", methods=["GET"])
def user_page(username):
# username is an obfuscated string
clear_username = deobfuscate(MY_KEY, username)
return flask.make_response("Hi {}".format(clear_username), 200)
...
...
def create_user(username):
# username is a clear string
save_to_database(username)
return obfuscate(MY_KEY, username)
To summarize, the obfuscation needs to be simple but good enough that you won't be able to figure it out by looking at the URL, and two-way so that I can figure out what the original string was and print it out.
I ended up solving the issue with itsdangerous, which is a dependency of Flask so I have it on my server anyway.
As the example here shows:
>>> from itsdangerous import URLSafeSerializer
>>> s = URLSafeSerializer('secret-key')
>>> s.dumps([1, 2, 3, 4])
'WzEsMiwzLDRd.wSPHqC0gR7VUqivlSukJ0IeTDgo'
>>> s.loads('WzEsMiwzLDRd.wSPHqC0gR7VUqivlSukJ0IeTDgo')
[1, 2, 3, 4]
It's safe to assume I won't have any surprises as the docstring says:
Works like :class:Serializer but dumps and loads into a URL safe string consisting of the upper and lowercase character of the alphabet as well as _, - and ..

Having issues with Python xpath scraping

I'm back again with a question for the wonderful people here :)
Ive recently begun getting back into python (50% done at codcademy lol) and decided to make a quick script for web-scraping the spot price of gold in CAD. This will eventually be a part of a much bigger script... but Im VERY rusty and thought it would be a good project.
My issue:
I have been following the guide over at http://docs.python-guide.org/en/latest/scenarios/scrape/ to accomplish my goal, however my script always returns/prints
<Element html at 0xRANDOM>
with RANDOM being a (i assume) random hex number. This happens no matter what website I seem to use.
My Code:
#!/bin/python
#Scrape current gold spot price in CAD
from lxml import html
import requests
def scraped_price():
page = requests.get('http://goldprice.org/gold-price-canada.html')
tree = html.fromstring(page.content)
print "The full page is: ", tree #added for debug WHERE ERROR OCCURS
bid = tree.xpath("//span[#id='gpotickerLeftCAD_price']/text()")
print "Scraped content: ", bid
return bid
gold_scraper = scraped_price()
My research:
1) www.w3schools.com/xsl/xpath_syntax.asp
This is where I figured out to use '//span' to find all 'span' objects and then used the #id to narrow it down to the one I need.
2)Scraping web content using xpath won't work
This makes me think I simply have a bad tree.xpath setup. However I cannot seem to figure out where or why.
Any assistance would be greatly appreciated.
<Element html at 0xRANDOM>
What you see printed is the lxml.html's Element class string representation. If you want to see the actual HTML content, use tostring():
print(html.tostring(tree, pretty_print=True))
You are also getting Scraped content: [] printed which really means that there were no elements matching the locator. And, if you would see the previously printed out HTML, there is actually no element with id="gpotickerLeftCAD_price" in the downloaded source.
The prices on this particular site are retrieved dynamically with continuous JSONP GET requests issued periodically. You can either look into simulating these requests, or stay on a higher level automating a browser via selenium. Demo (using PhantomJS headless browser):
>>> import time
>>> from selenium import webdriver
>>>
>>> driver = webdriver.PhantomJS()
>>> driver.get("http://goldprice.org/gold-price-canada.html")
>>> while True:
... print(driver.find_element_by_id("gpotickerLeftCAD_price").text)
... time.sleep(1)
...
1,595.28
1,595.28
1,595.28
1,595.28
1,595.28
1,595.19
...

Python lxml xpath no output

For educational purposes I am trying to scrape this page using lxml and requests in Python.
Specifically I just want to print the research areas of all the professors on the page.
This is what I have done till now
import requests
from lxml import html
response=requests.get('http://cse.iitkgp.ac.in/index.php?secret=d2RkOUgybWlNZzJwQXdLc28wNzh6UT09')
parsed_body=html.fromstring(response.content)
for row in parsed_body.xpath('//div[#id="maincontent"]//tr[position() mod 2 = 1]'):
for column in row.xpath('//td[#class="fcardcls"]/tr[2]/td/font/text()'):
print column.strip()
But it is not printing anything. I was struggling quite a bit with xpaths and was intially using the copy xpath feature in chrome. I followed what was done in the following SO questions/answers and cleaned up my code quite a bit and got rid of ' tbody ' in the xpaths. Still the code returns a blank.
1. Empty List Returned
2. Python-lxml-xpath problem
First of all, the main content with the desired data inside is loaded from a different endpoint via an XHR request - simulate that in your code.
Here is the complete working code printing names and a list of research areas per name:
import requests
from lxml import html
response = requests.get('http://cse.iitkgp.ac.in/faculty4.php?_=1450503917634')
parsed_body = html.fromstring(response.content)
for row in parsed_body.xpath('.//td[#class="fcardcls"]'):
name = row.findtext(".//a[#href]/b")
name = ' '.join(name.split()) # getting rid of multiple spaces
research_areas = row.xpath('.//*[. = "Research Areas: "]/following-sibling::text()')[0].split(", ")
print(name, research_areas)
The idea here is use the fact that all "professor blocks" are located in td elements with class="fcardcls". For every block, get the name from the bold link text and research areas from the following string after Research Areas: bold text.

Regex to extract URLs from href attribute in HTML with Python [duplicate]

This question already has answers here:
What is the best regular expression to check if a string is a valid URL?
(62 answers)
Closed last month.
Considering a string as follows:
string = "<p>Hello World</p>More ExamplesEven More Examples"
How could I, with Python, extract the URLs, inside the anchor tag's href? Something like:
>>> url = getURLs(string)
>>> url
['http://example.com', 'http://2.example']
import re
url = '<p>Hello World</p>More ExamplesEven More Examples'
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', url)
>>> print urls
['http://example.com', 'http://2.example']
The best answer is...
Don't use a regex
The expression in the accepted answer misses many cases. Among other things, URLs can have unicode characters in them. The regex you want is here, and after looking at it, you may conclude that you don't really want it after all. The most correct version is ten-thousand characters long.
Admittedly, if you were starting with plain, unstructured text with a bunch of URLs in it, then you might need that ten-thousand-character-long regex. But if your input is structured, use the structure. Your stated aim is to "extract the URL, inside the anchor tag's href." Why use a ten-thousand-character-long regex when you can do something much simpler?
Parse the HTML instead
For many tasks, using Beautiful Soup will be far faster and easier to use:
>>> from bs4 import BeautifulSoup as Soup
>>> html = Soup(s, 'html.parser') # Soup(s, 'lxml') if lxml is installed
>>> [a['href'] for a in html.find_all('a')]
['http://example.com', 'http://2.example']
If you prefer not to use external tools, you can also directly use Python's own built-in HTML parsing library. Here's a really simple subclass of HTMLParser that does exactly what you want:
from html.parser import HTMLParser
class MyParser(HTMLParser):
def __init__(self, output_list=None):
HTMLParser.__init__(self)
if output_list is None:
self.output_list = []
else:
self.output_list = output_list
def handle_starttag(self, tag, attrs):
if tag == 'a':
self.output_list.append(dict(attrs).get('href'))
Test:
>>> p = MyParser()
>>> p.feed(s)
>>> p.output_list
['http://example.com', 'http://2.example']
You could even create a new method that accepts a string, calls feed, and returns output_list. This is a vastly more powerful and extensible way than regular expressions to extract information from HTML.