Obfuscating a string that can afterwards be used in a web address - python-2.7

I have a small simple server in Flask, and I'd like to be able to route to a user's page with the following route:
#app.route("/something/<string:username>", methods=["GET"])
When it's a clear username it's not a problem, however I want to add simple obfuscation so that when given a key produces a new string that can still be used in a web address.
I tried my luck with several methods I found in Stack Overflow, but the output strings have various issues like non-ASCII characters, or characters that give me issues in the routing (like having a / which confuses Flask).
Ideally I'd like to have two functions, obfuscate(key, string) and deobfuscate(key, string) so I'll be able to use like so:
#app.route("/something/<string:username>", methods=["GET"])
def user_page(username):
# username is an obfuscated string
clear_username = deobfuscate(MY_KEY, username)
return flask.make_response("Hi {}".format(clear_username), 200)
...
...
def create_user(username):
# username is a clear string
save_to_database(username)
return obfuscate(MY_KEY, username)
To summarize, the obfuscation needs to be simple but good enough that you won't be able to figure it out by looking at the URL, and two-way so that I can figure out what the original string was and print it out.

I ended up solving the issue with itsdangerous, which is a dependency of Flask so I have it on my server anyway.
As the example here shows:
>>> from itsdangerous import URLSafeSerializer
>>> s = URLSafeSerializer('secret-key')
>>> s.dumps([1, 2, 3, 4])
'WzEsMiwzLDRd.wSPHqC0gR7VUqivlSukJ0IeTDgo'
>>> s.loads('WzEsMiwzLDRd.wSPHqC0gR7VUqivlSukJ0IeTDgo')
[1, 2, 3, 4]
It's safe to assume I won't have any surprises as the docstring says:
Works like :class:Serializer but dumps and loads into a URL safe string consisting of the upper and lowercase character of the alphabet as well as _, - and ..

Related

Parse &-character in GET query using Django and Vue

We are using axios to pass GET request to our django instance, that splits it into search terms and runs the search. This has been working fine until we ran into an edge case. We use urlencode to ensure that strings do not have empty spaces or others
So generalize issue, we have TextField called "name" and we want to search for term "A & B Company". However, issue is that when the request reaches django.
What we expected was that name=A%20&%20B%20Company&field=value would be parsed as name='A & B Company' and field='value'.
Instead, it is parsed as name='A ' 'B Company' and field='value'. The & symbol is incorrectly treated as separator, despite being encoded.
Is there a way to indicate django GET parameter that certain & symbols are part of the value, instead of separators for fields?
You can use the lib urllib
class ModelExample(models.Model):
name = models.TextField()
# in view...
from urllib.parse import parse_qs
instance = ModelExample(name="name=A%20&%20B%20Company&field=value")
dict_qs = parse_qs(instance.name)
dict_qs contains a dict with decoded querystring
You can find more informations about urllib.parse here: https://docs.python.org/3/library/urllib.parse.html

Cleaning up re.search output?

I have been writing a script which will recover for me CVSS3 scores when i enter a vulnerability name, i've pretty much got it working as intended except for a minor annoying detail.
π ~/Documents/Tools/Scripts ❯ python3 CVSS3-Grabber.py
Paste Vulnerability Name: PHP 7.2.x < 7.2.21 Multiple Vulnerabilities.
Base Score: None
Vector: <re.Match object; span=(27869, 27913), match='CVSS:3.0/AV:N/AC:L/PR:N/UI:R/S:U/C:L/I:N/A:H'>
Temporal Vector: <re.Match object; span=(27986, 28008), match='CVSS:3.0/E:U/RL:O/RC:C'>
As can be seen the output could be much neater, i would much prefer something like this:
π ~/Documents/Tools/Scripts ❯ python3 CVSS3-Grabber.py
Paste Vulnerability Name: PHP 7.2.x < 7.2.21 Multiple Vulnerabilities.
Base Score: None
Vector: CVSS:3.0/AV:N/AC:L/PR:N/UI:R/S:U/C:L/I:N/A:H
However i have been struggling to figure out how to get the output nicer, is there an easy part of the re module that im missing that can do this for me? or perhaps putting the output into a file first would then allow me to manipulate the text to how i need it.
Here is my code, would appreciate any feedback on how to improve as i have recently gotten back into python and scripting in general.
import requests
import re
from bs4 import BeautifulSoup
from googlesearch import search
def get_url():
vuln = input("Paste Vulnerability Name: ") + "tenable"
for url in search(vuln, tld='com',lang='en',num=1,start=0,stop=1,pause=2.0):
return url
def get_scores(url):
response = requests.get(url)
html = response.text
cvss3_temporal_v = re.search("CVSS:3.0/E:./RL:./RC:.",html)
cvss3_v = re.search("CVSS:3.0/AV:./AC:./PR:./UI:./S:./C:./I:./A:.",html)
cvss3_basescore = re.search("Base Score:....",html)
print("Base Score: ",cvss3_basescore)
print("Vector: ",cvss3_v)
print("Temporal Vector: ",cvss3_temporal_v)
urll = get_url()
get_scores(urll)
### IMPROVEMENTS ###
# Include the base score in output
# Tidy up output
# Vulnerability list?
# modify to accept flags, i.e python3 CVSS3-Grabber.py -v VULNAME ???
# State whether it is a failing issue or Action point
Thanks!
Don't print the match object. Print the match value.
In Python the value is accessible through the .group() method. If there are no regex subgroups (or you want the entire match, like in this case), don't specify any arguments when you call it:
print("Vector: ", cvss3_v.group())

What does it mean to normalize an email address?

In this example, Django talks about normalizing an email address with self.normalize_email(email) where self is BaseUserManager. When I search for "normalizing emails" it seems to be a practice across all platforms. I see tutorials of how to do it, but nothing really explaining what it is and what it's used for.
For email addresses, foo#bar.com and foo#BAR.com are equivalent; the domain part is case-insensitive according to the RFC specs. Normalizing means providing a canonical representation, so that any two equivalent email strings normalize to the same thing.
The comments on the Django method explain:
Normalize the email address by lowercasing the domain part of it.
One application of normalizing emails is to prevent multiple signups. If your application lets the public to sign up, your application might attract the "unkind" types, and they could attempt to sign up multiple times with the same email address by mixing symbols, upper and lower cases to make variants of the same email address.
From Django's repository, the docstring of normalize_email is the following:
Normalize the email address by lowercasing the domain part of it.
What this method does is to lowercase the domain part of an email, so this part is case insensitive, so consider the following examples:
>>> from django.contrib.auth.models import BaseUserManager
>>> BaseUserManager.normalize_email("user#example.com")
user#example.com
>>> BaseUserManager.normalize_email("user#EXAMPLE.COM")
user#example.com
>>> BaseUserManager.normalize_email("user#example.COM")
user#example.com
>>> BaseUserManager.normalize_email("user#EXAMPLE.com")
user#example.com
>>> BaseUserManager.normalize_email("user#ExAmPlE.CoM")
user#example.com
As you can see all emails are equivalent because the case after # is irrelevant.

how to read only URL from txt file in MATLAB

I have a text file having multiple URLs with other information of the URL. How can I read the txt file and save the URLs only in an array to download it? I want to use
C = textscan(fileId, formatspec);
What should I mention in formatspec for URL as format?
This is not a job for textscan; you should use regular expressions for this. In MATLAB, regexes are described here.
For URLs, also refer here or here for examples in other languages.
Here's an example in MATLAB:
% This string is obtained through textscan or something
str = {...
'pre-URL garbage http://www.example.com/index.php?query=test&otherStuf=info more stuff here'
'other foolish stuff ftp://localhost/home/ruler_of_the_world/awesomeContent.py 1 2 3 4 misleading://';
};
% find URLs
C = regexpi(str, ...
['((http|https|ftp|file)://|www\.|ftp\.)',...
'[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$]'], 'match');
C{:}
Result:
ans =
'http://www.example.com/index.php?query=test&otherStuf=info'
ans =
'ftp://localhost/home/ruler_of_the_world/awesomeContent.py'
Note that this regex requires you to have the protocol included, or have a leading www. or ftp.. Something like example.com/universal_remote.cgi?redirect= is NOT matched.
You could go on and make the regex cover more and more cases. However, eventually you'll stumble upon the the most important conclusion (as made here for example; where I got my regex from): given the full definition of what precisely constitutes a valid URL, there is no single regex able to always match every valid URL. That is, there are valid URLs you can dream up that are not captured by any of the regexes shown.
But please keep in mind that this last statement is more theoretical rather than practical -- those non-matchable URLs are valid but not often encountered in practice :) In other words, if your URLs have a pretty standard form, you're pretty much covered with the regex I gave you.
Now, I fooled around a bit with the Java suggestion by pm89. As I suspected, it is an order of magnitude slower than just a regex, since you introduce another "layer of goo" to the code (in my timings, the difference was about 40x slower, excluding the imports). Here's my version:
import java.net.URL;
import java.net.MalformedURLException;
str = {...
'pre-URL garbage http://www.example.com/index.php?query=test&otherStuf=info more stuff here'
'pre--URL garbage example.com/index.php?query=test&otherStuf=info more stuff here'
'other foolish stuff ftp://localhost/home/ruler_of_the_world/awesomeContent.py 1 2 3 4 misleading://';
};
% Attempt to convert each item into an URL.
for ii = 1:numel(str)
cc = textscan(str{ii}, '%s');
for jj = 1:numel(cc{1})
try
url = java.net.URL(cc{1}{jj})
catch ME
% rethrow any non-url related errors
if isempty(regexpi(ME.message, 'MalformedURLException'))
throw(ME);
end
end
end
end
Results:
url =
'http://www.example.com/index.php?query=test&otherStuf=info'
url =
'ftp://localhost/home/ruler_of_the_world/awesomeContent.py'
I'm not too familiar with java.net.URL, but apparently, it is also unable to find URLs without leading protocol or standard domain (e.g., example.com/path/to/page).
This snippet can undoubtedly be improved upon, but I would urge you to consider why you'd want to do this for this longer, inherently slower and far uglier solution :)
As I suspected you could use java.net.URL according to this answer.
To implement the same code in Matlab:
First read the file into a string, using fileread for example:
str = fileread('Sample.txt');
Then split the text with respect to spaces, using strsplit:
spl_str = strsplit(str);
Finally use java.net.URL to detect the URLs:
for k = 1:length(spl_str)
try
url = java.net.URL(spl_str{k})
% Store or save the URL contents here
catch e
% it's not a URL.
end
end
You can write the URL contents into a file using urlwrite. But first convert the URLs obtained from java.net.URL to char:
url = java.net.URL(spl_str{k});
urlwrite(char(url), 'test.html');
Hope it helps.

Regex to extract URLs from href attribute in HTML with Python [duplicate]

This question already has answers here:
What is the best regular expression to check if a string is a valid URL?
(62 answers)
Closed last month.
Considering a string as follows:
string = "<p>Hello World</p>More ExamplesEven More Examples"
How could I, with Python, extract the URLs, inside the anchor tag's href? Something like:
>>> url = getURLs(string)
>>> url
['http://example.com', 'http://2.example']
import re
url = '<p>Hello World</p>More ExamplesEven More Examples'
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', url)
>>> print urls
['http://example.com', 'http://2.example']
The best answer is...
Don't use a regex
The expression in the accepted answer misses many cases. Among other things, URLs can have unicode characters in them. The regex you want is here, and after looking at it, you may conclude that you don't really want it after all. The most correct version is ten-thousand characters long.
Admittedly, if you were starting with plain, unstructured text with a bunch of URLs in it, then you might need that ten-thousand-character-long regex. But if your input is structured, use the structure. Your stated aim is to "extract the URL, inside the anchor tag's href." Why use a ten-thousand-character-long regex when you can do something much simpler?
Parse the HTML instead
For many tasks, using Beautiful Soup will be far faster and easier to use:
>>> from bs4 import BeautifulSoup as Soup
>>> html = Soup(s, 'html.parser') # Soup(s, 'lxml') if lxml is installed
>>> [a['href'] for a in html.find_all('a')]
['http://example.com', 'http://2.example']
If you prefer not to use external tools, you can also directly use Python's own built-in HTML parsing library. Here's a really simple subclass of HTMLParser that does exactly what you want:
from html.parser import HTMLParser
class MyParser(HTMLParser):
def __init__(self, output_list=None):
HTMLParser.__init__(self)
if output_list is None:
self.output_list = []
else:
self.output_list = output_list
def handle_starttag(self, tag, attrs):
if tag == 'a':
self.output_list.append(dict(attrs).get('href'))
Test:
>>> p = MyParser()
>>> p.feed(s)
>>> p.output_list
['http://example.com', 'http://2.example']
You could even create a new method that accepts a string, calls feed, and returns output_list. This is a vastly more powerful and extensible way than regular expressions to extract information from HTML.