How do I sort Flask/Werkzeug routes in the order in which they will be matched? - flask

I have a function that I'm using to list routes for my Flask site and would like to be sure it is sorting them in the order in which they will be matched by Flask/Werkzeug.
Currently I have
def routes(verbose, wide, nostatic):
"""List routes supported by the application"""
for rule in sorted(app.url_map.iter_rules()):
if nostatic and rule.endpoint == 'static':
continue
if verbose:
fmt = "{:45s} {:30s} {:30s}" if wide else "{:35s} {:25s} {:25s}"
line = fmt.format(rule, rule.endpoint, ','.join(rule.methods))
else:
fmt = "{:45s}" if wide else "{:35s}"
line = fmt.format(rule)
print(line)
but this just seems to sort the routes in the order I've defined them in my code.
How do I sort Flask/Werkzeug routes in the order in which they will be matched?
Another approach: Specifying a host and a path and seeing what rule applies.
urls = app.url_map.bind(host)
try:
m = urls.match(match, "GET")
z = '{}({})'.format(m[0], ','.join(["{}='{}'".format(arg, m[1][arg]) for arg in m[1]] +
["host='{}'".format(host)]))
return z
except NotFound:
return

On each request Flask/Werkzeug reordered self.map.update() rules if it need with route weights (see my another answer URL routing conflicts for static files in Flask dev server).
You can found that Map.update method update _rules (see https://github.com/mitsuhiko/werkzeug/blob/master/werkzeug/routing.py#L1313).
So you can do the same thing:
sorted(current_app.url_map._rules, key=lambda x: x.match_compare_key())
Or if all routes already loaded and any request done (called Map.build or Map.match) just use app.url_map._rules - it will be sorted. In request app.url_map._rules already sorted, because Map.match was called before dispatcher (match rule with dispatcher).
You also can found that Map.update called in Map.iter_rules, so you can just use this method current_app.url_map.iter_rules() (see https://github.com/mitsuhiko/werkzeug/blob/master/werkzeug/routing.py#L1161).

Related

Python3 - filter text before writing to a file

This is a function that is called by my main.py script.
Here is the issue, I have a file with a list of IP addresses, and I query this API in order to find me reverse DNS lookups of those given IPs against the (url) variable. Then spits out the "response.text".
Well in the response.text file,
I get No DNS A records found for 96.x.x.x
Other data I get is just a dnsname: 'subdomain.domain.com'
How do I filter my results to NOT show for every 'No DNS A records found for (whateverip shows)'
#!/usr/bin/env python3
import requests
import pdb
#function to use hackertarget api for reverse dns lookup
def dns_finder(file_of_ips="endpointslist"):
print('\n\n########## Finding DNS Names ##########\n')
targets = open("TARGETS","w")
with open (file_of_ips) as f:
for line in f:
line.strip()
url = 'https://api.hackertarget.com/reverseiplookup/?q=' + line
response = requests.get(url)
#pdb.set_trace()
hits = response.text
print(hits)
targets.write(hits)
return hits
You can use the re module to check the response to make sure it is a valid DNS hostname (using a regular expression pattern):
import re
# ... rest of your code
# before writing to targets file
if re.match('^(([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9])\.)*([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9])$', hits) is not None:
targets.write(hits)
The re.match() function will return a match object if the given string matches the regex. Otherwise, it will return None. So if you check to make sure it is not None, then it should have matched the regex (i.e. it matches the pattern for a DNS hostname). You could alter the regex or abstract the check to a more generic isValidHit(hit) and put any functionality you want there to check for more than just DNS hostnames.
Note
You want to be careful with your line targets = open("TARGETS","w"). That opens the file for writing, but you need to manually close the file elsewhere and handle any errors involved with IO to it.
You should try to use with open(filename) as blah: whenever dealing with file IO. It is much safer and easier to maintain.
Edit
Simple regex for IPv4 address found here.
Add more explanation of the regex match function.
Edit 2
I just re-read your post and saw that you are actually not looking for IPv4 addresses, but actually DNS hostnames...
In that case you could simply switch the regex pattern to ^(([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9])\.)*([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9])$. Source for the DNS Hostname regex.
#Tim Klein answer is great if you want to keep only the domain names. Howerver, as stated in the question, if you just want to remove error messages like "No DNS A records found for ...", then you can use the str.startswith method:
#!/usr/bin/env python3
import requests
import pdb
#function to use hackertarget api for reverse dns lookup
def dns_finder(file_of_ips="endpointslist"):
print('\n\n########## Finding DNS Names ##########\n')
targets = open("TARGETS","w")
with open (file_of_ips) as f:
for line in f:
line.strip()
url = 'https://api.hackertarget.com/reverseiplookup/?q=' + line
response = requests.get(url)
#pdb.set_trace()
hits = response.text
print(hits)
if not hits.startswith("No DNS A records found for"):
targets.write(hits)
return hits # why iterating if you return there ?

'request' module not pinging web pages correctly when they are contained in list

I am using the request module to see if items in a list of words are an article at https://www.britannica.com. My current code is:
import requests
words = ['no', 'yes', 'thermodynamics', 'london', 'Max-Factor', 'be']
for word in words:
request = requests.head('https://www.britannica.com/topic/' + word.lower())
if request.status_code == 200:
print(">EXISTS")
print('https://www.britannica.com/topic/' + word.lower())
print("<")
else:
print(">DOESNT EXIST")
print('https://www.britannica.com/topic/' + word.lower())
print("<")
'Be' is the only string that prints 'EXIST', but 'thermodynamics', 'london', and 'Max-Factor' also exists and the program prints 'DOESNT EXIST'. If I do the operation on thermodynamics alone, it correctly prints 'EXISTS'. What is the reason and possible work-around for the discrepancy? Possibly the loading times of the various webpages ('Be' having the smallest)?
Apparently, britanica.com uses redirects, probably for load balancing, so you'll often get status 301 instead of 200. The requests module can follow on redirects if you use:
request = requests.head('https://www.britannica.com/topic/' + word.lower(),
allow_redirects=True)

Determine the requested content type?

I'd like to write a Django view which serves out variant content based on what's requested. For example, for "text/xml", serve XML, for "text/json", serve JSON, etc. Is there a way to determine this from a request object? Something like this would be awesome:
def process(request):
if request.type == "text/xml":
pass
elif request.type == "text/json":
pass
else:
pass
Is there a property on HttpRequest for this?
'Content-Type' header indicates media type send in the HTTP request. This is used for requests that have a content (POST, PUT).
'Content-Type' should not be used to indicate preferred response format, 'Accept' header serves this purpose. To access it in Django use: HttpRequest.META.get('HTTP_ACCEPT')
See more detailed description of these headers
HttpRequest.META, more specifically HttpRequest.META.get('HTTP_ACCEPT') — and not HttpRequest.META.get('CONTENT_TYPE') as mentioned earlier
As said in other answers, this information is located in the Accept request header. Available in the request as HttpRequest.META['HTTP_ACCEPT'].
However there is no only one requested content type, and this header often is a list of accepted/preferred content types. This list might be a bit annoying to exploit properly. Here is a function that does the job:
import re
def get_accepted_content_types(request):
def qualify(x):
parts = x.split(';', 1)
if len(parts) == 2:
match = re.match(r'(^|;)q=(0(\.\d{,3})?|1(\.0{,3})?)(;|$)',
parts[1])
if match:
return parts[0], float(match.group(2))
return parts[0], 1
raw_content_types = request.META.get('HTTP_ACCEPT', '*/*').split(',')
qualified_content_types = map(qualify, raw_content_types)
return (x[0] for x in sorted(qualified_content_types,
key=lambda x: x[1], reverse=True))
For instance, if request.META['HTTP_ACCEPT'] is equal to "text/html;q=0.9,application/xhtml+xml,application/xml;q=0.8,*/*;q=0.7". This will return: ['application/xhtml+xml', 'text/html', 'application/xml', '*/*'] (not actually, since it returns a generator).
Then you can iterate over the resulting list to select the first content type you know how to respond properly.
Note that this function should work for most cases but do not handle cases such as q=0 which means "Not acceptable".
Sources: HTTP Accept header specification and Quality Values specification
in django 1.10, you can now use, request.content_type, as mentioned here in their doc

list of 2ld.1ld (2nd level domain.1st(top)level domain) names?

I am looking for a list of (doesnt matter if its not all, just needs to be big as its for generating dummy data)
Im looking for a list like
.net.nz
.co.nz
.edu.nz
.govt.nz
.com.au
.govt.au
.com
.net
any ideas where I can locate a list?
There are answers here. Most of them are relating to the use of http://publicsuffix.org/, and even some implementations to use it were given in some languages, like Ruby.
To get all the ICANN domains, this python code should work for you:
import requests
url = 'https://publicsuffix.org/list/public_suffix_list.dat'
page = requests.get(url)
icann_domains = []
for line in page.text.splitlines():
if 'END ICANN DOMAINS' in line:
break
elif line.startswith('//'):
continue
else:
domain = line.strip()
if domain:
icann_domains.append(domain)
print(len(icann_domains)) # 7334 as of Nov 2018
Remove the break statement to get private domains as well.
Be careful as you will get some domains like this: *.kh (e.g. http://www.mptc.gov.kh/dns_registration.htm). The * is a wildcard.

Django: creating/modifying the request object

I'm trying to build an URL-alias app which allows the user create aliases for existing url in his website.
I'm trying to do this via middleware, where the request.META['PATH_INFO'] is checked against database records of aliases:
try:
src: request.META['PATH_INFO']
alias = Alias.objects.get(src=src)
view = get_view_for_this_path(request)
return view(request)
except Alias.DoesNotExist:
pass
return None
However, for this to work correctly it is of life-importance that (at least) the PATH_INFO is changed to the destination path.
Now there are some snippets which allow the developer to create testing request objects (http://djangosnippets.org/snippets/963/, http://djangosnippets.org/snippets/2231/), but these state that they are intended for testing purposes.
Of course, it could be that these snippets are fit for usage in a live enviroment, but my knowledge about Django request processing is too undeveloped to assess this.
Instead of the approach you're taking, have you considered the Redirects app?
It won't invisibly alias the path /foo/ to return the view bar(), but it will redirect /foo/ to /bar/
(posted as answer because comments do not seem to support linebreaks or other markup)
Thank for the advice, I have the same feeling regarding modifying request attributes. There must be a reason that the Django manual states that they should be considered read only.
I came up with this middleware:
def process_request(self, request):
try:
obj = A.objects.get(src=request.path_info.rstrip('/')) #The alias record.
view, args, kwargs = resolve_to_func(obj.dst + '/') #Modified http://djangosnippets.org/snippets/2262/
request.path = request.path.replace(request.path_info, obj.dst)
request.path_info = obj.dst
request.META['PATH_INFO'] = obj.dst
request.META['ROUTED_FROM'] = obj.src
request.is_routed = True
return view(request, *args, **kwargs)
except A.DoesNotExist: #No alias for this path
request.is_routed = False
except TypeError: #View does not exist.
pass
return None
But, considering the objections against modifying the requests' attributes, wouldn't it be a better solution to just skip that part, and only add the is_routed and ROUTED_TO (instead of routed from) parts?
Code that relies on the original path could then use that key from META.
Doing this using URLConfs is not possible, because this aliasing is aimed at enabling the end-user to configure his own URLs, with the assumption that the end-user has no access to the codebase or does not know how to write his own URLConf.
Though it would be possible to write a function that converts a user-readable-editable file (XML for example) to valid Django urls, it feels that using database records allows a more dynamic generation of aliases (other objects defining their own aliases).
Sorry to necro-post, but I just found this thread while searching for answers. My solution seems simpler. Maybe a) I'm depending on newer django features or b) I'm missing a pitfall.
I encountered this because there is a bot named "Mediapartners-Google" which is asking for pages with url parameters still encoded as from a naive scrape (or double-encoded depending on how you look at it.) i.e. I have 404s in my log from it that look like:
1.2.3.4 - - [12/Nov/2012:21:23:11 -0800] "GET /article/my-slug-name%3Fpage%3D2 HTTP/1.1" 1209 404 "-" "Mediapartners-Google
Normally I'd just ignore a broken bot, but this one I want to appease because it ought to better target our ads (It's google adsense's bot) resulting in better revenue - if it can see our content. Rumor is it doesn't follow redirects so I wanted to find a solution similar to the original Q. I do not want regular clients accessing pages by these broken urls, so I detect the user-agent. Other applications probably won't do that.
I agree a redirect would normally be the right answer.
My (complete?) solution:
from django.http import QueryDict
from django.core.urlresolvers import NoReverseMatch, resolve
class MediapartnersPatch(object):
def process_request(self, request):
# short-circuit asap
if request.META['HTTP_USER_AGENT'] != 'Mediapartners-Google':
return None
idx = request.path.find('?')
if idx == -1:
return None
oldpath = request.path
newpath = oldpath[0:idx]
try:
url = resolve(newpath)
except NoReverseMatch:
return None
request.path = newpath
request.GET = QueryDict(oldpath[idx+1:])
response = url.func(request, *url.args, **url.kwargs)
response['Link'] = '<%s>; rel="canonical"' % (oldpath,)
return response