Remove the domain name from URLs - django

I need to remove the domain name from URLs with different schemes.
examples of url:
http://www.example.org/cat1/page1
example.org/cat1/page1
https://www.example.org/cat1/page1
outcome:
cat1/page1
It can be done both on django template or on the views.

Use the urlparse module:
>>> from urlparse import urlparse
>>> o = urlparse('http://www.example.org/cat1/page1')
>>> o.path
'/cat1/page1'
Note, that example.org/cat1/page1 is a valid path, so you can't remove the domain from it. As workaround you can manually add the protocol to the url string:
>>> url = 'example.org/cat1/page1'
>>> if not '//' in url:
... url = 'http://' + url
...
>>> o = urlparse(url)
>>> o.path
'/cat1/page1'

The request object has this information also:
https://docs.djangoproject.com/en/1.7/ref/request-response/#module-django.http
HttpRequest.path
A string representing the full path to the requested page, not including the domain.
Example: "/music/bands/the_beatles/"
This will let you get the path of the current page, so it might not work in your situation

Related

How to get URI template in django?

I need to get original uri template with regex from resolve function in view or middleware. For example:
def view(request):
uri_path = resolve(request...)
# and uri_path must be equals something like 'articles/<int:year>/<int:month>/'
...
Is this possible? I didn't found information about this
You are probably looking for PATH_INFO from the request object.
Try this -
request.META['PATH_INFO']
You can look for other info present at request META -
meta = request.META
for k, v in meta.items():
print(f"{k}\t{v}")
To get the resolve path, try below -
from django.urls import resolve
r = resolve(request.META['PATH_INFO'])
print(r.route) # will print like /office/employee/<int:eid>
Doc
So, i've found what i've searching for - documentation. But the problem is that this API new in django 2.2, i have an older django

Regular expression to find precise pdf links in a webpage

Given url='http://normanpd.normanok.gov/content/daily-activity', the website has three types of arrests, incidents, and case summaries. I was asked to use regular expressions to discover the URL strings of all the Incidents pdf documents in Python.
The pdfs are to be downloaded in a defined location.
I have gone through the link and found that Incident pdf files URLs are in the form of:
normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf
I have written code :
import urllib.request
url="http://normanpd.normanok.gov/content/daily-activity"
response = urllib.request.urlopen(url)
data = response.read() # a `bytes` object
text = data.decode('utf-8')
urls=re.findall(r'(\w|/|-/%)+\sIncident\s(%|\w)+\.pdf$',text)
But in the URLs list, the values are empty.
I am a beginner in python3 and regex commands. Can anyone help me?
This is not an advisable method. Instead, use an HTML parsing library like bs4 (BeautifulSoup) to find the links and then only regex to filter the results.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url="http://normanpd.normanok.gov/content/daily-activity"
response = urlopen(url).read()
soup= BeautifulSoup(response, "html.parser")
links = soup.find_all('a', href=re.compile(r'(Incident%20Summary\.pdf)'))
for el in links:
print("http://normanpd.normanok.gov" + el['href'])
Output :
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-23%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-22%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-21%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-20%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-18%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-17%20Daily%20Incident%20Summary.pdf
But if you were asked to use only regexes, then try something simpler:
import urllib.request
import re
url="http://normanpd.normanok.gov/content/daily-activity"
response = urllib.request.urlopen(url)
data = response.read() # a `bytes` object
text = data.decode('utf-8')
urls=re.findall(r'(filebrowser_download.+?Daily%20Incident.+?\.pdf)',text)
print(urls)
for link in urls:
print("http://normanpd.normanok.gov/" + link)
Using BeautifulSoup this is an easy way:
soup = BeautifulSoup(open_page, 'html.parser')
links = []
for link in soup.find_all('a'):
current = link.get('href')
if current.endswith('pdf') and "Incident" in current:
links.append('{0}{1}'.format(url,current))

How to extract URLs matching a pattern

I'm trying to extract URLs from a webpage with the following pattern :
'http://www.realclearpolitics.com/epolls/????/governor/??/-.html'
My current code extracts all the links. How could I change my code to only extract URLs that match the pattern? Thank you!
import requests
from bs4 import BeautifulSoup
def find_governor_races(html):
url = html
base_url = 'http://www.realclearpolitics.com/'
page = requests.get(html).text
soup = BeautifulSoup(page,'html.parser')
links = []
for a in soup.findAll('a', href=True):
links.append(a['href'])
find_governor_races('http://www.realclearpolitics.com/epolls/2010/governor/2010_elections_governor_map.html')
You can provide a regular expression pattern as an href argument value for the .find_all():
import re
pattern = re.compile(r"http://www.realclearpolitics.com\/epolls/\d+/governor/.*?/.*?.html")
links = soup.find_all("a", href=pattern)

How to set static_url_path in Flask application

I want to do something like this:
app = Flask(__name__)
app.config.from_object(mypackage.config)
app.static_url_path = app.config['PREFIX']+"/static"
when I try:
print app.static_url_path
I get the correct static_url_path
But in my templates when I use url_for('static'), The html file generated using jinja2 still has the default static URL path /static with the missing PREFIX that I added.
If I hardcode the path like this:
app = Flask(__name__, static_url_path='PREFIX/static')
It works fine. What am I doing wrong?
Flask creates the URL route when you create the Flask() object. You'll need to re-add that route:
# remove old static map
url_map = app.url_map
try:
for rule in url_map.iter_rules('static'):
url_map._rules.remove(rule)
except ValueError;
# no static view was created yet
pass
# register new; the same view function is used
app.add_url_rule(
app.static_url_path + '/<path:filename>',
endpoint='static', view_func=app.send_static_file)
It'll be easier just to configure your Flask() object with the correct static URL path.
Demo:
>>> from flask import Flask
>>> app = Flask(__name__)
>>> app.url_map
Map([<Rule '/static/<filename>' (HEAD, OPTIONS, GET) -> static>])
>>> app.static_url_path = '/PREFIX/static'
>>> url_map = app.url_map
>>> for rule in url_map.iter_rules('static'):
... url_map._rules.remove(rule)
...
>>> app.add_url_rule(
... app.static_url_path + '/<path:filename>',
... endpoint='static', view_func=app.send_static_file)
>>> app.url_map
Map([<Rule '/PREFIX/static/<filename>' (HEAD, OPTIONS, GET) -> static>])
The accepted answer is correct, but slightly incomplete. It is true that in order to change the static_url_path one must also update the app's url_map by removing the existing Rule for the static endpoint and adding a new Rule with the modified url path. However, one must also update the _rules_by_endpoint property on the url_map.
It is instructive to examine the add() method on the underlying Map in Werkzeug. In addition to adding a new Rule to its ._rules property, the Map also indexes the Rule by adding it to ._rules_by_endpoint. This latter mapping is what is used when you call app.url_map.iter_rules('static'). It is also what is used by Flask's url_for().
Here is a working example of how to completely rewrite the static_url_path, even if it was set in the Flask app constructor.
app = Flask(__name__, static_url_path='/some/static/path')
a_new_static_path = '/some/other/path'
# Set the static_url_path property.
app.static_url_path = a_new_static_path
# Remove the old rule from Map._rules.
for rule in app.url_map.iter_rules('static'):
app.url_map._rules.remove(rule) # There is probably only one.
# Remove the old rule from Map._rules_by_endpoint. In this case we can just
# start fresh.
app.url_map._rules_by_endpoint['static'] = []
# Add the updated rule.
app.add_url_rule(f'{a_new_static_path}/<path:filename>',
endpoint='static',
view_func=app.send_static_file)
Just to complete the answer above, we also need to clear the view_functions that maps the endpoint with the delegate:
app.view_functions["static"] = None
I think you missed static_folder
app = Flask(__name__)
app.config.from_object(mypackage.config)
app.static_url_path = app.config['PREFIX']+"/static"
app.static_folder = app.config['PREFIX']+"/static"

How to construct a full URL in django

I need to get django to send an email which contains a URL like this
http://www.mysite.org/history/
Where 'history' is obtained like so:
history_url = urlresolvers.reverse('satchmo_order_history')
history_url is a parameter that I pass on to the function that sends the email, and it correctly produces '/history/'. But how do I get the first part? (http://www.mysite.org)
Edit 1
Is there anything wrong or unportable about doing it like this? :
history = urlresolvers.reverse('satchmo_order_history')
domain = Site.objects.get_current().domain
history_url = 'http://' + domain + history
If you have access to an HttpRequest instance, you can use HttpRequest.build_absolute_uri(location):
absolute_uri = request.build_absolute_uri(relative_uri)
In alternative, you can get it using the sites framework:
import urlparse
from django.contrib.sites.models import Site
domain = Site.objects.get_current().domain
absolute_uri = urlparse.urljoin('http://{}'.format(domain), relative_uri)
Re: Edit1
I tend to use urlparse.join, because it's in the standard library and it's technically the most Pythonic way to combine URIs, but I think that your approach is fine too.