Specifically my middleware is interested to differentiate between a GET request on:
/admin/app/model/?
/admin/app/model/
URL #1 was initiated with a dangling question mark.
From my experiments, django's HttpRequest swallows it up and I am unable to differentiate between the two. Is there a way to obtain the raw nonfiltered query string ?
? should be escaped as %3F. So, may be you should choose another symbol, without such problems?
This may not be possible. Typically a django application is served from behind the WSGI interface, by the time the request gets to django it's already been parsed into PATH_INFO (before the?) and QUERY_STRING (after the ?). When django runs get_full_path it's just concatenating those two things with a ? in the middle if needed.
It's also a bad idea: HTTP does not expect URLs to behave differently with a trailing ?, as that just means an empty set of parameters, which is the same thing that the absence of a ? means. As well as being confusing, this may cause interoperability problems, as a proxy or web browser might drop the trailing '?' in the expectation that it should have no effect.
Related
I have a Django form that uses a 'forms.URLField' like local_url1 = URLField(label="First Local URL", required=False). If a user inputs something like 'https://www.google.com' then the field validates without error.
However, if the user puts 'www.google.com' the field fails validation and the user sees an error. This is because the layout of a URL is scheme://host:port/absolute_path and the failing URL is missing the scheme (e.g. https), which Django's URLFieldValidation expects.
I don't care if my users include the scheme and nor should my form. Unfortunately, the error from django is completely useless in indicating what is wrong, and I've had multiple users ask why it says to enter a valid URL. I'm also certain I've lost paying customers because of this.
Is there a way to have all the other validation of a URL take place, but ignore the fact that the scheme is missing? At the very least, can I change the error message to add something like "Did you include http?". I've attempted implementing my own URLField and URLFieldValidation, but unless that's the path I have to take, then that is a different StackOverflow question.
I'm using Django 1.7, by the way. Thanks for any help!
URL/URI scheme list to validate against. If not provided, the default
list is ['http', 'https', 'ftp', 'ftps']. As a reference, the IANA Web
site provides a full list of valid URI schemes.
If the valid URI schemes provided by IANA web are not what you are looking for, then I suggest you create your own field validator.
Remember that URLField is a subclass of the CharField. and since www.something.com is ok with you, then It's simple to add a regular expression to the regular CharField that checks if the pattern is correct or not.
A regular expression like this for example will validate against www and http://. so with or without http or https.
((?:https?\:\/\/|www\.)(?:[-a-z0-9]+\.)*[-a-z0-9]+.*)
www.google.com -- OK
http://www.google.com -- OK
https://www.google.com -- OK
http://google.com -- OK
https://google.com -- OK
However, this will not complain about blahwww.domain.com
so you might enhance it as you like.
Usually to retrieve a resource one uses:
GET http://ws.mydomain.com/resource/123212
But what if your item IDs are HTTP URIs?:
GET http://ws.mydomain.com/resource/http://id.someotherdomain.com/SGX.3211
Browsers replace two slashes with one, and the request turns into:
GET http://ws.mydomain.com/resource/http:/id.someotherdomain.com/SGX.3211
which will not work.
URI encoding the "http://id.someotherdomain.com/SGX.3211" -part results in HTTP 400 - Bad request.
Is there a best practice for handling this?
Edit:
Then of course if we would need to have (I don't at the moment) request in form:
resources/ID/collections/ID
and all IDs are HTTP URIs, things get out of hand... Possibly one could do something like this and parse the contents inside the curly braces:
resources/{http://id...}/collections/{http://id...}
Encode the other system's URI, and then pass the value as a query parameter:
GET http://ws.mydomain.com/resource?ref=http%3A%2F%2Fid.someotherdomain.com%2FSGX.3211
Ugly looking, but no one said that URIs used in a REST architecture have to be beautiful. :)
By the way, a GET actually looks like this when it's sent:
GET /resource?ref=http%3A%2F%2Fid.someotherdomain.com%2FSGX.3211 HTTP/1.1
Host: ws.mydomain.com
UPDATE: apparently you no longer have to encode "/" and "?" within a query component. From RFC 3986:
The characters slash ("/") and question mark ("?") may represent data
within the query component. Beware that some older, erroneous
implementations may not handle such data correctly when it is used as
the base URI for relative references (Section 5.1), apparently
because they fail to distinguish query data from path data when
looking for hierarchical separators. However, as query components are
often used to carry identifying information in the form of "key=value"
pairs and one frequently used value is a reference to another URI, it
is sometimes better for usability to avoid percent-encoding those
characters.
So you could legally do this:
GET /resource?ref=id.someotherdomain.com/SGX.3211 HTTP/1.1
Host: ws.mydomain.com
If I define url like "^optional/slash/?&" - and so web-page to which it bound will available by both url versions - with slash and without - will I violate any conventions or standards by doing that?
Wouldn't a redirection be more appropriate?
If I remember correctly, trailing slashes should be used with resources that list other resources. Like a directory that lists files, a list of articles or a category query (e.g http://www.example.com/category/cakes/). Without trailing slashes the URI should point to a single resource. Like a file, an article or a complex query with parameters (e.g http://www.example.com/search?ingredients=strawberry&taste=good)
Just use the HTTP code 302 FOUND to redirect typos to their correct URIs.
EDIT: Thanks to AndreD for pointing it out, a HTTP code 301 MOVED PERMANENTLY is more appropriate for permanently aliasing typos. Search engines and other clients should stop querying for the misspelled URL after getting a 301 code once, and Google recommends using it for changing the URL of a page in their index.
According to RFC 3986: Uniform Resource Identifier (URI): Generic Syntax:
Section 6.2.4. Protocol-Based Normalization -
"Substantial effort to reduce the incidence of false negatives is
often cost-effective for web spiders. Therefore, they implement even
more aggressive techniques in URI comparison. For example, if they
observe that a URI such as
http://example.com/data
redirects to a URI differing only in the trailing slash
http://example.com/data/
they will likely regard the two as equivalent in the future. This
kind of technique is only appropriate when equivalence is clearly
indicated by both the result of accessing the resources and the
common conventions of their scheme's dereference algorithm (in this
case, use of redirection by HTTP origin servers to avoid problems
with relative references)."
My interpretation of this statement would be that making the two URIs functionally equivalent (e.g. by means of an .htaccess statement, redirect, or similar) does not violate any standard conventions. According to the RFC, web spiders are prepared to treat them functionally equivalent if they point to the same resource.
No, you are not violating any standards by doing that you can Use this Optional trailing slash in URL of websites
but you need to stay on the safe side, because there are different ways servers handle the issue:
Sometimes, it doesn't matter for SEO: many web servers will just re-direct using 301 status code to the default version;
Some web servers may return a 404 page for the non-trailing-slash address = wasted link juice and efforts;
Some web servers may return 302 redirect to the correct version = wasted link juice and efforts;
Some web servers may return 200 response for both the versions = wasted link juice and efforts as well as potential duplicate content problems.
i'm trying to get full path of the requested url in Django. I use a such url pattern:
('^', myawesomeview),
It works good for domain.com/hello, domain.com/hello/sdfsdfsd and even for domain.com/hello.php/sd""^some!bullshit.index.aspx (although, "^" is replaced with "%5E")
But when I try to use # in request (ex. http://127.0.0.1:8000/solid#url) it returns only "/sold". Is there any way to get the full path without ANY changes or replacements?
BTW, I'getting url with return HttpResponse(request.path)
Thanks in advance.
The part of URI separated by '#' sign is called a fragment identifier. Its sense is to be processed on client side only, and not to be passed to server. So if you really need this, you have to process it with JS, for example, and pass it as a usual parameter. Otherwise, this information will never be sent to Django.
I have a resource at a URL that both humans and machines should be able to read:
http://example.com/foo-collection/foo001
What is the best way to distinguish between human browsers and machines, and return either HTML or a domain-specific XML response?
(1) The Accept type field in the request?
(2) An additional bit of URL? eg:
http://example.com/foo-collection/foo001 -> returns HTML
http://example.com/foo-collection/foo001?xml -> returns, er, XML
I do not wish to oblige machines reading the resource to parse HTML (or XHTML for that matter). Machines like the googlebot should receive the HTML response.
It is reasonable to assume I control the machine readers.
If this is under your control, rather than adding a query parameter why not add a file extension:
http://example.com/foo-collection/foo001.html - return HTML
http://example.com/foo-collection/foo001.xml - return XML
Apart from anything else, that means if someone fetches it with wget or saves it from their browser, it'll have an appropriate filename without any fuss.
My preference is to make it a first-class part of the URI. This is debatable, since there are -- in a sense -- multiple URI's for the same resource. And is "format" really part of the URI?
http://example.com/foo-collection/html/foo001
http://example.com/foo-collection/xml/foo001
These are very easy deal with in a web framework that has URI parsing to direct the request to the proper application.
If this is indeed the same resource with two different representations, the HTTP invites you to use the Accept-header as you suggest. This is probably a very reliable way to distinguish between the two different scenarios. You can be plenty sure that user agents (including search engine spiders) send the Accept-header properly.
About the machine agents you are going to give XML; are they under your control? In that case you can be doubly sure that Accept will work. If they do not set this header properly, you can give XML as default. User agents DO set the header properly.
I would try to use the Accept heder for this, because this is exactly what the Accept header is there for.
The problem with having two different URLs is that is is not automatically apparent that these two represent the same underlying resource. This can be bad if a user finds an URL in one program, which renders HTML, and pastes it in the other, which needs XML. At this point a smart user could probably change the URL appropriately, but this is just a source of error that you don't need.
I would say adding a Query String parameter is your best bet. The only way to automatically detect whether your client is a browser(human) or application would be to read the User-Agent string from the HTTP Request. But this is easily set by any application to mimic a browser, you're not guaranteed that this is going to work.