Correct escaping of % in the URL with Apache - django

I have a Django project where I have a search page which takes input through a POST and redirect to /search/<search string>/ and this page renders the result. The percentage sign (%) is used as a wildcard in the search (tes%er returns testuser, tester, etc and the url looks like this then: example.com/search/tes%25er/) and everything works fine with the Django development server. If I manually write tes%er in the url it changes to tes%25er automatically.
Now I'm deploying on an Apache server with mod_wsgi and when my search page redirects to example.com/search/tes%er/ I get the server error: Bad Request. Your browser sent a request that this server could not understand.. If I manually add '25' to the url, like the encoded % sign so it looks like the development server it works fine.
Is there a way for Apache to automatically escape the %-sign and create a url that works, understand % unescaped or do I need to do ugly hacks in my search page that builds the url? (I'd rather not do ugly hacks like this cause then the users can't manually add % to the url and get it to work).
Edit: The code that sends the query from the search page to the search url.
if form.is_valid():
if 'search_user' in request.POST:
q = request.POST['search_user']
return redirect('/search/'+q)

As Ignacio already suggested, you should not redirect to an invalid url. So to answer your question:
you can (or perhaps its better to say 'should') not ask your Apache server to escape your url. The reason you escape your URL is because some characters have another meaning. For example, take a querystring:
somedomain.com/?key=value
If we would want to use a ? or a = in your value you would have a problem because your server would think that you are using operators of your querystring.
The same for the %-symbol. When your apache server sees a %-symbol he thinks he will find an enconded and will try to decode it. If your querystring is %20, apache will translate this to a space, while you meant "wildcard20".
In summary: apache decodes your string, so you dont want him to encode it.
But this does not solve your problem. You can solve your problem by changing your code into the following:
from urllib import urlencode
if form.is_valid():
if 'search_user' in request.POST:
q = request.POST['search_user']
return redirect('/search/?q='+urlencode(q))
In case you wonder: what if my user would type /search/?q=%; in that case he'ld have a problem for he has typed an invalid address.
Hope this helps :-).
Wout

Related

How do I Regex this Facebook fbclid?

I migrated a clients site from a Movable Type site with posts that ended in ".php" to a WordPress site that ends in a slash "/". All my 301 redirects are working great but i found out from the client he has links in his websites Facebook page. Those links end in ".php?fbclid=InsertRandomParamsHere". What I need to do is replace the ".php" with "/" and the pages will redirect correctly while maintaining the Facebook tracking parameters at the end.
I've been using a regular expression for the 301 and here is what my regex looks like so far (I'm using Rank Math plugin for redirects):
The Source URL regex is:
^(.*)\.php(.*)
The Destination URL is:
https://www.beachwoodreporter.com/
What I get right now is, for an example link:
http://www.beachwoodreporter.com/music/you_turn_me_on_again.php?fbclid=IwAR37SDAQdPrxMqwHQEY6dcs5rle1Mt0b0WubR9dL8WbaX3zoKNqjW0J84p0
which should redirect to:
http://www.beachwoodreporter.com/music/you_turn_me_on_again/?fbclid=IwAR37SDAQdPrxMqwHQEY6dcs5rle1Mt0b0WubR9dL8WbaX3zoKNqjW0J84p0
is instead redirecting to:
http://www.beachwoodreporter.com/?fbclid=IwAR37SDAQdPrxMqwHQEY6dcs5rle1Mt0b0WubR9dL8WbaX3zoKNqjW0J84p0
so it's basically stripping out the slug portion of the URL:
/music/you_turn_me_on_again/
And the client has many links like this on their Facebook site trying to do one at a time is out of the question. All I need is to replace the ".php" with "/" and it should fix all these problems. Can what I want be done or should I tell the client I can't do it?
Image of the Rank Math regex settings:

Parse redirection URL

I analyze the URL in a malicious e-mail. I parse the e-mail using BeautifulSoup. I get this URL
https://www.google.com/url?q=http://my.%42%41%44%2e%43%4F&sa=D&usg=AFQjCNGTKogvWUF40RsyeAXrGi6uQrlhoQ
This URL will force Google.com to redirect to http://my.BAD.CO Given a URL like the one above how can I know that the URL will trigger redirect?
I want to get an indication that this is a redirect and I want to get two separate URLs
http://my.BAD.CO and https://www.google.com/url?q=http://5sr0s.%61%6b%68%6f%72%61%62%2e%72%75&sa=D&usg=AFQjCNGTKogvWUF40RsyeAXrGi6uQrlhoQ
where http://my.BAD.CO is an encoded target URL http://my.%42%41%44%2e%43%4F
If the only solution is a custom RegEx like this
(?i)(http|https)://(www.|)google.com/url\?q=(http|https)://(\S+)\&usg=\S+
followed by a call to urllib.parse.unquote will it cover all corner cases?
Are there other ways to redirect besides https://www.google.com/url... ?
I found another way to redirect Here is another way to redirect: via https://www.google.de/url?sa=t&url=
I ended up with a regex
(?i)^(http|https)://(www.|)google.(ac|ad|aero|ae|af|ag|ai|al|am|an|ao|aq|arpa|ar|asia|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|biz|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|cat|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|coop|com|co|cr|cu|cv|cx|cy|cz|de|dj|dk|dm|do|dz|ec|edu|ee|eg|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gov|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|info|int|in|io|iq|ir|is|it|je|jm|jobs|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mil|mk|ml|mm|mn|mobi|mo|mp|mq|mr|ms|mt|museum|mu|mv|mw|mx|my|mz|name|na|nc|net|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|org|pa|pe|pf|pg|ph|pk|pl|pm|pn|pro|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|sk|sl|sm|sn|so|sr|st|su|sv|sy|sz|tc|td|tel|tf|tg|th|tj|tk|tl|tm|tn|to|tp|travel|tr|tt|tv|tw|tz|ua|ug|uk|um|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|xn--0zwm56d|xn--11b5bs3a9aj6g|xn--80akhbyknj4f|xn--9t4b11yi5a|xn--deba0ad|xn--g6w251d|xn--hgbk6aj7f53bba|xn--hlcj6aya9esc7a|xn--jxalpdlp|xn--kgbechtv|xn--zckzah|ye|yt|yu|za|zm|zw)/url\?.+$
or a readable form
(?i)^(http|https)://(www.|)google.(com|de)/url\?.+$
Lot of people considered that the question is not worth an effort of anyone. I got -4 for the question. Some questions appear to be trivial. I still hope that there is a better solution for the problem. I did not find a list of WEB sites allowing redirect of the URL like what google.com/url\?q does
Here is another way to redirect https://www.google.de/url?sa=t&url=

apache does not pass request to Django (404 not found)

I have a custom 404 page setup for the site, which works fine, like this:
when I hit mysite.com/fdsafsadfdsa which doesn't exist, the custom 404 page shows up.
However if I add a urlencoded '/' which is '%2f' at the end of url, mysite.com/fdsafsadfdsa%2f, and this gives me the apache 404 not found.
it looks like apache decided to handle this 404 itself instead of passing down to Django
Anybody has idea why this is happening?
Turns out it's a issue in Apache/Nginx. And somebody submit this issue to Django project before, see here: https://code.djangoproject.com/ticket/15718
and quote from the ticket, there's a workaround:
After investigation I've found that the 2nd issue (404 error directly from apache) is not related to django and can be avoided by adding "AllowEncodedSlashes On" into apache config. Unfortunately apache replaces %2f with / itself, so the behavior is exactly the same as in simple http server provided by django. In Apache 2.2.18 (which is not released yet, i guess), AllowEncodedSlashes allows value NoDecode. With the value NoDecode, such URLs are accepted, but encoded slashes are not decoded but left in their encoded state. Meanwhile I'm using the workaround
request_uri = force_unicode(environ.get('REQUEST_URI', u'/'))
if u'?' in request_uri:
path_info,query = request_uri.split('?',1)
else:
path_info,query = request_uri,''
instead of original
path_info = force_unicode(environ.get('PATH_INFO', u'/'))
in core/handlers/wsgi.py

django and angular url patterns

Sometimes in my code I pass get parameters with URL's. One particular scenario is if the user is not logged in, but puts a URL for a page that requires login, they will be required to login first.
In that case I may have a URL such as: www.example.com/home/#/main/.
The end of the URL /#/main/ is for angular. However, in django when I do the below to get the next parameter above, I do this:
self.request.GET.get('next', self.redirect_url)
The problem is that in this case, next provides everything but the angular portion, so I get: www.example.com/home/.
Is there anyway to get the remaining portion of the URL as well?
You have to urlencode the url before you add it as a parameter. Then it will turn into %23 and insn't the separator for the anchor anymore, which is handled client side only as KVISH described.
Apparantly you can't. Django doesn't even see the anchor, its all handled on client (browser).
How to identify an anchor in a url in Django?
The way I got around this is I use jQuery to set a hidden input field to the hash location, which can be obtained like so:
window.location.hash
The hash gets submitted with the form and I can take it from there.

How do short URLs services work?

How do services like TinyURL or Metamark work?
Do they simply associate the tiny URL key with a [virtual?] web page which merely provide an "HTTP redirect" to the original URL? or is there more "magic" to it ?
[original wording]
I often use URL shortening services like TinyURL, Metamark, and others, but every time I do, I wonder how these services work. Do they create a new file that will redirect to another page or do they use subdomains?
No, they don't use files. When you click on a link like that, an HTTP request is send to their server with the full URL, like http://bit.ly/duSk8wK (links to this question). They read the path part (here duSk8wK), which maps to their database. In the database, they find a description (sometimes), your name (sometimes) and the real URL. Then they issue a redirect, which is a HTTP 302 response and the target URL in the header.
This direct redirect is important. If you were to use files or first load HTML and then redirect, the browser would add TinyUrl to the history, which is not what you want. Also, the site that is redirected to will see the referrer (the site that you originally come from) as being the site the TinyUrl link is on (i.e., twitter.com, your own site, wherever the link is). This is just as important, so that site owners can see where people are coming from. This too, would not work if a page gets loaded that redirects.
PS: there are more types of redirect. HTTP 301 means: redirect permanent. If that would happen, the browser will not request the bit.ly or TinyUrl site anymore and those sites want to count the hits. That's why HTTP 302 is used, which is a temporary redirect. The browser will ask TinyUrl.com or bit.ly each time again, which makes it possible to count the hits for you (some tiny url services offer this).
Others have answered how the redirects work but you should also know how they generate their tiny urls. You'll mistakenly hear that they create a hash of the URL in order to generate that unique code for the shortened URL. This is incorrect in most cases, they aren't using a hashing algorithm (where you could potentially have collisions).
Most of the popular URL shortening services simply take the ID in the database of the URL and then convert it to either Base 36 [a-z0-9] (case insensitive) or Base 62 (case sensitive).
A simplified example of a TinyURL Database Table:
ID URL VisitCount
1 www.google.com 26
2 www.stackoverflow.com 2048
3 www.reddit.com 64
...
20103 www.digg.com 201
20104 www.4chan.com 20
Web Frameworks that allow flexible routing make handling the incoming URL's really easy (Ruby, ASP.NET MVC, etc).
So, on your webserver you might have a route action that looks like (pseudo code):
Route: www.mytinyurl.com/{UrlID}
Route Action: RouteURL(UrlID);
Which routes any incoming request to your server that has any text after your domain www.mytinyurl.com to your associated method, RouteURL. It supplies the text that is passed in after the forward slash in your URL to that method.
So, lets say you requested: www.mytinyurl.com/fif
"fif" would then be passed to your method, RouteURL(String UrlID). RouteURL would then convert "fif" to its base10 equivalent, 20103, and a database request will be made to redirect to whatever URL is stored under the ID 20103 (in this case, www.digg.com). You would also increase the visit count for Digg by one before redirecting to the correct URL.
This is a really simplified example but you should be able to get the general idea.
As an extension to #A Salcedo answer:
Some url shortening services (Tinyarro.ws) go to extreme by using Unicode (UTF-8) to encode characters in shortened url - which allows higher amount of websites before having to add additional symbol. Since most of UTF-8 is accepted for use ((IRI) RFC 3987 handled by most browsers) that bumps from 62 sites per symbol to ~1,112,064.
To put in perspective one can encode 1.2366863e+12 sites with 2 symbols (1,112,064*1,112,064) - in November 2009, shortened links on bit.ly were accessed 2.1 billion times (Around that time, bit.ly and TinyURL were the most widely used URL-shortening services.) which is ~600 times less than you can fit in just 2 symbols, so for full duration of existence of all url shortening services it should last another 20 years minimum till adding third symbol.
In simple words, URL shortener maps an arbitrary long sequence of character ( original, long crappy url ) into a short and slick sequence of characters. This is nothing but Hashing, which is most commonly used to create lookup tables, HashMap, md5 Hash for cryptographic purposes etc.
To understand the URL-Shortening process I have created a demo project on GitHub and also a blog post. Do refer to this and let me know if it was helpful.
Blog Post : URL Shortening