How do short URLs services work? - web-services

How do services like TinyURL or Metamark work?
Do they simply associate the tiny URL key with a [virtual?] web page which merely provide an "HTTP redirect" to the original URL? or is there more "magic" to it ?
[original wording]
I often use URL shortening services like TinyURL, Metamark, and others, but every time I do, I wonder how these services work. Do they create a new file that will redirect to another page or do they use subdomains?

No, they don't use files. When you click on a link like that, an HTTP request is send to their server with the full URL, like http://bit.ly/duSk8wK (links to this question). They read the path part (here duSk8wK), which maps to their database. In the database, they find a description (sometimes), your name (sometimes) and the real URL. Then they issue a redirect, which is a HTTP 302 response and the target URL in the header.
This direct redirect is important. If you were to use files or first load HTML and then redirect, the browser would add TinyUrl to the history, which is not what you want. Also, the site that is redirected to will see the referrer (the site that you originally come from) as being the site the TinyUrl link is on (i.e., twitter.com, your own site, wherever the link is). This is just as important, so that site owners can see where people are coming from. This too, would not work if a page gets loaded that redirects.
PS: there are more types of redirect. HTTP 301 means: redirect permanent. If that would happen, the browser will not request the bit.ly or TinyUrl site anymore and those sites want to count the hits. That's why HTTP 302 is used, which is a temporary redirect. The browser will ask TinyUrl.com or bit.ly each time again, which makes it possible to count the hits for you (some tiny url services offer this).

Others have answered how the redirects work but you should also know how they generate their tiny urls. You'll mistakenly hear that they create a hash of the URL in order to generate that unique code for the shortened URL. This is incorrect in most cases, they aren't using a hashing algorithm (where you could potentially have collisions).
Most of the popular URL shortening services simply take the ID in the database of the URL and then convert it to either Base 36 [a-z0-9] (case insensitive) or Base 62 (case sensitive).
A simplified example of a TinyURL Database Table:
ID URL VisitCount
1 www.google.com 26
2 www.stackoverflow.com 2048
3 www.reddit.com 64
...
20103 www.digg.com 201
20104 www.4chan.com 20
Web Frameworks that allow flexible routing make handling the incoming URL's really easy (Ruby, ASP.NET MVC, etc).
So, on your webserver you might have a route action that looks like (pseudo code):
Route: www.mytinyurl.com/{UrlID}
Route Action: RouteURL(UrlID);
Which routes any incoming request to your server that has any text after your domain www.mytinyurl.com to your associated method, RouteURL. It supplies the text that is passed in after the forward slash in your URL to that method.
So, lets say you requested: www.mytinyurl.com/fif
"fif" would then be passed to your method, RouteURL(String UrlID). RouteURL would then convert "fif" to its base10 equivalent, 20103, and a database request will be made to redirect to whatever URL is stored under the ID 20103 (in this case, www.digg.com). You would also increase the visit count for Digg by one before redirecting to the correct URL.
This is a really simplified example but you should be able to get the general idea.

As an extension to #A Salcedo answer:
Some url shortening services (Tinyarro.ws) go to extreme by using Unicode (UTF-8) to encode characters in shortened url - which allows higher amount of websites before having to add additional symbol. Since most of UTF-8 is accepted for use ((IRI) RFC 3987 handled by most browsers) that bumps from 62 sites per symbol to ~1,112,064.
To put in perspective one can encode 1.2366863e+12 sites with 2 symbols (1,112,064*1,112,064) - in November 2009, shortened links on bit.ly were accessed 2.1 billion times (Around that time, bit.ly and TinyURL were the most widely used URL-shortening services.) which is ~600 times less than you can fit in just 2 symbols, so for full duration of existence of all url shortening services it should last another 20 years minimum till adding third symbol.

In simple words, URL shortener maps an arbitrary long sequence of character ( original, long crappy url ) into a short and slick sequence of characters. This is nothing but Hashing, which is most commonly used to create lookup tables, HashMap, md5 Hash for cryptographic purposes etc.
To understand the URL-Shortening process I have created a demo project on GitHub and also a blog post. Do refer to this and let me know if it was helpful.
Blog Post : URL Shortening

Related

Parse redirection URL

I analyze the URL in a malicious e-mail. I parse the e-mail using BeautifulSoup. I get this URL
https://www.google.com/url?q=http://my.%42%41%44%2e%43%4F&sa=D&usg=AFQjCNGTKogvWUF40RsyeAXrGi6uQrlhoQ
This URL will force Google.com to redirect to http://my.BAD.CO Given a URL like the one above how can I know that the URL will trigger redirect?
I want to get an indication that this is a redirect and I want to get two separate URLs
http://my.BAD.CO and https://www.google.com/url?q=http://5sr0s.%61%6b%68%6f%72%61%62%2e%72%75&sa=D&usg=AFQjCNGTKogvWUF40RsyeAXrGi6uQrlhoQ
where http://my.BAD.CO is an encoded target URL http://my.%42%41%44%2e%43%4F
If the only solution is a custom RegEx like this
(?i)(http|https)://(www.|)google.com/url\?q=(http|https)://(\S+)\&usg=\S+
followed by a call to urllib.parse.unquote will it cover all corner cases?
Are there other ways to redirect besides https://www.google.com/url... ?
I found another way to redirect Here is another way to redirect: via https://www.google.de/url?sa=t&url=
I ended up with a regex
(?i)^(http|https)://(www.|)google.(ac|ad|aero|ae|af|ag|ai|al|am|an|ao|aq|arpa|ar|asia|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|biz|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|cat|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|coop|com|co|cr|cu|cv|cx|cy|cz|de|dj|dk|dm|do|dz|ec|edu|ee|eg|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gov|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|info|int|in|io|iq|ir|is|it|je|jm|jobs|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mil|mk|ml|mm|mn|mobi|mo|mp|mq|mr|ms|mt|museum|mu|mv|mw|mx|my|mz|name|na|nc|net|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|org|pa|pe|pf|pg|ph|pk|pl|pm|pn|pro|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|sk|sl|sm|sn|so|sr|st|su|sv|sy|sz|tc|td|tel|tf|tg|th|tj|tk|tl|tm|tn|to|tp|travel|tr|tt|tv|tw|tz|ua|ug|uk|um|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|xn--0zwm56d|xn--11b5bs3a9aj6g|xn--80akhbyknj4f|xn--9t4b11yi5a|xn--deba0ad|xn--g6w251d|xn--hgbk6aj7f53bba|xn--hlcj6aya9esc7a|xn--jxalpdlp|xn--kgbechtv|xn--zckzah|ye|yt|yu|za|zm|zw)/url\?.+$
or a readable form
(?i)^(http|https)://(www.|)google.(com|de)/url\?.+$
Lot of people considered that the question is not worth an effort of anyone. I got -4 for the question. Some questions appear to be trivial. I still hope that there is a better solution for the problem. I did not find a list of WEB sites allowing redirect of the URL like what google.com/url\?q does
Here is another way to redirect https://www.google.de/url?sa=t&url=

Create a redirect rule for Azure CDN in Verizon which adds /index.html to the end of URL

I created an Azure CDN under Verizon Premium Subscription in the Azure portal with an endpoint which points to my Azure Static Website URL.
I want to create a redirect rule in Azure Verizon engine which adds /index.html to the end of URL if no extension is specified or the last sing of URL is not a / symbol.
So far I tried to use (.+\/[^\.]\w+$) regex expression, you can see an example of how it works here
My first approach:
In this case, if you type the URL https://blah.com/foo/bar in the web browser
it doesn't change the URL however you are able to view some of the content of the existing file from https://blah.com/foo/bar/index.html but some of the links to resources are broken. Not sure why I'm not getting the 404 in this case but maybe its because that I set the Index document name to index.html in the Static Website panel in the Storage account panel in Azure. If I open the Network tab in the developer tools of Chrome I can see a lot of 404 responses e.g.
And its because the website tries to get resources from the https://blah.com/foo/ directory instead of https://blah.com/foo/bar/
So, for example, the loadcsh.js in fact is located under the https://blah.com/foo/bar/loadcsh.js but the website is searching for the file under the wrong directory https://blah.com/foo/loadcsh.js
My second approach
In this case, if you type the URL https://blah.com/foo/bar
it makes a redirect to https://blah.com/foo/bar/foo/bar/index.html
so the foo/bar/ is redundant here.
My third approach
In this case, if you type the URL https://blah.com/foo/bar
it makes a redirect to https://blah.com/index.html
I have no idea how to apply the rule which makes a redirect from https://blah.com/foo/bar
to https://blah.com/foo/bar/index.html and is generic for all such cases.
Any ideas??
Cheers
I think your Regex expression is ok. You can add the rule URL redirect, in the Source textbox you type your Regex expression (.+\/[^\.]\w+$) and in the Destination textbox add https://%{host}/$1/index.htm. Here I used the HTTP variable for Azure CDN which can be used in Verizon. You can read more about the variables here.
In short words the %{host} returns a host name e.g. www.contoso.com
Please keep in mind all rule changes will require couple of hours propagation before it takes an effect on the CDN.

S3 Static Website: Return HTTP 410

Background
I have a static website on S3 with 10000s of HTML pages indexed on Google. I'm moving to a new version and I want to remove old pages (which may no longer exist) from Google index. I've read online that the most efficient way to do that is to return HTTP 410 (Gone)
Problem
According to http://docs.aws.amazon.com/AmazonS3/latest/dev/CustomErrorDocSupport.html , you can not return a HTTP 410 when using S3 Static website
Api Gateway
I created a mock integration of API Gateway which return HTTP 410. Then I configured my S3 bucket to automatically redirect specific prefix to this url. However, the return code seen is HTTP 301 (for the first redirect). If I GET the API endpoint directly, I receive the 410 successfully, however if I access the API through a S3 GET, then the error code is 301
What's next
If anyone has an idea on how to return HTTP 410 on a static website hosted on S3, let me know.
Additionally, if you can think of a better alternative to de-index old page on Google (the manual tool isn't a solution as I have a large amount of pages) let me know :)
I really feel that a better answer would be to put a server in front of the S3 content with a very simple database table. Your real issue is determining a 410 vs a 404. That is, you know a page is gone but how do you differentiate from a typo or other error?
What I would envision is a table that is indexed by the path name - i.e., /path/to/my/file.html and a status of some sort. The server takes in a request for the full path, does a lookup in the database and either serves the page (assuming that the page is "active" or "available") or a 410 if you know the page is not active. If the page can't be found in the database then return a 404.
The two issues I see with this approach are:
The initial population of the database. If you've already removed the pages from S3 then how will you know when to put in a page and a "not available" flag? I'm not sure how many pages we're talking about but it could be quite big the first time.
Maintenance - you will likely need an administrative interface of some sort down the road for the next time you need to deactivate some number of pages.
There are content management systems that will do some of this for you or it wouldn't be too bad to write a simple server to do this pending the issues I've outlined.

How to identify GET/POST requests made by a human ignoring any requests following

I'm writing an application that listens to HTTP traffic and tries to recognize which requests where initiated by a human.
For example:
The user types cnn.com in their address bar, which starts a request. Then I want to find
CNN's server response while discarding any others requests (such as XHR, etc.)
How could you tell from the header information what means what?
After doing some research I've found that relevant responses come with :
Content-Type: text/html
Html comes with a meaningful title
status 200 ok
There is no way to tell from the bits on the wire. The HTTP protocol has a defined format, which all (non-broken) user agents adhere to.
You are probably thinking that the translation of a user's typing of just 'cnn.com' into 'http://www.cnn.com/' on the wire can be detected from the protocol payload. The answer is no, it can't.
To detect the user agent allowing the user such shorthand, you would have to snoop the user agent application (e.g. a browser) itself.
Actually, detecting non-human agency is the interesting problem (with spam detection as one obvious motivation). This is because HTTP belongs to the family of NVT protocols, where the basic idea, believe it or not, is that a human should be able to run the protocol "by hand" in a network terminal/console program (such as a telnet client.) In other words, the protocol is basically designed as if a human were using it.
I don't think header information can suffice to identify real users from bots, since bots are made to mimic real users and headers are very easy to imitate.
One thing you can do, is to track the path (sequence of clicks) followed by a user, which is most likely to be different from one made by a bot, and made some analysis on the posted information (i.e. bayesian filters).
A very easy to implement check is based on the IP source. There are databases of black listed IP addresses, see Project Honeypot - and if you are writing your software in java, here is an example on how to check an IP address: How to query HTTP:BL for spamming IP addresses.
What I do on my blog is this (using wordpress plugins):
check if an IP address is in the HTTP:BL, if it is the user is shown an html page to take action to whitelist his IP address. This is done in Wordpress by Bad Behavior plugin.
when the user submits some content, a bayesian filter verifies the content of his submission and if his comment is identified as spam, a captcha is displayed before completing the submission. This is done with akismet and conditional captcha, and the comment is also enqueued for manual approval.
After being approved once, the same user is considered safe, and can post without restrictions/checks.
Applying the above rules, I have nomore spam on my blog. And I think that a similar logic can be used for any website.
The advantage of this approach, is that most of the users don't even notice any security mechanism, since no captcha is displayed, nor anything unusual happens in 99% of the times. But still there is quite restrictive, and effective, checks going on under the hoods.
I can't offer any code to help, but I'd say look at the Referer HTTP header. The initial GET request shouldn't have a Referer, but when you start loading the resources on the page (such as JavaScript, CSS, and so on) the Referer will be set to the URL that requested those resources.
So when I type in "stackoverflow.com" in my browser and hit enter, the browser will send a GET request with no Referer, like this:
GET / HTTP/1.1
Host: stackoverflow.com
# ... other Headers
When the browser loads the supporting static resources on the page, though, each request will have a Referer header, like this:
GET /style.css HTTP/1.1
Host: stackoverflow.com
Referer: http://www.stackoverflow.com
# ... other Headers

"URL with WWW and URL without WWW" -Is there any difference between them?

i have noticed one things , when some website are opened in any browser then in URL bar some are like
http://www.something.com
where some are like
http://something.com
here www is missing. Same things is happening with my blog url
if i write in URL bar as
http://www.shareprogrammingtips.com/
then it automatic converted in
http://shareprogrammingtips.com/
i am not getting why this happening ? is there any difference url with www and url without www ?
Edit:
one more thing i have notice is that url with www take longer time to open website then url without www takes..!
It does not matter if you have www in the URL or not, as long as you use the same URL always. This must be happening probably because your server is set-up to redirect the http://www.shareprogrammingtips.com/ to http://shareprogrammingtips.com/.
This will make sure that all the pages will always come to http://shareprogrammingtips.com/ and also search engines would index your site as http://shareprogrammingtips.com/. If your site is accessible from both http://www.shareprogrammingtips.com/ and http://shareprogrammingtips.com/ then the search engines would index both versions of your site, but the page rank of your site will be divided between these 2 versions as for search engines both these sites are different sites.
In the past, every URL required the www. prefix (i.e. www.hello.com). Nowadays we have naked domains which don't require this prefix (i.e. hello.com). We still have many domains with the www. prefix for legacy reasons.
When a company wants to buy a domain name, they can buy it either with or without the prefix, or get both (for example, buy the naked domain and set-up the same domain with the www. prefix as sub-domain) and configure both to load the same website. There are technical reasons to chosing a domain with a www. prefix (allows for certain cookie blocking polices) or a naked domain (shorter url).
Usually, one of the two will be the canonical (real) domain, while the other will only redirect to the real domain. This redirect causes a delay but it's there for a reason.
If you code this redirect the right way, search engines will understand that both are the same website. Otherwise if you skip the redirect and point both domains to your files directly, search engines will think they are separate websites which will hurt your SEO (Search Engine Optimization).