"Not Found: /406.shtml" from django - django

I'm running django with apache fcgi on a shared host. I've set it up to report 404 errors and keep seeing Not Found: /406.shtml via emails (I'm guessing the s is because it's https only). However I have error documents already set up in .htaccess:
ErrorDocument 406 /error/406.html
I was getting a bunch of similar 404 errors from django before setting up an ErrorDocument for each one, but it's still happening for 406. From a grep 406 through the apache error log I'm seeing an occasional 406 (not 404) error for 406.shtml, such as the following, but not nearly as often as django emails me:
[Fri ...] [error] [client ...]
ModSecurity: Access denied with code 406 (phase 1).
Pattern match "Mozilla ... AhrefsBot ...)" at REQUEST_HEADERS:User-Agent.
[file "/usr/local/apache/conf/mod_sec/mod_sec.hg.conf"] [line "126"]
[id "900165"]
[msg "AhrefsBot BOT Request"]
[hostname "www.myhostname.com"]
[uri "/406.shtml"]
[unique_id "..."]
I'm not even sure if this is apache redirecting internally to 406.shtml and it being forwarded on to django or if some bot is trying to find 406.shtml directly. The former seems to indicate a problem with ErrorDocument. The latter isn't really my problem, but then either I should be seeing a 404 for 406.shtml in the apache logs or nothing at all because django will handle the 404? How can I track it down further?
I haven't been able to reproduce the issue just by visiting my site, but I'd like to know what's going on.

You have ModSecurity installed in your Apache which is a WAF which attempts to protect your website from attacks, bots and the like. These, like email spam are part and parcel of running a website now a days unfortunately.
ModSecurity is an add on module to Apache which allows you to define rules and then it runs each request against those rules and decides whether to block the request or not.
In this case a rule (900165, which is defined in file "/usr/local/apache/conf/mod_sec/mod_sec.hg.con) has decided to block this request with a 406 status based on the user agent (AhrefsBot).
Ahref is a website which crawls the web trying to build up a database of links. It's used by SEO people to see who links to your websites (back links are very important to SEO) as Google (who you think would be better providers of this type of information) only give samples of links rather than full listing.
Is AhrefBot a danger and should it be blocked? Well that's a matter of opinion. Assuming it's really AhrefBot (some nefarious bots might pretend to be it so as to look legitimate so check the IP address to see the hostname it came from), then it's probably wasting your resources without doing you much good. On the other hand this is the price of an open web. Your website is available to the public and so also to those that write bots and tools (good or bad).
Why does it return a 406? Well that's how your ModSecurity and/or your rule is defined. Check your Apache config. 406 is a little unusual as would normally expect a 403 (access denied) or 500 (internal server error).
What's the 406.shtml file? That I don't get. A .shtml is a HTML file which also allows server side includes to embed other files and code into an HTML file. They are not used much any more to be honest as the likes of PHP and/or other languages are more common. It could be an attack: I.e. someone's attempting to upload the 406.shtml file and then cause it to be called so it "executes" and includes the contents of the file, potentially giving access to files Apache can see which are not available on the webserver, or the user has requested that (for some reason) or Apache is configured to show that for 406 errors or the ModSecurity rule is redirecting to that file.
Hopefully that gives a good bit of background, and best thing I can suggest is to go through your Apache config file, and any other config files it loads (including mod_sec.hg.con file which it must load) to fully understand your set up and the. Decide if you need to do anything here.
You could do one of several things:
Leave as is. ModSecurity is doing what it was told to do and blocking this with a 406
Turn off this rule and allow AhrefRef through so you don't get alerted by this.
Alter the ModSecurity config/rule to return an error other than 406 so you can ignore it
Turn off ModSecurity completely. I think it is a good tool and worthwhile but does take some time and effort to get most out of it.
Set up the 406 error page properly. To do that you need to understand why it's attempting to return 406.shtml at the moment.
Also not sure which of these options are available to you as you are on a shared host and might not have full access. If so speak to your hosting provider for advice.

Related

Django add exemption to Same Origin Policy (only for one site)

I am getting errors thrown from google.g.doubleclick.net when I try to load google ads on my site through plain html.
Blocked a frame with origin "https://googleads.g.doubleclick.net" from accessing a frame with origin "https://example.com". Protocols, domains, and ports must match.
Oddly enough I have a section of my site where I add some ads through javascript and that section does not throw any errors.
I read about adding a crossdomain.xml to the site root and I tried that (and also serving it with NGINX and that does not work either...
Is there any way to add an exception to django's CSRF rules, or any other way to get around this? It is driving me nuts. This error is only thrown in safari (only tried safari and chrome) but it adds a LOT to the data transfer for loading the page and I do not want things to be slowed down.
This has nothing to do with CSRF, but rather this has to do with the same origin policy security restriction which you can fix by implementing CORS and sending the appropriate headers.
You can use django-cors-headers to help with this.

Website is up and running but parsing it results in HTTP Error 503

I want to crawl a webpage using urllib2 library and extract some information according to my need. I am able to freely navigate the site(going from one link to another and so on), but when I try to parse-it I am getting an error
HTTP Error 503 : Service Temporarily Unavailable
I searched about it on net and found out that this error occurs when "web site's server is not available at that time"
I am confused after reading this, if website server is down then how come its up and running(since I am able to navigate the webpage), and if the server is not down then why I am getting this 503 Error.
Is their a possibility that the server has done something to prevent the parsing of web-page
Thanks in advance.
Most probably your user-agent is banned from the server, so as to avoid, well, web crawlers. Therefore some websites, including Wikipedia, show up a 50x error when using an unwanted user-agent (such as wget, curl, urllib, …)
However, changing the user-agent might be enough. At least, it's the case for Wikipedia, which works just fine when using Firefox user agent. (The "bann" most probably only relies on the user-agent).
Finally, there must be a reason for those websites to ban web crawlers. Depending on what you're working on, you might want to use another solution. For example, wikipedia provides database dumps, which can be convenient if you intend to make an intensive use of it.
PS. Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.11) Gecko/20101012 Firefox/3.6.11 is the user-agent I use for wikipedia on a project of mine.

How to identify GET/POST requests made by a human ignoring any requests following

I'm writing an application that listens to HTTP traffic and tries to recognize which requests where initiated by a human.
For example:
The user types cnn.com in their address bar, which starts a request. Then I want to find
CNN's server response while discarding any others requests (such as XHR, etc.)
How could you tell from the header information what means what?
After doing some research I've found that relevant responses come with :
Content-Type: text/html
Html comes with a meaningful title
status 200 ok
There is no way to tell from the bits on the wire. The HTTP protocol has a defined format, which all (non-broken) user agents adhere to.
You are probably thinking that the translation of a user's typing of just 'cnn.com' into 'http://www.cnn.com/' on the wire can be detected from the protocol payload. The answer is no, it can't.
To detect the user agent allowing the user such shorthand, you would have to snoop the user agent application (e.g. a browser) itself.
Actually, detecting non-human agency is the interesting problem (with spam detection as one obvious motivation). This is because HTTP belongs to the family of NVT protocols, where the basic idea, believe it or not, is that a human should be able to run the protocol "by hand" in a network terminal/console program (such as a telnet client.) In other words, the protocol is basically designed as if a human were using it.
I don't think header information can suffice to identify real users from bots, since bots are made to mimic real users and headers are very easy to imitate.
One thing you can do, is to track the path (sequence of clicks) followed by a user, which is most likely to be different from one made by a bot, and made some analysis on the posted information (i.e. bayesian filters).
A very easy to implement check is based on the IP source. There are databases of black listed IP addresses, see Project Honeypot - and if you are writing your software in java, here is an example on how to check an IP address: How to query HTTP:BL for spamming IP addresses.
What I do on my blog is this (using wordpress plugins):
check if an IP address is in the HTTP:BL, if it is the user is shown an html page to take action to whitelist his IP address. This is done in Wordpress by Bad Behavior plugin.
when the user submits some content, a bayesian filter verifies the content of his submission and if his comment is identified as spam, a captcha is displayed before completing the submission. This is done with akismet and conditional captcha, and the comment is also enqueued for manual approval.
After being approved once, the same user is considered safe, and can post without restrictions/checks.
Applying the above rules, I have nomore spam on my blog. And I think that a similar logic can be used for any website.
The advantage of this approach, is that most of the users don't even notice any security mechanism, since no captcha is displayed, nor anything unusual happens in 99% of the times. But still there is quite restrictive, and effective, checks going on under the hoods.
I can't offer any code to help, but I'd say look at the Referer HTTP header. The initial GET request shouldn't have a Referer, but when you start loading the resources on the page (such as JavaScript, CSS, and so on) the Referer will be set to the URL that requested those resources.
So when I type in "stackoverflow.com" in my browser and hit enter, the browser will send a GET request with no Referer, like this:
GET / HTTP/1.1
Host: stackoverflow.com
# ... other Headers
When the browser loads the supporting static resources on the page, though, each request will have a Referer header, like this:
GET /style.css HTTP/1.1
Host: stackoverflow.com
Referer: http://www.stackoverflow.com
# ... other Headers

OpenGraph Debugger reporting bad HTTP response codes

For a number of sites that are functioning normally, when I run them through the OpenGraph debugger at developers facebook com/tools/debug, Facebook reports that the server returned a 502 or 503 response code.
These sites are clearly working fine on servers that are not under heavy load. URLs I've tried include but are not limited to:
http://ac.mediatemple.net
http://freespeechforpeople.org
These are in fact all sites hosted by MediaTemple. After talking to people at MediaTemple, though, they've insisted that it must be a bug in the API and is not an issue on their end. Anyone else getting unexpected 500/502/503 HTTP response codes from the Facebook Debug tool, with sites hosted by MediaTemple or anyone else? Is there a fix?
Note that I've reviewed the Apache logs on one of these and could find no evidence of Apache receiving the request from Facebook, or of a 502 response etc.
Got this response of them:
At this time, it would appear that (mt) Media Temple servers are returning 200 response codes to all requests from Facebook, including the debugger. This can be confirmed by searching your access logs for hits from the debugger. For additional information regarding viewing access logs, please review the following KnowledgeBase article:
Where are the access_log and error_log files for my server?
http://kb.mediatemple.net/questions/732/Where+are+the+access_log+and+error_log+files+for+my+server%3F#gs
You can check your access logs for hits from Facebook by using the following command:
cat <name of access log> | grep 'facebook'
This will return all hits from Facebook. In general, the debugger will specify the user-agent 'facebookplatform/1.0 (+http://developers.facebook.com),' while general hits from Facebook will specify 'facebookexternalhit/1.0 (+http://www.facebook.com/externalhit_uatext.php).'
Using this information, you can perform even further testing by using 'curl' to emulate a request from Facebook, like so:
curl -Iv -A "facebookplatform/1.0 (+http://developers.facebook.com)" http://domain.com
This should return a 200 or 206 response code.
In summary, all indications are that our servers are returning 200 response codes, so it would seem that the issue is with the way that the debugger is interpreting this response code. Bug reports have been filed with Facebook, and we are still working to obtain more information regarding this issue. We will be sure to update you as more information becomes available.
So good news, is that they are busy with it solving it. Bad news, it's out of our control.
There's a forum post here of the matter:
https://forum.mediatemple.net/topic/6759-facebook-503-502-same-html-different-servers-different-results/
With more than 800 views, and recent activity, it states that they are working hard on it.
I noticed that https MT sites don't even give a return code:
Error parsing input URL, no data was scraped.
RESOLUTION
MT admitted it was their fault and fixed it:
During our investigation of the Facebook debugger issue, we have found that multiple IPs used by this tool were being filtered by our firewall due to malformed requests. We have whitelisted the range of IP addresses used by the Facebook debugger tool at this time, as listed on their website, which should prevent this from occurring again.
We believe our auto-banning system has been blocking several Facebook IP addresses. This was not immediately clear upon our initial investigation and we apologize this was not caught earlier.
The reason API requests may intermittently fail is because only a handful of the many Facebook IP addresses were blocked. The API is load-balanced across several IP ranges. When our system picks up abusive patterns, like HTTP requests resulting in 404 responses or invalid PUT requests, a global firewall rule is added to mitigate the behavior. More often than not, this system works wonderfully and protects our customers from constant threats.
So, that being said, we've been in the process of whitelisting the Facebook API ranges today and confirming our system is no longer blocking these requests. We'd still like those affected to confirm if the issue persists. If for any reason you're still having problems, please open up or respond to your existing support request

How can I do an HTTP redirect in C++

I'm making an HTTP server in c++, I notice that the way apache works is if you request a directory without adding a forward slash at the end, firefox still somehow knows that it's a directory you are requesting (which seems impossible for firefox to do, which is why I'm assuming apache is doing a redirect).
Is that assumption right? Does apache check to see that you are requesting a directory and then does an http redirect to a request with the forward slash? If that is how apache works, how do I implement that in c++? Thanks to anyone who replies.
Determine if the resource represents a directory, if so reply with a:
HTTP/1.X 301 Moved Permanently
Location: URI-including-trailing-slash
Using 301 allows user agents to cache the redirect.
If you wanted to do this, you would:
call stat on the pathname
determine that it is a directory
send the necesssary HTTP response for a redirect
I'm not at all sure that you need to do this. Install the Firefox 'web developer' add-on to see exactly what goes back and forth.
Seriously, this should not be a problem. Suggestions for how to proceed:
Get the source code for Apache and look at what it does
Build a debug build of Apache and step through the code in a debugger in such a case; examine which pieces of code get run.
Install Wireshark (network analysis tool), Live HTTP Headers (Firefox extension) etc, and look at what's happening on the network
Read the relevant RFCs for HTTP - which presumably you should be keeping under your pillow anyway if you're writing a server.
Once you've done those things, it should be obvious how to do it. If you can't do those things, you should not be trying to develop a web server in C++.
The assumption is correct and make sure your response includes a Location header to the URL that allows directory listing and a legal 301/302 first line. It is not a C++ question, it is more of a HTTP protocol question, since you are trying to write a HTTP server, as one of the other posts suggests, read the RFC.
You should install Fiddler and observe the HTTP headers sent by other web servers.
Your question is impossible to answer precisely without more details, but you want to send an HTTP 3xx status code with a Location header.