Apache dummy connections, Django, mod_wsgi - django

I'm running a site using Django in a shared environment (Dreamhost), but 1.4 in a local environment.
Somtimes, I get hit by many, many Apache dummy connections (e.g., [10/Jul/2012:00:49:16 -0700] "OPTIONS * HTTP/1.0" 200 136 "-" "Apache (internal dummy connection)"
), which makes the site non-responsive (either it gets killed for resource consumption or max connections).
This does not happen on other sites on this account (though none are running Django). I'm trying to figure out a way to prevent this from happening, but I'm not sure what trouble-shooting process to use. Guidance on process or common sources of this issue would be useful.

Try:
<Limit OPTIONS>
Order allow,deny
Deny from all
</Limit>
This would cause a 403 forbidden to be returned by Apache and would not be handed off to any Django application if the issue is that they are getting through to the application at the moment.

Related

"Not Found: /406.shtml" from django

I'm running django with apache fcgi on a shared host. I've set it up to report 404 errors and keep seeing Not Found: /406.shtml via emails (I'm guessing the s is because it's https only). However I have error documents already set up in .htaccess:
ErrorDocument 406 /error/406.html
I was getting a bunch of similar 404 errors from django before setting up an ErrorDocument for each one, but it's still happening for 406. From a grep 406 through the apache error log I'm seeing an occasional 406 (not 404) error for 406.shtml, such as the following, but not nearly as often as django emails me:
[Fri ...] [error] [client ...]
ModSecurity: Access denied with code 406 (phase 1).
Pattern match "Mozilla ... AhrefsBot ...)" at REQUEST_HEADERS:User-Agent.
[file "/usr/local/apache/conf/mod_sec/mod_sec.hg.conf"] [line "126"]
[id "900165"]
[msg "AhrefsBot BOT Request"]
[hostname "www.myhostname.com"]
[uri "/406.shtml"]
[unique_id "..."]
I'm not even sure if this is apache redirecting internally to 406.shtml and it being forwarded on to django or if some bot is trying to find 406.shtml directly. The former seems to indicate a problem with ErrorDocument. The latter isn't really my problem, but then either I should be seeing a 404 for 406.shtml in the apache logs or nothing at all because django will handle the 404? How can I track it down further?
I haven't been able to reproduce the issue just by visiting my site, but I'd like to know what's going on.
You have ModSecurity installed in your Apache which is a WAF which attempts to protect your website from attacks, bots and the like. These, like email spam are part and parcel of running a website now a days unfortunately.
ModSecurity is an add on module to Apache which allows you to define rules and then it runs each request against those rules and decides whether to block the request or not.
In this case a rule (900165, which is defined in file "/usr/local/apache/conf/mod_sec/mod_sec.hg.con) has decided to block this request with a 406 status based on the user agent (AhrefsBot).
Ahref is a website which crawls the web trying to build up a database of links. It's used by SEO people to see who links to your websites (back links are very important to SEO) as Google (who you think would be better providers of this type of information) only give samples of links rather than full listing.
Is AhrefBot a danger and should it be blocked? Well that's a matter of opinion. Assuming it's really AhrefBot (some nefarious bots might pretend to be it so as to look legitimate so check the IP address to see the hostname it came from), then it's probably wasting your resources without doing you much good. On the other hand this is the price of an open web. Your website is available to the public and so also to those that write bots and tools (good or bad).
Why does it return a 406? Well that's how your ModSecurity and/or your rule is defined. Check your Apache config. 406 is a little unusual as would normally expect a 403 (access denied) or 500 (internal server error).
What's the 406.shtml file? That I don't get. A .shtml is a HTML file which also allows server side includes to embed other files and code into an HTML file. They are not used much any more to be honest as the likes of PHP and/or other languages are more common. It could be an attack: I.e. someone's attempting to upload the 406.shtml file and then cause it to be called so it "executes" and includes the contents of the file, potentially giving access to files Apache can see which are not available on the webserver, or the user has requested that (for some reason) or Apache is configured to show that for 406 errors or the ModSecurity rule is redirecting to that file.
Hopefully that gives a good bit of background, and best thing I can suggest is to go through your Apache config file, and any other config files it loads (including mod_sec.hg.con file which it must load) to fully understand your set up and the. Decide if you need to do anything here.
You could do one of several things:
Leave as is. ModSecurity is doing what it was told to do and blocking this with a 406
Turn off this rule and allow AhrefRef through so you don't get alerted by this.
Alter the ModSecurity config/rule to return an error other than 406 so you can ignore it
Turn off ModSecurity completely. I think it is a good tool and worthwhile but does take some time and effort to get most out of it.
Set up the 406 error page properly. To do that you need to understand why it's attempting to return 406.shtml at the moment.
Also not sure which of these options are available to you as you are on a shared host and might not have full access. If so speak to your hosting provider for advice.

OpenGraph Debugger reporting bad HTTP response codes

For a number of sites that are functioning normally, when I run them through the OpenGraph debugger at developers facebook com/tools/debug, Facebook reports that the server returned a 502 or 503 response code.
These sites are clearly working fine on servers that are not under heavy load. URLs I've tried include but are not limited to:
http://ac.mediatemple.net
http://freespeechforpeople.org
These are in fact all sites hosted by MediaTemple. After talking to people at MediaTemple, though, they've insisted that it must be a bug in the API and is not an issue on their end. Anyone else getting unexpected 500/502/503 HTTP response codes from the Facebook Debug tool, with sites hosted by MediaTemple or anyone else? Is there a fix?
Note that I've reviewed the Apache logs on one of these and could find no evidence of Apache receiving the request from Facebook, or of a 502 response etc.
Got this response of them:
At this time, it would appear that (mt) Media Temple servers are returning 200 response codes to all requests from Facebook, including the debugger. This can be confirmed by searching your access logs for hits from the debugger. For additional information regarding viewing access logs, please review the following KnowledgeBase article:
Where are the access_log and error_log files for my server?
http://kb.mediatemple.net/questions/732/Where+are+the+access_log+and+error_log+files+for+my+server%3F#gs
You can check your access logs for hits from Facebook by using the following command:
cat <name of access log> | grep 'facebook'
This will return all hits from Facebook. In general, the debugger will specify the user-agent 'facebookplatform/1.0 (+http://developers.facebook.com),' while general hits from Facebook will specify 'facebookexternalhit/1.0 (+http://www.facebook.com/externalhit_uatext.php).'
Using this information, you can perform even further testing by using 'curl' to emulate a request from Facebook, like so:
curl -Iv -A "facebookplatform/1.0 (+http://developers.facebook.com)" http://domain.com
This should return a 200 or 206 response code.
In summary, all indications are that our servers are returning 200 response codes, so it would seem that the issue is with the way that the debugger is interpreting this response code. Bug reports have been filed with Facebook, and we are still working to obtain more information regarding this issue. We will be sure to update you as more information becomes available.
So good news, is that they are busy with it solving it. Bad news, it's out of our control.
There's a forum post here of the matter:
https://forum.mediatemple.net/topic/6759-facebook-503-502-same-html-different-servers-different-results/
With more than 800 views, and recent activity, it states that they are working hard on it.
I noticed that https MT sites don't even give a return code:
Error parsing input URL, no data was scraped.
RESOLUTION
MT admitted it was their fault and fixed it:
During our investigation of the Facebook debugger issue, we have found that multiple IPs used by this tool were being filtered by our firewall due to malformed requests. We have whitelisted the range of IP addresses used by the Facebook debugger tool at this time, as listed on their website, which should prevent this from occurring again.
We believe our auto-banning system has been blocking several Facebook IP addresses. This was not immediately clear upon our initial investigation and we apologize this was not caught earlier.
The reason API requests may intermittently fail is because only a handful of the many Facebook IP addresses were blocked. The API is load-balanced across several IP ranges. When our system picks up abusive patterns, like HTTP requests resulting in 404 responses or invalid PUT requests, a global firewall rule is added to mitigate the behavior. More often than not, this system works wonderfully and protects our customers from constant threats.
So, that being said, we've been in the process of whitelisting the Facebook API ranges today and confirming our system is no longer blocking these requests. We'd still like those affected to confirm if the issue persists. If for any reason you're still having problems, please open up or respond to your existing support request

Restart deamon process for django under apache for each request

I have many django installation which must run under one URL only. So I have a structure like
Django Installation 1
Django Installation 2
Django Installation N
under my root directory.
Now from the URL "www.mysite.com/installation1" I pick up subpart "installation1" and set the os.environ['DJANGO_SETTINGS_MODULE'] to "installation.settings" and let the request be handled. Now for the request "www.mysite.com/installation2" I must do the same. However since django caches site object, AppCache etc internally I must restart wsgi process before each request so that internal cache of django gets cleared. ( I know performance would not be good but Im not worried about that). To implement the above scenerio I have implemented the below solution:
In httpd.conf as WSGIDaemonProcess django processes=5 threads=1
In the django.core.handlers.wsgi I made the following change in def "call"
if environ['mod_wsgi.process_group'] != '':
import signal, os
print 'Sending the signal to kill the wsgi process'
os.kill(os.getpid(), signal.SIGINT)
return response
My assumption being that deamon process would be killed at each request after the response has been sent. I want to confirm this assumption that my process would only be killed only and only after my response has been sent.
Also is there another way I can solve this problem
Thanks
EDIT: After the suggestion to set the MaxRequestsPerChild to 1 I made the following changes to httpd.conf
KeepAlive Off
Listen 12021
MaxSpareThreads 1
MinSpareThreads 1
MaxRequestsPerChild 1
ServerLimit 1
SetEnvIf X-Forwarded-SSL on HTTPS=1
ThreadsPerChild 1
WSGIDaemonProcess django processes=5 threads=1
But my process is not getting restarted at each request. once the process is started it keeps on processing the request. Am i missing something?
This should be possible using the apache directive MaxRequestsForChild to 1 as stated in http://httpd.apache.org/docs/2.2/mod/mpm_common.html#maxrequestsperchild
However it will be global to every process daemon.
Update:
Or you chould check maximum-requests option into WSGIDaemonProcess as in http://code.google.com/p/modwsgi/wiki/ConfigurationGuidelines
Don't do this. There's a reason why WSGI processes last longer than a single request - you say you're not worried about performance, but that's still no reason to do this. I don't know why you think you need to - plenty of people run multiple Django sites on one server, and none of them have ever felt the need to kill the process after each request.
Instead you need to ensure each site runs in its own WSGI process. See the mod_wsgi documentation for how to do this: Application Groups seems like a good place to start.

twisted logging with django

My server is out in production, and I am running django on top of twisted.
I have the following for logging:
log.startLogging(sys.stdout)
...
reactor.listenTCP(DJANGO_PORT, server.Site(wsgi_root, logPath=os.path.join('./log', '.django.log')))
However, I am only seeing these in my .django.log.X files:
127.0.0.1 - - [25/Nov/2010:16:48:22 +0000] "GET /statics/css/xxx.css HTTP/1.1" 200 1110 "http://www.xxx.com/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12"
I know for a fact that registration is throwing a lot of errors, but then the log has NOTHING about the exceptions and errors being thrown!
How can I actually output errors generated by the WSGI file?
Note: I think it has something to do with the fact that I have to change log.startLogging(sys.stdout). However, if the solution indeed is to change that, I would like to know how I can output to BOTH sys.stdout as well the file.
Django doesn't use Twisted's logging APIs. twisted.python.log.startLogging only configures Twisted's logging system. Django probably uses the stdlib logging module. So you'll have to configure that in order to get Django log output written somewhere useful. You see the request logs in your .django.log.X files because those are logged by the Twisted HTTP server, independently of whatever Django logs.

Django - How to meter bandwidth usage on static content?

I realize this is more of a server question (since all media requests bypass Django via NGINX), but I want to know how others Django Developers have been doing this, more so than I care to understand only the specifics of how to do it in NGINX. I don't care about the bandwidth of HTML page requests via Django; only the bandwidth of static media files. Are those of you out there using Django and its DB to do this, or are you using web server-specific methods to do this? If the latter is the case, I'll head over to ServerFault.
I want to do this so I can measure the bandwidth usage on a per-subdomain (or similar method) basis.
Sorry about non-django approach but as we speak about static files that in good practice get passed through without ever hitting the wsgi or whathaveyou.
Apache access logs have request size in them, so what you could do is grep out your media files and directories (cat access_log|grep "/images/\|/media/thumbs/\|jpg) and parse/sum that number with regexp and/or awk. Here's example access log entry (45101 being the file size):
10.0.0.123 - - [09/Sep/2010:13:30:05 -0400] "GET /media/images/mypic.jpg HTTP/1.1" 200 45101 "http://10.0.0.123/myapp" "Mozilla/5.0 (Windows; U; Windows
NT 5.1; en-US; rv:1.9.1.11) Gecko/20100701 Firefox/3.5.11"
That should get you going..