I realize this is more of a server question (since all media requests bypass Django via NGINX), but I want to know how others Django Developers have been doing this, more so than I care to understand only the specifics of how to do it in NGINX. I don't care about the bandwidth of HTML page requests via Django; only the bandwidth of static media files. Are those of you out there using Django and its DB to do this, or are you using web server-specific methods to do this? If the latter is the case, I'll head over to ServerFault.
I want to do this so I can measure the bandwidth usage on a per-subdomain (or similar method) basis.
Sorry about non-django approach but as we speak about static files that in good practice get passed through without ever hitting the wsgi or whathaveyou.
Apache access logs have request size in them, so what you could do is grep out your media files and directories (cat access_log|grep "/images/\|/media/thumbs/\|jpg) and parse/sum that number with regexp and/or awk. Here's example access log entry (45101 being the file size):
10.0.0.123 - - [09/Sep/2010:13:30:05 -0400] "GET /media/images/mypic.jpg HTTP/1.1" 200 45101 "http://10.0.0.123/myapp" "Mozilla/5.0 (Windows; U; Windows
NT 5.1; en-US; rv:1.9.1.11) Gecko/20100701 Firefox/3.5.11"
That should get you going..
Related
I'm trying to crawl data from Linkedin which use for a personal data crawling practice. But I can not crawl the data without login. So I used two way to simulate log in. One is to get the cookies from HttpClient, which will try to make a simulation login to get the cookies. the other is just add the cookie directly. But I failed both. I don't know the reason.
I used Framework Webmagic for the data crawling.
generally, adding Cookies directly will be an easy way. But I don't know whether I added the wrong cookies.
Here's the thing. I wanna fetch data from the website https://www.linkedin.com/mynetwork/invite-connect/connections/
And I added all the cookies at this page.
Here's all the cookies.
private Site site = Site.me().setRetryTimes(3).setSleepTime(100);
site.setCharset("utf-8")
.setUserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36")
.addHeader("accept","text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8")
.addHeader("accept-encoding","gzip, deflate, br")
.addHeader("accept-language:en-US","en;q=0.8")
.addHeader("connection", "keep-alive")
.addHeader("referer","https://www.linkedin.com/")
.addCookie(".linkedin.com","lidc", "b=TB91:g=750:u=38:i=1503815541:t=1503895683:s=AQE5xZLW6mVmRdHBY9qNO-YOiyAnKtgk")
.addCookie(".linkedin.com","lang", "v=2&lang=en-us")
.addCookie(".linkedin.com","_lipt", "CwEAAAFeIo5-jXjgrpSKF4JfxzNbjC6328JPUgtSHQIKtSDyk4Bockuw84uMkCwbKS0TzUOM_w8Al4s9YjFFF-0T43TPtfG_wv-JNVXsPeO8mVxaYwEcTGiyOdyaRZOCIK7qi02EvZUCtjsaTpAos60U4XrFnu1FO-cY1LrzpqDNUmfrqWJPjSoZpOmjeKtTh-nHcdgpruvjf237E78dqMydLLd1A0Uu7Kr7CmNIurXFd9-Z4hwevLRd3SQMEbSRxAwCclgC4tTzEZ5KoFmpI4veKBFGOqF5MCx3hO9iNRdHrJC44hfRx-Bw7p__PYNWF8sc6yYd0deF-C5aJpronFUYp3vXiwt023qm6T9eRqVvtH1BRfLwCZOJmYrGbKzq4plzNKM7DnHKHNV_cjJQtc9aD3JQz8n2GI-cHx2PYubUyIjVWWvntKWC-EUtn4REgL4jmIaWzDUVz3nkEBW7I3Wf6u2TkuAVu9vq_0mW_dTVDCzgASk")
.addCookie(".linkedin.com","_ga", "GA1.2.2091383287.1503630105")
.addCookie(".www.linkedin.com","li_at", "AQEDAReIjksE2n3-AAABXiKOYVQAAAFeRprlVFYAV8gUt-kMEnL2ktiHZG-AOblSny98srz2r2i18IGs9PqmSRstFVL2ZLdYOcHfPyKnBYLQPJeq5SApwmbQiNtsxO938zQrrcjJZxpOFXa4wCMAuIsN")
.addCookie(".www.linkedin.com","JSESSIONID", "ajax:4085733349730512988")
.addCookie(".linkedin.com","liap", "true")
.addCookie(".www.linkedin.com","sl","v=1&f68pf")
.addCookie("www.linkedin.com","visit", "v=1&M")
.addCookie(".www.linkedin.com","bscookie", "v=1&201708250301246c8eaadc-a08f-4e13-8f24-569529ab1ce0AQEk9zZ-nB0gizfSrOSucwXV2Wfc3TBY")
.addCookie(".linkedin.com","bcookie", "v=2&d2115cf0-88a6-415a-8a0b-27e56fef9e39");
Did I miss something?
LinkedIn is very difficult to crawl, not just technically but they also sue people who do.
When they detect an IP as a possible bot, they give you the login page. Most IP addresses known for bots by them are now serving a login page. New ranges do not last very long.
They're probably just pretty confident you're a bot and keeping you from logging in.
I have just deployed a Mezzanine instance on Elastic Beanstalk and all is working fine when DEBUG = True.
When DEBUG = False however, I am bounced to the 500 error page whenever I am in the admin section of the site and "Save" something (a page, or blog for example). Other than that, the rest of the site works perfectly - it's reading everything from the database, serving up compressed JS/CSS etc.
Things I have already checked:
- ALLOWED_HOSTS is now set correctly
- There are no console errors for missing JS files
- The log file just shows what is below:
172.31.17.189 (73.222.4.136) - - [08/Jun/2016:04:09:35 +0000] "POST /admin/blog/blogpost/1/change/ HTTP/1.1" 500 6317 "http://tenzo-www.us-west-2.elasticbeanstalk.com/admin/blog/blogpost/1/change/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
I'd welcome thoughts. Perhaps you can just even tell me how to get better logging while DEBUG=False? I don't see anything in access_log or error_log that's saying anything about this error.
Actually managed to fix my own problem - two fold answer:
1) To enable better debugging while DEBUG=False, I suggest the following in your settings.py:
SERVER_EMAIL = 'server#xxx.com'
ADMINS = (('Name', 'xxx#xxx.com'),)
SEND_BROKEN_LINK_EMAILS = True
Then you'll get a nice email showing you the error!
2) The actual error was caused because django-htmlmin doesn't play nice with Mezzanine. It adds content around the response that breaks it.
Hope it helps someone.
I want to crawl a webpage using urllib2 library and extract some information according to my need. I am able to freely navigate the site(going from one link to another and so on), but when I try to parse-it I am getting an error
HTTP Error 503 : Service Temporarily Unavailable
I searched about it on net and found out that this error occurs when "web site's server is not available at that time"
I am confused after reading this, if website server is down then how come its up and running(since I am able to navigate the webpage), and if the server is not down then why I am getting this 503 Error.
Is their a possibility that the server has done something to prevent the parsing of web-page
Thanks in advance.
Most probably your user-agent is banned from the server, so as to avoid, well, web crawlers. Therefore some websites, including Wikipedia, show up a 50x error when using an unwanted user-agent (such as wget, curl, urllib, …)
However, changing the user-agent might be enough. At least, it's the case for Wikipedia, which works just fine when using Firefox user agent. (The "bann" most probably only relies on the user-agent).
Finally, there must be a reason for those websites to ban web crawlers. Depending on what you're working on, you might want to use another solution. For example, wikipedia provides database dumps, which can be convenient if you intend to make an intensive use of it.
PS. Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.11) Gecko/20101012 Firefox/3.6.11 is the user-agent I use for wikipedia on a project of mine.
I'm running a site using Django in a shared environment (Dreamhost), but 1.4 in a local environment.
Somtimes, I get hit by many, many Apache dummy connections (e.g., [10/Jul/2012:00:49:16 -0700] "OPTIONS * HTTP/1.0" 200 136 "-" "Apache (internal dummy connection)"
), which makes the site non-responsive (either it gets killed for resource consumption or max connections).
This does not happen on other sites on this account (though none are running Django). I'm trying to figure out a way to prevent this from happening, but I'm not sure what trouble-shooting process to use. Guidance on process or common sources of this issue would be useful.
Try:
<Limit OPTIONS>
Order allow,deny
Deny from all
</Limit>
This would cause a 403 forbidden to be returned by Apache and would not be handed off to any Django application if the issue is that they are getting through to the application at the moment.
My server is out in production, and I am running django on top of twisted.
I have the following for logging:
log.startLogging(sys.stdout)
...
reactor.listenTCP(DJANGO_PORT, server.Site(wsgi_root, logPath=os.path.join('./log', '.django.log')))
However, I am only seeing these in my .django.log.X files:
127.0.0.1 - - [25/Nov/2010:16:48:22 +0000] "GET /statics/css/xxx.css HTTP/1.1" 200 1110 "http://www.xxx.com/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12"
I know for a fact that registration is throwing a lot of errors, but then the log has NOTHING about the exceptions and errors being thrown!
How can I actually output errors generated by the WSGI file?
Note: I think it has something to do with the fact that I have to change log.startLogging(sys.stdout). However, if the solution indeed is to change that, I would like to know how I can output to BOTH sys.stdout as well the file.
Django doesn't use Twisted's logging APIs. twisted.python.log.startLogging only configures Twisted's logging system. Django probably uses the stdlib logging module. So you'll have to configure that in order to get Django log output written somewhere useful. You see the request logs in your .django.log.X files because those are logged by the Twisted HTTP server, independently of whatever Django logs.