I am trying to get website source in python. My code is like:
import urllib2
url = urllib2.urlopen("http://kolekcjoner.nbp.pl")
I have problem with this one website - expect this website, everything is working ex. google or anything else. What is funny, I was fetching data from this website 3 days ago, but now this code is not working. Why? What has changed?
the problem is that you are trying to get a page that don't exist...
as you can see the error is:
urllib2.HTTPError: HTTP Error 404: Not Found
you can use try and except or use another module that will not raise an exception each time that there is a HTTP error code (like the requests module)
update: after checking a bit I found that in the browser the address that you gave works properly so just for the requests sent by python it gave 404, this means that the server is checking the user agent of the requests and if the user agent isn't allowed/known the server will return an error code (e.g 404). so I checked if that is true by changing the user agent field:
>>> requests.get("https://kolekcjoner.nbp.pl/")
<Response [404]>
>>> requests.get("https://kolekcjoner.nbp.pl/",headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103'})
<Response [200]>
I hope that this helps you (anyway you should know that the robot.txt of this site disallow any robot to check this site...)
Related
I have recently created an API for internal use in my company. Only my colleagues and I have the URL.
From a few days ago, I detected that random requests where occuring to a given method of the API (less than once per day), so I logged accesses to that method and this is what I am getting:
2017-06-18 17:10:00,359 INFO (default task-427) 85.52.215.80 - Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36 Google Favicon
2017-06-20 07:25:42,273 INFO (default task-614) 85.52.215.80 - Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36 Google Favicon
The request to the API is performed with the full set of parameters (I mean, it's not just to the root of the webservice)
Any idea of what could be going on?
I have several thesis:
A team member that has a browser tab with the method request URL open, that reloads every time he opens the browser. --> This is my favourite, but everybody claims not their fault
A team member having the service URL (with all parameters) in their browser History, with the browser randomly querying it to retrieve the favicon
A team member having the service service URL (with all parameters) in their browser Favourites/Bookmarks, with the browser randomly querying it to retrieve the favicon
Since the UserAgent (Google Favicon) seems to suggest one of the two latter options, the IP (located near our own city, with Orange Spain ISP) seem to suggest the first option: After a quick search on the Internet, I've found that everybody that is having such requests seem to have a California's Google IP.
I know I could just block that User Agent or IP, but I'd really would like to get to the bottom of this issue.
Thanks!
Edit:
Now I am getting User Agents as:
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Web Preview) Chrome/41.0.2272.118 Safari/537.36
as well :/
Both of these User Agents are associated with Google's Fetch and Render tool in Google Search Console. These user agents make request upon asking Google to Fetch and Render a given page for SEO validation. This really does not make sense considering you are asking about an API and not a page. But perhaps it is because a page that was submitted to the Fetch and Render service called the API?
Specifically, I'm trying to scrape this entire page, but am only getting a portion of it. If I use:
r = requests.get('http://store.nike.com/us/en_us/pw/mens-shoes/7puZoi3?ipp=120')
it only gets the "visible" part of the page, since more items load as you scroll downwards.
I know there are some solutions in PyQT such as this, but is there a way to have python requests continuously scroll to the bottom of a webpage until all items load?
You could monitor page network activity with browser development console (F12 - Network in Chrome) to see what request does the page do when you scroll down, use that data and reproduce the request with requests. As an alternative, you can use selenium to control a browser programmatically to scroll down until page is ended, then save its HTML.
I guess I found the right request
Request URL:http://store.nike.com/html-services/gridwallData?country=US&lang_locale=en_US&gridwallPath=mens-shoes/7puZoi3&pn=3
Request Method:GET
Status Code:200 OK
Remote Address:87.245.221.98:80
Request Headers
Provisional headers are shown
Accept:application/json, text/javascript, */*; q=0.01
Referer:http://store.nike.com/us/en_us/pw/mens-shoes/7puZoi3?ipp=120
User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
X-NewRelic-ID:VQYGVF5SCBAJVlFaAQIH
X-Requested-With:XMLHttpRequest
Seems that query parameter pn means the current "subpage". But you still need to understand the response correctly.
I have just deployed a Mezzanine instance on Elastic Beanstalk and all is working fine when DEBUG = True.
When DEBUG = False however, I am bounced to the 500 error page whenever I am in the admin section of the site and "Save" something (a page, or blog for example). Other than that, the rest of the site works perfectly - it's reading everything from the database, serving up compressed JS/CSS etc.
Things I have already checked:
- ALLOWED_HOSTS is now set correctly
- There are no console errors for missing JS files
- The log file just shows what is below:
172.31.17.189 (73.222.4.136) - - [08/Jun/2016:04:09:35 +0000] "POST /admin/blog/blogpost/1/change/ HTTP/1.1" 500 6317 "http://tenzo-www.us-west-2.elasticbeanstalk.com/admin/blog/blogpost/1/change/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
I'd welcome thoughts. Perhaps you can just even tell me how to get better logging while DEBUG=False? I don't see anything in access_log or error_log that's saying anything about this error.
Actually managed to fix my own problem - two fold answer:
1) To enable better debugging while DEBUG=False, I suggest the following in your settings.py:
SERVER_EMAIL = 'server#xxx.com'
ADMINS = (('Name', 'xxx#xxx.com'),)
SEND_BROKEN_LINK_EMAILS = True
Then you'll get a nice email showing you the error!
2) The actual error was caused because django-htmlmin doesn't play nice with Mezzanine. It adds content around the response that breaks it.
Hope it helps someone.
I am trying to retrieve a public Facebook photo album for display on a website.
I can access the list of photo albums and the individual albums using Facebook Graphc without getting an Access Token
see https://graph.facebook.com/vailresorts/albums
I've tried that from 4 different servers on different networks, none are logged on to facebook, and it works fine. However, when I put it on my test server, it doesn't work and I get the
OAuthException - An access token is required to request this resource.
error.
I'm wondeirng why that is, requesting that url from different places give different results?
I was understanding that if the album is public, I don't need an App Id and App Secret.
Is this not true?
Request header:
Request URL:https://graph.facebook.com/vailresorts/albums
Request Method:GET
Status Code:200 OK
Request Headersview source
accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
accept-charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3
accept-encoding:gzip,deflate,sdch
accept-language:en-US,en;q=0.8
cache-control:no-cache
host:graph.facebook.com
method:GET
pragma:no-cache
scheme:https
url:/vailresorts/albums
user-agent:Mozilla/5.0 (Windows NT 5.2) AppleWebKit/537.31 (KHTML, like Gecko)
Chrome/26.0.1410.64 Safari/537.31
version:HTTP/1.1
Just to throw in a 'me too' and the failures are not with every request. This bug may be related:
https://developers.facebook.com/bugs/410815249017032
I get the error:
{"error":{"message":"An access token is required to request this resource.","type":"OAuthException","code":104}}
Edit: sorry, didn't mean to answer, but wished to provide as a comment. Ignorant yet on how to do that.
Update: I filed a new bug attempting to get their attention https://developers.facebook.com/bugs/444360005657309
My server is out in production, and I am running django on top of twisted.
I have the following for logging:
log.startLogging(sys.stdout)
...
reactor.listenTCP(DJANGO_PORT, server.Site(wsgi_root, logPath=os.path.join('./log', '.django.log')))
However, I am only seeing these in my .django.log.X files:
127.0.0.1 - - [25/Nov/2010:16:48:22 +0000] "GET /statics/css/xxx.css HTTP/1.1" 200 1110 "http://www.xxx.com/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12"
I know for a fact that registration is throwing a lot of errors, but then the log has NOTHING about the exceptions and errors being thrown!
How can I actually output errors generated by the WSGI file?
Note: I think it has something to do with the fact that I have to change log.startLogging(sys.stdout). However, if the solution indeed is to change that, I would like to know how I can output to BOTH sys.stdout as well the file.
Django doesn't use Twisted's logging APIs. twisted.python.log.startLogging only configures Twisted's logging system. Django probably uses the stdlib logging module. So you'll have to configure that in order to get Django log output written somewhere useful. You see the request logs in your .django.log.X files because those are logged by the Twisted HTTP server, independently of whatever Django logs.