This scraper works fine. I only want to get the titles of the items on this page.
walmart scraper
In scrapy shell, using the view(response) function reveals a web page that says "Your web browser is not accepting cookies." Even when I add USER_AGENT information to a scrapy shell launch.
"Your web browser is not accepting cookies."
As a result, the scraper doesn't manage to scrape any information. Things that I have changed:
COOKIES_ENABLED = True
COOKIES_DEBUG = True
ROBOTSTXT_OBEY = False
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
DOWNLOADER_MIDDLEWARES = {'walmartscraper.middlewares.WalmartscraperDownloaderMiddleware': 543,}
I have a feeling I need to add/change something in the middlewares section (it is still the default code) and/or implement requests somewhere. This is the first time I've worked with cookies while scraping and the information I've found hasn't helped me figure this out.
Any advice is very much appreciated. Thank you.
Related
Problem 1: (resolved - Thanks #Ranjith Thangaraju)
I tried to access this website via postman, but I can't do this because I got an error: https://i.stack.imgur.com/Dmfj8.png
Then when I try to access it on chrome - there's no restriction at all - I can access it: https://finance.vietstock.vn/
Could someone please help me to explain or help with this?
I'm sorry if someone else had the same issue and it is fixed, if you see some other similar, please point me the direction on that
Problem 2:
When I access this page [https://finance.vietstock.vn/CEO/phan-tich-ky-thuat.htm],
there is one of the APIs that I've tried to call from the postman but I couldn't, could you please point me a solution for this?
Chrome: https://i.stack.imgur.com/RTfsM.png
Postman: https://i.stack.imgur.com/2P2Qe.png
Go to Headers -> Click on Bulk Edit
Add the Following Lines
Host: finance.vietstock.vn
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36
Then Hit Send!! ;)
I have recently created an API for internal use in my company. Only my colleagues and I have the URL.
From a few days ago, I detected that random requests where occuring to a given method of the API (less than once per day), so I logged accesses to that method and this is what I am getting:
2017-06-18 17:10:00,359 INFO (default task-427) 85.52.215.80 - Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36 Google Favicon
2017-06-20 07:25:42,273 INFO (default task-614) 85.52.215.80 - Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36 Google Favicon
The request to the API is performed with the full set of parameters (I mean, it's not just to the root of the webservice)
Any idea of what could be going on?
I have several thesis:
A team member that has a browser tab with the method request URL open, that reloads every time he opens the browser. --> This is my favourite, but everybody claims not their fault
A team member having the service URL (with all parameters) in their browser History, with the browser randomly querying it to retrieve the favicon
A team member having the service service URL (with all parameters) in their browser Favourites/Bookmarks, with the browser randomly querying it to retrieve the favicon
Since the UserAgent (Google Favicon) seems to suggest one of the two latter options, the IP (located near our own city, with Orange Spain ISP) seem to suggest the first option: After a quick search on the Internet, I've found that everybody that is having such requests seem to have a California's Google IP.
I know I could just block that User Agent or IP, but I'd really would like to get to the bottom of this issue.
Thanks!
Edit:
Now I am getting User Agents as:
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Web Preview) Chrome/41.0.2272.118 Safari/537.36
as well :/
Both of these User Agents are associated with Google's Fetch and Render tool in Google Search Console. These user agents make request upon asking Google to Fetch and Render a given page for SEO validation. This really does not make sense considering you are asking about an API and not a page. But perhaps it is because a page that was submitted to the Fetch and Render service called the API?
Specifically, I'm trying to scrape this entire page, but am only getting a portion of it. If I use:
r = requests.get('http://store.nike.com/us/en_us/pw/mens-shoes/7puZoi3?ipp=120')
it only gets the "visible" part of the page, since more items load as you scroll downwards.
I know there are some solutions in PyQT such as this, but is there a way to have python requests continuously scroll to the bottom of a webpage until all items load?
You could monitor page network activity with browser development console (F12 - Network in Chrome) to see what request does the page do when you scroll down, use that data and reproduce the request with requests. As an alternative, you can use selenium to control a browser programmatically to scroll down until page is ended, then save its HTML.
I guess I found the right request
Request URL:http://store.nike.com/html-services/gridwallData?country=US&lang_locale=en_US&gridwallPath=mens-shoes/7puZoi3&pn=3
Request Method:GET
Status Code:200 OK
Remote Address:87.245.221.98:80
Request Headers
Provisional headers are shown
Accept:application/json, text/javascript, */*; q=0.01
Referer:http://store.nike.com/us/en_us/pw/mens-shoes/7puZoi3?ipp=120
User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
X-NewRelic-ID:VQYGVF5SCBAJVlFaAQIH
X-Requested-With:XMLHttpRequest
Seems that query parameter pn means the current "subpage". But you still need to understand the response correctly.
I am trying to get website source in python. My code is like:
import urllib2
url = urllib2.urlopen("http://kolekcjoner.nbp.pl")
I have problem with this one website - expect this website, everything is working ex. google or anything else. What is funny, I was fetching data from this website 3 days ago, but now this code is not working. Why? What has changed?
the problem is that you are trying to get a page that don't exist...
as you can see the error is:
urllib2.HTTPError: HTTP Error 404: Not Found
you can use try and except or use another module that will not raise an exception each time that there is a HTTP error code (like the requests module)
update: after checking a bit I found that in the browser the address that you gave works properly so just for the requests sent by python it gave 404, this means that the server is checking the user agent of the requests and if the user agent isn't allowed/known the server will return an error code (e.g 404). so I checked if that is true by changing the user agent field:
>>> requests.get("https://kolekcjoner.nbp.pl/")
<Response [404]>
>>> requests.get("https://kolekcjoner.nbp.pl/",headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103'})
<Response [200]>
I hope that this helps you (anyway you should know that the robot.txt of this site disallow any robot to check this site...)
I ran into a safari problem considering cookie policy in iframes... Also I found a working solution for that, yet to make it work I need to determine in what browser user is viewing.
Original solution as to search in HTTP_USER_AGENT (django) word - safari. Problem here is:
Safari Windows XP on WM User Agent - Mozilla/5.0 (Windows NT 5.1) AppleWebKit/534.52.7 (KHTML, like Gecko) Version/5.1.2 Safari/534.52.7
Chrome Linux User Agent - Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.
So I'm struggling to find information what makes User Agent and how to parce it to get precise results. Sure in this case I can trow in extra if there is no word 'chrome', but what about chromium konqueror and any other minor browsers...
So I found that in User agent there can be any information you want.
There is some sort of abstract rules by witch you can determine a browser, yet those rules does not apply to all browsers.
During the browser wars, many web servers were configured to only send web pages that required advanced features to clients that were identified as some version of Mozilla.
For this reason, most Web browsers use a User-Agent value as follows: Mozilla/[version] ([system and browser information]) [platform] ([platform details]) [extensions].
More # http://en.wikipedia.org/wiki/User_agent
In my case I've looked at http://www.user-agents.org/ and determined that only Chrome impersonates Safari in the last section.
http://www.quirksmode.org/js/detect.html
Just search for the word Chrome, first, then search for Safari.