Scrape entire scrolling-load page with Python Requests

Scrape entire scrolling-load page with Python Requests - python-2.7

Specifically, I'm trying to scrape this entire page, but am only getting a portion of it. If I use:
r = requests.get('http://store.nike.com/us/en_us/pw/mens-shoes/7puZoi3?ipp=120')
it only gets the "visible" part of the page, since more items load as you scroll downwards.
I know there are some solutions in PyQT such as this, but is there a way to have python requests continuously scroll to the bottom of a webpage until all items load?

You could monitor page network activity with browser development console (F12 - Network in Chrome) to see what request does the page do when you scroll down, use that data and reproduce the request with requests. As an alternative, you can use selenium to control a browser programmatically to scroll down until page is ended, then save its HTML.
I guess I found the right request
Request URL:http://store.nike.com/html-services/gridwallData?country=US&lang_locale=en_US&gridwallPath=mens-shoes/7puZoi3&pn=3
Request Method:GET
Status Code:200 OK
Remote Address:87.245.221.98:80
Request Headers
Provisional headers are shown
Accept:application/json, text/javascript, */*; q=0.01
Referer:http://store.nike.com/us/en_us/pw/mens-shoes/7puZoi3?ipp=120
User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
X-NewRelic-ID:VQYGVF5SCBAJVlFaAQIH
X-Requested-With:XMLHttpRequest
Seems that query parameter pn means the current "subpage". But you still need to understand the response correctly.

Related

Web.Contents stuck or times out while trying to download a file

I cannot connect to the site on which there is a small xlsx file. The file is the Rig Count Summary on this site. Right-click > copy link to use in Power Query.
let
Source = Excel.Workbook(Web.Contents("https://rigcount.bakerhughes.com/static-files/3ba17f6e-62be-454c-bbd9-806996a7d991"), null, true)
in
Source

The web server gets the HTTP request and can behave differently based on the HTTP headers present in the request. By trial-and-error you can copy request headers from a working request that you examine in browser debug mode or Fiddler and add them to Web.Contents. The result of doing that here is something like:
let
headers = [
#"User-Agent"="Mozilla/5.0 (iPad; CPU OS 13_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/87.0.4280.77 Mobile/15E148 Safari/604.1 Edg/108.0.0.0"
,#"Accept-Encoding"="gzip, deflate"
],
Source = Web.Contents("https://rigcount.bakerhughes.com/static-files/3ba17f6e-62be-454c-bbd9-806996a7d991",[Headers=headers]),
Data = Excel.Workbook(Source),
RigCountSummary_CurrentWeek_Sheet = Data{[Item="RigCountSummary_CurrentWeek",Kind="Sheet"]}[Data]
in
RigCountSummary_CurrentWeek_Sheet
Without the Accept-Encoding and User-Agent headers the request was hanging on the web server. This is probably a bug with that web application, instead of an anti-scraping measure, as you wouldn't intentially cause incoming requests to hang for a long time.

The problem is that your site doesn't provide a direct link to the Excel file.
When you click on the xlsx link, a java script starts the download of Rig Count Summary_121622.xlsx
When you copy the xlsx link, you get https://rigcount.bakerhughes.com/static-files/4ef2cc30-b5a4-4b91-856a-499467858baa, which is not an Excel file.

Sampler fails due to regex value conflict

I am trying to write a test script in Jmeter the script works fine for a single user, but the same script fails for multiple users, the issue is there are 2 regex one is extracting the authentication details and the other is extracting the jobs associated to each user, now in some of the samplers when I am running the script for more than 2 users it picks the job of one user and the authentication details from another due to this the server sends the 500 error message,
Here's the data that authKey is capturing
sitecode: 601584
credential: {"SiteCode":601584,"UserName":"ADMIN","FirstName":"ADMIN","LastName":"ADMIN","RoleId":1,"UserTypeId":1,"SiteId":696,"StaffId":0,"UserId":15240,"AuthKey":"1548dbe78e5d4a71bbe8a70112c66eb82899c6d38dd140be91b7cf5610b140617ed460d6ba674d3089c7199941e1342b","DefaultPagePath":"","UserEntity":2519,"TypeOfEntity":20,"Culture":null,"SuperFranchiseId":0,"SuperFranchiseName":null,"MasterFranchiseId":null,"FranchiseIds":null,"CountryList":[],"MarketId":0,"MarketName":null,"TechnicianId":0,"Actions":"","GroupId":10,"SessionId":103807374,"EntityRef":2519,"IsLocked":false,"EncryptedAuthKey":"tNKvRhDrm4R99OCP45Q+uQnSa+CLfm2iLuTG9lCCWo17CRXPGoCzrzj2nQ0nC68IrqkP6ygRH0hQrrdosqmXoYngBxu04l4zH7rNhMZ1bbcK49QKBVQ9sVp3mTUPjzaBU1MH431lTyGCQMfCJafHHxY+XJNSMeTk/CG6m6D47oZW/v0az17IYcNL586QC6Vsm5BGul5U6+c71fSnTQfIdiWY5Ijye2xjDTHN1LZ8u9UGtrShF7zFCm2hkdFsQ2pk"}
Content-Type: application/json
sec-ch-ua: "Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"
sec-ch-ua-mobile: ?0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36
Content-Length: 3519
Host: qa-coreservices.wsautovhc.co.uk
The site code goes as 601584
Here's the value of VHC that's been picked it's from a different site user
POST https://qa-coreservices.wsautovhc.co.uk/api/vhc/v2/update
POST data:
{"IsTCO":true,"TCOId":1016590096,"TimeIn":"00:00:00","TimeOut":"00:00:00","CustomerBookedWork":"","CustomerStatus":"","WorkStatus":"","MobilityStatus":"","RepeatRepair":"false","DateIn":"","DateOut":"","SquashedFrogModelId":0,"SquashedFrogurl":null,"TotalPhotoCount":0,"IsMoveVHC":false,"LinkGuid":"00000000-0000-0000-0000-000000000000","CustomerName":null,"TechnicianName":null,"TemplateLogoAssetId":0,"TemplateLogoUrl":null,"SAName":null,"CustomerUUId":null,"SiteName":null,"SAFirstName":null,"SASurName":null,"SADisplayName":null,"CustomerFirstName":null,"CustomerSurName":null,"CustomerSalutation":null,"CustomerEmail":null,"CustomerMobile":null,"DialingCode":null,"RoleID":0,"SAEmail":null,"MitchellIntegration":null,"VhcTotalId":0,"VhcTotalData":{"Notes":null,"OnlineAuthAuthorisationDate":null,"FuelGaugeValue":0,"EmacIntegrationInfo":null,"DmsMetaData":null,"Chat":null,"NotApplicableTemplateItemTypeIds":null,"UberVhcScoreCard":null,"OnlinePayments":null,"IsCreateRepairOrder":false,"IsGetRepairOrder":false,"EngineNumber":"","PartsPartiallyPriced":false,"LabourPartiallyPriced":false,"PartiallyChecked":null,"ParentCode":null,"IsTransmittedToDms":false,"YearOfMake":"","ChatStages":0,"RequestCustomerCallBack":false,"WhatsApp":null},"VQCQuestionIds":null,"VQCByUserId":0,"TotalVideoCount":0,"SquashedfrogPhotoCount":0,"SquashedfrogVideoCount":0,"HasIntegrationRequired":false,"IsAutoSendMessage":false,"PreviousVhcComments":null,"ConvertToVideoCheck":false,"VhcJcbLookup":null,"VhcJcbTemplateLookupId":0,"IsAppRequest":false,"NissanRecallInfo":null,"HondaRecallInfo":null,"VisitReasons":[],"TemplateName":null,"QualityControlName":null,"SAStartTStamp":null,"BoatMetaData":{"BoatName":"","EngineId1":"","EngineId2":"","TransmissionId":"","BoatLength":0},"EventsMetaData":null,"Workshop":"Select","DateAdded":"24/05/2021 11:38:40","DateChecked":"","DateParts":"","DateLabour":"","DateAuthorised":"","Status":"N","VHCId":1012875189,"CustomerId":1004437412,"VHCDate":"20210524","Make":"2019","Model":"124 SPIDER","RegNo":"REG55671","Mileage":1280,"JobCardNo":"JC5461","ItemNo":"","FollowUpDate":"20210524","Technician":0,"AuthTotal":0.0,"IdentTotal":0.0,"InvTotal":0.00,"QualityControl":0,"RepairOrderNo":"","Deleted":false,"SiteCode":601470,"TimeStamp":"24/05/2021 11:38:40","Altered":false,"ServiceAdvisor":0,"DateWorkIssued":"","DatePartsIssued":"","VIN":"","TemplateId":47692,"Populated":false,"FranchiseId":27,"AuthTotalIncVAT":0.0,"IdentTotalIncVAT":0.0,"LastAccessTStamp":null,"LastAccessUser":0,"StartTStamp":null,"TyresRequired":false,"MultiRole":false,"CommentsAvailable":false,"MultiRoleSAId":0,"MultiRoleTStamp":null,"FirstRegistrationDate":"","NextMOTDate":"","NextServiceDate":"","DefaultServiceRate":0.00,"AgreedEstimate":0.0,"WorkRequired":"","AverageMileagePerAnnum":"0","PushedToDMS":false,"TransmittedToFord":false,"OriginalVHCDate":"20210524","MileageUnit":"m","PreviousVHCId":0,"PreviousVHCDate":"","PricedByUserId":0,"Revisit":0,"HasBeenLate":false,"FirstEvent":false,"RevisitRecorded":false,"UnknownReasonCode":false,"DMSAccountCode":"","JobType":0,"PhotoCount":0,"VideoCount":0,"NextCheckTachographDate":"","NextChangeCoolantDate":"","NextChangeBrakeFluidDate":"","LastWorkshopVisitDate":"","MileageLastWorkshopVisit":0,"LastServiceDate":"","MileageLastService":0,"ServiceCode":"","EngineOil":"","TransmissionOil":"","PositionNumber":0,"AxleNumber":0,"AxleLocation":"","FirstTyreLocation":"","SecondTyreLocation":"","VHCIcon":null,"FastLaneVHC":0,"Pin":null,"VhcType":0,"EngineNumber":""};
[no cookies]
The site code here is different than the one that's captured in the AuthKey and this creating the problem.
What am I doing wrong?

it picks the job of one user and the authentication details from another
it cannot happen, as per JMeter Documentation
Properties are not the same as variables. Variables are local to a thread; properties are common to all threads, and need to be referenced using the __P or __property function.
So if one user extracts some value from the response another user cannot access this value (unless you convert it into a JMeter Property using __setProperty() function) so most probably the error is either in your extraction logic or in business logic of your test scenario. See article on Thread-local storage for more information if needed.
So inspect responses and extracted values using Debug Sampler and View Results Tree listener combination, hopefully you will be able to detect the inconsistency yourself.
Also JSON doesn't seem to be a regular language so parsing it using regular expressions is not the best idea, maybe it worth consider using JSON Extractor or even better JSON JMESPath Extractor instead?

Data Crawling From Linkedin

I'm trying to crawl data from Linkedin which use for a personal data crawling practice. But I can not crawl the data without login. So I used two way to simulate log in. One is to get the cookies from HttpClient, which will try to make a simulation login to get the cookies. the other is just add the cookie directly. But I failed both. I don't know the reason.
I used Framework Webmagic for the data crawling.
generally, adding Cookies directly will be an easy way. But I don't know whether I added the wrong cookies.
Here's the thing. I wanna fetch data from the website https://www.linkedin.com/mynetwork/invite-connect/connections/
And I added all the cookies at this page.
Here's all the cookies.
private Site site = Site.me().setRetryTimes(3).setSleepTime(100);
site.setCharset("utf-8")
.setUserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36")
.addHeader("accept","text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8")
.addHeader("accept-encoding","gzip, deflate, br")
.addHeader("accept-language:en-US","en;q=0.8")
.addHeader("connection", "keep-alive")
.addHeader("referer","https://www.linkedin.com/")
.addCookie(".linkedin.com","lidc", "b=TB91:g=750:u=38:i=1503815541:t=1503895683:s=AQE5xZLW6mVmRdHBY9qNO-YOiyAnKtgk")
.addCookie(".linkedin.com","lang", "v=2&lang=en-us")
.addCookie(".linkedin.com","_lipt", "CwEAAAFeIo5-jXjgrpSKF4JfxzNbjC6328JPUgtSHQIKtSDyk4Bockuw84uMkCwbKS0TzUOM_w8Al4s9YjFFF-0T43TPtfG_wv-JNVXsPeO8mVxaYwEcTGiyOdyaRZOCIK7qi02EvZUCtjsaTpAos60U4XrFnu1FO-cY1LrzpqDNUmfrqWJPjSoZpOmjeKtTh-nHcdgpruvjf237E78dqMydLLd1A0Uu7Kr7CmNIurXFd9-Z4hwevLRd3SQMEbSRxAwCclgC4tTzEZ5KoFmpI4veKBFGOqF5MCx3hO9iNRdHrJC44hfRx-Bw7p__PYNWF8sc6yYd0deF-C5aJpronFUYp3vXiwt023qm6T9eRqVvtH1BRfLwCZOJmYrGbKzq4plzNKM7DnHKHNV_cjJQtc9aD3JQz8n2GI-cHx2PYubUyIjVWWvntKWC-EUtn4REgL4jmIaWzDUVz3nkEBW7I3Wf6u2TkuAVu9vq_0mW_dTVDCzgASk")
.addCookie(".linkedin.com","_ga", "GA1.2.2091383287.1503630105")
.addCookie(".www.linkedin.com","li_at", "AQEDAReIjksE2n3-AAABXiKOYVQAAAFeRprlVFYAV8gUt-kMEnL2ktiHZG-AOblSny98srz2r2i18IGs9PqmSRstFVL2ZLdYOcHfPyKnBYLQPJeq5SApwmbQiNtsxO938zQrrcjJZxpOFXa4wCMAuIsN")
.addCookie(".www.linkedin.com","JSESSIONID", "ajax:4085733349730512988")
.addCookie(".linkedin.com","liap", "true")
.addCookie(".www.linkedin.com","sl","v=1&f68pf")
.addCookie("www.linkedin.com","visit", "v=1&M")
.addCookie(".www.linkedin.com","bscookie", "v=1&201708250301246c8eaadc-a08f-4e13-8f24-569529ab1ce0AQEk9zZ-nB0gizfSrOSucwXV2Wfc3TBY")
.addCookie(".linkedin.com","bcookie", "v=2&d2115cf0-88a6-415a-8a0b-27e56fef9e39");
Did I miss something?

LinkedIn is very difficult to crawl, not just technically but they also sue people who do.
When they detect an IP as a possible bot, they give you the login page. Most IP addresses known for bots by them are now serving a login page. New ranges do not last very long.
They're probably just pretty confident you're a bot and keeping you from logging in.

Strange Google Favicon queries to API

I have recently created an API for internal use in my company. Only my colleagues and I have the URL.
From a few days ago, I detected that random requests where occuring to a given method of the API (less than once per day), so I logged accesses to that method and this is what I am getting:
2017-06-18 17:10:00,359 INFO (default task-427) 85.52.215.80 - Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36 Google Favicon
2017-06-20 07:25:42,273 INFO (default task-614) 85.52.215.80 - Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36 Google Favicon
The request to the API is performed with the full set of parameters (I mean, it's not just to the root of the webservice)
Any idea of what could be going on?
I have several thesis:
A team member that has a browser tab with the method request URL open, that reloads every time he opens the browser. --> This is my favourite, but everybody claims not their fault
A team member having the service URL (with all parameters) in their browser History, with the browser randomly querying it to retrieve the favicon
A team member having the service service URL (with all parameters) in their browser Favourites/Bookmarks, with the browser randomly querying it to retrieve the favicon
Since the UserAgent (Google Favicon) seems to suggest one of the two latter options, the IP (located near our own city, with Orange Spain ISP) seem to suggest the first option: After a quick search on the Internet, I've found that everybody that is having such requests seem to have a California's Google IP.
I know I could just block that User Agent or IP, but I'd really would like to get to the bottom of this issue.
Thanks!
Edit:
Now I am getting User Agents as:
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Web Preview) Chrome/41.0.2272.118 Safari/537.36
as well :/

Both of these User Agents are associated with Google's Fetch and Render tool in Google Search Console. These user agents make request upon asking Google to Fetch and Render a given page for SEO validation. This really does not make sense considering you are asking about an API and not a page. But perhaps it is because a page that was submitted to the Fetch and Render service called the API?

request['HTTP_USER_AGENT'] structure in modern browsers

I ran into a safari problem considering cookie policy in iframes... Also I found a working solution for that, yet to make it work I need to determine in what browser user is viewing.
Original solution as to search in HTTP_USER_AGENT (django) word - safari. Problem here is:
Safari Windows XP on WM User Agent - Mozilla/5.0 (Windows NT 5.1) AppleWebKit/534.52.7 (KHTML, like Gecko) Version/5.1.2 Safari/534.52.7
Chrome Linux User Agent - Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.
So I'm struggling to find information what makes User Agent and how to parce it to get precise results. Sure in this case I can trow in extra if there is no word 'chrome', but what about chromium konqueror and any other minor browsers...

So I found that in User agent there can be any information you want.
There is some sort of abstract rules by witch you can determine a browser, yet those rules does not apply to all browsers.
During the browser wars, many web servers were configured to only send web pages that required advanced features to clients that were identified as some version of Mozilla.
For this reason, most Web browsers use a User-Agent value as follows: Mozilla/[version] ([system and browser information]) [platform] ([platform details]) [extensions].
More # http://en.wikipedia.org/wiki/User_agent
In my case I've looked at http://www.user-agents.org/ and determined that only Chrome impersonates Safari in the last section.

http://www.quirksmode.org/js/detect.html
Just search for the word Chrome, first, then search for Safari.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Scrape entire scrolling-load page with Python Requests - python-2.7

Related

Web.Contents stuck or times out while trying to download a file

Sampler fails due to regex value conflict

Data Crawling From Linkedin

Strange Google Favicon queries to API

request['HTTP_USER_AGENT'] structure in modern browsers

Categories

Resources