request['HTTP_USER_AGENT'] structure in modern browsers - django

I ran into a safari problem considering cookie policy in iframes... Also I found a working solution for that, yet to make it work I need to determine in what browser user is viewing.
Original solution as to search in HTTP_USER_AGENT (django) word - safari. Problem here is:
Safari Windows XP on WM User Agent - Mozilla/5.0 (Windows NT 5.1) AppleWebKit/534.52.7 (KHTML, like Gecko) Version/5.1.2 Safari/534.52.7
Chrome Linux User Agent - Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.
So I'm struggling to find information what makes User Agent and how to parce it to get precise results. Sure in this case I can trow in extra if there is no word 'chrome', but what about chromium konqueror and any other minor browsers...

So I found that in User agent there can be any information you want.
There is some sort of abstract rules by witch you can determine a browser, yet those rules does not apply to all browsers.
During the browser wars, many web servers were configured to only send web pages that required advanced features to clients that were identified as some version of Mozilla.
For this reason, most Web browsers use a User-Agent value as follows: Mozilla/[version] ([system and browser information]) [platform] ([platform details]) [extensions].
More # http://en.wikipedia.org/wiki/User_agent
In my case I've looked at http://www.user-agents.org/ and determined that only Chrome impersonates Safari in the last section.

http://www.quirksmode.org/js/detect.html
Just search for the word Chrome, first, then search for Safari.

Related

Sampler fails due to regex value conflict

I am trying to write a test script in Jmeter the script works fine for a single user, but the same script fails for multiple users, the issue is there are 2 regex one is extracting the authentication details and the other is extracting the jobs associated to each user, now in some of the samplers when I am running the script for more than 2 users it picks the job of one user and the authentication details from another due to this the server sends the 500 error message,
Here's the data that authKey is capturing
sitecode: 601584
credential: {"SiteCode":601584,"UserName":"ADMIN","FirstName":"ADMIN","LastName":"ADMIN","RoleId":1,"UserTypeId":1,"SiteId":696,"StaffId":0,"UserId":15240,"AuthKey":"1548dbe78e5d4a71bbe8a70112c66eb82899c6d38dd140be91b7cf5610b140617ed460d6ba674d3089c7199941e1342b","DefaultPagePath":"","UserEntity":2519,"TypeOfEntity":20,"Culture":null,"SuperFranchiseId":0,"SuperFranchiseName":null,"MasterFranchiseId":null,"FranchiseIds":null,"CountryList":[],"MarketId":0,"MarketName":null,"TechnicianId":0,"Actions":"","GroupId":10,"SessionId":103807374,"EntityRef":2519,"IsLocked":false,"EncryptedAuthKey":"tNKvRhDrm4R99OCP45Q+uQnSa+CLfm2iLuTG9lCCWo17CRXPGoCzrzj2nQ0nC68IrqkP6ygRH0hQrrdosqmXoYngBxu04l4zH7rNhMZ1bbcK49QKBVQ9sVp3mTUPjzaBU1MH431lTyGCQMfCJafHHxY+XJNSMeTk/CG6m6D47oZW/v0az17IYcNL586QC6Vsm5BGul5U6+c71fSnTQfIdiWY5Ijye2xjDTHN1LZ8u9UGtrShF7zFCm2hkdFsQ2pk"}
Content-Type: application/json
sec-ch-ua: "Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"
sec-ch-ua-mobile: ?0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36
Content-Length: 3519
Host: qa-coreservices.wsautovhc.co.uk
The site code goes as 601584
Here's the value of VHC that's been picked it's from a different site user
POST https://qa-coreservices.wsautovhc.co.uk/api/vhc/v2/update
POST data:
{"IsTCO":true,"TCOId":1016590096,"TimeIn":"00:00:00","TimeOut":"00:00:00","CustomerBookedWork":"","CustomerStatus":"","WorkStatus":"","MobilityStatus":"","RepeatRepair":"false","DateIn":"","DateOut":"","SquashedFrogModelId":0,"SquashedFrogurl":null,"TotalPhotoCount":0,"IsMoveVHC":false,"LinkGuid":"00000000-0000-0000-0000-000000000000","CustomerName":null,"TechnicianName":null,"TemplateLogoAssetId":0,"TemplateLogoUrl":null,"SAName":null,"CustomerUUId":null,"SiteName":null,"SAFirstName":null,"SASurName":null,"SADisplayName":null,"CustomerFirstName":null,"CustomerSurName":null,"CustomerSalutation":null,"CustomerEmail":null,"CustomerMobile":null,"DialingCode":null,"RoleID":0,"SAEmail":null,"MitchellIntegration":null,"VhcTotalId":0,"VhcTotalData":{"Notes":null,"OnlineAuthAuthorisationDate":null,"FuelGaugeValue":0,"EmacIntegrationInfo":null,"DmsMetaData":null,"Chat":null,"NotApplicableTemplateItemTypeIds":null,"UberVhcScoreCard":null,"OnlinePayments":null,"IsCreateRepairOrder":false,"IsGetRepairOrder":false,"EngineNumber":"","PartsPartiallyPriced":false,"LabourPartiallyPriced":false,"PartiallyChecked":null,"ParentCode":null,"IsTransmittedToDms":false,"YearOfMake":"","ChatStages":0,"RequestCustomerCallBack":false,"WhatsApp":null},"VQCQuestionIds":null,"VQCByUserId":0,"TotalVideoCount":0,"SquashedfrogPhotoCount":0,"SquashedfrogVideoCount":0,"HasIntegrationRequired":false,"IsAutoSendMessage":false,"PreviousVhcComments":null,"ConvertToVideoCheck":false,"VhcJcbLookup":null,"VhcJcbTemplateLookupId":0,"IsAppRequest":false,"NissanRecallInfo":null,"HondaRecallInfo":null,"VisitReasons":[],"TemplateName":null,"QualityControlName":null,"SAStartTStamp":null,"BoatMetaData":{"BoatName":"","EngineId1":"","EngineId2":"","TransmissionId":"","BoatLength":0},"EventsMetaData":null,"Workshop":"Select","DateAdded":"24/05/2021 11:38:40","DateChecked":"","DateParts":"","DateLabour":"","DateAuthorised":"","Status":"N","VHCId":1012875189,"CustomerId":1004437412,"VHCDate":"20210524","Make":"2019","Model":"124 SPIDER","RegNo":"REG55671","Mileage":1280,"JobCardNo":"JC5461","ItemNo":"","FollowUpDate":"20210524","Technician":0,"AuthTotal":0.0,"IdentTotal":0.0,"InvTotal":0.00,"QualityControl":0,"RepairOrderNo":"","Deleted":false,"SiteCode":601470,"TimeStamp":"24/05/2021 11:38:40","Altered":false,"ServiceAdvisor":0,"DateWorkIssued":"","DatePartsIssued":"","VIN":"","TemplateId":47692,"Populated":false,"FranchiseId":27,"AuthTotalIncVAT":0.0,"IdentTotalIncVAT":0.0,"LastAccessTStamp":null,"LastAccessUser":0,"StartTStamp":null,"TyresRequired":false,"MultiRole":false,"CommentsAvailable":false,"MultiRoleSAId":0,"MultiRoleTStamp":null,"FirstRegistrationDate":"","NextMOTDate":"","NextServiceDate":"","DefaultServiceRate":0.00,"AgreedEstimate":0.0,"WorkRequired":"","AverageMileagePerAnnum":"0","PushedToDMS":false,"TransmittedToFord":false,"OriginalVHCDate":"20210524","MileageUnit":"m","PreviousVHCId":0,"PreviousVHCDate":"","PricedByUserId":0,"Revisit":0,"HasBeenLate":false,"FirstEvent":false,"RevisitRecorded":false,"UnknownReasonCode":false,"DMSAccountCode":"","JobType":0,"PhotoCount":0,"VideoCount":0,"NextCheckTachographDate":"","NextChangeCoolantDate":"","NextChangeBrakeFluidDate":"","LastWorkshopVisitDate":"","MileageLastWorkshopVisit":0,"LastServiceDate":"","MileageLastService":0,"ServiceCode":"","EngineOil":"","TransmissionOil":"","PositionNumber":0,"AxleNumber":0,"AxleLocation":"","FirstTyreLocation":"","SecondTyreLocation":"","VHCIcon":null,"FastLaneVHC":0,"Pin":null,"VhcType":0,"EngineNumber":""};
[no cookies]
The site code here is different than the one that's captured in the AuthKey and this creating the problem.
What am I doing wrong?
it picks the job of one user and the authentication details from another
it cannot happen, as per JMeter Documentation
Properties are not the same as variables. Variables are local to a thread; properties are common to all threads, and need to be referenced using the __P or __property function.
So if one user extracts some value from the response another user cannot access this value (unless you convert it into a JMeter Property using __setProperty() function) so most probably the error is either in your extraction logic or in business logic of your test scenario. See article on Thread-local storage for more information if needed.
So inspect responses and extracted values using Debug Sampler and View Results Tree listener combination, hopefully you will be able to detect the inconsistency yourself.
Also JSON doesn't seem to be a regular language so parsing it using regular expressions is not the best idea, maybe it worth consider using JSON Extractor or even better JSON JMESPath Extractor instead?

Scrapy: browser not accepting cookies despite settings

This scraper works fine. I only want to get the titles of the items on this page.
walmart scraper
In scrapy shell, using the view(response) function reveals a web page that says "Your web browser is not accepting cookies." Even when I add USER_AGENT information to a scrapy shell launch.
"Your web browser is not accepting cookies."
As a result, the scraper doesn't manage to scrape any information. Things that I have changed:
COOKIES_ENABLED = True
COOKIES_DEBUG = True
ROBOTSTXT_OBEY = False
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
DOWNLOADER_MIDDLEWARES = {'walmartscraper.middlewares.WalmartscraperDownloaderMiddleware': 543,}
I have a feeling I need to add/change something in the middlewares section (it is still the default code) and/or implement requests somewhere. This is the first time I've worked with cookies while scraping and the information I've found hasn't helped me figure this out.
Any advice is very much appreciated. Thank you.

Data Crawling From Linkedin

I'm trying to crawl data from Linkedin which use for a personal data crawling practice. But I can not crawl the data without login. So I used two way to simulate log in. One is to get the cookies from HttpClient, which will try to make a simulation login to get the cookies. the other is just add the cookie directly. But I failed both. I don't know the reason.
I used Framework Webmagic for the data crawling.
generally, adding Cookies directly will be an easy way. But I don't know whether I added the wrong cookies.
Here's the thing. I wanna fetch data from the website https://www.linkedin.com/mynetwork/invite-connect/connections/
And I added all the cookies at this page.
Here's all the cookies.
private Site site = Site.me().setRetryTimes(3).setSleepTime(100);
site.setCharset("utf-8")
.setUserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36")
.addHeader("accept","text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8")
.addHeader("accept-encoding","gzip, deflate, br")
.addHeader("accept-language:en-US","en;q=0.8")
.addHeader("connection", "keep-alive")
.addHeader("referer","https://www.linkedin.com/")
.addCookie(".linkedin.com","lidc", "b=TB91:g=750:u=38:i=1503815541:t=1503895683:s=AQE5xZLW6mVmRdHBY9qNO-YOiyAnKtgk")
.addCookie(".linkedin.com","lang", "v=2&lang=en-us")
.addCookie(".linkedin.com","_lipt", "CwEAAAFeIo5-jXjgrpSKF4JfxzNbjC6328JPUgtSHQIKtSDyk4Bockuw84uMkCwbKS0TzUOM_w8Al4s9YjFFF-0T43TPtfG_wv-JNVXsPeO8mVxaYwEcTGiyOdyaRZOCIK7qi02EvZUCtjsaTpAos60U4XrFnu1FO-cY1LrzpqDNUmfrqWJPjSoZpOmjeKtTh-nHcdgpruvjf237E78dqMydLLd1A0Uu7Kr7CmNIurXFd9-Z4hwevLRd3SQMEbSRxAwCclgC4tTzEZ5KoFmpI4veKBFGOqF5MCx3hO9iNRdHrJC44hfRx-Bw7p__PYNWF8sc6yYd0deF-C5aJpronFUYp3vXiwt023qm6T9eRqVvtH1BRfLwCZOJmYrGbKzq4plzNKM7DnHKHNV_cjJQtc9aD3JQz8n2GI-cHx2PYubUyIjVWWvntKWC-EUtn4REgL4jmIaWzDUVz3nkEBW7I3Wf6u2TkuAVu9vq_0mW_dTVDCzgASk")
.addCookie(".linkedin.com","_ga", "GA1.2.2091383287.1503630105")
.addCookie(".www.linkedin.com","li_at", "AQEDAReIjksE2n3-AAABXiKOYVQAAAFeRprlVFYAV8gUt-kMEnL2ktiHZG-AOblSny98srz2r2i18IGs9PqmSRstFVL2ZLdYOcHfPyKnBYLQPJeq5SApwmbQiNtsxO938zQrrcjJZxpOFXa4wCMAuIsN")
.addCookie(".www.linkedin.com","JSESSIONID", "ajax:4085733349730512988")
.addCookie(".linkedin.com","liap", "true")
.addCookie(".www.linkedin.com","sl","v=1&f68pf")
.addCookie("www.linkedin.com","visit", "v=1&M")
.addCookie(".www.linkedin.com","bscookie", "v=1&201708250301246c8eaadc-a08f-4e13-8f24-569529ab1ce0AQEk9zZ-nB0gizfSrOSucwXV2Wfc3TBY")
.addCookie(".linkedin.com","bcookie", "v=2&d2115cf0-88a6-415a-8a0b-27e56fef9e39");
Did I miss something?
LinkedIn is very difficult to crawl, not just technically but they also sue people who do.
When they detect an IP as a possible bot, they give you the login page. Most IP addresses known for bots by them are now serving a login page. New ranges do not last very long.
They're probably just pretty confident you're a bot and keeping you from logging in.

Strange Google Favicon queries to API

I have recently created an API for internal use in my company. Only my colleagues and I have the URL.
From a few days ago, I detected that random requests where occuring to a given method of the API (less than once per day), so I logged accesses to that method and this is what I am getting:
2017-06-18 17:10:00,359 INFO (default task-427) 85.52.215.80 - Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36 Google Favicon
2017-06-20 07:25:42,273 INFO (default task-614) 85.52.215.80 - Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36 Google Favicon
The request to the API is performed with the full set of parameters (I mean, it's not just to the root of the webservice)
Any idea of what could be going on?
I have several thesis:
A team member that has a browser tab with the method request URL open, that reloads every time he opens the browser. --> This is my favourite, but everybody claims not their fault
A team member having the service URL (with all parameters) in their browser History, with the browser randomly querying it to retrieve the favicon
A team member having the service service URL (with all parameters) in their browser Favourites/Bookmarks, with the browser randomly querying it to retrieve the favicon
Since the UserAgent (Google Favicon) seems to suggest one of the two latter options, the IP (located near our own city, with Orange Spain ISP) seem to suggest the first option: After a quick search on the Internet, I've found that everybody that is having such requests seem to have a California's Google IP.
I know I could just block that User Agent or IP, but I'd really would like to get to the bottom of this issue.
Thanks!
Edit:
Now I am getting User Agents as:
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Web Preview) Chrome/41.0.2272.118 Safari/537.36
as well :/
Both of these User Agents are associated with Google's Fetch and Render tool in Google Search Console. These user agents make request upon asking Google to Fetch and Render a given page for SEO validation. This really does not make sense considering you are asking about an API and not a page. But perhaps it is because a page that was submitted to the Fetch and Render service called the API?

Scrape entire scrolling-load page with Python Requests

Specifically, I'm trying to scrape this entire page, but am only getting a portion of it. If I use:
r = requests.get('http://store.nike.com/us/en_us/pw/mens-shoes/7puZoi3?ipp=120')
it only gets the "visible" part of the page, since more items load as you scroll downwards.
I know there are some solutions in PyQT such as this, but is there a way to have python requests continuously scroll to the bottom of a webpage until all items load?
You could monitor page network activity with browser development console (F12 - Network in Chrome) to see what request does the page do when you scroll down, use that data and reproduce the request with requests. As an alternative, you can use selenium to control a browser programmatically to scroll down until page is ended, then save its HTML.
I guess I found the right request
Request URL:http://store.nike.com/html-services/gridwallData?country=US&lang_locale=en_US&gridwallPath=mens-shoes/7puZoi3&pn=3
Request Method:GET
Status Code:200 OK
Remote Address:87.245.221.98:80
Request Headers
Provisional headers are shown
Accept:application/json, text/javascript, */*; q=0.01
Referer:http://store.nike.com/us/en_us/pw/mens-shoes/7puZoi3?ipp=120
User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
X-NewRelic-ID:VQYGVF5SCBAJVlFaAQIH
X-Requested-With:XMLHttpRequest
Seems that query parameter pn means the current "subpage". But you still need to understand the response correctly.