Data Crawling From Linkedin - cookies

I'm trying to crawl data from Linkedin which use for a personal data crawling practice. But I can not crawl the data without login. So I used two way to simulate log in. One is to get the cookies from HttpClient, which will try to make a simulation login to get the cookies. the other is just add the cookie directly. But I failed both. I don't know the reason.
I used Framework Webmagic for the data crawling.
generally, adding Cookies directly will be an easy way. But I don't know whether I added the wrong cookies.
Here's the thing. I wanna fetch data from the website https://www.linkedin.com/mynetwork/invite-connect/connections/
And I added all the cookies at this page.
Here's all the cookies.
private Site site = Site.me().setRetryTimes(3).setSleepTime(100);
site.setCharset("utf-8")
.setUserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36")
.addHeader("accept","text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8")
.addHeader("accept-encoding","gzip, deflate, br")
.addHeader("accept-language:en-US","en;q=0.8")
.addHeader("connection", "keep-alive")
.addHeader("referer","https://www.linkedin.com/")
.addCookie(".linkedin.com","lidc", "b=TB91:g=750:u=38:i=1503815541:t=1503895683:s=AQE5xZLW6mVmRdHBY9qNO-YOiyAnKtgk")
.addCookie(".linkedin.com","lang", "v=2&lang=en-us")
.addCookie(".linkedin.com","_lipt", "CwEAAAFeIo5-jXjgrpSKF4JfxzNbjC6328JPUgtSHQIKtSDyk4Bockuw84uMkCwbKS0TzUOM_w8Al4s9YjFFF-0T43TPtfG_wv-JNVXsPeO8mVxaYwEcTGiyOdyaRZOCIK7qi02EvZUCtjsaTpAos60U4XrFnu1FO-cY1LrzpqDNUmfrqWJPjSoZpOmjeKtTh-nHcdgpruvjf237E78dqMydLLd1A0Uu7Kr7CmNIurXFd9-Z4hwevLRd3SQMEbSRxAwCclgC4tTzEZ5KoFmpI4veKBFGOqF5MCx3hO9iNRdHrJC44hfRx-Bw7p__PYNWF8sc6yYd0deF-C5aJpronFUYp3vXiwt023qm6T9eRqVvtH1BRfLwCZOJmYrGbKzq4plzNKM7DnHKHNV_cjJQtc9aD3JQz8n2GI-cHx2PYubUyIjVWWvntKWC-EUtn4REgL4jmIaWzDUVz3nkEBW7I3Wf6u2TkuAVu9vq_0mW_dTVDCzgASk")
.addCookie(".linkedin.com","_ga", "GA1.2.2091383287.1503630105")
.addCookie(".www.linkedin.com","li_at", "AQEDAReIjksE2n3-AAABXiKOYVQAAAFeRprlVFYAV8gUt-kMEnL2ktiHZG-AOblSny98srz2r2i18IGs9PqmSRstFVL2ZLdYOcHfPyKnBYLQPJeq5SApwmbQiNtsxO938zQrrcjJZxpOFXa4wCMAuIsN")
.addCookie(".www.linkedin.com","JSESSIONID", "ajax:4085733349730512988")
.addCookie(".linkedin.com","liap", "true")
.addCookie(".www.linkedin.com","sl","v=1&f68pf")
.addCookie("www.linkedin.com","visit", "v=1&M")
.addCookie(".www.linkedin.com","bscookie", "v=1&201708250301246c8eaadc-a08f-4e13-8f24-569529ab1ce0AQEk9zZ-nB0gizfSrOSucwXV2Wfc3TBY")
.addCookie(".linkedin.com","bcookie", "v=2&d2115cf0-88a6-415a-8a0b-27e56fef9e39");
Did I miss something?

LinkedIn is very difficult to crawl, not just technically but they also sue people who do.
When they detect an IP as a possible bot, they give you the login page. Most IP addresses known for bots by them are now serving a login page. New ranges do not last very long.
They're probably just pretty confident you're a bot and keeping you from logging in.

Related

File Uploads in Django from urllib

I have small django app where you can upload PDF files.
In the past only human beings used the web application.
In the future a script should be able to upload files.
Up to now we use ModelBackend for authentication (settings.AUTHENTICATION_BACKENDS)
Goal
A script should be able to authenticate and upload files
My current strategy
I add a new user remote-system-foo and give him a password.
Somehow log in to the django web application and then upload pdf files via a script.
I would like to use the requests library for the http client script.
Question
How to login into the django web application?
Is my current strategy the right one, or are there better strategies?
You can use the requests library to log into any site, you of course need to tailor the POST depending on which parameters your site requires. If things aren't trivial, take a look at the post data in Chrome's developer tools from when you log in to your site. Here is some code I used to log into a site, it could easily be extended to do what ever you need it to do.
from bs4 import BeautifulSoup as bs
import requests
data = requests.session.get(page)
soup = bs(data.text, "lxml")
# Grab csrf token
# soup.find(...) or something
# The POST data for authorizing, this may or may not have been a django
# site, so see what your POST needs
data = {
'user[login]': 'foo' ,
'user[password]': 'foofoo',
}
# Act like a computer, and insert token here, not with data!
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95
Safari/537.36', 'X-CSRF-Token': token
}
requests.session.post('https://www.examplesite.com/users/sign_in', data=data,
headers=headers)
Now, your session is logged in and you should be able to upload your pdf. But I've never tried to upload via requests. Take a look at the relevant requests documentation
That being said, this feels like a strange solution. You might consider uploading the files as fixtures or RunSQL, or rather, their location (eg AWS bucket url) to the database. But this is new territory for me.
Hope it helps.
We use this library now: https://github.com/hirokiky/django-basicauth
This way we use http-basic-auth for API views and session/cookie auth for interactive human beings.
Since I found no matching solution, I wrote and published this:
https://pypi.python.org/pypi/tbzuploader/
Generic http upload tool.
If the http upload was successfull, files get moved to a “done” sub
directory.
The upload is considered successfull by tbzuploader if the servers
replies with http status 201 Created
Additional features: Handles pairs of files.
For example you have four files: a.pdf, a.xml, b.pdf, b.xml
The first upload should take a.pdf and a.xml, and the second upload
b.pdf and b.xml, then read the docs for –patterns.

Strange Google Favicon queries to API

I have recently created an API for internal use in my company. Only my colleagues and I have the URL.
From a few days ago, I detected that random requests where occuring to a given method of the API (less than once per day), so I logged accesses to that method and this is what I am getting:
2017-06-18 17:10:00,359 INFO (default task-427) 85.52.215.80 - Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36 Google Favicon
2017-06-20 07:25:42,273 INFO (default task-614) 85.52.215.80 - Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36 Google Favicon
The request to the API is performed with the full set of parameters (I mean, it's not just to the root of the webservice)
Any idea of what could be going on?
I have several thesis:
A team member that has a browser tab with the method request URL open, that reloads every time he opens the browser. --> This is my favourite, but everybody claims not their fault
A team member having the service URL (with all parameters) in their browser History, with the browser randomly querying it to retrieve the favicon
A team member having the service service URL (with all parameters) in their browser Favourites/Bookmarks, with the browser randomly querying it to retrieve the favicon
Since the UserAgent (Google Favicon) seems to suggest one of the two latter options, the IP (located near our own city, with Orange Spain ISP) seem to suggest the first option: After a quick search on the Internet, I've found that everybody that is having such requests seem to have a California's Google IP.
I know I could just block that User Agent or IP, but I'd really would like to get to the bottom of this issue.
Thanks!
Edit:
Now I am getting User Agents as:
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Web Preview) Chrome/41.0.2272.118 Safari/537.36
as well :/
Both of these User Agents are associated with Google's Fetch and Render tool in Google Search Console. These user agents make request upon asking Google to Fetch and Render a given page for SEO validation. This really does not make sense considering you are asking about an API and not a page. But perhaps it is because a page that was submitted to the Fetch and Render service called the API?

Website is up and running but parsing it results in HTTP Error 503

I want to crawl a webpage using urllib2 library and extract some information according to my need. I am able to freely navigate the site(going from one link to another and so on), but when I try to parse-it I am getting an error
HTTP Error 503 : Service Temporarily Unavailable
I searched about it on net and found out that this error occurs when "web site's server is not available at that time"
I am confused after reading this, if website server is down then how come its up and running(since I am able to navigate the webpage), and if the server is not down then why I am getting this 503 Error.
Is their a possibility that the server has done something to prevent the parsing of web-page
Thanks in advance.
Most probably your user-agent is banned from the server, so as to avoid, well, web crawlers. Therefore some websites, including Wikipedia, show up a 50x error when using an unwanted user-agent (such as wget, curl, urllib, …)
However, changing the user-agent might be enough. At least, it's the case for Wikipedia, which works just fine when using Firefox user agent. (The "bann" most probably only relies on the user-agent).
Finally, there must be a reason for those websites to ban web crawlers. Depending on what you're working on, you might want to use another solution. For example, wikipedia provides database dumps, which can be convenient if you intend to make an intensive use of it.
PS. Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.11) Gecko/20101012 Firefox/3.6.11 is the user-agent I use for wikipedia on a project of mine.

Facebook OAuthException from only some servers

I am trying to retrieve a public Facebook photo album for display on a website.
I can access the list of photo albums and the individual albums using Facebook Graphc without getting an Access Token
see https://graph.facebook.com/vailresorts/albums
I've tried that from 4 different servers on different networks, none are logged on to facebook, and it works fine. However, when I put it on my test server, it doesn't work and I get the
OAuthException - An access token is required to request this resource.
error.
I'm wondeirng why that is, requesting that url from different places give different results?
I was understanding that if the album is public, I don't need an App Id and App Secret.
Is this not true?
Request header:
Request URL:https://graph.facebook.com/vailresorts/albums
Request Method:GET
Status Code:200 OK
Request Headersview source
accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
accept-charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3
accept-encoding:gzip,deflate,sdch
accept-language:en-US,en;q=0.8
cache-control:no-cache
host:graph.facebook.com
method:GET
pragma:no-cache
scheme:https
url:/vailresorts/albums
user-agent:Mozilla/5.0 (Windows NT 5.2) AppleWebKit/537.31 (KHTML, like Gecko)
Chrome/26.0.1410.64 Safari/537.31
version:HTTP/1.1
Just to throw in a 'me too' and the failures are not with every request. This bug may be related:
https://developers.facebook.com/bugs/410815249017032
I get the error:
{"error":{"message":"An access token is required to request this resource.","type":"OAuthException","code":104}}
Edit: sorry, didn't mean to answer, but wished to provide as a comment. Ignorant yet on how to do that.
Update: I filed a new bug attempting to get their attention https://developers.facebook.com/bugs/444360005657309

request['HTTP_USER_AGENT'] structure in modern browsers

I ran into a safari problem considering cookie policy in iframes... Also I found a working solution for that, yet to make it work I need to determine in what browser user is viewing.
Original solution as to search in HTTP_USER_AGENT (django) word - safari. Problem here is:
Safari Windows XP on WM User Agent - Mozilla/5.0 (Windows NT 5.1) AppleWebKit/534.52.7 (KHTML, like Gecko) Version/5.1.2 Safari/534.52.7
Chrome Linux User Agent - Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.
So I'm struggling to find information what makes User Agent and how to parce it to get precise results. Sure in this case I can trow in extra if there is no word 'chrome', but what about chromium konqueror and any other minor browsers...
So I found that in User agent there can be any information you want.
There is some sort of abstract rules by witch you can determine a browser, yet those rules does not apply to all browsers.
During the browser wars, many web servers were configured to only send web pages that required advanced features to clients that were identified as some version of Mozilla.
For this reason, most Web browsers use a User-Agent value as follows: Mozilla/[version] ([system and browser information]) [platform] ([platform details]) [extensions].
More # http://en.wikipedia.org/wiki/User_agent
In my case I've looked at http://www.user-agents.org/ and determined that only Chrome impersonates Safari in the last section.
http://www.quirksmode.org/js/detect.html
Just search for the word Chrome, first, then search for Safari.