How to appropriately scrape LinkedIn directory

How to appropriately scrape LinkedIn directory - python-2.7

I am trying to build a basic LinkedIn scraper for a research project and am running into challenges when I try to scrape through levels of the directory. I am a beginner and I keep on running the code below and IDLE returns and error before shutting down. See below the code and error:
Code:
import requests
from bs4 import BeautifulSoup
from urllib2 import urlopen
from pprint import pprint as pp
PROFILE_URL = "linkedin.com"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
#use this to gather all of the individual links from the second directory page
def get_second_links(pre_section_link):
response = requests.get(pre_section_link, headers=headers)
soup = BeautifulSoup(response.content, "lxml")
column = soup.find("ul", attrs={'class':'column dual-column'})
second_links = [li.a["href"] for li in column.findAll("li")]
return second_links
# use this to gather all of the individual links from the third directory page
def get_third_links(section_link):
response = requests.get(section_link, headers=headers)
soup = BeautifulSoup(response.content, "lxml")
column = soup.find("ul", attrs={'class':'column dual-column'})
third_links = [li.a["href"] for li in column.findAll("li")]
return third_links
use this to build the individual profile links
def get_profile_link(link):
response = requests.get(link, headers=headers)
soup = BeautifulSoup(response.content, "lxml")
column2 = soup.find("ul", attrs={'class':'column dual-column'})
profile_links = [PROFILE_URL + li.a["href"] for li in column2.findAll("li")]
return profile_links
if __name__=="__main__":
sub_directory = get_second_links("https://www.linkedin.com/directory/people-a-1/")
sub_directory = map(get_third_links, sub_directory)
profiles = get_third_links(sub_directory)
profiles = map(get_profile_link, profiles)
profiles = [item for sublist in fourth_links for item in sublist]
pp(profiles)
Error I keep getting:
Error Page

You need to add https to PROFILE_URL:
PROFILE_URL = "https://linkedin.com"

Related

Need to scrape the data using BeautifulSoup

I am in need to get the celebrity details from https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php
Input: Time of birth as known only, except the world events in a profession, where I get nearby 22,822 celebrities. I am able to get the first page data, using the urllib2 and bs4
import re
import urllib2
from bs4 import BeautifulSoup
url = "https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php"
data = "sexe=M|F&categorie[0]=0|1|2|3|4|5|6|7|8|9|10|11|12&connue=1&pays=-1&tri=0&x=33&y=13"
fp = urllib2.urlopen(url, data)
soup = BeautifulSoup(fp, 'html.parser')
from_div = soup.find_all('div', attrs={'class': 'titreFiche'})
for major in from_div:
name = re.findall(r'portrait">(.*?)<br/>', str(major))
link = re.findall(r'<a href="(.*?)"', str(major))
print name[0], link[0]
For the next 230 pages, I am unable to get the data. I used to change the URL as page equal to until end but I can't scrape. Is there any way to get those remaining data from that page?

you need session cookies, use requests to save session easily
from bs4 import BeautifulSoup
import requests, re
url = "https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php"
searchData = {
"sexe": "M|F",
"categorie[0]": "0|1|2|3|4|5|6|7|8|9|10|11|12",
"connue": 1, "pays": -1, "tri": 0, "x": 33, "y": 13
}
session = requests.session()
def doSearch(url, data=None):
if data:
fp = session.post(url, data=data).text
else:
fp = session.get(url).text
soup = BeautifulSoup(fp, 'html.parser')
from_div = soup.find_all('div', attrs={'class': 'titreFiche'})
for major in from_div:
name = re.findall(r'portrait">(.*?)<br/>', str(major))
link = re.findall(r'<a href="(.*?)"', str(major))
print name[0], link[0]
# do Post search in first request
doSearch(url, searchData)
# we have session and we can use Get request for next page
for index in range(2, 4): # get page 2 to 3
print('getting page: %s' % index)
pageurl = '%s?page=%s' % (url, index)
print(pageurl)
doSearch(pageurl)

Loop through payload Python

There is a site that i connect to, but need to login 4 times with different user names and passwords.
Is there anyway that i can do this by looping through the usernames and passwords in a payload.
This is the first time im am doing this and am not really sure of how to go about it.
The code works fine if i post just one username and password.
Im using Python 2.7 and BeautifulSoup and requests.
Here is my code.
import requests
import zipfile, StringIO
from bs4 import BeautifulSoup
# Here were add the login details to be submitted to the login form.
payload = [
{'USERNAME': 'xxxxxx','PASSWORD': 'xxxxxx','option': 'login'},
{'USERNAME': 'xxxxxx','PASSWORD': 'xxxxxxx','option': 'login'},
{'USERNAME': 'xxxxx','PASSWORD': 'xxxxx','option': 'login'},
{'USERNAME': 'xxxxxx','PASSWORD': 'xxxxxx','option': 'login'},
]
#Possibly need headers later.
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}
base_url = "https://service.rl360.com/scripts/customer.cgi/SC/servicing/"
with requests.Session() as s:
p = s.post('https://service.rl360.com/scripts/customer.cgi?option=login', data=payload)
# Get the download page to scrape.
r = s.get('https://service.rl360.com/scripts/customer.cgi/SC/servicing/downloads.php?Folder=DataDownloads&SortField=ExpiryDays&SortOrder=Ascending', stream=True)
content = r.text
soup = BeautifulSoup(content, 'lxml')
#Now i get the most recent download URL.
download_url = soup.find_all("a", {'class':'tabletd'})[-1]['href']
#now we join the base url with the download url.
download_docs = s.get(base_url + download_url, stream=True)
print "Checking Content"
content_type = download_docs.headers['content-type']
print content_type
print "Checking Filename"
content_name = download_docs.headers['content-disposition']
print content_name
print "Checking Download Size"
content_size = download_docs.headers['content-length']
print content_size
#This is where we extract and download the specified xml files.
z = zipfile.ZipFile(StringIO.StringIO(download_docs.content))
print "---------------------------------"
print "Downloading........."
#Now we save the files to the specified location.
z.extractall('C:\Temp')
print "Download Complete"

Just use a for loop. You may need to adjust your download directory if files will be overwritten.
payloads = [
{'USERNAME': 'xxxxxx1','PASSWORD': 'xxxxxx','option': 'login'},
{'USERNAME': 'xxxxxx2','PASSWORD': 'xxxxxxx','option': 'login'},
{'USERNAME': 'xxxxx3','PASSWORD': 'xxxxx','option': 'login'},
{'USERNAME': 'xxxxxx4','PASSWORD': 'xxxxxx','option': 'login'},
]
....
for payload in payloads:
with requests.Session() as s:
p = s.post('https://service.rl360.com/scripts/customer.cgi?option=login', data=payload)
...

Beautiful Soup - Unable to scrape links from paginated pages

I'm unable to scrape the links of the articles present in the paginated webpages. Additionally I get a blank screen at times as my output. I am unable to find the problem in my loop. Also the csv file doesn't get created.
from pprint import pprint
import requests
from bs4 import BeautifulSoup
import lxml
import csv
import urllib2
def get_url_for_search_key(search_key):
for i in range(1,100):
base_url = 'http://www.thedrum.com/'
response = requests.get(base_url + 'search?page=%s&query=' + search_key +'&sorted=')%i
soup = BeautifulSoup(response.content, "lxml")
results = soup.findAll('a')
return [url['href'] for url in soup.findAll('a')]
pprint(get_url_for_search_key('artificial intelligence'))
with open('StoreUrl.csv', 'w+') as f:
f.seek(0)
f.write('\n'.join(get_url_for_search_key('artificial intelligence')))

Are you sure, that you need only first 100 pages? Maybe there's more of them...
My vision of your task below, this will collect links from all pages and also precisely catches next page button links:
import requests
from bs4 import BeautifulSoup
base_url = 'http://www.thedrum.com/search?sort=date&query=artificial%20intelligence'
response = requests.get(base_url)
soup = BeautifulSoup(response.content, "lxml")
res = []
while 1:
results = soup.findAll('a')
res.append([url['href'] for url in soup.findAll('a')])
next_button = soup.find('a', text='Next page')
if not next_button:
break
response = requests.get(next_button['href'])
soup = BeautifulSoup(response.content, "lxml")
EDIT: alternative approach for collecting only article links:
import requests
from bs4 import BeautifulSoup
base_url = 'http://www.thedrum.com/search?sort=date&query=artificial%20intelligence'
response = requests.get(base_url)
soup = BeautifulSoup(response.content, "lxml")
res = []
while 1:
search_results = soup.find('div', class_='search-results') #localizing search window with article links
article_link_tags = search_results.findAll('a') #ordinary scheme goes further
res.append([url['href'] for url in article_link_tags])
next_button = soup.find('a', text='Next page')
if not next_button:
break
response = requests.get(next_button['href'])
soup = BeautifulSoup(response.content, "lxml")
to print links use:
for i in res:
for j in i:
print(j)

Parsing a table using beautifulsoup

Want to fetch contents of a table everytime it gets updated. Using BeautifulSoup. Why doesn't this piece of code work? It doesn't return any output or throws an exception sometimes
from bs4 import BeautifulSoup
import urllib2
url = "http://tenders.ongc.co.in/wps/portal/!ut/p/b1/04_Sj9CPykssy0xPLMnMz0vMAfGjzOINLc3MPB1NDLwsPJ1MDTzNPcxMDYJCjA0MzIAKIoEKDHAARwNC-sP1o8BK8Jjg55Gfm6pfkBthoOuoqAgArsFI6g!!/pw/Z7_1966IA40J8IB50I7H650RT30D2/ren/m=view/s=normal/p=struts.portlet.action=QCPtenderHomeQCPlatestTenderListAction/p=struts.portlet.mode=view/=/#Z7_1966IA40J8IB50I7H650RT30D2"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
divcontent = soup.find('div', {"id":"latestTrPagging", "class":"content2"})
table = soup.find_all('table')
rows = table.findAll('tr', {"class":"even", "class": "odd"})
for row in rows:
cols = row.findAll('td', {"class":"tno"})
for td in cols:
print td.text(text=True)`
The url is https://tenders.ongc.co.in/wps/portal/!ut/p/b1/04_Sj9CPykssy0xPLMnMz0vMAfGjzOINLc3MPB1NDLwsPJ1MDTzNPcxMDYJCjA0MzIAKIoEKDHAARwNC-sP1o8BK8Jjg55Gfm6pfkBthoOuoqAgArsFI6g!!/pw/Z7_1966IA40J8IB50I7H650RT30D2/ren/m=view/s=normal/p=struts.portlet.action=QCPtenderHomeQCPlatestTenderListAction/p=struts.portlet.mode=view/=/#Z7_1966IA40J8IB50I7H650RT30D2
Just want to fetch the table part and get notified when a new tender comes in

Here is what works for me - using requests instead of urllib2, setting the User-Agent header and adjusting some of the locators:
from bs4 import BeautifulSoup
import requests
url = "https://tenders.ongc.co.in/wps/portal/!ut/p/b1/04_Sj9CPykssy0xPLMnMz0vMAfGjzOINLc3MPB1NDLwsPJ1MDTzNPcxMDYJCjA0MzIAKIoEKDHAARwNC-sP1o8BK8Jjg55Gfm6pfkBthoOuoqAgArsFI6g!!/pw/Z7_1966IA40J8IB50I7H650RT30D2/ren/m=view/s=normal/p=struts.portlet.action=QCPtenderHomeQCPlatestTenderListAction/p=struts.portlet.mode=view/=/#Z7_1966IA40J8IB50I7H650RT30D2"
page = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36"})
soup = BeautifulSoup(page.content, "html.parser")
divcontent = soup.find('div', {"id": "latestTrPagging", "class": "content2"})
table = soup.find('table')
rows = table.find_all('tr', {"class": ["even", "odd"]})
for row in rows:
cols = row.find_all('td', {"class": "tno"})
for td in cols:
print(td.get_text())
Prints the first 10 tender numbers:
LC1MC16044[NIT]
LC1MC16043[NIT]
LC1MC16045[NIT]
EY1VC16028[NIT]
RC2SC16050(E -tender)[NIT]
RC2SC16048(E -tender)[NIT]
RC2SC16049(E -tender)[NIT]
UI1MC16002[NIT]
V16RC16015[E-Gas]
K16AC16002[E-Procurement]
Please note how you should have been handling multiple classes ("even" and "odd").

Automating pulling csv files off google Trends

pyGTrends does not seem to work. Giving errors in Python.
pyGoogleTrendsCsvDownloader seems to work, logs in, but after getting 1-3 requests (per day!) complains about exhausted quota, even though manual download with the same login/IP works flawlessly.
Bottom line: neither work. Searching through stackoverflow: many questions from people trying to pull csv's from Google, but no workable solution I could find...
Thank you in advance: whoever will be able to help. How should the code be changed? Do you know of another solution that works?
Here's the code of pyGoogleTrendsCsvDownloader.py
import httplib
import urllib
import urllib2
import re
import csv
import lxml.etree as etree
import lxml.html as html
import traceback
import gzip
import random
import time
import sys
from cookielib import Cookie, CookieJar
from StringIO import StringIO
class pyGoogleTrendsCsvDownloader(object):
'''
Google Trends Downloader
Recommended usage:
from pyGoogleTrendsCsvDownloader import pyGoogleTrendsCsvDownloader
r = pyGoogleTrendsCsvDownloader(username, password)
r.get_csv(cat='0-958', geo='US-ME-500')
'''
def __init__(self, username, password):
'''
Provide login and password to be used to connect to Google Trends
All immutable system variables are also defined here
'''
# The amount of time (in secs) that the script should wait before making a request.
# This can be used to throttle the downloading speed to avoid hitting servers too hard.
# It is further randomized.
self.download_delay = 0.25
self.service = "trendspro"
self.url_service = "http://www.google.com/trends/"
self.url_download = self.url_service + "trendsReport?"
self.login_params = {}
# These headers are necessary, otherwise Google will flag the request at your account level
self.headers = [('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0'),
("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"),
("Accept-Language", "en-gb,en;q=0.5"),
("Accept-Encoding", "gzip, deflate"),
("Connection", "keep-alive")]
self.url_login = 'https://accounts.google.com/ServiceLogin?service='+self.service+'&passive=1209600&continue='+self.url_service+'&followup='+self.url_service
self.url_authenticate = 'https://accounts.google.com/accounts/ServiceLoginAuth'
self.header_dictionary = {}
self._authenticate(username, password)
def _authenticate(self, username, password):
'''
Authenticate to Google:
1 - make a GET request to the Login webpage so we can get the login form
2 - make a POST request with email, password and login form input values
'''
# Make sure we get CSV results in English
ck = Cookie(version=0, name='I4SUserLocale', value='en_US', port=None, port_specified=False, domain='www.google.com', domain_specified=False,domain_initial_dot=False, path='/trends', path_specified=True, secure=False, expires=None, discard=False, comment=None, comment_url=None, rest=None)
self.cj = CookieJar()
self.cj.set_cookie(ck)
self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(self.cj))
self.opener.addheaders = self.headers
# Get all of the login form input values
find_inputs = etree.XPath("//form[#id='gaia_loginform']//input")
try:
#
resp = self.opener.open(self.url_login)
if resp.info().get('Content-Encoding') == 'gzip':
buf = StringIO( resp.read())
f = gzip.GzipFile(fileobj=buf)
data = f.read()
else:
data = resp.read()
xmlTree = etree.fromstring(data, parser=html.HTMLParser(recover=True, remove_comments=True))
for input in find_inputs(xmlTree):
name = input.get('name')
if name:
name = name.encode('utf8')
value = input.get('value', '').encode('utf8')
self.login_params[name] = value
except:
print("Exception while parsing: %s\n" % traceback.format_exc())
self.login_params["Email"] = username
self.login_params["Passwd"] = password
params = urllib.urlencode(self.login_params)
self.opener.open(self.url_authenticate, params)
def get_csv(self, throttle=False, **kwargs):
'''
Download CSV reports
'''
# Randomized download delay
if throttle:
r = random.uniform(0.5 * self.download_delay, 1.5 * self.download_delay)
time.sleep(r)
params = {
'export': 1
}
params.update(kwargs)
params = urllib.urlencode(params)
r = self.opener.open(self.url_download + params)
# Make sure everything is working ;)
if not r.info().has_key('Content-Disposition'):
print "You've exceeded your quota. Continue tomorrow..."
sys.exit(0)
if r.info().get('Content-Encoding') == 'gzip':
buf = StringIO( r.read())
f = gzip.GzipFile(fileobj=buf)
data = f.read()
else:
data = r.read()
myFile = open('trends_%s.csv' % '_'.join(['%s-%s' % (key, value) for (key, value) in kwargs.items()]), 'w')
myFile.write(data)
myFile.close()

Although I don't know python, I may have a solution. I am currently doing the same thing in C# and though I didn't get the .csv file, I got created a custom URL through code and then downloaded that HTML and saved to a text file (also through code). In this HTML (at line 12) is all the information needed to create the graph that is used on Google Trends. However, this has alot of unnecessary text within it that needs to be cut down. But either way, you end up with the same result. The Google Trends data. I posted a more detailed answer to my question here:
Downloading .csv file from Google Trends

There is an alternative module named pytrends - https://pypi.org/project/pytrends/ It is really cool. I would recommend this.
Example usage:
import numpy as np
import pandas as pd
from pytrends.request import TrendReq
pytrend = TrendReq()
#It is the term that you want to search
pytrend.build_payload(kw_list=["Eminem is the Rap God"])
# Find which region has searched the term
df = pytrend.interest_by_region()
df.to_csv("path\Eminem_InterestbyRegion.csv")
Potentially if you have a list of terms to search you could make use of "for loop" to automate the insights as per your wish.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to appropriately scrape LinkedIn directory - python-2.7

You need to add https to PROFILE_URL: PROFILE_URL = "https://linkedin.com"

Related

Need to scrape the data using BeautifulSoup

Loop through payload Python

Beautiful Soup - Unable to scrape links from paginated pages

Parsing a table using beautifulsoup

Automating pulling csv files off google Trends

Categories

Resources