Avoid duplicates in postgresql database after scraping with beautifull soup

Avoid duplicates in postgresql database after scraping with beautifull soup - django

I'm scraping god names from the website of a game. The scraped text is stored in a postgresql database through Django models.
When I run my program twice, I get everything double.
How do I avoid this?
import requests
import urllib3
from bs4 import BeautifulSoup
import psycopg2
import os
import django
os.environ['DJANGO_SETTINGS_MODULE'] = 'locallibrary.settings'
django.setup()
from scraper.models import GodList
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
session = requests.Session()
session.headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36"}
url = 'https://www.smitegame.com/'
content = session.get(url, verify=False).content
soup = BeautifulSoup(content, "html.parser")
allgods = soup.find_all('div', {'class': 'god'})
allitem = []
for god in allgods:
godName = god.find('p')
godFoto = god.find('img').get('src')
allitem.append((godName, godFoto))
GodList.objects.create(godName=godName.text)
below my models file.
class GodList(models.Model):
godName = models.CharField(max_length=50, unique=True)
godFoto = models.CharField(max_length=100, unique=True)
def __str__(self):
return self.godName

Just use the get_or_create() method on the model manager instead of create() to avoid adding a duplicate.
god, created = GodList.objects.get_or_create(godName=godName.text)
god will obviously be the model instance that was gotten or created, and created will be returned as True if the object had to be created else False.

Related

Parsing a table using beautifulsoup

Want to fetch contents of a table everytime it gets updated. Using BeautifulSoup. Why doesn't this piece of code work? It doesn't return any output or throws an exception sometimes
from bs4 import BeautifulSoup
import urllib2
url = "http://tenders.ongc.co.in/wps/portal/!ut/p/b1/04_Sj9CPykssy0xPLMnMz0vMAfGjzOINLc3MPB1NDLwsPJ1MDTzNPcxMDYJCjA0MzIAKIoEKDHAARwNC-sP1o8BK8Jjg55Gfm6pfkBthoOuoqAgArsFI6g!!/pw/Z7_1966IA40J8IB50I7H650RT30D2/ren/m=view/s=normal/p=struts.portlet.action=QCPtenderHomeQCPlatestTenderListAction/p=struts.portlet.mode=view/=/#Z7_1966IA40J8IB50I7H650RT30D2"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
divcontent = soup.find('div', {"id":"latestTrPagging", "class":"content2"})
table = soup.find_all('table')
rows = table.findAll('tr', {"class":"even", "class": "odd"})
for row in rows:
cols = row.findAll('td', {"class":"tno"})
for td in cols:
print td.text(text=True)`
The url is https://tenders.ongc.co.in/wps/portal/!ut/p/b1/04_Sj9CPykssy0xPLMnMz0vMAfGjzOINLc3MPB1NDLwsPJ1MDTzNPcxMDYJCjA0MzIAKIoEKDHAARwNC-sP1o8BK8Jjg55Gfm6pfkBthoOuoqAgArsFI6g!!/pw/Z7_1966IA40J8IB50I7H650RT30D2/ren/m=view/s=normal/p=struts.portlet.action=QCPtenderHomeQCPlatestTenderListAction/p=struts.portlet.mode=view/=/#Z7_1966IA40J8IB50I7H650RT30D2
Just want to fetch the table part and get notified when a new tender comes in

Here is what works for me - using requests instead of urllib2, setting the User-Agent header and adjusting some of the locators:
from bs4 import BeautifulSoup
import requests
url = "https://tenders.ongc.co.in/wps/portal/!ut/p/b1/04_Sj9CPykssy0xPLMnMz0vMAfGjzOINLc3MPB1NDLwsPJ1MDTzNPcxMDYJCjA0MzIAKIoEKDHAARwNC-sP1o8BK8Jjg55Gfm6pfkBthoOuoqAgArsFI6g!!/pw/Z7_1966IA40J8IB50I7H650RT30D2/ren/m=view/s=normal/p=struts.portlet.action=QCPtenderHomeQCPlatestTenderListAction/p=struts.portlet.mode=view/=/#Z7_1966IA40J8IB50I7H650RT30D2"
page = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36"})
soup = BeautifulSoup(page.content, "html.parser")
divcontent = soup.find('div', {"id": "latestTrPagging", "class": "content2"})
table = soup.find('table')
rows = table.find_all('tr', {"class": ["even", "odd"]})
for row in rows:
cols = row.find_all('td', {"class": "tno"})
for td in cols:
print(td.get_text())
Prints the first 10 tender numbers:
LC1MC16044[NIT]
LC1MC16043[NIT]
LC1MC16045[NIT]
EY1VC16028[NIT]
RC2SC16050(E -tender)[NIT]
RC2SC16048(E -tender)[NIT]
RC2SC16049(E -tender)[NIT]
UI1MC16002[NIT]
V16RC16015[E-Gas]
K16AC16002[E-Procurement]
Please note how you should have been handling multiple classes ("even" and "odd").

Scrapy image pipeline does not download images

I'm trying to set up image downloading from web pages by using Scrapy Framework and djano-item. I think I have done everything like in doc
but after calling scrapy crawl I log looking like this:
Scrapy log
I can't find there any information on what went wrong but Images field Is empty and directory does not contain any images.
This is my model
class Event(models.Model):
title = models.CharField(max_length=100, blank=False)
description = models.TextField(blank=True, null=True)
event_location = models.CharField(max_length=100, blank = True, null= True)
image_urls = models.CharField(max_length = 200, blank = True, null = True)
images = models.CharField(max_length=100, blank = True, null = True)
url = models.URLField(max_length=200)
def __unicode(self):
return self.title
and this is how i go from spider to image pipeline
def parse_from_details_page(self, response):
"Some code"
item_event = item_loader.load_item()
#this is to create image_urls list (there is only one image_url allways)
item_event['image_urls'] = [item_event['image_urls'],]
return item_event
and finally this is my settings.py for Scrapy project:
import sys
import os
import django
DJANGO_PROJECT_PATH = os.path.join(os.path.dirname((os.path.abspath(__file__))), 'MyScrapy')
#sys.path.insert(0, DJANGO_PROJECT_PATH)
#sys.path.append(DJANGO_PROJECT_PATH)
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "MyScrapy.settings")
#os.environ["DJANGO_SETTINGS_MODULE"] = "MyScrapy.settings"
django.setup()
BOT_NAME = 'EventScraper'
SPIDER_MODULES = ['EventScraper.spiders']
NEWSPIDER_MODULE = 'EventScraper.spiders'
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 100,
'EventScraper.pipelines.EventscraperPipeline': 200,
}
#MEDIA STORAGE URL
IMAGES_STORE = os.path.join(DJANGO_PROJECT_PATH, "IMAGES")
#IMAGES (used to be sure that it takes good fields)
FILES_URLS_FIELD = 'image_urls'
FILES_RESULT_FIELD = 'images'
Thank you in advance for your help
EDIT:
I used custom image pipeline from doc looking like this,
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
import ipdb; ipdb.set_trace()
yield scrapy.Request(image_url)
def item_completed(self, results, item, info):
import ipdb; ipdb.set_trace()
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
return item
In get_media_requests it creates request to my Url but in item_completed in result param i get somethin like this : [(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: >)]
I still don't know how to fix it.
Is it possible that the problem could be caused by a reference to the address with https ?

I faced the EXACT issue with scrapy.
My Solution:
Added headers to the request you're yielding in the get_media_requests function. I added a user agent and a host along with some other headers. Here's my list of headers.
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'en-GB,en-US;q=0.8,en;q=0.6',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Proxy-Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
'Host': 'images.finishline.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
Open up the exact image url in your browser (the url with which you're downloading the image). Simply check your browser's network tab for the list of headers. Make sure your headers for that request I mentioned above are the same as those.
Hope it works.

How to appropriately scrape LinkedIn directory

I am trying to build a basic LinkedIn scraper for a research project and am running into challenges when I try to scrape through levels of the directory. I am a beginner and I keep on running the code below and IDLE returns and error before shutting down. See below the code and error:
Code:
import requests
from bs4 import BeautifulSoup
from urllib2 import urlopen
from pprint import pprint as pp
PROFILE_URL = "linkedin.com"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
#use this to gather all of the individual links from the second directory page
def get_second_links(pre_section_link):
response = requests.get(pre_section_link, headers=headers)
soup = BeautifulSoup(response.content, "lxml")
column = soup.find("ul", attrs={'class':'column dual-column'})
second_links = [li.a["href"] for li in column.findAll("li")]
return second_links
# use this to gather all of the individual links from the third directory page
def get_third_links(section_link):
response = requests.get(section_link, headers=headers)
soup = BeautifulSoup(response.content, "lxml")
column = soup.find("ul", attrs={'class':'column dual-column'})
third_links = [li.a["href"] for li in column.findAll("li")]
return third_links
use this to build the individual profile links
def get_profile_link(link):
response = requests.get(link, headers=headers)
soup = BeautifulSoup(response.content, "lxml")
column2 = soup.find("ul", attrs={'class':'column dual-column'})
profile_links = [PROFILE_URL + li.a["href"] for li in column2.findAll("li")]
return profile_links
if __name__=="__main__":
sub_directory = get_second_links("https://www.linkedin.com/directory/people-a-1/")
sub_directory = map(get_third_links, sub_directory)
profiles = get_third_links(sub_directory)
profiles = map(get_profile_link, profiles)
profiles = [item for sublist in fourth_links for item in sublist]
pp(profiles)
Error I keep getting:
Error Page

You need to add https to PROFILE_URL:
PROFILE_URL = "https://linkedin.com"

'module' object has no attribute 'GraphAPI'

i am new in python.i want to facebook likes count using Facebook Graph API.but problem in graph api. Following code run in ipython but doesn't run in django
from django.http import HttpResponse, HttpResponseForbidden
import requests # pip install requests
import json
import facebook
from prettytable import PrettyTable
from collections import Counter
def tweet(request):
ACCESS_TOKEN = 'CAACEdEose0cBAGqqZAQTaCoCnGn9jhUR42LAuxtZBHBZCPCsverUSIngAqYidbLMgQ8K0gCnOoGFRmYEZCMoTVL0SF R0ZBKCi2TUZC8m8RXk4wAj1UQyu927GYXicFIXXv2zWVeKbPFXaGhqofwClOF7DHdewTL48ZCqy5ZBZBVsM1JopgpmGNldcNV9ZBbWtfZC4FwE7fWlCZAolwZDZD'
print("ACCESS_TOKEN:",ACCESS_TOKEN)
base_url = 'https://graph.facebook.com/me'
fields = 'id,name'
url = '%s?fields=%s&access_token=%s' % \
(base_url, fields, ACCESS_TOKEN,)
print(url)
content = requests.get(url).json()
print(json.dumps(content, indent=1))
g = facebook.GraphAPI(ACCESS_TOKEN)
friends = g.get_connections("me", "friends")['data']
likes = { friend['name'] : g.get_connections(friend['id'], "likes")['data']
for friend in friends }
print(likes)
friends_likes = Counter([like['name']
for friend in likes
for like in likes[friend]
if like.get('name')])
return HttpResponse(json.dumps(content, indent=1))

Automating pulling csv files off google Trends

pyGTrends does not seem to work. Giving errors in Python.
pyGoogleTrendsCsvDownloader seems to work, logs in, but after getting 1-3 requests (per day!) complains about exhausted quota, even though manual download with the same login/IP works flawlessly.
Bottom line: neither work. Searching through stackoverflow: many questions from people trying to pull csv's from Google, but no workable solution I could find...
Thank you in advance: whoever will be able to help. How should the code be changed? Do you know of another solution that works?
Here's the code of pyGoogleTrendsCsvDownloader.py
import httplib
import urllib
import urllib2
import re
import csv
import lxml.etree as etree
import lxml.html as html
import traceback
import gzip
import random
import time
import sys
from cookielib import Cookie, CookieJar
from StringIO import StringIO
class pyGoogleTrendsCsvDownloader(object):
'''
Google Trends Downloader
Recommended usage:
from pyGoogleTrendsCsvDownloader import pyGoogleTrendsCsvDownloader
r = pyGoogleTrendsCsvDownloader(username, password)
r.get_csv(cat='0-958', geo='US-ME-500')
'''
def __init__(self, username, password):
'''
Provide login and password to be used to connect to Google Trends
All immutable system variables are also defined here
'''
# The amount of time (in secs) that the script should wait before making a request.
# This can be used to throttle the downloading speed to avoid hitting servers too hard.
# It is further randomized.
self.download_delay = 0.25
self.service = "trendspro"
self.url_service = "http://www.google.com/trends/"
self.url_download = self.url_service + "trendsReport?"
self.login_params = {}
# These headers are necessary, otherwise Google will flag the request at your account level
self.headers = [('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0'),
("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"),
("Accept-Language", "en-gb,en;q=0.5"),
("Accept-Encoding", "gzip, deflate"),
("Connection", "keep-alive")]
self.url_login = 'https://accounts.google.com/ServiceLogin?service='+self.service+'&passive=1209600&continue='+self.url_service+'&followup='+self.url_service
self.url_authenticate = 'https://accounts.google.com/accounts/ServiceLoginAuth'
self.header_dictionary = {}
self._authenticate(username, password)
def _authenticate(self, username, password):
'''
Authenticate to Google:
1 - make a GET request to the Login webpage so we can get the login form
2 - make a POST request with email, password and login form input values
'''
# Make sure we get CSV results in English
ck = Cookie(version=0, name='I4SUserLocale', value='en_US', port=None, port_specified=False, domain='www.google.com', domain_specified=False,domain_initial_dot=False, path='/trends', path_specified=True, secure=False, expires=None, discard=False, comment=None, comment_url=None, rest=None)
self.cj = CookieJar()
self.cj.set_cookie(ck)
self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(self.cj))
self.opener.addheaders = self.headers
# Get all of the login form input values
find_inputs = etree.XPath("//form[#id='gaia_loginform']//input")
try:
#
resp = self.opener.open(self.url_login)
if resp.info().get('Content-Encoding') == 'gzip':
buf = StringIO( resp.read())
f = gzip.GzipFile(fileobj=buf)
data = f.read()
else:
data = resp.read()
xmlTree = etree.fromstring(data, parser=html.HTMLParser(recover=True, remove_comments=True))
for input in find_inputs(xmlTree):
name = input.get('name')
if name:
name = name.encode('utf8')
value = input.get('value', '').encode('utf8')
self.login_params[name] = value
except:
print("Exception while parsing: %s\n" % traceback.format_exc())
self.login_params["Email"] = username
self.login_params["Passwd"] = password
params = urllib.urlencode(self.login_params)
self.opener.open(self.url_authenticate, params)
def get_csv(self, throttle=False, **kwargs):
'''
Download CSV reports
'''
# Randomized download delay
if throttle:
r = random.uniform(0.5 * self.download_delay, 1.5 * self.download_delay)
time.sleep(r)
params = {
'export': 1
}
params.update(kwargs)
params = urllib.urlencode(params)
r = self.opener.open(self.url_download + params)
# Make sure everything is working ;)
if not r.info().has_key('Content-Disposition'):
print "You've exceeded your quota. Continue tomorrow..."
sys.exit(0)
if r.info().get('Content-Encoding') == 'gzip':
buf = StringIO( r.read())
f = gzip.GzipFile(fileobj=buf)
data = f.read()
else:
data = r.read()
myFile = open('trends_%s.csv' % '_'.join(['%s-%s' % (key, value) for (key, value) in kwargs.items()]), 'w')
myFile.write(data)
myFile.close()

Although I don't know python, I may have a solution. I am currently doing the same thing in C# and though I didn't get the .csv file, I got created a custom URL through code and then downloaded that HTML and saved to a text file (also through code). In this HTML (at line 12) is all the information needed to create the graph that is used on Google Trends. However, this has alot of unnecessary text within it that needs to be cut down. But either way, you end up with the same result. The Google Trends data. I posted a more detailed answer to my question here:
Downloading .csv file from Google Trends

There is an alternative module named pytrends - https://pypi.org/project/pytrends/ It is really cool. I would recommend this.
Example usage:
import numpy as np
import pandas as pd
from pytrends.request import TrendReq
pytrend = TrendReq()
#It is the term that you want to search
pytrend.build_payload(kw_list=["Eminem is the Rap God"])
# Find which region has searched the term
df = pytrend.interest_by_region()
df.to_csv("path\Eminem_InterestbyRegion.csv")
Potentially if you have a list of terms to search you could make use of "for loop" to automate the insights as per your wish.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Avoid duplicates in postgresql database after scraping with beautifull soup - django

Related

Parsing a table using beautifulsoup

Scrapy image pipeline does not download images

How to appropriately scrape LinkedIn directory

'module' object has no attribute 'GraphAPI'

Automating pulling csv files off google Trends

Categories

Resources