I'm trying to set up image downloading from web pages by using Scrapy Framework and djano-item. I think I have done everything like in doc
but after calling scrapy crawl I log looking like this:
Scrapy log
I can't find there any information on what went wrong but Images field Is empty and directory does not contain any images.
This is my model
class Event(models.Model):
title = models.CharField(max_length=100, blank=False)
description = models.TextField(blank=True, null=True)
event_location = models.CharField(max_length=100, blank = True, null= True)
image_urls = models.CharField(max_length = 200, blank = True, null = True)
images = models.CharField(max_length=100, blank = True, null = True)
url = models.URLField(max_length=200)
def __unicode(self):
return self.title
and this is how i go from spider to image pipeline
def parse_from_details_page(self, response):
"Some code"
item_event = item_loader.load_item()
#this is to create image_urls list (there is only one image_url allways)
item_event['image_urls'] = [item_event['image_urls'],]
return item_event
and finally this is my settings.py for Scrapy project:
import sys
import os
import django
DJANGO_PROJECT_PATH = os.path.join(os.path.dirname((os.path.abspath(__file__))), 'MyScrapy')
#sys.path.insert(0, DJANGO_PROJECT_PATH)
#sys.path.append(DJANGO_PROJECT_PATH)
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "MyScrapy.settings")
#os.environ["DJANGO_SETTINGS_MODULE"] = "MyScrapy.settings"
django.setup()
BOT_NAME = 'EventScraper'
SPIDER_MODULES = ['EventScraper.spiders']
NEWSPIDER_MODULE = 'EventScraper.spiders'
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 100,
'EventScraper.pipelines.EventscraperPipeline': 200,
}
#MEDIA STORAGE URL
IMAGES_STORE = os.path.join(DJANGO_PROJECT_PATH, "IMAGES")
#IMAGES (used to be sure that it takes good fields)
FILES_URLS_FIELD = 'image_urls'
FILES_RESULT_FIELD = 'images'
Thank you in advance for your help
EDIT:
I used custom image pipeline from doc looking like this,
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
import ipdb; ipdb.set_trace()
yield scrapy.Request(image_url)
def item_completed(self, results, item, info):
import ipdb; ipdb.set_trace()
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
return item
In get_media_requests it creates request to my Url but in item_completed in result param i get somethin like this : [(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: >)]
I still don't know how to fix it.
Is it possible that the problem could be caused by a reference to the address with https ?
I faced the EXACT issue with scrapy.
My Solution:
Added headers to the request you're yielding in the get_media_requests function. I added a user agent and a host along with some other headers. Here's my list of headers.
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'en-GB,en-US;q=0.8,en;q=0.6',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Proxy-Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
'Host': 'images.finishline.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
Open up the exact image url in your browser (the url with which you're downloading the image). Simply check your browser's network tab for the list of headers. Make sure your headers for that request I mentioned above are the same as those.
Hope it works.
Related
I'm scraping god names from the website of a game. The scraped text is stored in a postgresql database through Django models.
When I run my program twice, I get everything double.
How do I avoid this?
import requests
import urllib3
from bs4 import BeautifulSoup
import psycopg2
import os
import django
os.environ['DJANGO_SETTINGS_MODULE'] = 'locallibrary.settings'
django.setup()
from scraper.models import GodList
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
session = requests.Session()
session.headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36"}
url = 'https://www.smitegame.com/'
content = session.get(url, verify=False).content
soup = BeautifulSoup(content, "html.parser")
allgods = soup.find_all('div', {'class': 'god'})
allitem = []
for god in allgods:
godName = god.find('p')
godFoto = god.find('img').get('src')
allitem.append((godName, godFoto))
GodList.objects.create(godName=godName.text)
below my models file.
class GodList(models.Model):
godName = models.CharField(max_length=50, unique=True)
godFoto = models.CharField(max_length=100, unique=True)
def __str__(self):
return self.godName
Just use the get_or_create() method on the model manager instead of create() to avoid adding a duplicate.
god, created = GodList.objects.get_or_create(godName=godName.text)
god will obviously be the model instance that was gotten or created, and created will be returned as True if the object had to be created else False.
Title is a bit confusing, but basically I have an s3 path stored as a string
class S3Stuff(Model):
s3_path = CharField(max_length=255, blank=True, null=True)
# rest is not important
There are existing methods to download the content given the url, so I want to utilize that
def download_from_s3(bucket, file_name):
s3_client = boto3.client(bleh_bleh)
s3_response = s3_client.get_object(Bucket=s3_bucket, Key=file_name)
return {'response': 200, 'body': s3_response['Body'].read()}
s3_path can be broken into bucket and file_name. This works very easily when I use my own frontend because I can do whatever I want with it, but I don't know how to apply this to admin
class S3StuffAdmin(admin.StackedInline):
model = S3Stuff
fields = ('s3_path', )
Now how do I call that method and make the display a link that says "download"
I don't think this function will be much useful for generating download links, instead use the boto3's presigned_url like this:
from django.utils.html import format_html
class S3StuffAdmin(admin.StackedInline):
model = S3Stuff
fields = ('s3_path', )
readonly_field = ('download',)
def download(self, obj):
s3_client = boto3.client(bleh_bleh)
url = s3_client.generate_presigned_url('get_object', Params = {'Bucket': 'bucket', 'Key': obj.s3_path}, ExpiresIn = 100)
return format_html('<a href={}>download</a>'.format(url))
I'm trying to log into Quora with Scrapy, however I did not succeed, which indicating 400 or 500 code, corresponds to my formdata.
I found the form data by Chrome:
General
Request URL:https://www.quora.com/webnode2/server_call_POST?__instart__
Request Method:POST
Status Code:200
Remote Address:103.243.14.60:443
Form Data
json:{"args":[],"kwargs":{"email":"1liusai253#163.com","password":"XXXX","passwordless":1}}
formkey:750febacf08976a47c82f3e10af83305
postkey:dab46d0df2014d1568ead6b2fbad7297
window_id:dep3300-2420196009402604566
referring_controller:index
referring_action:index
_lm_transaction_id:0.2598935768985011
_lm_window_id:dep3300-2420196009402604566
__vcon_json:["Vn03YsuKFZvHV9"]
__vcon_method:do_login
__e2e_action_id:ee1qmp1iit
js_init:{}
Next are my code samples, a normal Scrapy flow. I thought the problem lies in the formdata. Can someone help with this?
import scrapy
import re
class QuestionsSpider(scrapy.Spider):
name = 'questions'
domain = 'https://www.quora.com'
headers = {
"Accept": "application/json, text/javascript, */*; q=0.01",
"Accept-Language": "zh-Hans-CN,zh-Hans;q=0.8,en-US;q=0.5,en;q=0.3",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/49.0.2623.108 Chrome/49.0.2623.108 Safari/537.36",
"Accept-Encoding": "gzip, deflate",
"Host": "www.quora.com",
"Connection": "Keep-Alive",
"content-type":"application/x-www-form-urlencoded"
}
def __init__(self, login_url = None):
self.login_url = 'https://www.quora.com/webnode2/server_call_POST?__instart__' # Here is the login URL of Quora
def start_requests(self):
body = response.body
formkey_patt = re.compile(r'.*?"formkey".*?"(.*?)".*?',re.S)
formkey = re.findall(formkey_patt, body)[0]
postkey_patt = re.compile('.*?"postkey".*?"(.*?)".*?',re.S)
postkey = re.findall(postkey_patt, body)[0]
window_id_patt = re.compile('.*?window_id.*?"(.*?)".*?',re.S)
window_id = re.findall(window_id_patt, body)[0]
referring_controller = 'index'
referring_action = 'index'
__vcon_method = 'do_login'
yield scrapy.Request(
url = self.domain,
headers = self.headers,
meta = {'cookiejar':1},
callback = self.start_login
)
def start_login(self,response):
yield scrapy.FormRequest.from_response(
response,
url = self.login_url,
meta = {'cookiejar':response.meta['cookiejar']},
headers = self.headers,
formdata = {"json":{"args":[],"kwargs":{"email":"xxxx","password":"xxx"}},
"formkey":formkey,
"postkey":postkey,
"window_id":window_id,
"referring_controller":referring_controller,
"referring_action":referring_action,
"__vcon_method":__vcon_method,
"__e2e_action_id":"ee1qmp1iit"
},
callback = self.after_login
)
def after_login(self, response):
print response.body
You are not setting nor sending formkey, postkey, window_id, etc. That's why you should grab them from the response. That being said, you need to use FormRequest.from_response()
I am trying to build a basic LinkedIn scraper for a research project and am running into challenges when I try to scrape through levels of the directory. I am a beginner and I keep on running the code below and IDLE returns and error before shutting down. See below the code and error:
Code:
import requests
from bs4 import BeautifulSoup
from urllib2 import urlopen
from pprint import pprint as pp
PROFILE_URL = "linkedin.com"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
#use this to gather all of the individual links from the second directory page
def get_second_links(pre_section_link):
response = requests.get(pre_section_link, headers=headers)
soup = BeautifulSoup(response.content, "lxml")
column = soup.find("ul", attrs={'class':'column dual-column'})
second_links = [li.a["href"] for li in column.findAll("li")]
return second_links
# use this to gather all of the individual links from the third directory page
def get_third_links(section_link):
response = requests.get(section_link, headers=headers)
soup = BeautifulSoup(response.content, "lxml")
column = soup.find("ul", attrs={'class':'column dual-column'})
third_links = [li.a["href"] for li in column.findAll("li")]
return third_links
use this to build the individual profile links
def get_profile_link(link):
response = requests.get(link, headers=headers)
soup = BeautifulSoup(response.content, "lxml")
column2 = soup.find("ul", attrs={'class':'column dual-column'})
profile_links = [PROFILE_URL + li.a["href"] for li in column2.findAll("li")]
return profile_links
if __name__=="__main__":
sub_directory = get_second_links("https://www.linkedin.com/directory/people-a-1/")
sub_directory = map(get_third_links, sub_directory)
profiles = get_third_links(sub_directory)
profiles = map(get_profile_link, profiles)
profiles = [item for sublist in fourth_links for item in sublist]
pp(profiles)
Error I keep getting:
Error Page
You need to add https to PROFILE_URL:
PROFILE_URL = "https://linkedin.com"
I'm using a simple REST client to test. Sending a simple JPEG, tried the following content-Type(s):
Content-Type: image/jpeg
Content-Type: multipart/form-data
Also note csrftoken authentication is turned off to allow outside 3rd party REST connection.
(image is attached via the rest client)
Checked wireshark and the packet is setup according to the above parameter.
Django - request object has several variables:
request.body
request.FILES
After the POST is received by the Django server, the request object always stores all data/payload into request.body. Shouldn't an image or any attached files be going into request.FILES? Is there something setup incorrectly on the content-type or POST.
very simple code. Just trying to print into the log. All objects in post keep going to request.body
def testPost(request):
print request.body
print request.FILES
return HttpResponse()
Wireshark packet:
Hypertext Transfer Protocol
POST /testPost/ HTTP/1.1\r\n
Host: MYURL.com:8000\r\n
Connection: keep-alive\r\n
Content-Length: 8318\r\n
Origin: chrome-extension://aejoelaoggembcahagimdiliamlcdmfm\r\n
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36\r\n
Content-Type: image/jpeg\r\n
Accept: */*\r\n
Accept-Encoding: gzip,deflate,sdch\r\n
Accept-Language: en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4\r\n
Cookie: ******; csrftoken=**********\r\n
\r\n
[Full request URI: http://MYURL.com:8000/testPost/]
[HTTP request 1/1]
JPEG File Interchange Format
Here is how I handle file uploads: which in this case happen to be images. One of the issues I fought with for awhile was that request.FILES could come in with multiple keys and I always wanted the last one.
Note: request.FILES will only contain data if:
the request is a POST
the request has the attribute 'enctype="multipart/form-data"'
see the Django File-uploads documentation for more details.
The Model: First there is a model with a ImageField in it: models.py
photos_dir = settings.MEDIA_ROOT + "/photos" + "/%Y/%m/%d/"
class Photo(models.Model):
image = models.ImageField(upload_to=photos_dir, null=True, blank=True, default=None)
filename = models.CharField(max_length=60, blank=True, null=True)
The View: in views.py handle the post:
from django.core.files.images import ImageFile
def upload_image( request ):
file_key=None
for file_key in sorted(request.FILES):
pass
wrapped_file = ImageFile(request.FILES[file_key])
filename = wrapped_file.name
# new photo table-row
photo = Photo()
photo.filename = filename
photo.image = request.FILES[file_key]
try:
photo.save()
except OSError:
print "Deal with this situation"
# do your stuff here.
return HttpResponse("boo", "text/html");
The Standlone Poster: Some python code to stimulate your django view.
Reference: I actually used this lib: poster.encode to 'stimulate data' to my django view.py
from poster.streaminghttp import register_openers
from poster.encode import multipart_encode
import urllib2
server = "http://localhost/"
headers = {}
# Register the streaming http handlers with urllib2
register_openers()
img="path/to/image/image.png"
data = {'media' : open( img ),
'additionalattr': 111,
}
datagen, headers = multipart_encode(data)
headers['Connection']='keep-alive'
request = urllib2.Request('%s/upload_image/' % ( server ), datagen, headers)
print urllib2.urlopen(request).read()