I am making an api which return the JsonResponse as my text from the scrapy. When i run the scripts individually it runs perfectly. But when i try to integrate the scrapy script with python django i am not getting the output.
What i want is only return the response to the request(which in my case is POSTMAN POST request.
Here is the code which i am trying
from django.http import HttpResponse, JsonResponse
from django.views.decorators.csrf import csrf_exempt
import scrapy
from scrapy.crawler import CrawlerProcess
#csrf_exempt
def some_view(request, username):
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'LOG_ENABLED': 'false'
})
process_test = process.crawl(QuotesSpider)
process.start()
return JsonResponse({'return': process_test})
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/random',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
return response.css('.text::text').extract_first()
I am very new to python and django stuff.Any kind of help would be much appreciated.
In your code, process_test is a CrawlerProcess, not the output from the crawling.
You need additional configuration to make your spider store its output "somewhere". See this SO Q&A about writing a custom pipeline.
If you just want to synchronously retrieve and parse a single page, you may be better off using requests to retrieve the page, and parsel to parse it.
Related
I am planning making backend using django + scrapy. my goal is this.
frontend(react) send 'get' methods by axios to django views endpoint.
this activate scrapy to start crawling (spiders)
send scraping result to django views.
frontend get json result (scraped result, not jobid or log file)
from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapyApp.items import ScrapyappItem
from scrapy.utils.project import get_project_settings
class MySpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'https://www.google.com',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
item = ScrapyappItem()
item['title'] = response.css('title::text').get()
yield item
def show1(request):
# configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
return HttpResponse({"result":d})
I want to use scrapy spider in Django views and I tried using CrawlRunner and CrawlProcess but there are problems, views are synced and further crawler does not return a response directly
I tried a few ways:
# Core imports.
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
# Third-party imports.
from rest_framework.views import APIView
from rest_framework.response import Response
# Local imports.
from scrapy_project.spiders.google import GoogleSpider
class ForFunAPIView(APIView):
def get(self, *args, **kwargs):
process = CrawlerProcess(get_project_settings())
process.crawl(GoogleSpider)
process.start()
return Response('ok')
is there any solution to handle that and run spider directly in other scripts or projects without using DjangoItem pipeline?
you didn't really specify what the problems are, however, I guess the problem is that you need to return the Response immediately, and leave the heavy call aka function to run in the background, you can alter your code as following, to use the Threading module
from threading import Thread
class ForFunAPIView(APIView):
def get(self, *args, **kwargs):
process = CrawlerProcess(get_project_settings())
process.crawl(GoogleSpider)
thread = Thread(target=process.start)
thread.start()
return Response('ok')
after a while of searching for this topic, I found a good explanation here: Building a RESTful Flask API for Scrapy
I have created a webscraper in Python and now I want to insert this file into my views.py and execute them using the HTML button created on the HTML page.
My scraper name is maharera.py and it is saved in same folder where I have saved views.py
My views.py looks like this:
from django.shortcuts import render
from django.conf.urls import url
from django.conf.urls import include
from django.http import HttpResponse
# Create your views here.
def index(request):
first = {"here":"will enter more details"}
return render(request, "files/first-page.html", context=first)
#return HttpResponse("<em>Rera details will be patched here</em>")
After inserting it in views.y I want to execute that file using html HTML I created. How can I do that?
Actual answer to question
Lets say the contents of maharera.py are as follows
def scraper(*args, **kwargs):
#the scraper code goes here
then you'll need to import it as follows in the views.py file
from django.shortcuts import render
from django.conf.urls import url
from django.conf.urls import include
from django.http import HttpResponse
# Create your views here.
import maharera
def index(request):
first = {"here":"will enter more details"}
return render(request, "files/first-page.html", context=first)
#return HttpResponse("<em>Rera details will be patched here</em>")
def scraper_view(request):
maharera.scraper()
return HttpResponse("<em>Scraper started</em>")
It is advisable to not run a web scraper through a http requests like these. Http requests are supposed to return response within fraction of seconds and should not take long.
When you hit scraper_view it will start executing the code inside it. In scraper view, there is call to the scraper and we don't know how long will it take for that function to end. Till that function doesn't end, the response of the page will not be returned to the user.
For such long running tasks, you should look into task queues.
Looking into celery
I have a CrawlSpider, the code is below. I use Tor through tsocks.
When I start my spider, everything works fine. Using init_request I can login on site and crawl with sticky cookies.
But problem occurred when I stopped and resumed spider. Cookies became not sticky.
I give you the response from Scrapy.
=======================INIT_REQUEST================
2013-01-30 03:03:58+0300 [my] INFO: Spider opened
2013-01-30 03:03:58+0300 [my] INFO: Resuming crawl (675 requests scheduled)
............ And here crawling began
So... callback=self.login_url in def init_request is not fired!!!
I thought that scrapy engine didn't want to send again request on login page. Before resuming scrapy I changed login_page (I can login from every page on site) to different that not included in restrict_xpaths.
Result is - After resuming I cannot login and previous cookies are lost.
Does anyone have some assumptions?
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import TakeFirst, MapCompose, Join, Identity
from beles_com_ua.items import Product
from scrapy.utils.response import get_base_url
from scrapy.utils.url import urljoin_rfc
from scrapy.utils.markup import remove_entities
from django.utils.html import strip_tags
from datetime import datetime
from scrapy import log
import re
from scrapy.http import Request, FormRequest
class ProductLoader(XPathItemLoader):
.... some code is here ...
class MySpider(CrawlSpider):
name = 'my'
login_page = 'http://test.com/index.php?section=6&type=12'
allowed_domains = ['test.com']
start_urls = [
'http://test.com/index.php?section=142',
]
rules = (
Rule(SgmlLinkExtractor(allow=('.',),restrict_xpaths=('...my xpath...')),callback='parse_item', follow=True),
)
def start_requests(self):
return self.init_request()
def init_request(self):
print '=======================INIT_REQUEST================'
return [Request(self.login_page, callback=self.login_url)]
def login_url(self, response):
print '=======================LOGIN======================='
"""Generate a login request."""
return FormRequest.from_response(response,
formdata={'login': 'mylogin', 'pswd': 'mypass'},
callback=self.after_login)
def after_login(self, response):
print '=======================AFTER_LOGIN ...======================='
if "images/info_enter.png" in response.body:
print "==============Bad times :(==============="
else:
print "=========Successfully logged in.========="
for url in self.start_urls:
yield self.make_requests_from_url(url)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
entry = hxs.select("//div[#class='price']/text()").extract()
l = ProductLoader(Product(), hxs)
if entry:
name = hxs.select("//div[#class='header_box']/text()").extract()[0]
l.add_value('name', name)
... some code is here ...
return l.load_item()
The init_request(self): is available only when you subclass from InitSpider not CrawlSpider
You need to subclass your spider from InitSpider like this
class WorkingSpider(InitSpider):
login_page = 'http://www.example.org/login.php'
def init_request(self):
#"""This function is called before crawling starts."""
return Request(url=self.login_page, callback=self.login)
But then remember that you can't define Rules in initSpider as its only avaiable in CrawlSpider you need to manually extract the links
I was just wondering if there's any reasonable way to pass authentication cookies from webdriver.Firefox() instance to the spider itself? It would be helpful to perform some webdriver stuff and then go about scraping "business as usual". Something to the effect of:
def __init__(self):
BaseSpider.__init__(self)
self.selenium = webdriver.Firefox()
def __del__(self):
self.selenium.quit()
print self.verificationErrors
def parse(self, response):
# Initialize the webdriver, get login page
sel = self.selenium
sel.get(response.url)
sleep(3)
##### Transfer (sel) cookies to (self) and crawl normally??? #####
...
...
Transfer Cookies from Selenium to Scrapy Spider
Scrapying File
from selenium import webdriver
driver=webdriver.Firefox()
data=driver.get_cookies()
# write to temp file
with open('cookie.json', 'w') as outputfile:
json.dump(data, outputfile)
driver.close()
outputfile.close()
....
Spider
import os
if os.stat("cookie.json").st_size > 2:
with open('./cookie.json', 'r') as inputfile:
self.cookie = json.load(inputfile)
inputfile.close()
You can try to override BaseSpider.start_requests method to attach to starting requests needed cookies using scrapy.http.cookies.CookieJar.
See also: Scrapy - how to manage cookies/sessions
This works with chrome driver but not Firefox (Tested OK)
refer https://christopher.su/2015/selenium-chromedriver-ubuntu/ for installation.
import scrapy
from scrapy.spiders.init import InitSpider
from scrapy.http import Request
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pickle
class HybridSpider(InitSpider):
name = 'hybrid'
def init_request(self):
driver = webdriver.Chrome()`
driver.get('https://example.com')
driver.find_element_by_id('js-login').click()
driver.find_element_by_id('email').send_keys('mymail#example.net')
driver.find_element_by_id('password').send_keys('mypasssword',Keys.ENTER)
pickle.dump( driver.get_cookies() , open(os.getenv("HOME")+"/my_cookies","wb"))
cookies = pickle.load(open(os.getenv("HOME")+"/my_cookies", "rb"))
FH = open(os.getenv("HOME")+"/my_urls", 'r')
for url in FH.readlines():
pass
yield Request(url,cookies=cookies,callback=self.parse)
def parse(self, response):
pass
Haven't tried directly passing the cookies like
yield Request(url,cookies=driver.get_cookies(),callback=self.parse)
Might work too..
driver = webdriver.Chrome()
Then perform the login or interact with the page through the browser. Now when using the crawler in scrapy, set the cookies parameter:
request = Request(URL, cookies=driver.get_cookies(), callback=self.mycallback)