I want to use scrapy spider in Django views and I tried using CrawlRunner and CrawlProcess but there are problems, views are synced and further crawler does not return a response directly
I tried a few ways:
# Core imports.
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
# Third-party imports.
from rest_framework.views import APIView
from rest_framework.response import Response
# Local imports.
from scrapy_project.spiders.google import GoogleSpider
class ForFunAPIView(APIView):
def get(self, *args, **kwargs):
process = CrawlerProcess(get_project_settings())
process.crawl(GoogleSpider)
process.start()
return Response('ok')
is there any solution to handle that and run spider directly in other scripts or projects without using DjangoItem pipeline?
you didn't really specify what the problems are, however, I guess the problem is that you need to return the Response immediately, and leave the heavy call aka function to run in the background, you can alter your code as following, to use the Threading module
from threading import Thread
class ForFunAPIView(APIView):
def get(self, *args, **kwargs):
process = CrawlerProcess(get_project_settings())
process.crawl(GoogleSpider)
thread = Thread(target=process.start)
thread.start()
return Response('ok')
after a while of searching for this topic, I found a good explanation here: Building a RESTful Flask API for Scrapy
Related
I am attempting to create a Flask middleware in order to make py2neo transactions atomic. First I got a working outside of application context error, and I tried to apply this solution, as seen in the following code:
from flask import g
from py2neo import Graph
def get_db():
return Graph(password="secret")
class TransactionMiddleware(object):
def __init__(self, app):
self.app = app
with app.app_context(): # Error raises here.
g.graph = get_db()
g.transaction = g.graph.begin()
def __call__(self, environ, start_response):
try:
app_status = self.app(environ, start_response)
# some code
return app_status
except BaseException:
g.transaction.rollback()
raise
else:
g.transaction.commit()
But I got this error: AttributeError: 'function' object has no attribute 'app_context'.
I don't know if the solution I'm trying is not suitable for my case, or what is the problem.
You are in a WSGI middleware, so what gets passed in as an app is actually a method called Flask.wsgi_app the results of which you a later supposed to return from your __call__ handler.
Perhaps you can simply import your actual flask app and use app_context on that, as in:
from app import app # or whatever module it's in
# and then
class TransactionMiddleware(object):
...
def __call__(self, environ, start_response):
with app.app_context():
...
I would also consider just instantiating your Graph somehere on a module level. Perhaps next to your flask app instantiation and not attaching it to g. This way you can use it without application context:
# app.py
flask = Flask(__main__)
db = Graph(password="secret")
You can use it by directly importing:
# middleware.py
from app import db
I am making an api which return the JsonResponse as my text from the scrapy. When i run the scripts individually it runs perfectly. But when i try to integrate the scrapy script with python django i am not getting the output.
What i want is only return the response to the request(which in my case is POSTMAN POST request.
Here is the code which i am trying
from django.http import HttpResponse, JsonResponse
from django.views.decorators.csrf import csrf_exempt
import scrapy
from scrapy.crawler import CrawlerProcess
#csrf_exempt
def some_view(request, username):
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'LOG_ENABLED': 'false'
})
process_test = process.crawl(QuotesSpider)
process.start()
return JsonResponse({'return': process_test})
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/random',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
return response.css('.text::text').extract_first()
I am very new to python and django stuff.Any kind of help would be much appreciated.
In your code, process_test is a CrawlerProcess, not the output from the crawling.
You need additional configuration to make your spider store its output "somewhere". See this SO Q&A about writing a custom pipeline.
If you just want to synchronously retrieve and parse a single page, you may be better off using requests to retrieve the page, and parsel to parse it.
I have the following three files.
app.py
from flask_restful import Api
from lib import globals
from flask import Flask
from flask.ext.cache import Cache
globals.algos_app = Flask(__name__)
#cache in file system
globals.cache = Cache(globals.algos_app, config={'CACHE_TYPE': 'filesystem', 'CACHE_DIR': '/tmp'})
api = Api(globals.algos_app)
api.add_resource(Test, '/test')
if __name__ == '__main__':
globals.algos_app.run(host='0.0.0.0', debug=True)
globals.py
global algos_app
global cache
Test.py
from flask_restful import Resource
from lib import globals
from flask_restful import Resource
import time
class Test(Resource):
def get(self):
return self.someMethod()
def post(self):
globals.cache.clear()
return self.someMethod()
#globals.cache.cached()
def someMethod(self):
return str(time.ctime())
I have a GET method which needs to the value from the cache and a POST method which updates the cache by first clearing the cache.
However, no matter I call the GET or the POST method, it always gives me the value from the cache.
PS: At the moment I am simply testing on the development server however I do need to deploy it using WSGI later.
I am not sure if it is the best way, but I did it using the following way.
class Test(Resource):
def get(self):
return globals.cache.get('curr_time')
def post(self):
result = self.someMethod()
globals.cache.set('curr_time', result, timeout=3600)
def someMethod(self):
return str(time.ctime())
I have the following management command (website.py)
from __future__ import absolute_import
from django.core.management.base import BaseCommand
class Command(BaseCommand):
def run_from_argv(self, argv):
self._argv = argv
self.execute()
def handle(self, *args, **options):
from scrapy.cmdline import execute
execute(self._argv[1:])
I'd like to execute this command via URL: /crawl/update-now/
The view is:
from django.core import management
def update_index(request):
management.call_command('website', 'crawl spider')
But it's not working:
Command' object has no attribute '_argv'
I think problem is that run_from_argv is internal Django method and called by django.core.management.ManagementUtility. And you should not implement it by yourself, self._argv is not set anywhere. Arguments are already available in handle().
And your approach has some disadvantages.
First of all, due to synchronous nature of Django, if your URL is "heavy" it can take a lot of time to get it and parse. Instead of it, I'm strongly recommend you to take a look at Celery. It is more "right" way to execute tasks from views and has no problems with performance.
I'm running Django (1.4.3) with dajaxice (0.5.4). I have a file ajax.py with my functions in the main project folder called prj, which looks like:
from dajax.core import Dajax
from dajaxice.decorators import dajaxice_register
from django.utils import simplejson
from dajaxice.core import dajaxice_functions
from django.core.urlresolvers import reverse, resolve
def getContent(request, *args, **kwargs):
url = kwargs['url']
try:
v = resolve(url)
except:
data = []
data.append(('some','data'))
return simplejson.dumps(data)
dajaxice_functions.register(getContent)
I ran python manage.py collectstatic, and I get the following output:
Copying '/tmp/tmpm8OlOw'
However, the dajaxice.core.js generated does not have my function getContent at all. Where am I going wrong? I have dajaxice installed properly and everything, I hope.
Seems you forgot to call dajaxice_autodiscover() from urls.py (this is the place recommended by dajaxice author)
this call will load ajax.py module and make it's method discoverable by JS code generator
You need to register you function with dajaxice either using #dajaxice_register decorator or other methods mention in the documentation.
http://django-dajaxice.readthedocs.org/en/latest/quickstart.html#create-your-first-ajax-function