Get scrapy result inside a Django view - django

I'm scrapping a page successfully that returns me an unique item. I don't want neither to save the scrapped item in the database nor to a file. I need to get it inside a Django view.
My view is as follows:
def start_crawl(process_number, court):
"""
Starts the crawler.
Args:
process_number (str): Process number to be found.
court (str): Court of the process.
"""
runner = CrawlerRunner(get_project_settings())
results = list()
def crawler_results(sender, parse_result, **kwargs):
results.append(parse_result)
dispatcher.connect(crawler_results, signal=signals.item_passed)
process_info = runner.crawl(MySpider, process_number=process_number, court=court)
return results
I followed this solution but results list is always empty.
I read something as creating a custom middleware and getting the results at the process_spider_output method.
How can I get the desired result?
Thanks!

I managed to implement something like that in one of my projects. It is a mini-project and I was looking for a quick solution. You'll might need modify it or support multi-threading etc in case you put it in production environment.
Overview
I created an ItemPipeline that just add the items into a InMemoryItemStore helper. Then, in my __main__ code I wait for the crawler to finish, and pop all the items out of the InMemoryItemStore. Then I can manipulate the items as I wish.
Code
items_store.py
Hacky in-memory store. It is not very elegant but it got the job done for me. Modify and improve if you wish. I've implemented that as a simple class object so I can simply import it anywhere in the project and use it without passing its instance around.
class InMemoryItemStore(object):
__ITEM_STORE = None
#classmethod
def pop_items(cls):
items = cls.__ITEM_STORE or []
cls.__ITEM_STORE = None
return items
#classmethod
def add_item(cls, item):
if not cls.__ITEM_STORE:
cls.__ITEM_STORE = []
cls.__ITEM_STORE.append(item)
pipelines.py
This pipleline will store the objects in the in-memory store from the snippet above. All items are simply returned to keep the regular pipeline flow intact. If you don't want to pass some items down the to the other pipelines simply change process_item to not return all items.
from <your-project>.items_store import InMemoryItemStore
class StoreInMemoryPipeline(object):
"""Add items to the in-memory item store."""
def process_item(self, item, spider):
InMemoryItemStore.add_item(item)
return item
settings.py
Now add the StoreInMemoryPipeline in the scraper settings. If you change the process_item method above, make sure you set the proper priority here (changing the 100 down here).
ITEM_PIPELINES = {
...
'<your-project-name>.pipelines.StoreInMemoryPipeline': 100,
...
}
main.py
This is where I tie all these things together. I clean the in-memory store, run the crawler, and fetch all the items.
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from <your-project>.items_store import InMemoryItemStore
from <your-project>.spiders.your_spider import YourSpider
def get_crawler_items(**kwargs):
InMemoryItemStore.pop_items()
process = CrawlerProcess(get_project_settings())
process.crawl(YourSpider, **kwargs)
process.start() # the script will block here until the crawling is finished
process.stop()
return InMemoryItemStore.pop_items()
if __name__ == "__main__":
items = get_crawler_items()

If you really want to collect all data in a "special" object.
Store the data in a separate pipeline like https://doc.scrapy.org/en/latest/topics/item-pipeline.html#duplicates-filter and in close_spider (https://doc.scrapy.org/en/latest/topics/item-pipeline.html?highlight=close_spider#close_spider) you open your django object.

Related

Wagtail add functions to models.py

i'm trying to make a custom plotly-graphic on a wagtail homepage.
I got this far. I'm overriding the wagtail Page-model by altering the context returned to the template. Am i doing this the right way, is this possible in models.py ?
Thnx in advanced.
from django.db import models
from wagtail.models import Page
from wagtail.fields import RichTextField
from wagtail.admin.panels import FieldPanel
import psycopg2
from psycopg2 import sql
import pandas as pd
import plotly.graph_objs as go
from plotly.offline import plot
class CasPage(Page):
body = RichTextField(blank=True)
content_panels = Page.content_panels + [
FieldPanel('body'),
]
def get_connection(self):
try:
return psycopg2.connect(
database="xxxx",
user="xxxx",
password="xxxx",
host="xxxxxxxxxxxxx",
port=xxxxx,
)
except:
return False
conn = get_connection()
cursor = conn.cursor()
strquery = (f'''SELECT t.datum, t.grwaarde - LAG(t.grwaarde,1) OVER (ORDER BY datum) AS
gebruiktgas
FROM XXX
''')
data = pd.read_sql(strquery, conn)
fig1 = go.Figure(
data = data,
layout=go.Layout(
title="Gas-verbruik",
yaxis_title="aantal M3")
)
output = plotly.plot(fig1, output_type='div', include_plotlyjs=False)
# https://stackoverflow.com/questions/32626815/wagtail-views-extra-context
def get_context(self, request):
context = super(CasPage, self).get_context(request)
context['output'] = output
return context
Kind of the right track. You should move all the plot code into its own method though. At the moment, it runs the plot code when the site initialises then stays stored in memory.
There's three usual ways to get the plot to the rendered page then.
As you've done with context
As a property or method of the page class
As a template tag called from the template
The first two have more or less the same effect, except the 2nd makes the property available anywhere, not just the template. The context method runs before the page starts rendering, the other two happen during that process. I guess the only real difference there is that if you're using template caching, the context will always run each time the page is loaded, the other two only run when the cache is invalid, or if the code is escaped out of the cache (for fragment caching).
To call the plot as a property of your page class, you'd just pull out the code into a def with the #property decorator:
class CasPage(Page):
....
#property
def plot(self):
try:
conn = psycopg2.connect(
database="xxxx",
user="xxxx",
password="xxxx",
host="xxxxxxxxxxxxx",
port=xxxxx,
)
cursor = conn.cursor()
strquery = (f'''SELECT t.datum, t.grwaarde - LAG(t.grwaarde,1) OVER (ORDER BY datum) AS
gebruiktgas FROM XXX''')
data = pd.read_sql(strquery, conn)
fig1 = go.Figure(
data = data,
layout=go.Layout(
title="Gas-verbruik",
yaxis_title="aantal M3")
)
return plotly.plot(fig1, output_type='div', include_plotlyjs=False)
except Exception as e:
print(f"{type(e).__name__} at line {e.__traceback__.tb_lineno} of {__file__}: {e}")
return None
^ I haven't tried this code ... it should work as is, but no guarantees I didn't make a typo ;)
Now you can access your plot with {{ self.plot }} in the template.
If you want to stick with context, then you'd stay with the def above but just amend your output line to
context['output'] = self.plot
Template tags are more useful when they're being used in StructBlocks and not part of a page class like this, or where you have code that you want to re-use in multiple templates.
Then you'd move all that plot code into a template tag file, register it and call it in the template with {% plot %}. Wagtail template tags work the same as Django: https://docs.djangoproject.com/en/4.1/howto/custom-template-tags/
Is the plot data outside of the site database? If not, you could probably get the data via the ORM if it was defined as a model. If so, it's probably worth writing a view (or stored procedure if you want to pass parameters) on the db server and calling that rather than hard coding the SQL into your python.
The other consideration is the page load time - if the dataset is big, this could take a while and prevent the page from loading. You'd probably want a front-end solution in that case.

Django disable low level caching with context manager

One of my methods in a project I'm working on looks like this:
from django.core.cache import cache
from app import models
def _get_active_children(parent_id, timestamp):
children = cache.get(f"active_seasons_{parent_id}")
if children is None:
children = models.Children.objects.filter(parent_id=parent_id).active(
dt=timestamp
)
cache.set(
f"active_children_{parent_id}",
children,
60 * 10,
)
return children
The issue is I don't want caching to occur when this method is being called via the command line (it's inside a task). So I'm wondering if there's a way to disable caching of this form?
Ideally I want to use a context manager so that any cache calls inside the context are ignored (or pushed to a DummyCache/LocalMem cache which wouldn't effect my main Redis cache).
I've considered pasisng skip_cache=True through the methods, but this is pretty brittle and I'm sure there's a more elegant solution. Additionally, I've tried using mock.patch but I'm not sure this works outside of test classes.
My ideal solution would look something like:
def task():
...
_get_active_children(parent_id, timestamp):
with no_cache:
task()
I have a solution (but I think there's a better one out there):
from unittest.mock import patch
from django.core.cache.backends.dummy import DummyCache
from django.utils.module_loading import import_string
def no_cache(module_str, cache_object_str='cache'):
""" example usage: with no_cache('app.tasks', 'cache'): """
module_ = import_string(module_str)
return patch.object(module_, cache_object_str, DummyCache('mock', {}))
Inspired by this.

Defining functions inside of views

I am attempting to create a job board website and upon entering a zip code in a form, that zip code is passed to a search_results view (as zip_code). In this view, I need to:
Get the surrounding zip codes (with a certain mile radius)
Get objects in DB that match those zip codes.
I have step one complete and have not implemented step two yet (actual code not that important to question):
from uszipcode import Zipcode, SearchEngine
def search_results(request, zip_code):
zip_codes = []
search = SearchEngine(simple_zipcode=True) # create SearchEngine object
zip_code = search.by_zipcode(zip_code) #create Zipcode object?
latitude = zip_code.lat
longitude = zip_code.lng
result = search.by_coordinates(latitude, longitude, radius = 5, returns = 5)
for item in result:
zip_codes.append(item.zipcode)
# code that will return matching objects
My question is can you define functions inside of a view in Django, like such:
def search_results(request, zip_code):
zip_codes = getSurroundingZipCodes(zip_code)
results = getJobsInArea(zip_codes)
return render(request, 'results.html', {'results: results})
def getSurroundingZipCodes(zip_code):
# logic for this function
def getJobsInArea(zip_codes):
# logic for this function
This is something I haven't seen in any tutorials so I feel like the answer is no, but I'm not sure why?
Yes you can do it. django view here is a function . You can define functions inside function.
That is how decorators work in python. But
why cant we define functions in seperate modules and import them above? Like in a file do
utils.py
def getSurroundingZipCodes(zip_code):
# logic for this function
def getJobsInArea(zip_codes):
# logic for this function
and simply import
from utils import getSurroundingZipCodes,getJobsInArea
this way they will be resuable

Store numpy array in session (Alternatives)

I have a class in a view that does some calculations.
The instance of this class is passed to a Django template context and the results are displayed in the html.
On the other hand, I need to store these results somewhere for later use in another view, where a pdf document is generated using these results.
The results are large lists of data.
For example:
def view_one(request):
# some code
class_instance = SomeClass(form.cleaned_data)
html = render_to_string('results.html', {'instance': class_instance})
return JsonResponse({"result": html})
def view_two(request):
# some code
class_pdf = GeneratePdf(class_instance) # How can I pass the class_instance data?
# some code
The results are large lists of data.
Should I use Django request session to store the data?
Can I use Celery?
There is another alternative?

How to test views that use post request in django?

here's my view (simplified):
#login_required(login_url='/try_again')
def change_bar(request):
foo_id = request.POST['fid']
bar_id = request.POST['bid']
foo = models.Foo.objects.get(id=foo_id)
if foo.value > 42:
bar = models.Bar.objects.get(id=bar_id)
bar.value = foo.value
bar.save()
return other_view(request)
Now I'd like to check if this view works properly (in this simplified model, if Bar instance changes value when it should). How do I go about it?
I'm going to assume you mean automated testing rather than just checking that the post request seems to work. If you do mean the latter, just check by executing the request and checking the values of the relevant Foo and Bar in a shell or in the admin.
The best way to go about sending POST requests is using a Client. Assuming the name of the view is my_view:
from django.test import Client
from django.urls import reverse
c = Client()
c.post(reverse('my_view'), data={'fid':43, 'bid':20})
But you still need some initial data in the database, and you need to check if the changes you expected to be made got made. This is where you could use a TestCase:
from django.test import TestCase, Client
from django.urls import reverse
FooBarTestCase(TestCase):
def setUp(self):
# create some foo and bar data, using foo.objects.create etc
# this will be run in between each test - the database is rolled back in between tests
def test_bar_not_changed(self):
# write a post request which you expect not to change the value
# of a bar instance, then check that the values didn't change
self.assertEqual(bar.value, old_bar.value)
def test_bar_changes(self):
# write a post request which you expect to change the value of
# a bar instance, then assert that it changed as expected
self.assertEqual(foo.value, bar.value)
A library which I find useful for making setting up some data to execute the tests easier is FactoryBoy. It reduces the boilerplate when it comes to creating new instances of Foo or Bar for testing purposes. Another option is to write fixtures, but I find that less flexible if your models change.
I'd also recommend this book if you want to know more about testing in python. It's django-oriented, but the principles apply to other frameworks and contexts.
edit: added advice about factoryboy and link to book
you can try putting "print" statements in between the code and see if the correct value is saved. Also for update instead of querying with "get" and then saving it (bar.save()) you can use "filter" and "update" method.
#login_required(login_url='/try_again')
def change_bar(request):
foo_id = request.POST['fid']
bar_id = request.POST['bid']
foo = models.Foo.objects.get(id=foo_id)
if foo.value > 42:
models.Bar.objects.filter(id=bar_id).update(value=foo.value)
#bar.value = foo.value
#bar.save()
return other_view(request)