Run Scrapy script, process output, and load into database all at once? - django

I've managed to write a Scrapy project that scrapes data from a web page, and when I call it with the scrapy crawl dmoz -o items.json -t json at the command line, it successfully outputs the scraped data to a JSON file.
I then wrote another script that takes that JSON file, loads it, changes the way the data is organized (I didn't like the default way it was being organized), and spits it out as a second JSON file. I then use Django's manage.py loaddata fixture.json command to load the contents of that second file into a Django database.
Now, I'm sensing that I'm going to get laughed out of the building for doing this in three separate steps, but I'm not quite sure how to put it all together into one script. For starters, it does seem really stupid that I can't just have my Scrapy project output my data in the exact way that I want. But where do I put the code to modify the 'default' way that Feed exports is outputting my data? Would it just go in my pipelines.py file?
And secondly, I want to call the scraper from inside a python script that will then also load the resulting JSON fixture into my database. Is that as simple as putting something like:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from testspiders.spiders.followall import FollowAllSpider
spider = FollowAllSpider(domain='scrapinghub.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
at the top of my script, and then following it with something like:
from django.something.manage import loaddata
loaddata('/path/to/fixture.json')
? And finally, is there any specific place this script would have to live relative to both my Django project and the Scrapy project for it to work properly?

Exactly that. Define a custom item pipeline in pipelines.py to output the item data as desired, then add the pipeline class to settings.py. The scrapy documentation has a JSONWriterPipeline example that may be of use.
Well, on the basis that the script in your example was taken from the scrapy documentation, it should work. Have you tried it?
The location shouldn't matter, so long as all of the necessary imports work. You could test this by firing a Python interpreter in the desired location and then checking all of the imports one by one. If they all run correctly, then the script should be fine.
If anything goes wrong, then post another question in here with the relevant code and what was tried and I'm sure someone will be happy to help out. :)

Related

How to execute django query from string and get output

I want to run a django query from a string and put the output into a variable
in my DRF project, the client sends a django query:
{'query': 'model.objects.all()'}
and I need to return the result of this query.
I tried using exec('model.objects.all()') but i can't assign the output to a variable, i also tried using subprocess.run([sys.executable, "-c", 'model.objects.all()'], capture_output=True, text=True) but subrocess doesn't find the model
There's a huge amount of setting up needed before a process using Django models can work correctly. That's why manage.py shell exists.
If you want to perform Django operations outside the context of a Django server, write a management command. You can then invoke it from the command line, from cron, from other Python scripts ... wherever.

Is there a way to use CLI to POST data on django2 project?

I have a webapp that I created using Django2. At a high level, it will be used to process .tsv files of data and display them nicely on a screen.
I want to be able to have a command line interface where I can perform a POST request to the already running webapp, and essentially add data to a model, save it, and create a unique webpage to display that data. Something like:
uploadtodjangoapp <myfilename> --user='heidi' --other-options='....'
uploading myfilename to myapp!
done
see data here: www.mysite.com/info/myfilename
In this situation ^ the webpage will be running already somewhere (either locally or on a vm).
Currently, I know you can create a form on the user interface to perform post requests/get user data. And I know you can also use python manage.py shell and do something like:
>> from myapp.model import mymodel
>> m = mymodel(data="some data here")
>> m.save()
.... but is this the only way?
Any help would be greatly appreciated!
You can easily achieve this using curl
Just for Example :
In terminal
curl --data "field_1=data_1&field_2=data_2&field_3=data_3" <API FOR POST REQUEST>

Simplest way to periodically run a function from Django app on Elastic Beanstalk

Within my app i have a function which I want to run every hour to collect data and populate a database (I have an RDS database linked to my Elastic Beankstalk app). This is the function I want to want (a static method defined in my Data model):
#staticmethod
def get_data():
page = requests.get(....)
soup = BeautifulSoup(page, 'lxml')
.....
site_data = Data.objects.create(...)
site_data.save()
>>> Data.get_data()
# populates database on my local machine
From reading it seems I want to use either Celery or a cron job. I am unfamiliar with either of these and it seems quite complicated using them with AWS. This post here seems most relevant but I am unsure how I would apply the suggestion to my example. Would I need to create a management command as mentioned and what would this look like with my example?
As this is new to me it would help a lot it someone could point me down the right path.
How to create a management command is covered very detailed in the docs.
The following provides a management command called foobar.
project_root/app_name/management/commands/foobar.py
from django.core.management.base import BaseCommand, CommandError
from yourapp.models import Data
class Command(BaseCommand):
help = 'Dump data'
def handle(self, *args, **options):
Data.get_data()
Please read the linked docs - e.g. there are a few __init__.py files that need to be present for django to discover the command properly.
When your project is installed on your EBS it should be connected to the proper database and the data gets stored there.
To configure the cron, follow the instructions from your linked question. There is also AWS Elastic Beanstalk, running a cronjob that covers the topic more detailed.
The line in crontab file should look like that.
0 * * * * /path/to/your/environment/bin/python /path/to/your/project_root/manage.py name_of_your_management_command > /path/to/your/cron.log 2>&1
As I've never used EBS so far the paths are not correct, but with explanations which path it should be. A few details regarding the cron line.
0 * * * * run the command if minute is 0 each hour * at each day * of the month in each month * and every day of th week *
The next part is the command that should run
/path/to/your/environment/bin/python use the python from your projects environment
/path/to/your/project_root/manage.py to invoke your projects manage.py
foobar which should run your management command
> /path/to/your/cron.log 2>&1 Whole the output from this script STDIN and STDERR should be written into the file /path/to/your/cron.log

How to detect and respond to a database change (INSERT) from a django project?

I am setting up our project to integrate with a shipping platform called Endicia which has the ability to insert new rows into our database when a package is shipped.
How can I detect from python when a new row has been inserted?
My solution for now would be to query the DB every 30 seconds or so for new rows... is there another solution to send a signal from postgres to python?
You'd set up a custom command that is run by the manage.py file.
You'd put it in `yourapp/management/commands/' folder. Make sure to add an init.py file to both the management and commands folder or the command won't work. Then you create the code for the custom command.
Then, see this related question about running a shell script when changes are made to a postgres database. The answer there was to use PL/sh. You'll need to figure that part out on your own, but basically however you do it, the end result is that the script should call something like /path/to/app/manage.py command_name

don't load 'initial_data.json' fixture when testing

I'm testing a django app not written by myself, which uses two fixtures: initial_data.json and testing.json. Both fixtures files contain conflicting data (throwing an integrity error).
For testing, I've specified TestCase.fixtures = ['testing.json'], but initial_data.json is loaded too.
How can I avoid loading initial_data.json (not renaming it) in the testcase?
Quoting from Django Website:
If you create a fixture named
initial_data.[xml/yaml/json], that
fixture will be loaded every time you
run syncdb. This is extremely
convenient, but be careful: remember
that the data will be refreshed every
time you run syncdb. So don't use
initial_data for data you'll want to
edit.
So I guess there's no way to say "okay, don't load initial data just this once". Perhaps you could write a short bash script that would rename the file. Otherwise you'd have to dig into the Django code.
More info here: http://docs.djangoproject.com/en/dev/howto/initial-data/#automatically-loading-initial-data-fixtures
You might want to think about whether initial_data.json is something your app actually needs. It's not hard to "manually" load your production data with ./manage.py loaddata production.json after running a syncdb (how often do you run syncdb in production, anyway?), and it would make loading your testing fixture much easier.
If you want to have tables with no initial data, this code will help you:
edit tests.py:
from django.core import management
class FooTest(TestCase):
#classmethod
def setUpClass(cls):
management.call_command('flush', interactive=False, load_initial_data=False)
this will remove your data and syncdb again without loading initial data.