How to pass custom parameters(such as -o) to scrapy crawler - python-2.7

I'm currently working on python2.7/Scrapy 1.8 project.
I work within a Docker container and using a
launchable.py:
import scrapy
from scrapy.crawler import CrawlerProcess
from spiders import addonsimilartechSpider, similartechSpider
process = CrawlerProcess()
process.crawl(similartechSpider.SimilarTechSpider)
process.crawl(addonsimilartechSpider.AddonSimilarSpider)
process.start()
I used to start my scrapy like this :
scrapy crawl <nameofmyspider> -o output.xlsx
I installed scrapy-xlsx and used it until now, now that I have my launchable.py I dont know how to pass 'custom' arguments through scrappy crawler (not spider).
I understand the difference between scrapy settings and spider settings, so :
process.crawl(similartechSpider.SimilarTechSpider, input='-o', first='test1.xlsx')
will likely not work right?
thanks for any of your time taken to answer this.

Use the corresponding Scrapy settings instead (FEED_*).
You can pass them to CrawlerProcess as a dict.

CrawlerProcess(settings={
'FEED_URI': 'output_file_name.xlsx',
'FEED_EXPORTERS' : {'xlsx': 'scrapy_xlsx.XlsxItemExporter'},
})

Related

Python script on Django shell not seeing import if import not set as global?

I have searched the stackoverflow and wasn't able to find this. I have noticed something I can not wrap my head around. When run as normal python script import works ok, but when run from Django shell it behaves weird, needs to set import as global to be seen.
You can reproduce it like this. Make a file test.py in folder with manage.py. Code you can test with is this.
This doesn't work, code of test.py:
#!/usr/bin/env python3
import chardet
class LoadList():
def __init__(self):
self.email_list_path = '/home/omer/test.csv'
#staticmethod
def check_file_encoding(file_to_check):
encoding = chardet.detect(open(file_to_check, "rb").read())
return encoding
def get_encoding(self):
return self.check_file_encoding(self.email_list_path)['encoding']
print(LoadList().get_encoding())
This works ok when chardet set as global inside test.py file:
#!/usr/bin/env python3
import chardet
class LoadList():
def __init__(self):
self.email_list_path = '/home/omer/test.csv'
#staticmethod
def check_file_encoding(file_to_check):
global chardet
encoding = chardet.detect(open(file_to_check, "rb").read())
return encoding
def get_encoding(self):
return self.check_file_encoding(self.email_list_path)['encoding']
print(LoadList().get_encoding())
First run is without global chardet and you can see the error. Second run is with global chardet set and you can see it works ok.
What is going on and can someone explain this to me? Why it isn't seen until set as global?
Piping a file into shell is the same as piping it into the python command. It's not the same as running the file with python test.py. I suspect it's something to do with the way the the newlines are interpreted as to how the file is really parsed, but don't have time to check.
Instead of this approach I'd recommend you write a custom management command.

SublimeText2 and Django build

Just started using st2, switching from Eclipse pydev.
In Eclipse, I imported my Django project and could hit "Run" from any file and use the django framework.
for instance if I wanted to test something about my model class:
models.py
from django.contrib.auth import logout, login as auth_login
if __name__ == "__main__":
auth_login(username, password)
This would run and validate.
However, in sublime text, I can't get my python virtual_env to build the same way.
I keep getting the following error when I import anything django:
django.core.exceptions.ImproperlyConfigured: Requested settings CACHES, but settings are not configured.
One way to fix it is to ADD to EVERY FILE i want to test:
import os
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "myapp.settings")
I was wondering if there is an easier way to include this? My current build file is:
{
"working_dir": "/home/username/virtualenvs/workspace/myproject",
"env" : { "PYTHONPATH" : "/home/username/virtualenvs/workspace/myproject"},
"cmd" : "/home/username/virtualenvs/workspace/bin/python2.7",
"selector" : "source.python"
}
I've been searching around for hours...any help is appreciated!

Add method imports to shell_plus

In shell_plus, is there a way to automatically import selected helper methods, like the models are?
I often open the shell to type:
proj = Project.objects.get(project_id="asdf")
I want to replace that with:
proj = getproj("asdf")
Found it in the docs. Quoted from there:
Additional Imports
In addition to importing the models you can specify other items to
import by default. These are specified in SHELL_PLUS_PRE_IMPORTS and
SHELL_PLUS_POST_IMPORTS. The former is imported before any other
imports (such as the default models import) and the latter is imported
after any other imports. Both have similar syntax. So in your
settings.py file:
SHELL_PLUS_PRE_IMPORTS = (
('module.submodule1', ('class1', 'function2')),
('module.submodule2', 'function3'),
('module.submodule3', '*'),
'module.submodule4'
)
The above example would directly translate to the following python
code which would be executed before the automatic imports:
from module.submodule1 import class1, function2
from module.submodule2 import function3
from module.submodule3 import *
import module.submodule4
These symbols will be available as soon as the shell starts.
ok, two ways:
1) using PYTHONSTARTUP variable (see this Docs)
#in some file. (here, I'll call it "~/path/to/foo.py"
def getproj(p_od):
#I'm importing here because this script run in any python shell session
from some_app.models import Project
return Project.objects.get(project_id="asdf")
#in your .bashrc
export PYTHONSTARTUP="~/path/to/foo.py"
2) using ipython startup (my favourite) (See this Docs,this issue and this Docs ):
$ pip install ipython
$ ipython profile create
# put the foo.py script in your profile_default/startup directory.
# django run ipython if it's installed.
$ django-admin.py shell_plus

Importing CSV to Django and settings not recognised

So i'm getting to grips with Django, or trying to. I have some code that isn't dependent on being called by the webpage - it's designed to populate the database with information. Eventually it will be set up as a cron job to run overnight. This is the first crack at it, which is to do an initial population (once I have that working, I'll move to an add structure, where only new records are pushed.) I'm using Python 2.7, Django 1.5 and Sqlite3. When I run this code, I get
Requested setting DATABASES, but settings are not configured. You must either define the environment variable DJANGO_SETTINGS_MODULE or call settings.configure() before accessing settings.
That seems fairly obvious, but I've spent a couple of hours now trying to work out how to adjust that setting. How do I call / open a connection / whatever the right terminology is here? I have a number of functions like this that will be scheduled jobs, and this has been frustrating me all afternoon.
import urllib2
import csv
import requests
from django.db import models
from gmbl.models import Match
master_data_file = urllib2.urlopen("http://www.football-data.co.uk/mmz4281/1213/E0.csv", "GET")
data = list(tuple(rec) for rec in csv.reader(master_data_file, delimiter=','))
for row in data:
current_match = Match(matchdate=row[1],
hometeam=row[2],
awayteam = row [3],
homegoals = row [4],
awaygoals = row[5],
homeshots = row[10],
awayshots = row[11],
homeshotsontarget = row[12],
awayshotsontarget = row[13],
homecorners = row[16],
awaycorners = row[17])
current_match.save()
I had originally started out with http://django-csv-importer.readthedocs.org/en/latest/ but I had the same error, and the documentation doesn't make much sense trying to debug it. When I tried calling settings.configure in the function, it said it didn't exist; presumably I had to import it, but couldn't make that work.
Make sure Django, and your project are in PYTHONPATH then you can do:
import urllib2
import csv
import requests
from django.core.management import setup_environ
from django.db import models
from yoursite import settings
setup_environ(settings)
from gmbl.models import Match
master_data_file = urllib2.urlopen("http://www.football-data.co.uk/mmz4281/1213/E0.csv", "GET")
data = list(tuple(rec) for rec in csv.reader(master_data_file, delimiter=','))
# ... your code ...
Reference: http://www.b-list.org/weblog/2007/sep/22/standalone-django-scripts/
Hope it helps!

Django timer thread

I would like to compute some information in my Django application on regular basis.
I need to select and insert data each second and want to use Django ORM.
How can I do this?
In a shell script, set the DJANGO_SETTINGS_MODULE variable and call a python script
export DJANGO_SETTINGS_MODULE=yourapp.settings
python compute_some_info.py
In compute_some_info.py, set up django and import your modules (look at how the manage.py script sets up to run Django)
#!/usr/bin/env python
import sys
try:
import settings # Assumed to be in the same directory.
except ImportError:
sys.stderr.write("Error: Can't find the file 'settings.py'")
sys.exit(1)
sys.path = sys.path + ['/yourapphome']
from yourapp.models import YourModel
YourModel.compute_some_info()
Then call your shell script in a cron job.
Alternatively -- you can just keep running and sleeping (better if it's every second) -- you would still want to be outside of the webserver and in your own process that is set up this way.
One way to do it would be to create a custom command, and invoke python manage.py your_custom_command from cron or windows scheduler.
http://docs.djangoproject.com/en/dev/howto/custom-management-commands/
For example, create myapp/management/commands/myapp_task.py which reads:
from django.core.management.base import NoArgsCommand
class Command(NoArgsCommand):
def handle_noargs(self, **options):
print 'Doing task...'
# invoke the functions you need to run on your project here
print 'Done'
Then you can run it from cron like this:
export DJANGO_SETTINGS_MODULE=myproject.settings; export PYTHONPATH=/path/to/project_parent; python manage.py myapp_task