Python Scrapy - Run Spider - python-2.7

Running Python27 on a Windows machine ... Attempting to use Scrapy
following the basic Scrapy tutorial # http://doc.scrapy.org/en/latest/intro/overview.html
I've created the following spider and saved it as Test2 # C:\Python27\Scrapy
import scrapy
class StackOverflowSpider(scrapy.Spider):
name = 'stackoverflow'
start_urls = ['http://stackoverflow.com/questions?sort=votes']
def parse(self, response):
for href in response.css('.question-summary h3 a::attr(href)'):
full_url = response.urljoin(href.extract())
yield scrapy.Request(full_url, callback=self.parse_question)
def parse_question(self, response):
yield {
'title': response.css('h1 a::text').extract_first(),
'votes': response.css('.question .vote-count-post::text').extract_first(),
'body': response.css('.question .post-text').extract_first(),
'tags': response.css('.question .post-tag::text').extract(),
'link': response.url,
}
The next step tells me to run the spider using
scrapy runspider stackoverflow_spider.py -o top-stackoverflow-questions.json
But I have no idea where to run that line of code.
I am used to running a print or a store to csv command at the end of my python file in order to retrieve results.
Sure this is an easy resolve but I'm not getting it .. Thanks in advance.

You will need to execute the runspider command in whatever command line utility you are using, e.g. Cygwin, cmd etc.
That command will crate a file called top-stackoverflow-questions.json in the directory in which you run the command.

Related

Python script on Django shell not seeing import if import not set as global?

I have searched the stackoverflow and wasn't able to find this. I have noticed something I can not wrap my head around. When run as normal python script import works ok, but when run from Django shell it behaves weird, needs to set import as global to be seen.
You can reproduce it like this. Make a file test.py in folder with manage.py. Code you can test with is this.
This doesn't work, code of test.py:
#!/usr/bin/env python3
import chardet
class LoadList():
def __init__(self):
self.email_list_path = '/home/omer/test.csv'
#staticmethod
def check_file_encoding(file_to_check):
encoding = chardet.detect(open(file_to_check, "rb").read())
return encoding
def get_encoding(self):
return self.check_file_encoding(self.email_list_path)['encoding']
print(LoadList().get_encoding())
This works ok when chardet set as global inside test.py file:
#!/usr/bin/env python3
import chardet
class LoadList():
def __init__(self):
self.email_list_path = '/home/omer/test.csv'
#staticmethod
def check_file_encoding(file_to_check):
global chardet
encoding = chardet.detect(open(file_to_check, "rb").read())
return encoding
def get_encoding(self):
return self.check_file_encoding(self.email_list_path)['encoding']
print(LoadList().get_encoding())
First run is without global chardet and you can see the error. Second run is with global chardet set and you can see it works ok.
What is going on and can someone explain this to me? Why it isn't seen until set as global?
Piping a file into shell is the same as piping it into the python command. It's not the same as running the file with python test.py. I suspect it's something to do with the way the the newlines are interpreted as to how the file is really parsed, but don't have time to check.
Instead of this approach I'd recommend you write a custom management command.

Scrapy how to save a State between spider runs (via scrapinghub)?

I have a spider that will run on schedule. Spider input is based on Date. From date of last scrape to todays date. So the question is how to save the date of last scrape within the Scrapy project? There is an option to get data from scrapy settings using pkjutil module, but i did not find any reference in the docs on how to write data in that file. Any idea? Maybe an alternative?
P.S. My other option is to use some free remote MySql DB just for this. But looks like more work if simple solution is available.
import pkgutil
class CodeSpider(scrapy.Spider):
name = "code"
allowed_domains = ["google.com.au"]
def start_requests(self):
f = pkgutil.get_data("au_go", "res/state.json")
ids = json.loads(f)
id = ids[0]['state']
yield {'state':id}
ids[0]['state'] = 'New State'
with open('./au_go/res/state.json', 'w') as f:
json.dump(ids, f)
The above solution works fine when ran locally. But I am getting no such file or directory when running the code at Scrapinghub.
File "/tmp/unpacked-eggs/__main__.egg/au_go/spiders/test_state.py", line 33, in parse
with open(savePath, 'w') as f:
IOError: [Errno 2] No such file or directory: './au_go/res/state.json'
The problem is fixed with use of Scrapinghub Colections
And scrapinghub API. Works nice now.
Here is an example code in case somebody will find it usefull.
from scrapinghub import ScrapinghubClient
client = ScrapinghubClient(Your API KEY)
project = client.get_project(Your Project ID)
collections = project.collections
last_accessed = collections.get_store('last_accessed')
last_accessed.set({'_key': 'Date', 'value': '12-54-1235'})
print last_accessed.get('Date')['value']

python + wx & uno to fill libreoffice using ubuntu 14.04

I collected user data using a wx python gui and than I used uno to fill this data into an openoffice document under ubuntu 10.xx
user + my-script ( +empty document ) --> prefilled document
After upgrading to ubuntu 14.04 uno doesn't work with python 2.7 anymore and now we have libreoffice instead of openoffice in ubuntu. when I try to run my python2.7 code, it says:
ImportError: No module named uno
How could I bring it back to work?
what I tried:
installed https://pypi.python.org/pypi/unotools v0.3.3
sudo apt-get install libreoffice-script-provider-python
converted the code to python3 and got uno importable, but wx is not importable in python3 :-/
ImportError: No module named 'wx'
googled and read python3 only works with wx phoenix
so tried to install: http://wxpython.org/Phoenix/snapshot-builds/
but wasn't able to get it to run with python3
is there a way to get the uno bridge to work with py2.7 under ubuntu 14.04?
Or how to get wx to run with py3?
what else could I try?
Create a python macro in LibreOffice that will do the work of inserting the data into LibreOffice and then in your python 2.7 code envoke the macro.
As the macro is running from with LibreOffice it will use python3.
Here is an example of how to envoke a LibreOffice macro from the command line:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
##
# a python script to run a libreoffice python macro externally
# NOTE: for this to run start libreoffice in the following manner
# soffice "--accept=socket,host=127.0.0.1,port=2002,tcpNoDelay=1;urp;" --writer --norestore
# OR
# nohup soffice "--accept=socket,host=127.0.0.1,port=2002,tcpNoDelay=1;urp;" --writer --norestore &
#
import uno
from com.sun.star.connection import NoConnectException
from com.sun.star.uno import RuntimeException
from com.sun.star.uno import Exception
from com.sun.star.lang import IllegalArgumentException
def uno_directmacro(*args):
localContext = uno.getComponentContext()
localsmgr = localContext.ServiceManager
resolver = localsmgr.createInstanceWithContext("com.sun.star.bridge.UnoUrlResolver", localContext )
try:
ctx = resolver.resolve("uno:socket,host=localhost,port=2002;urp;StarOffice.ComponentContext")
except NoConnectException as e:
print ("LibreOffice is not running or not listening on the port given - ("+e.Message+")")
return
msp = ctx.getValueByName("/singletons/com.sun.star.script.provider.theMasterScriptProviderFactory")
sp = msp.createScriptProvider("")
scriptx = sp.getScript('vnd.sun.star.script:directmacro.py$directmacro?language=Python&location=user')
try:
scriptx.invoke((), (), ())
except IllegalArgumentException as e:
print ("The command given is invalid ( "+ e.Message+ ")")
return
except RuntimeException as e:
print("An unknown error occurred: " + e.Message)
return
except Exception as e:
print ("Script error ( "+ e.Message+ ")")
print(e)
return
return(None)
uno_directmacro()
And this is the corresponding macro code within LibreOffice called "directmacro.py" and stored in the User area for libreOffice macros (which would normally be $HOME/.config/libreoffice/4/user/Scripts/python :
#!/usr/bin/python
from com.sun.star.awt.MessageBoxButtons import BUTTONS_OK, BUTTONS_OK_CANCEL, BUTTONS_YES_NO, BUTTONS_YES_NO_CANCEL, BUTTONS_RETRY_CANCEL, BUTTONS_ABORT_IGNORE_RETRY
from com.sun.star.awt.MessageBoxButtons import DEFAULT_BUTTON_OK, DEFAULT_BUTTON_CANCEL, DEFAULT_BUTTON_RETRY, DEFAULT_BUTTON_YES, DEFAULT_BUTTON_NO, DEFAULT_BUTTON_IGNORE
from com.sun.star.awt.MessageBoxType import MESSAGEBOX, INFOBOX, WARNINGBOX, ERRORBOX, QUERYBOX
def directmacro(*args):
import socket, time
class FontSlant():
from com.sun.star.awt.FontSlant import (NONE, ITALIC,)
#get the doc from the scripting context which is made available to all scripts
desktop = XSCRIPTCONTEXT.getDesktop()
model = desktop.getCurrentComponent()
text = model.Text
tRange = text.End
cursor = desktop.getCurrentComponent().getCurrentController().getViewCursor()
doc = XSCRIPTCONTEXT.getDocument()
parentwindow = doc.CurrentController.Frame.ContainerWindow
# your cannot insert simple text and text into a table with the same method
# so we have to know if we are in a table or not.
# oTable and oCurCell will be null if we are not in a table
oTable = cursor.TextTable
oCurCell = cursor.Cell
insert_text = "This is text inserted into a LibreOffice Document\ndirectly from a macro called externally"
Text_Italic = FontSlant.ITALIC
Text_None = FontSlant.NONE
cursor.CharPosture=Text_Italic
if oCurCell == None: # Are we inserting into a table or not?
text.insertString(cursor, insert_text, 0)
else:
cell = oTable.getCellByName(oCurCell.CellName)
cell.insertString(cursor, insert_text, False)
cursor.CharPosture=Text_None
return None
You will of course need to adapt the code to either accept data as arguments, read it from a file or whatever.
Ideally I would say use python 3, because python 2 is becoming outdated. The switch requires quite a bit of new coding changes, but better sooner than later. So I tried:
sudo pip3 install -U --pre \
-f http://wxpython.org/Phoenix/snapshot-builds/ \
wxPython_Phoenix
However this gave me errors, and I didn't want to spend the next couple of days working through them. Probably the pre-release versions are not ready for prime time yet.
So instead, what I recommend is to switch to AOO for now. See https://stackoverflow.com/a/27980255/5100564 for instructions. AOO does not have all the latest features that LO has, but it is a good solid Office product.
Apparently it is also possible to rebuild LibreOffice with python 2 using this script: https://gist.github.com/hbrunn/6f4a007a6ff7f75c0f8b

CRITICAL: Unhandled error in Deferred:

I am developing spider project and I have moved to a new computer. Now I am installing everything and I encounter problem with Twisted. I have read about this bug and I have installed pywin32 and then also WinPython, but it doesn't help. I have tried to update Twisted with this command
pip install Twisted --update
as advised in the forum, but it says that pip install doesn't have --update option. I have also run
python python27\scripts\pywin32_postinstall.py -install
but with no success. This is my error:
G:\Job_vacancies\Python\vacancies>scrapy crawl jobs
2015-10-06 09:12:53 [scrapy] INFO: Scrapy 1.0.3 started (bot: vacancies)
2015-10-06 09:12:53 [scrapy] INFO: Optional features available: ssl, http11
2015-10-06 09:12:53 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'va
cancies.spiders', 'SPIDER_MODULES': ['vacancies.spiders'], 'DEPTH_LIMIT': 3, 'BO
T_NAME': 'vacancies'}
2015-10-06 09:12:53 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsol
e, LogStats, CoreStats, SpiderState
Unhandled error in Deferred:
2015-10-06 09:12:53 [twisted] CRITICAL: Unhandled error in Deferred:
Traceback (most recent call last):
File "c:\python27\lib\site-packages\scrapy\cmdline.py", line 150, in _run_comm
and
cmd.run(args, opts)
File "c:\python27\lib\site-packages\scrapy\commands\crawl.py", line 57, in run
self.crawler_process.crawl(spname, **opts.spargs)
File "c:\python27\lib\site-packages\scrapy\crawler.py", line 153, in crawl
d = crawler.crawl(*args, **kwargs)
File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 1274, in
unwindGenerator
return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 1128, in
_inlineCallbacks
result = g.send(result)
File "c:\python27\lib\site-packages\scrapy\crawler.py", line 71, in crawl
self.engine = self._create_engine()
File "c:\python27\lib\site-packages\scrapy\crawler.py", line 83, in _create_en
gine
return ExecutionEngine(self, lambda _: self.stop())
File "c:\python27\lib\site-packages\scrapy\core\engine.py", line 66, in __init
__
self.downloader = downloader_cls(crawler)
File "c:\python27\lib\site-packages\scrapy\core\downloader\__init__.py", line
65, in __init__
self.handlers = DownloadHandlers(crawler)
File "c:\python27\lib\site-packages\scrapy\core\downloader\handlers\__init__.p
y", line 23, in __init__
cls = load_object(clspath)
File "c:\python27\lib\site-packages\scrapy\utils\misc.py", line 44, in load_ob
ject
mod = import_module(module)
File "c:\python27\lib\importlib\__init__.py", line 37, in import_module
__import__(name)
File "c:\python27\lib\site-packages\scrapy\core\downloader\handlers\s3.py", li
ne 6, in <module>
from .http import HTTPDownloadHandler
File "c:\python27\lib\site-packages\scrapy\core\downloader\handlers\http.py",
line 5, in <module>
from .http11 import HTTP11DownloadHandler as HTTPDownloadHandler
File "c:\python27\lib\site-packages\scrapy\core\downloader\handlers\http11.py"
, line 15, in <module>
from scrapy.xlib.tx import Agent, ProxyAgent, ResponseDone, \
File "c:\python27\lib\site-packages\scrapy\xlib\tx\__init__.py", line 3, in <m
odule>
from twisted.web import client
File "c:\python27\lib\site-packages\twisted\web\client.py", line 42, in <modul
e>
from twisted.internet.endpoints import TCP4ClientEndpoint, SSL4ClientEndpoin
t
File "c:\python27\lib\site-packages\twisted\internet\endpoints.py", line 34, i
n <module>
from twisted.internet.stdio import StandardIO, PipeAddress
File "c:\python27\lib\site-packages\twisted\internet\stdio.py", line 30, in <m
odule>
from twisted.internet import _win32stdio
File "c:\python27\lib\site-packages\twisted\internet\_win32stdio.py", line 7,
in <module>
import win32api
exceptions.ImportError: DLL load failed: The specified module could not be found
.
2015-10-06 09:12:53 [twisted] CRITICAL:
And this is my code:
#!/usr/bin/python
# -*- coding: utf-8 -*-
# encoding=UTF-8
import scrapy, urlparse
from scrapy.http import Request
from scrapy.utils.response import get_base_url
from urlparse import urlparse, urljoin
from vacancies.items import JobItem
#We need that in order to force Slovenian pages instead of English pages. It happened at "http://www.g-gmi.si/gmiweb/" that only English pages were found and no Slovenian.
#from scrapy.conf import settings
#settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl',}
#settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl','en':q=0.8,}
class JobSpider(scrapy.Spider):
name = "jobs"
#Test sample of SLO companies
start_urls = [
"http://www.g-gmi.si/gmiweb/",
]
#Result of the programme is this list of job vacancies webpages.
jobs_urls = []
def parse(self, response):
response.selector.remove_namespaces()
#We take all urls, they are marked by "href". These are either webpages on our website either new websites.
urls = response.xpath('//#href').extract()
#Base url.
base_url = get_base_url(response)
#Loop through all urls on the webpage.
for url in urls:
#If url represents a picture, a document, a compression ... we ignore it. We might have to change that because some companies provide job vacancies information in PDF.
if url.endswith((
#images
'.jpg', '.jpeg', '.png', '.gif', '.eps', '.ico', '.svg', '.tif', '.tiff',
'.JPG', '.JPEG', '.PNG', '.GIF', '.EPS', '.ICO', '.SVG', '.TIF', '.TIFF',
#documents
'.xls', '.ppt', '.doc', '.xlsx', '.pptx', '.docx', '.txt', '.csv', '.pdf', '.pd',
'.XLS', '.PPT', '.DOC', '.XLSX', '.PPTX', '.DOCX', '.TXT', '.CSV', '.PDF', '.PD',
#music and video
'.mp3', '.mp4', '.mpg', '.ai', '.avi', '.swf',
'.MP3', '.MP4', '.MPG', '.AI', '.AVI', '.SWF',
#compressions and other
'.zip', '.rar', '.css', '.flv', '.php',
'.ZIP', '.RAR', '.CSS', '.FLV', '.PHP',
)):
continue
#If url includes characters like ?, %, &, # ... it is LIKELY NOT to be the one we are looking for and we ignore it.
#However in this case we exclude good urls like http://www.mdm.si/company#employment
if any(x in url for x in ['?', '%', '&', '#']):
continue
#Ignore ftp.
if url.startswith("ftp"):
continue
#We need to save original url for xpath, in case we change it later (join it with base_url)
url_xpath = url
#If url doesn't start with "http", it is relative url, and we add base url to get absolute url.
# -- It is true, that we may get some strange urls, but it is fine for now.
if not (url.startswith("http")):
url = urljoin(base_url,url)
#We don't want to go to other websites. We want to stay on our website, so we keep only urls with domain (netloc) of the company we are investigating.
if (urlparse(url).netloc == urlparse(base_url).netloc):
#The main part. We look for webpages, whose urls include one of the employment words as strings.
# -- Instruction.
# -- Users in other languages, please insert employment words in your own language, like jobs, vacancies, career, employment ... --
if any(x in url for x in [
'zaposlovanje',
'Zaposlovanje',
'zaposlitev',
'Zaposlitev',
'zaposlitve',
'Zaposlitve',
'zaposlimo',
'Zaposlimo',
'kariera',
'Kariera',
'delovna-mesta',
'delovna_mesta',
'pridruzi-se',
'pridruzi_se',
'prijava-za-delo',
'prijava_za_delo',
'oglas',
'Oglas',
'iscemo',
'Iscemo',
'careers',
'Careers',
'jobs',
'Jobs',
'employment',
'Employment',
]):
#This is additional filter, suggested by Dan Wu, to improve accuracy. We will check the text of the url as well.
texts = response.xpath('//a[#href="%s"]/text()' % url_xpath).extract()
#1. Texts are empty.
if texts == []:
print "Ni teksta za url: " + str(url)
#We found url that includes one of the magic words and also the text includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
#item["text"] = text
item["url"] = url
#We return the item.
yield item
# 2. There are texts, one or more.
else:
#For the same partial url several texts are possible.
for text in texts:
if any(x in text for x in [
'zaposlovanje',
'Zaposlovanje',
'zaposlitev',
'Zaposlitev',
'zaposlitve',
'Zaposlitve',
'zaposlimo',
'Zaposlimo',
'ZAPOSLIMO',
'kariera',
'Kariera',
'delovna-mesta',
'delovna_mesta',
'pridruzi-se',
'pridruzi_se',
'oglas',
'Oglas',
'iscemo',
'Iscemo',
'ISCEMO',
'careers',
'Careers',
'jobs',
'Jobs',
'employment',
'Employment',
]):
#We found url that includes one of the magic words and also the text includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["text"] = text
item["url"] = url
#We return the item.
yield item
#We don't put "else" sentence because we want to further explore the employment webpage to find possible new employment webpages.
#We keep looking for employment webpages, until we reach the DEPTH, that we have set in settings.py.
yield Request(url, callback = self.parse)
# We run the programme in the command line with this command:
# scrapy crawl jobs -o jobs.csv -t csv --logfile log.txt
# We get two output files
# 1) jobs.csv
# 2) log.txt
# Then we manually put one of employment urls from jobs.csv into read.py
I would be glad if you could give some advice on how to run this thing. Thank you, Marko
You should always install stuff into a virtualenv. Once you've got a virtualenv and it's active, do:
pip install --upgrade twisted pypiwin32
and you should get the depenendency that makes Twisted support stdio on the Windows platform.
To get all the goodies you might try
pip install --upgrade twisted[windows_platform]
but you may run into problems with gmp.h if you try that, and you don't need most of it to do what you're trying to do.

How can I call a custom Django manage.py command directly from a test driver?

I want to write a unit test for a Django manage.py command that does a backend operation on a database table. How would I invoke the management command directly from code?
I don't want to execute the command on the Operating System's shell from tests.py because I can't use the test environment set up using manage.py test (test database, test dummy email outbox, etc...)
The best way to test such things - extract needed functionality from command itself to standalone function or class. It helps to abstract from "command execution stuff" and write test without additional requirements.
But if you by some reason cannot decouple logic form command you can call it from any code using call_command method like this:
from django.core.management import call_command
call_command('my_command', 'foo', bar='baz')
Rather than do the call_command trick, you can run your task by doing:
from myapp.management.commands import my_management_task
cmd = my_management_task.Command()
opts = {} # kwargs for your command -- lets you override stuff for testing...
cmd.handle_noargs(**opts)
the following code:
from django.core.management import call_command
call_command('collectstatic', verbosity=3, interactive=False)
call_command('migrate', 'myapp', verbosity=3, interactive=False)
...is equal to the following commands typed in terminal:
$ ./manage.py collectstatic --noinput -v 3
$ ./manage.py migrate myapp --noinput -v 3
See running management commands from django docs.
The Django documentation on the call_command fails to mention that out must be redirected to sys.stdout. The example code should read:
from django.core.management import call_command
from django.test import TestCase
from django.utils.six import StringIO
import sys
class ClosepollTest(TestCase):
def test_command_output(self):
out = StringIO()
sys.stdout = out
call_command('closepoll', stdout=out)
self.assertIn('Expected output', out.getvalue())
Building on Nate's answer I have this:
def make_test_wrapper_for(command_module):
def _run_cmd_with(*args):
"""Run the possibly_add_alert command with the supplied arguments"""
cmd = command_module.Command()
(opts, args) = OptionParser(option_list=cmd.option_list).parse_args(list(args))
cmd.handle(*args, **vars(opts))
return _run_cmd_with
Usage:
from myapp.management import mycommand
cmd_runner = make_test_wrapper_for(mycommand)
cmd_runner("foo", "bar")
The advantage here being that if you've used additional options and OptParse, this will sort the out for you. It isn't quite perfect - and it doesn't pipe outputs yet - but it will use the test database. You can then test for database effects.
I am sure use of Micheal Foords mock module and also rewiring stdout for the duration of a test would mean you could get some more out of this technique too - test the output, exit conditions etc.
The advanced way to run manage command with a flexible arguments and captured output
argv = self.build_argv(short_dict=kwargs)
cmd = self.run_manage_command_raw(YourManageCommandClass, argv=argv)
# Output is saved cmd.stdout.getvalue() / cmd.stderr.getvalue()
Add code to your base Test class
#classmethod
def build_argv(cls, *positional, short_names=None, long_names=None, short_dict=None, **long_dict):
"""
Build argv list which can be provided for manage command "run_from_argv"
1) positional will be passed first as is
2) short_names with be passed after with one dash (-) prefix
3) long_names with be passed after with one tow dashes (--) prefix
4) short_dict with be passed after with one dash (-) prefix key and next item as value
5) long_dict with be passed after with two dashes (--) prefix key and next item as value
"""
argv = [__file__, None] + list(positional)[:]
for name in short_names or []:
argv.append(f'-{name}')
for name in long_names or []:
argv.append(f'--{name}')
for name, value in (short_dict or {}).items():
argv.append(f'-{name}')
argv.append(str(value))
for name, value in long_dict.items():
argv.append(f'--{name}')
argv.append(str(value))
return argv
#classmethod
def run_manage_command_raw(cls, cmd_class, argv):
"""run any manage.py command as python object"""
command = cmd_class(stdout=io.StringIO(), stderr=io.StringIO())
with mock.patch('django.core.management.base.connections.close_all'):
# patch to prevent closing db connecction
command.run_from_argv(argv)
return command