I am currently working on building "Scrapy spiders control panel" in which I am testing this existing solution available on [Distributed Multi-user Scrapy Spiders Control Panel] https://github.com/aaldaber/Distributed-Multi-User-Scrapy-System-with-a-Web-UI.
I am trying to run this on my local Ubuntu Dev Machine but having issues with scrapd daemon.
One of the Workers, linkgenerator is working but scraper as worker1 is not working.
I can not figure out why scrapyd won't run on another local instance.
Background Information about the configuration.
The application comes bundled with Django, Scrapy, Pipeline for MongoDB (for saving the scraped items) and Scrapy scheduler for RabbitMQ (for distributing the links among workers). I have 2 local Ubuntu instances in which Django, MongoDB, Scrapyd daemon and RabbitMQ server running on Instance1.
On another Scrapyd daemon is running on Instance2.
RabbitMQ Workers:
linkgenerator
worker1
IP Configurations for Instances:
IP For local Ubuntu Instance1: 192.168.0.101
IP for local Ubuntu Instance2: 192.168.0.106
List of tools used:
MongoDB server
RabbitMQ server
Scrapy Scrapyd API
One RabbitMQ linkgenerator worker (WorkerName: linkgenerator) server with Scrapy installed and running scrapyd daemon on local Ubuntu Instance1: 192.168.0.101
Another one RabbitMQ scraper worker (WorkerName: worker1) server with Scrapy installed and running scrapyd daemon on local Ubuntu Instance2: 192.168.0.106
Instance1: 192.168.0.101
"Instance1" on which Django, RabbitMQ, scrapyd daemon servers running -- IP : 192.168.0.101
Instance2: 192.168.0.106
Scrapy installed on instance2 and running scrapyd daemon
Scrapy Control Panel UI Snapshot:
from snapshot, control panel outlook can be been seen, there are two workers, linkgenerator worked successfully but worker1 did not, the logs given in the end of the post
RabbitMQ status info
linkgenerator worker can successfully push the message to RabbitMQ queue, linkgenerator spider generates start_urls for "scraper spider* are consumed by scraper (worker1), which is not working, please see the logs for worker1 in end of the post
RabbitMQ settings
The below file contains the settings for MongoDB and RabbitMQ:
SCHEDULER = ".rabbitmq.scheduler.Scheduler"
SCHEDULER_PERSIST = True
RABBITMQ_HOST = 'ScrapyDevU79'
RABBITMQ_PORT = 5672
RABBITMQ_USERNAME = 'guest'
RABBITMQ_PASSWORD = 'guest'
MONGODB_PUBLIC_ADDRESS = 'OneScience:27017' # This will be shown on the web interface, but won't be used for connecting to DB
MONGODB_URI = 'localhost:27017' # Actual uri to connect to DB
MONGODB_USER = 'tariq'
MONGODB_PASSWORD = 'toor'
MONGODB_SHARDED = True
MONGODB_BUFFER_DATA = 100
# Set your link generator worker address here
LINK_GENERATOR = 'http://192.168.0.101:6800'
SCRAPERS = ['http://192.168.0.106:6800']
LINUX_USER_CREATION_ENABLED = False # Set this to True if you want a linux user account
linkgenerator scrapy.cfg settings:
[settings]
default = tester2_fda_trial20.settings
[deploy:linkgenerator]
url = http://192.168.0.101:6800
project = tester2_fda_trial20
scraper scrapy.cfg settings:
[settings]
default = tester2_fda_trial20.settings
[deploy:worker1]
url = http://192.168.0.101:6800
project = tester2_fda_trial20
scrapyd.conf file settings for Instance1 (192.168.0.101)
cat /etc/scrapyd/scrapyd.conf
[scrapyd]
eggs_dir = /var/lib/scrapyd/eggs
dbs_dir = /var/lib/scrapyd/dbs
items_dir = /var/lib/scrapyd/items
logs_dir = /var/log/scrapyd
max_proc = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5.0
bind_address = 0.0.0.0
#bind_address = 127.0.0.1
http_port = 6800
debug = on
runner = scrapyd.runner
application = scrapyd.app.application
launcher = scrapyd.launcher.Launcher
webroot = scrapyd.website.Root
[services]
schedule.json = scrapyd.webservice.Schedule
cancel.json = scrapyd.webservice.Cancel
addversion.json = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json = scrapyd.webservice.ListSpiders
delproject.json = scrapyd.webservice.DeleteProject
delversion.json = scrapyd.webservice.DeleteVersion
listjobs.json = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus
scrapyd.conf file settings for Instance2 (192.168.0.106)
cat /etc/scrapyd/scrapyd.conf
[scrapyd]
eggs_dir = /var/lib/scrapyd/eggs
dbs_dir = /var/lib/scrapyd/dbs
items_dir = /var/lib/scrapyd/items
logs_dir = /var/log/scrapyd
max_proc = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5.0
bind_address = 0.0.0.0
#bind_address = 127.0.0.1
http_port = 6800
debug = on
runner = scrapyd.runner
application = scrapyd.app.application
launcher = scrapyd.launcher.Launcher
webroot = scrapyd.website.Root
[services]
schedule.json = scrapyd.webservice.Schedule
cancel.json = scrapyd.webservice.Cancel
addversion.json = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json = scrapyd.webservice.ListSpiders
delproject.json = scrapyd.webservice.DeleteProject
delversion.json = scrapyd.webservice.DeleteVersion
listjobs.json = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus
RabbitMQ Status
sudo service rabbitmq-server status
[sudo] password for mtaziz:
Status of node rabbit#ScrapyDevU79
[{pid,53715},
{running_applications,
[{rabbitmq_shovel_management,
"Management extension for the Shovel plugin","3.6.11"},
{rabbitmq_shovel,"Data Shovel for RabbitMQ","3.6.11"},
{rabbitmq_management,"RabbitMQ Management Console","3.6.11"},
{rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.6.11"},
{rabbitmq_management_agent,"RabbitMQ Management Agent","3.6.11"},
{rabbit,"RabbitMQ","3.6.11"},
{os_mon,"CPO CXC 138 46","2.2.14"},
{cowboy,"Small, fast, modular HTTP server.","1.0.4"},
{ranch,"Socket acceptor pool for TCP protocols.","1.3.0"},
{ssl,"Erlang/OTP SSL application","5.3.2"},
{public_key,"Public key infrastructure","0.21"},
{cowlib,"Support library for manipulating Web protocols.","1.0.2"},
{crypto,"CRYPTO version 2","3.2"},
{amqp_client,"RabbitMQ AMQP Client","3.6.11"},
{rabbit_common,
"Modules shared by rabbitmq-server and rabbitmq-erlang-client",
"3.6.11"},
{inets,"INETS CXC 138 49","5.9.7"},
{mnesia,"MNESIA CXC 138 12","4.11"},
{compiler,"ERTS CXC 138 10","4.9.4"},
{xmerl,"XML parser","1.3.5"},
{syntax_tools,"Syntax tools","1.6.12"},
{asn1,"The Erlang ASN1 compiler version 2.0.4","2.0.4"},
{sasl,"SASL CXC 138 11","2.3.4"},
{stdlib,"ERTS CXC 138 10","1.19.4"},
{kernel,"ERTS CXC 138 10","2.16.4"}]},
{os,{unix,linux}},
{erlang_version,
"Erlang R16B03 (erts-5.10.4) [source] [64-bit] [smp:4:4] [async-threads:64] [kernel-poll:true]\n"},
{memory,
[{connection_readers,0},
{connection_writers,0},
{connection_channels,0},
{connection_other,6856},
{queue_procs,145160},
{queue_slave_procs,0},
{plugins,1959248},
{other_proc,22328920},
{metrics,160112},
{mgmt_db,655320},
{mnesia,83952},
{other_ets,2355800},
{binary,96920},
{msg_index,47352},
{code,27101161},
{atom,992409},
{other_system,31074022},
{total,87007232}]},
{alarms,[]},
{listeners,[{clustering,25672,"::"},{amqp,5672,"::"},{http,15672,"::"}]},
{vm_memory_calculation_strategy,rss},
{vm_memory_high_watermark,0.4},
{vm_memory_limit,3343646720},
{disk_free_limit,50000000},
{disk_free,56257699840},
{file_descriptors,
[{total_limit,924},{total_used,2},{sockets_limit,829},{sockets_used,0}]},
{processes,[{limit,1048576},{used,351}]},
{run_queue,0},
{uptime,34537},
{kernel,{net_ticktime,60}}]
scrapyd daemon on Instance1 ( 192.168.0.101 ) running status:
scrapyd
2017-09-11T06:16:07+0600 [-] Loading /home/mtaziz/.virtualenvs/onescience_dist_env/local/lib/python2.7/site-packages/scrapyd/txapp.py...
2017-09-11T06:16:07+0600 [-] Scrapyd web console available at http://0.0.0.0:6800/
2017-09-11T06:16:07+0600 [-] Loaded.
2017-09-11T06:16:07+0600 [twisted.scripts._twistd_unix.UnixAppLogger#info] twistd 17.5.0 (/home/mtaziz/.virtualenvs/onescience_dist_env/bin/python 2.7.6) starting up.
2017-09-11T06:16:07+0600 [twisted.scripts._twistd_unix.UnixAppLogger#info] reactor class: twisted.internet.epollreactor.EPollReactor.
2017-09-11T06:16:07+0600 [-] Site starting on 6800
2017-09-11T06:16:07+0600 [twisted.web.server.Site#info] Starting factory <twisted.web.server.Site instance at 0x7f5e265c77a0>
2017-09-11T06:16:07+0600 [Launcher] Scrapyd 1.2.0 started: max_proc=16, runner='scrapyd.runner'
2017-09-11T06:16:07+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:16:07 +0000] "GET /listprojects.json HTTP/1.1" 200 98 "-" "python-requests/2.18.4"
2017-09-11T06:16:07+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:16:07 +0000] "GET /listversions.json?project=tester2_fda_trial20 HTTP/1.1" 200 80 "-" "python-requests/2.18.4"
2017-09-11T06:16:07+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:16:07 +0000] "GET /listjobs.json?project=tester2_fda_trial20 HTTP/1.1" 200 92 "-" "python-requests/2.18.4"
scrapyd daemon on instance2 (192.168.0.106) running status:
scrapyd
2017-09-11T06:09:28+0600 [-] Loading /home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapyd/txapp.py...
2017-09-11T06:09:28+0600 [-] Scrapyd web console available at http://0.0.0.0:6800/
2017-09-11T06:09:28+0600 [-] Loaded.
2017-09-11T06:09:28+0600 [twisted.scripts._twistd_unix.UnixAppLogger#info] twistd 17.5.0 (/home/mtaziz/.virtualenvs/scrapydevenv/bin/python 2.7.6) starting up.
2017-09-11T06:09:28+0600 [twisted.scripts._twistd_unix.UnixAppLogger#info] reactor class: twisted.internet.epollreactor.EPollReactor.
2017-09-11T06:09:28+0600 [-] Site starting on 6800
2017-09-11T06:09:28+0600 [twisted.web.server.Site#info] Starting factory <twisted.web.server.Site instance at 0x7fbe6eaeac20>
2017-09-11T06:09:28+0600 [Launcher] Scrapyd 1.2.0 started: max_proc=16, runner='scrapyd.runner'
2017-09-11T06:09:32+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:32 +0000] "GET /listprojects.json HTTP/1.1" 200 98 "-" "python-requests/2.18.4"
2017-09-11T06:09:32+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:32 +0000] "GET /listversions.json?project=tester2_fda_trial20 HTTP/1.1" 200 80 "-" "python-requests/2.18.4"
2017-09-11T06:09:32+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:32 +0000] "GET /listjobs.json?project=tester2_fda_trial20 HTTP/1.1" 200 92 "-" "python-requests/2.18.4"
2017-09-11T06:09:37+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:37 +0000] "GET /listprojects.json HTTP/1.1" 200 98 "-" "python-requests/2.18.4"
2017-09-11T06:09:37+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:37 +0000] "GET /listversions.json?project=tester2_fda_trial20 HTTP/1.1" 200 80 "-" "python-requests/2.18.4"
worker1 logs
After updating the code for RabbitMQ server settings followed by the suggestions made by #Tarun Lalwani
The suggestion was to use RabbitMQ Server IP - 192.168.0.101:5672 instead of
127.0.0.1:5672. After I updated as suggested by Tarun Lalwani got the new problems as below............
2017-09-11 15:49:18 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: tester2_fda_trial20)
2017-09-11 15:49:18 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tester2_fda_trial20.spiders', 'ROBOTSTXT_OBEY': True, 'LOG_LEVEL': 'INFO', 'SPIDER_MODULES': ['tester2_fda_trial20.spiders'], 'BOT_NAME': 'tester2_fda_trial20', 'FEED_URI': 'file:///var/lib/scrapyd/items/tester2_fda_trial20/tester2_fda_trial20/79b1123a96d611e79276000c29bad697.jl', 'SCHEDULER': 'tester2_fda_trial20.rabbitmq.scheduler.Scheduler', 'TELNETCONSOLE_ENABLED': False, 'LOG_FILE': '/var/log/scrapyd/tester2_fda_trial20/tester2_fda_trial20/79b1123a96d611e79276000c29bad697.log'}
2017-09-11 15:49:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.corestats.CoreStats']
2017-09-11 15:49:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-09-11 15:49:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-09-11 15:49:18 [scrapy.middleware] INFO: Enabled item pipelines:
['tester2_fda_trial20.pipelines.FdaTrial20Pipeline',
'tester2_fda_trial20.mongodb.scrapy_mongodb.MongoDBPipeline']
2017-09-11 15:49:18 [scrapy.core.engine] INFO: Spider opened
2017-09-11 15:49:18 [pika.adapters.base_connection] INFO: Connecting to 192.168.0.101:5672
2017-09-11 15:49:18 [pika.adapters.blocking_connection] INFO: Created channel=1
2017-09-11 15:49:18 [scrapy.core.engine] INFO: Closing spider (shutdown)
2017-09-11 15:49:18 [pika.adapters.blocking_connection] INFO: Channel.close(0, Normal Shutdown)
2017-09-11 15:49:18 [pika.channel] INFO: Channel.close(0, Normal Shutdown)
2017-09-11 15:49:18 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method ?.close_spider of <scrapy.extensions.feedexport.FeedExporter object at 0x7f94878b8c50>>
Traceback (most recent call last):
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
result = f(*args, **kw)
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapy/extensions/feedexport.py", line 201, in close_spider
slot = self.slot
AttributeError: 'FeedExporter' object has no attribute 'slot'
2017-09-11 15:49:18 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method ?.spider_closed of <Tester2Fda_Trial20Spider 'tester2_fda_trial20' at 0x7f9484f897d0>>
Traceback (most recent call last):
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
result = f(*args, **kw)
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "/tmp/user/1000/tester2_fda_trial20-10-d4Req9.egg/tester2_fda_trial20/spiders/tester2_fda_trial20.py", line 28, in spider_closed
AttributeError: 'Tester2Fda_Trial20Spider' object has no attribute 'statstask'
2017-09-11 15:49:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'finish_reason': 'shutdown',
'finish_time': datetime.datetime(2017, 9, 11, 9, 49, 18, 159896),
'log_count/ERROR': 2,
'log_count/INFO': 10}
2017-09-11 15:49:18 [scrapy.core.engine] INFO: Spider closed (shutdown)
2017-09-11 15:49:18 [twisted] CRITICAL: Unhandled error in Deferred:
2017-09-11 15:49:18 [twisted] CRITICAL:
Traceback (most recent call last):
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
result = g.send(result)
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapy/crawler.py", line 95, in crawl
six.reraise(*exc_info)
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapy/crawler.py", line 79, in crawl
yield self.engine.open_spider(self.spider, start_requests)
OperationFailure: command SON([('saslStart', 1), ('mechanism', 'SCRAM-SHA-1'), ('payload', Binary('n,,n=tariq,r=MjY5OTQ0OTYwMjA4', 0)), ('autoAuthorize', 1)]) on namespace admin.$cmd failed: Authentication failed.
MongoDBPipeline
# coding:utf-8
import datetime
from pymongo import errors
from pymongo.mongo_client import MongoClient
from pymongo.mongo_replica_set_client import MongoReplicaSetClient
from pymongo.read_preferences import ReadPreference
from scrapy.exporters import BaseItemExporter
try:
from urllib.parse import quote
except:
from urllib import quote
def not_set(string):
""" Check if a string is None or ''
:returns: bool - True if the string is empty
"""
if string is None:
return True
elif string == '':
return True
return False
class MongoDBPipeline(BaseItemExporter):
""" MongoDB pipeline class """
# Default options
config = {
'uri': 'mongodb://localhost:27017',
'fsync': False,
'write_concern': 0,
'database': 'scrapy-mongodb',
'collection': 'items',
'replica_set': None,
'buffer': None,
'append_timestamp': False,
'sharded': False
}
# Needed for sending acknowledgement signals to RabbitMQ for all persisted items
queue = None
acked_signals = []
# Item buffer
item_buffer = dict()
def load_spider(self, spider):
self.crawler = spider.crawler
self.settings = spider.settings
self.queue = self.crawler.engine.slot.scheduler.queue
def open_spider(self, spider):
self.load_spider(spider)
# Configure the connection
self.configure()
self.spidername = spider.name
self.config['uri'] = 'mongodb://' + self.config['username'] + ':' + quote(self.config['password']) + '#' + self.config['uri'] + '/admin'
self.shardedcolls = []
if self.config['replica_set'] is not None:
self.connection = MongoReplicaSetClient(
self.config['uri'],
replicaSet=self.config['replica_set'],
w=self.config['write_concern'],
fsync=self.config['fsync'],
read_preference=ReadPreference.PRIMARY_PREFERRED)
else:
# Connecting to a stand alone MongoDB
self.connection = MongoClient(
self.config['uri'],
fsync=self.config['fsync'],
read_preference=ReadPreference.PRIMARY)
# Set up the collection
self.database = self.connection[spider.name]
# Autoshard the DB
if self.config['sharded']:
db_statuses = self.connection['config']['databases'].find({})
partitioned = []
notpartitioned = []
for status in db_statuses:
if status['partitioned']:
partitioned.append(status['_id'])
else:
notpartitioned.append(status['_id'])
if spider.name in notpartitioned or spider.name not in partitioned:
try:
self.connection.admin.command('enableSharding', spider.name)
except errors.OperationFailure:
pass
else:
collections = self.connection['config']['collections'].find({})
for coll in collections:
if (spider.name + '.') in coll['_id']:
if coll['dropped'] is not True:
if coll['_id'].index(spider.name + '.') == 0:
self.shardedcolls.append(coll['_id'][coll['_id'].index('.') + 1:])
def configure(self):
""" Configure the MongoDB connection """
# Set all regular options
options = [
('uri', 'MONGODB_URI'),
('fsync', 'MONGODB_FSYNC'),
('write_concern', 'MONGODB_REPLICA_SET_W'),
('database', 'MONGODB_DATABASE'),
('collection', 'MONGODB_COLLECTION'),
('replica_set', 'MONGODB_REPLICA_SET'),
('buffer', 'MONGODB_BUFFER_DATA'),
('append_timestamp', 'MONGODB_ADD_TIMESTAMP'),
('sharded', 'MONGODB_SHARDED'),
('username', 'MONGODB_USER'),
('password', 'MONGODB_PASSWORD')
]
for key, setting in options:
if not not_set(self.settings[setting]):
self.config[key] = self.settings[setting]
def process_item(self, item, spider):
""" Process the item and add it to MongoDB
:type item: Item object
:param item: The item to put into MongoDB
:type spider: BaseSpider object
:param spider: The spider running the queries
:returns: Item object
"""
item_name = item.__class__.__name__
# If we are working with a sharded DB, the collection will also be sharded
if self.config['sharded']:
if item_name not in self.shardedcolls:
try:
self.connection.admin.command('shardCollection', '%s.%s' % (self.spidername, item_name), key={'_id': "hashed"})
self.shardedcolls.append(item_name)
except errors.OperationFailure:
self.shardedcolls.append(item_name)
itemtoinsert = dict(self._get_serialized_fields(item))
if self.config['buffer']:
if item_name not in self.item_buffer:
self.item_buffer[item_name] = []
self.item_buffer[item_name].append([])
self.item_buffer[item_name].append(0)
self.item_buffer[item_name][1] += 1
if self.config['append_timestamp']:
itemtoinsert['scrapy-mongodb'] = {'ts': datetime.datetime.utcnow()}
self.item_buffer[item_name][0].append(itemtoinsert)
if self.item_buffer[item_name][1] == self.config['buffer']:
self.item_buffer[item_name][1] = 0
self.insert_item(self.item_buffer[item_name][0], spider, item_name)
return item
self.insert_item(itemtoinsert, spider, item_name)
return item
def close_spider(self, spider):
""" Method called when the spider is closed
:type spider: BaseSpider object
:param spider: The spider running the queries
:returns: None
"""
for key in self.item_buffer:
if self.item_buffer[key][0]:
self.insert_item(self.item_buffer[key][0], spider, key)
def insert_item(self, item, spider, item_name):
""" Process the item and add it to MongoDB
:type item: (Item object) or [(Item object)]
:param item: The item(s) to put into MongoDB
:type spider: BaseSpider object
:param spider: The spider running the queries
:returns: Item object
"""
self.collection = self.database[item_name]
if not isinstance(item, list):
if self.config['append_timestamp']:
item['scrapy-mongodb'] = {'ts': datetime.datetime.utcnow()}
ack_signal = item['ack_signal']
item.pop('ack_signal', None)
self.collection.insert(item, continue_on_error=True)
if ack_signal not in self.acked_signals:
self.queue.acknowledge(ack_signal)
self.acked_signals.append(ack_signal)
else:
signals = []
for eachitem in item:
signals.append(eachitem['ack_signal'])
eachitem.pop('ack_signal', None)
self.collection.insert(item, continue_on_error=True)
del item[:]
for ack_signal in signals:
if ack_signal not in self.acked_signals:
self.queue.acknowledge(ack_signal)
self.acked_signals.append(ack_signal)
To sum up, I believe the problem lies in scrapyd daemons running on both instances but somehow scraper or worker1 can not access it, I could not figure it out, I did not find any use cases on stackoverflow.
Any help is highly appreciated in this regard. Thank you in advance!
I am trying to use Apache Airflow to create a workflow. So basically I've installed Airflow manually in my own anaconda kernel in server.
Here is the way I run a simple DAG
export AIRFLOW_HOME=~/airflow/airflow_home # my airflow home
export AIRFLOW=~/.conda/.../lib/python2.7/site-packages/airflow/bin
export PATH=~/.conda/.../bin:$AIRFLOW:$PATH # my kernel
When I do the same thing using airflow test, it worked for particular task independently. For example, in dag1: task1 >> task2
airflow test dag1 task2 2017-06-22
I suppose that it will run task1 first then run task2. But it just run task2 independently.
Do you guys have any idea about this ? Thank you very much in advance!
Here is my code:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'txuantu',
'depends_on_past': False,
'start_date': datetime(2015, 6, 1),
'email': ['tran.xuantu#axa.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
dag = DAG(
'tutorial', default_args=default_args, schedule_interval=timedelta(1))
def python_op1(ds, **kwargs):
print(ds)
return 0
def python_op2(ds, **kwargs):
print(str(kwargs))
return 0
# t1, t2 and t3 are examples of tasks created by instantiating operators
# t1 = BashOperator(
# task_id='bash_operator',
# bash_command='echo {{ ds }}',
# dag=dag)
t1 = PythonOperator(
task_id='python_operator1',
python_callable=python_op1,
# provide_context=True,
dag=dag)
t2 = PythonOperator(
task_id='python_operator2',
python_callable=python_op2,
# provide_context=True,
dag=dag)
t2.set_upstream(t1)
Airflow: v1.8.0
Using executor SequentialExecutor with SQLLite
airflow run tutorial python_operator2 2015-06-01
Here is error message:
[2017-06-28 22:49:15,336] {models.py:167} INFO - Filling up the DagBag from /home/txuantu/airflow/airflow_home/dags
[2017-06-28 22:49:16,069] {base_executor.py:50} INFO - Adding to queue: airflow run tutorial python_operator2 2015-06-01T00:00:00 --mark_success --local -sd DAGS_FOLDER/tutorial.py
[2017-06-28 22:49:16,072] {sequential_executor.py:40} INFO - Executing command: airflow run tutorial python_operator2 2015-06-01T00:00:00 --mark_success --local -sd DAGS_FOLDER/tutorial.py
[2017-06-28 22:49:16,765] {models.py:167} INFO - Filling up the DagBag from /home/txuantu/airflow/airflow_home/dags/tutorial.py
[2017-06-28 22:49:16,986] {base_task_runner.py:112} INFO - Running: ['bash', '-c', u'airflow run tutorial python_operator2 2015-06-01T00:00:00 --mark_success --job_id 1 --raw -sd DAGS_FOLDER/tutorial.py']
[2017-06-28 22:49:17,373] {base_task_runner.py:95} INFO - Subtask: [2017-06-28 22:49:17,373] {__init__.py:57} INFO - Using executor SequentialExecutor
[2017-06-28 22:49:17,694] {base_task_runner.py:95} INFO - Subtask: [2017-06-28 22:49:17,693] {models.py:167} INFO - Filling up the DagBag from /home/txuantu/airflow/airflow_home/dags/tutorial.py
[2017-06-28 22:49:17,899] {base_task_runner.py:95} INFO - Subtask: [2017-06-28 22:49:17,899] {models.py:1120} INFO - Dependencies not met for <TaskInstance: tutorial.python_operator2 2015-06-01 00:00:00 [None]>, dependency 'Trigger Rule' FAILED: Task's trigger rule 'all_success' requires all upstream tasks to have succeeded, but found 1 non-success(es). upstream_tasks_state={'successes': 0, 'failed': 0, 'upstream_failed': 0, 'skipped': 0, 'done': 0}, upstream_task_ids=['python_operator1']
[2017-06-28 22:49:22,011] {jobs.py:2083} INFO - Task exited with return code 0
If you only want to run python_operator2, you should execute:
airflow run tutorial python_operator2 2015-06-01 --ignore_dependencies=False
If you want to execute the entire dag and execute both tasks, use trigger_dag:
airflow trigger_dag tutorial
For reference, airflow test will "run a task without checking for dependencies."
Documentation for all three commands can be found at https://airflow.incubator.apache.org/cli.html
Finally, I found about an answer for my problem. Basically I thought airflow is lazy load, but it seems not. So the answer is instead of:
t2.set_upstream(t1)
It should be:
t1.set_downstream(t2)
I'm building a wiki and faced a problem when the result during debugging seems to differ from that during the program run.
class WikiPage(Handler):
def get(self, id):
id, existing_article, article_content = self.get_stripped_id_article_content(id)
logging.error(("GET! id: {0}; article: {1}; content: {2}").format(id, existing_article, article_content))
<...>
def strip_id(self, id):
id = id.replace("/", "")
return id
def get_stripped_id_article_content(self, id):
id = self.strip_id(id)
q = Article.all()
q.filter("id = ", id)
existing_article = q.get()
content = existing_article.content if existing_article else ""
return id, existing_article, content
<...>
class CreateEditPage(WikiPage):
def post(self, id):
id, existing_article, article_content = self.get_stripped_id_article_content(id)
user = self.get_user(self.request)
input_content = self.request.get("content")
if existing_article:
existing_article.content = input_content
existing_article.put()
else:
new_article = Article(id = id, content = input_content)
new_article.put()
id, existing_article, article_content = self.get_stripped_id_article_content(id)
logging.error(("POST! In Else. id: {0}; article: {1}; content: {2}").format(id, existing_article, article_content))
id, existing_article, article_content = self.get_stripped_id_article_content(id)
logging.error(("POST! Outside Else. id: {0}; article: {1}; content: {2}").format(id, existing_article, article_content))
self.redirect("/" + str(id))
PAGE_RE = r'(/(?:[a-zA-Z0-9_-]+/?)*)'
app = webapp2.WSGIApplication([('/_edit' + PAGE_RE, CreateEditPage),
(PAGE_RE, WikiPage),
],
debug=True)
First of all, I clear all the content from the database and fluch the memcache.
I run my application:
INFO 2015-10-14 21:06:52,744 sdk_update_checker.py:229] Checking for updates to the SDK.
INFO 2015-10-14 21:06:53,135 api_server.py:205] Starting API server at: http://localhost:53588
INFO 2015-10-14 21:06:53,141 dispatcher.py:197] Starting module "default" running at: http://localhost:8080
INFO 2015-10-14 21:06:53,142 admin_server.py:118] Starting admin server at: http://localhost:8000
ERROR 2015-10-14 21:10:10,804 gmv_wiki.py:194] GET! id: ; article: None; content:
INFO 2015-10-14 21:10:10,810 module.py:809] default: "GET / HTTP/1.1" 302 -
INFO 2015-10-14 21:10:10,847 module.py:809] default: "GET /_edit/ HTTP/1.1" 200 348
ERROR 2015-10-14 21:10:12,228 gmv_wiki.py:194] GET! id: newpost; article: None; content:
INFO 2015-10-14 21:10:12,239 module.py:809] default: "GET /newpost HTTP/1.1" 302 -
INFO 2015-10-14 21:10:12,264 module.py:809] default: "GET /_edit/newpost HTTP/1.1" 200 348
ERROR 2015-10-14 21:10:18,945 gmv_wiki.py:194] GET! id: new_article; article: None; content:
INFO 2015-10-14 21:10:18,951 module.py:809] default: "GET /new_article HTTP/1.1" 302 -
INFO 2015-10-14 21:10:18,976 module.py:809] default: "GET /_edit/new_article HTTP/1.1" 200 348
ERROR 2015-10-14 21:10:50,979 gmv_wiki.py:246] POST! In Else. id: new_article; article: None; content:
ERROR 2015-10-14 21:10:50,985 gmv_wiki.py:251] POST! Outside Else. id: new_article; article: None; content:
INFO 2015-10-14 21:10:50,991 module.py:809] default: "POST /_edit/new_article HTTP/1.1" 302 -
ERROR 2015-10-14 21:10:51,014 gmv_wiki.py:194] GET! id: new_article; article: None; content:
INFO 2015-10-14 21:10:51,022 module.py:809] default: "GET /new_article HTTP/1.1" 302 -
INFO 2015-10-14 21:10:51,051 module.py:809] default: "GET /_edit/new_article HTTP/1.1" 200 348
ERROR 2015-10-14 21:11:18,321 gmv_wiki.py:251] POST! Outside Else. id: new_article; article: <gmv_wiki.Article object at 0x7f684d2ba250>; content: New article content.
INFO 2015-10-14 21:11:18,326 module.py:809] default: "POST /_edit/new_article HTTP/1.1" 302 -
ERROR 2015-10-14 21:11:18,351 gmv_wiki.py:194] GET! id: new_article; article: <gmv_wiki.Article object at 0x7f684d2663d0>; content: New article content.
INFO 2015-10-14 21:11:18,358 module.py:809] default: "GET /new_article HTTP/1.1" 200 262
I debug my application:
pydev debugger: starting (pid: 10230)
INFO 2015-10-14 21:12:24,730 sdk_update_checker.py:229] Checking for updates to the SDK.
INFO 2015-10-14 21:12:25,435 api_server.py:205] Starting API server at: http://localhost:44302
INFO 2015-10-14 21:12:25,463 dispatcher.py:197] Starting module "default" running at: http://localhost:8080
INFO 2015-10-14 21:12:25,467 admin_server.py:118] Starting admin server at: http://localhost:8000
pydev debugger: starting (pid: 10250)
ERROR 2015-10-14 21:12:36,523 gmv_wiki.py:194] GET! id: new_article_1; article: None; content:
INFO 2015-10-14 21:12:36,547 module.py:809] default: "GET /new_article_1 HTTP/1.1" 302 -
INFO 2015-10-14 21:12:36,766 module.py:809] default: "GET /_edit/new_article_1 HTTP/1.1" 200 348
ERROR 2015-10-14 21:12:54,394 gmv_wiki.py:246] POST! In Else. id: new_article_1; article: None; content:
ERROR 2015-10-14 21:12:54,430 gmv_wiki.py:251] POST! Outside Else. id: new_article_1; article: None; content:
INFO 2015-10-14 21:12:54,452 module.py:809] default: "POST /_edit/new_article_1 HTTP/1.1" 302 -
ERROR 2015-10-14 21:12:54,587 gmv_wiki.py:194] GET! id: new_article_1; article: <gmv_wiki.Article object at 0x7fad181bef10>; content: New article 1 content.
INFO 2015-10-14 21:12:54,610 module.py:809] default: "GET /new_article_1 HTTP/1.1" 200 266
Could you have a look at cases of logging.error in the code. And the results in the log.
Two absolutely equal situations.
http://localhost:8080/new_article # ordinary run
http://localhost:8080/new_article_1 # debug
The behaviour of the program is as follows:
1. During debugging everything works well: edit form for a new article is created and when I press "Submit", the content is placed to the database and I'm redirected to the wiki page of that article.
2. When I just run the program (not debug it), edit form for a new article opens, I input something and press "Submit". Then I'm redirected to the wiki page. No wiki page is found at that address. I'm redirected again to the edit page. And then when I again input the content, the content is finally put into the database and the wiki page opens.
Mysteries for me are:
1. Why there is a difference between running and debugging?
2. Why after I have done new_article.put(), I still have no article object found in the post function?
3. Then I'm redirected to the wiki page, and somehow article object is found. Strange.
4. Why existing_article.put() works well whereas new_article.put() don't place anything to the database?
Well, I seem to be at my wits end. Could you be so kind as to help me with these problems. Thank you in advice.
I suspect you're running into GAE datastore eventual consistency issues. Basically a new article saved in new_article.put() won't be found in the Article.all() query for a (typically short) while. You can still access it by id, tho, but you'd have to adjust your code for that.
I suspect the difference in behaviour is caused by the lower overall speed of debug-mode execution which can hide the eventual consistency issues.