2 RabbitMQ workers and 2 Scrapyd daemons running on 2 local Ubuntu instances, in which one of the rabbitmq worker is not working - django

I am currently working on building "Scrapy spiders control panel" in which I am testing this existing solution available on [Distributed Multi-user Scrapy Spiders Control Panel] https://github.com/aaldaber/Distributed-Multi-User-Scrapy-System-with-a-Web-UI.
I am trying to run this on my local Ubuntu Dev Machine but having issues with scrapd daemon.
One of the Workers, linkgenerator is working but scraper as worker1 is not working.
I can not figure out why scrapyd won't run on another local instance.
Background Information about the configuration.
The application comes bundled with Django, Scrapy, Pipeline for MongoDB (for saving the scraped items) and Scrapy scheduler for RabbitMQ (for distributing the links among workers). I have 2 local Ubuntu instances in which Django, MongoDB, Scrapyd daemon and RabbitMQ server running on Instance1.
On another Scrapyd daemon is running on Instance2.
RabbitMQ Workers:
linkgenerator
worker1
IP Configurations for Instances:
IP For local Ubuntu Instance1: 192.168.0.101
IP for local Ubuntu Instance2: 192.168.0.106
List of tools used:
MongoDB server
RabbitMQ server
Scrapy Scrapyd API
One RabbitMQ linkgenerator worker (WorkerName: linkgenerator) server with Scrapy installed and running scrapyd daemon on local Ubuntu Instance1: 192.168.0.101
Another one RabbitMQ scraper worker (WorkerName: worker1) server with Scrapy installed and running scrapyd daemon on local Ubuntu Instance2: 192.168.0.106
Instance1: 192.168.0.101
"Instance1" on which Django, RabbitMQ, scrapyd daemon servers running -- IP : 192.168.0.101
Instance2: 192.168.0.106
Scrapy installed on instance2 and running scrapyd daemon
Scrapy Control Panel UI Snapshot:
from snapshot, control panel outlook can be been seen, there are two workers, linkgenerator worked successfully but worker1 did not, the logs given in the end of the post
RabbitMQ status info
linkgenerator worker can successfully push the message to RabbitMQ queue, linkgenerator spider generates start_urls for "scraper spider* are consumed by scraper (worker1), which is not working, please see the logs for worker1 in end of the post
RabbitMQ settings
The below file contains the settings for MongoDB and RabbitMQ:
SCHEDULER = ".rabbitmq.scheduler.Scheduler"
SCHEDULER_PERSIST = True
RABBITMQ_HOST = 'ScrapyDevU79'
RABBITMQ_PORT = 5672
RABBITMQ_USERNAME = 'guest'
RABBITMQ_PASSWORD = 'guest'
MONGODB_PUBLIC_ADDRESS = 'OneScience:27017' # This will be shown on the web interface, but won't be used for connecting to DB
MONGODB_URI = 'localhost:27017' # Actual uri to connect to DB
MONGODB_USER = 'tariq'
MONGODB_PASSWORD = 'toor'
MONGODB_SHARDED = True
MONGODB_BUFFER_DATA = 100
# Set your link generator worker address here
LINK_GENERATOR = 'http://192.168.0.101:6800'
SCRAPERS = ['http://192.168.0.106:6800']
LINUX_USER_CREATION_ENABLED = False # Set this to True if you want a linux user account
linkgenerator scrapy.cfg settings:
[settings]
default = tester2_fda_trial20.settings
[deploy:linkgenerator]
url = http://192.168.0.101:6800
project = tester2_fda_trial20
scraper scrapy.cfg settings:
[settings]
default = tester2_fda_trial20.settings
[deploy:worker1]
url = http://192.168.0.101:6800
project = tester2_fda_trial20
scrapyd.conf file settings for Instance1 (192.168.0.101)
cat /etc/scrapyd/scrapyd.conf
[scrapyd]
eggs_dir = /var/lib/scrapyd/eggs
dbs_dir = /var/lib/scrapyd/dbs
items_dir = /var/lib/scrapyd/items
logs_dir = /var/log/scrapyd
max_proc = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5.0
bind_address = 0.0.0.0
#bind_address = 127.0.0.1
http_port = 6800
debug = on
runner = scrapyd.runner
application = scrapyd.app.application
launcher = scrapyd.launcher.Launcher
webroot = scrapyd.website.Root
[services]
schedule.json = scrapyd.webservice.Schedule
cancel.json = scrapyd.webservice.Cancel
addversion.json = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json = scrapyd.webservice.ListSpiders
delproject.json = scrapyd.webservice.DeleteProject
delversion.json = scrapyd.webservice.DeleteVersion
listjobs.json = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus
scrapyd.conf file settings for Instance2 (192.168.0.106)
cat /etc/scrapyd/scrapyd.conf
[scrapyd]
eggs_dir = /var/lib/scrapyd/eggs
dbs_dir = /var/lib/scrapyd/dbs
items_dir = /var/lib/scrapyd/items
logs_dir = /var/log/scrapyd
max_proc = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5.0
bind_address = 0.0.0.0
#bind_address = 127.0.0.1
http_port = 6800
debug = on
runner = scrapyd.runner
application = scrapyd.app.application
launcher = scrapyd.launcher.Launcher
webroot = scrapyd.website.Root
[services]
schedule.json = scrapyd.webservice.Schedule
cancel.json = scrapyd.webservice.Cancel
addversion.json = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json = scrapyd.webservice.ListSpiders
delproject.json = scrapyd.webservice.DeleteProject
delversion.json = scrapyd.webservice.DeleteVersion
listjobs.json = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus
RabbitMQ Status
sudo service rabbitmq-server status
[sudo] password for mtaziz:
Status of node rabbit#ScrapyDevU79
[{pid,53715},
{running_applications,
[{rabbitmq_shovel_management,
"Management extension for the Shovel plugin","3.6.11"},
{rabbitmq_shovel,"Data Shovel for RabbitMQ","3.6.11"},
{rabbitmq_management,"RabbitMQ Management Console","3.6.11"},
{rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.6.11"},
{rabbitmq_management_agent,"RabbitMQ Management Agent","3.6.11"},
{rabbit,"RabbitMQ","3.6.11"},
{os_mon,"CPO CXC 138 46","2.2.14"},
{cowboy,"Small, fast, modular HTTP server.","1.0.4"},
{ranch,"Socket acceptor pool for TCP protocols.","1.3.0"},
{ssl,"Erlang/OTP SSL application","5.3.2"},
{public_key,"Public key infrastructure","0.21"},
{cowlib,"Support library for manipulating Web protocols.","1.0.2"},
{crypto,"CRYPTO version 2","3.2"},
{amqp_client,"RabbitMQ AMQP Client","3.6.11"},
{rabbit_common,
"Modules shared by rabbitmq-server and rabbitmq-erlang-client",
"3.6.11"},
{inets,"INETS CXC 138 49","5.9.7"},
{mnesia,"MNESIA CXC 138 12","4.11"},
{compiler,"ERTS CXC 138 10","4.9.4"},
{xmerl,"XML parser","1.3.5"},
{syntax_tools,"Syntax tools","1.6.12"},
{asn1,"The Erlang ASN1 compiler version 2.0.4","2.0.4"},
{sasl,"SASL CXC 138 11","2.3.4"},
{stdlib,"ERTS CXC 138 10","1.19.4"},
{kernel,"ERTS CXC 138 10","2.16.4"}]},
{os,{unix,linux}},
{erlang_version,
"Erlang R16B03 (erts-5.10.4) [source] [64-bit] [smp:4:4] [async-threads:64] [kernel-poll:true]\n"},
{memory,
[{connection_readers,0},
{connection_writers,0},
{connection_channels,0},
{connection_other,6856},
{queue_procs,145160},
{queue_slave_procs,0},
{plugins,1959248},
{other_proc,22328920},
{metrics,160112},
{mgmt_db,655320},
{mnesia,83952},
{other_ets,2355800},
{binary,96920},
{msg_index,47352},
{code,27101161},
{atom,992409},
{other_system,31074022},
{total,87007232}]},
{alarms,[]},
{listeners,[{clustering,25672,"::"},{amqp,5672,"::"},{http,15672,"::"}]},
{vm_memory_calculation_strategy,rss},
{vm_memory_high_watermark,0.4},
{vm_memory_limit,3343646720},
{disk_free_limit,50000000},
{disk_free,56257699840},
{file_descriptors,
[{total_limit,924},{total_used,2},{sockets_limit,829},{sockets_used,0}]},
{processes,[{limit,1048576},{used,351}]},
{run_queue,0},
{uptime,34537},
{kernel,{net_ticktime,60}}]
scrapyd daemon on Instance1 ( 192.168.0.101 ) running status:
scrapyd
2017-09-11T06:16:07+0600 [-] Loading /home/mtaziz/.virtualenvs/onescience_dist_env/local/lib/python2.7/site-packages/scrapyd/txapp.py...
2017-09-11T06:16:07+0600 [-] Scrapyd web console available at http://0.0.0.0:6800/
2017-09-11T06:16:07+0600 [-] Loaded.
2017-09-11T06:16:07+0600 [twisted.scripts._twistd_unix.UnixAppLogger#info] twistd 17.5.0 (/home/mtaziz/.virtualenvs/onescience_dist_env/bin/python 2.7.6) starting up.
2017-09-11T06:16:07+0600 [twisted.scripts._twistd_unix.UnixAppLogger#info] reactor class: twisted.internet.epollreactor.EPollReactor.
2017-09-11T06:16:07+0600 [-] Site starting on 6800
2017-09-11T06:16:07+0600 [twisted.web.server.Site#info] Starting factory <twisted.web.server.Site instance at 0x7f5e265c77a0>
2017-09-11T06:16:07+0600 [Launcher] Scrapyd 1.2.0 started: max_proc=16, runner='scrapyd.runner'
2017-09-11T06:16:07+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:16:07 +0000] "GET /listprojects.json HTTP/1.1" 200 98 "-" "python-requests/2.18.4"
2017-09-11T06:16:07+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:16:07 +0000] "GET /listversions.json?project=tester2_fda_trial20 HTTP/1.1" 200 80 "-" "python-requests/2.18.4"
2017-09-11T06:16:07+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:16:07 +0000] "GET /listjobs.json?project=tester2_fda_trial20 HTTP/1.1" 200 92 "-" "python-requests/2.18.4"
scrapyd daemon on instance2 (192.168.0.106) running status:
scrapyd
2017-09-11T06:09:28+0600 [-] Loading /home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapyd/txapp.py...
2017-09-11T06:09:28+0600 [-] Scrapyd web console available at http://0.0.0.0:6800/
2017-09-11T06:09:28+0600 [-] Loaded.
2017-09-11T06:09:28+0600 [twisted.scripts._twistd_unix.UnixAppLogger#info] twistd 17.5.0 (/home/mtaziz/.virtualenvs/scrapydevenv/bin/python 2.7.6) starting up.
2017-09-11T06:09:28+0600 [twisted.scripts._twistd_unix.UnixAppLogger#info] reactor class: twisted.internet.epollreactor.EPollReactor.
2017-09-11T06:09:28+0600 [-] Site starting on 6800
2017-09-11T06:09:28+0600 [twisted.web.server.Site#info] Starting factory <twisted.web.server.Site instance at 0x7fbe6eaeac20>
2017-09-11T06:09:28+0600 [Launcher] Scrapyd 1.2.0 started: max_proc=16, runner='scrapyd.runner'
2017-09-11T06:09:32+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:32 +0000] "GET /listprojects.json HTTP/1.1" 200 98 "-" "python-requests/2.18.4"
2017-09-11T06:09:32+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:32 +0000] "GET /listversions.json?project=tester2_fda_trial20 HTTP/1.1" 200 80 "-" "python-requests/2.18.4"
2017-09-11T06:09:32+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:32 +0000] "GET /listjobs.json?project=tester2_fda_trial20 HTTP/1.1" 200 92 "-" "python-requests/2.18.4"
2017-09-11T06:09:37+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:37 +0000] "GET /listprojects.json HTTP/1.1" 200 98 "-" "python-requests/2.18.4"
2017-09-11T06:09:37+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:37 +0000] "GET /listversions.json?project=tester2_fda_trial20 HTTP/1.1" 200 80 "-" "python-requests/2.18.4"
worker1 logs
After updating the code for RabbitMQ server settings followed by the suggestions made by #Tarun Lalwani
The suggestion was to use RabbitMQ Server IP - 192.168.0.101:5672 instead of
127.0.0.1:5672. After I updated as suggested by Tarun Lalwani got the new problems as below............
2017-09-11 15:49:18 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: tester2_fda_trial20)
2017-09-11 15:49:18 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tester2_fda_trial20.spiders', 'ROBOTSTXT_OBEY': True, 'LOG_LEVEL': 'INFO', 'SPIDER_MODULES': ['tester2_fda_trial20.spiders'], 'BOT_NAME': 'tester2_fda_trial20', 'FEED_URI': 'file:///var/lib/scrapyd/items/tester2_fda_trial20/tester2_fda_trial20/79b1123a96d611e79276000c29bad697.jl', 'SCHEDULER': 'tester2_fda_trial20.rabbitmq.scheduler.Scheduler', 'TELNETCONSOLE_ENABLED': False, 'LOG_FILE': '/var/log/scrapyd/tester2_fda_trial20/tester2_fda_trial20/79b1123a96d611e79276000c29bad697.log'}
2017-09-11 15:49:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.corestats.CoreStats']
2017-09-11 15:49:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-09-11 15:49:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-09-11 15:49:18 [scrapy.middleware] INFO: Enabled item pipelines:
['tester2_fda_trial20.pipelines.FdaTrial20Pipeline',
'tester2_fda_trial20.mongodb.scrapy_mongodb.MongoDBPipeline']
2017-09-11 15:49:18 [scrapy.core.engine] INFO: Spider opened
2017-09-11 15:49:18 [pika.adapters.base_connection] INFO: Connecting to 192.168.0.101:5672
2017-09-11 15:49:18 [pika.adapters.blocking_connection] INFO: Created channel=1
2017-09-11 15:49:18 [scrapy.core.engine] INFO: Closing spider (shutdown)
2017-09-11 15:49:18 [pika.adapters.blocking_connection] INFO: Channel.close(0, Normal Shutdown)
2017-09-11 15:49:18 [pika.channel] INFO: Channel.close(0, Normal Shutdown)
2017-09-11 15:49:18 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method ?.close_spider of <scrapy.extensions.feedexport.FeedExporter object at 0x7f94878b8c50>>
Traceback (most recent call last):
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
result = f(*args, **kw)
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapy/extensions/feedexport.py", line 201, in close_spider
slot = self.slot
AttributeError: 'FeedExporter' object has no attribute 'slot'
2017-09-11 15:49:18 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method ?.spider_closed of <Tester2Fda_Trial20Spider 'tester2_fda_trial20' at 0x7f9484f897d0>>
Traceback (most recent call last):
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
result = f(*args, **kw)
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "/tmp/user/1000/tester2_fda_trial20-10-d4Req9.egg/tester2_fda_trial20/spiders/tester2_fda_trial20.py", line 28, in spider_closed
AttributeError: 'Tester2Fda_Trial20Spider' object has no attribute 'statstask'
2017-09-11 15:49:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'finish_reason': 'shutdown',
'finish_time': datetime.datetime(2017, 9, 11, 9, 49, 18, 159896),
'log_count/ERROR': 2,
'log_count/INFO': 10}
2017-09-11 15:49:18 [scrapy.core.engine] INFO: Spider closed (shutdown)
2017-09-11 15:49:18 [twisted] CRITICAL: Unhandled error in Deferred:
2017-09-11 15:49:18 [twisted] CRITICAL:
Traceback (most recent call last):
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
result = g.send(result)
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapy/crawler.py", line 95, in crawl
six.reraise(*exc_info)
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapy/crawler.py", line 79, in crawl
yield self.engine.open_spider(self.spider, start_requests)
OperationFailure: command SON([('saslStart', 1), ('mechanism', 'SCRAM-SHA-1'), ('payload', Binary('n,,n=tariq,r=MjY5OTQ0OTYwMjA4', 0)), ('autoAuthorize', 1)]) on namespace admin.$cmd failed: Authentication failed.
MongoDBPipeline
# coding:utf-8
import datetime
from pymongo import errors
from pymongo.mongo_client import MongoClient
from pymongo.mongo_replica_set_client import MongoReplicaSetClient
from pymongo.read_preferences import ReadPreference
from scrapy.exporters import BaseItemExporter
try:
from urllib.parse import quote
except:
from urllib import quote
def not_set(string):
""" Check if a string is None or ''
:returns: bool - True if the string is empty
"""
if string is None:
return True
elif string == '':
return True
return False
class MongoDBPipeline(BaseItemExporter):
""" MongoDB pipeline class """
# Default options
config = {
'uri': 'mongodb://localhost:27017',
'fsync': False,
'write_concern': 0,
'database': 'scrapy-mongodb',
'collection': 'items',
'replica_set': None,
'buffer': None,
'append_timestamp': False,
'sharded': False
}
# Needed for sending acknowledgement signals to RabbitMQ for all persisted items
queue = None
acked_signals = []
# Item buffer
item_buffer = dict()
def load_spider(self, spider):
self.crawler = spider.crawler
self.settings = spider.settings
self.queue = self.crawler.engine.slot.scheduler.queue
def open_spider(self, spider):
self.load_spider(spider)
# Configure the connection
self.configure()
self.spidername = spider.name
self.config['uri'] = 'mongodb://' + self.config['username'] + ':' + quote(self.config['password']) + '#' + self.config['uri'] + '/admin'
self.shardedcolls = []
if self.config['replica_set'] is not None:
self.connection = MongoReplicaSetClient(
self.config['uri'],
replicaSet=self.config['replica_set'],
w=self.config['write_concern'],
fsync=self.config['fsync'],
read_preference=ReadPreference.PRIMARY_PREFERRED)
else:
# Connecting to a stand alone MongoDB
self.connection = MongoClient(
self.config['uri'],
fsync=self.config['fsync'],
read_preference=ReadPreference.PRIMARY)
# Set up the collection
self.database = self.connection[spider.name]
# Autoshard the DB
if self.config['sharded']:
db_statuses = self.connection['config']['databases'].find({})
partitioned = []
notpartitioned = []
for status in db_statuses:
if status['partitioned']:
partitioned.append(status['_id'])
else:
notpartitioned.append(status['_id'])
if spider.name in notpartitioned or spider.name not in partitioned:
try:
self.connection.admin.command('enableSharding', spider.name)
except errors.OperationFailure:
pass
else:
collections = self.connection['config']['collections'].find({})
for coll in collections:
if (spider.name + '.') in coll['_id']:
if coll['dropped'] is not True:
if coll['_id'].index(spider.name + '.') == 0:
self.shardedcolls.append(coll['_id'][coll['_id'].index('.') + 1:])
def configure(self):
""" Configure the MongoDB connection """
# Set all regular options
options = [
('uri', 'MONGODB_URI'),
('fsync', 'MONGODB_FSYNC'),
('write_concern', 'MONGODB_REPLICA_SET_W'),
('database', 'MONGODB_DATABASE'),
('collection', 'MONGODB_COLLECTION'),
('replica_set', 'MONGODB_REPLICA_SET'),
('buffer', 'MONGODB_BUFFER_DATA'),
('append_timestamp', 'MONGODB_ADD_TIMESTAMP'),
('sharded', 'MONGODB_SHARDED'),
('username', 'MONGODB_USER'),
('password', 'MONGODB_PASSWORD')
]
for key, setting in options:
if not not_set(self.settings[setting]):
self.config[key] = self.settings[setting]
def process_item(self, item, spider):
""" Process the item and add it to MongoDB
:type item: Item object
:param item: The item to put into MongoDB
:type spider: BaseSpider object
:param spider: The spider running the queries
:returns: Item object
"""
item_name = item.__class__.__name__
# If we are working with a sharded DB, the collection will also be sharded
if self.config['sharded']:
if item_name not in self.shardedcolls:
try:
self.connection.admin.command('shardCollection', '%s.%s' % (self.spidername, item_name), key={'_id': "hashed"})
self.shardedcolls.append(item_name)
except errors.OperationFailure:
self.shardedcolls.append(item_name)
itemtoinsert = dict(self._get_serialized_fields(item))
if self.config['buffer']:
if item_name not in self.item_buffer:
self.item_buffer[item_name] = []
self.item_buffer[item_name].append([])
self.item_buffer[item_name].append(0)
self.item_buffer[item_name][1] += 1
if self.config['append_timestamp']:
itemtoinsert['scrapy-mongodb'] = {'ts': datetime.datetime.utcnow()}
self.item_buffer[item_name][0].append(itemtoinsert)
if self.item_buffer[item_name][1] == self.config['buffer']:
self.item_buffer[item_name][1] = 0
self.insert_item(self.item_buffer[item_name][0], spider, item_name)
return item
self.insert_item(itemtoinsert, spider, item_name)
return item
def close_spider(self, spider):
""" Method called when the spider is closed
:type spider: BaseSpider object
:param spider: The spider running the queries
:returns: None
"""
for key in self.item_buffer:
if self.item_buffer[key][0]:
self.insert_item(self.item_buffer[key][0], spider, key)
def insert_item(self, item, spider, item_name):
""" Process the item and add it to MongoDB
:type item: (Item object) or [(Item object)]
:param item: The item(s) to put into MongoDB
:type spider: BaseSpider object
:param spider: The spider running the queries
:returns: Item object
"""
self.collection = self.database[item_name]
if not isinstance(item, list):
if self.config['append_timestamp']:
item['scrapy-mongodb'] = {'ts': datetime.datetime.utcnow()}
ack_signal = item['ack_signal']
item.pop('ack_signal', None)
self.collection.insert(item, continue_on_error=True)
if ack_signal not in self.acked_signals:
self.queue.acknowledge(ack_signal)
self.acked_signals.append(ack_signal)
else:
signals = []
for eachitem in item:
signals.append(eachitem['ack_signal'])
eachitem.pop('ack_signal', None)
self.collection.insert(item, continue_on_error=True)
del item[:]
for ack_signal in signals:
if ack_signal not in self.acked_signals:
self.queue.acknowledge(ack_signal)
self.acked_signals.append(ack_signal)
To sum up, I believe the problem lies in scrapyd daemons running on both instances but somehow scraper or worker1 can not access it, I could not figure it out, I did not find any use cases on stackoverflow.
Any help is highly appreciated in this regard. Thank you in advance!

Related

Django Login suddenly stopped working - timing out

My Django project used to work perfectly fine for the last 90 days.
There has been no new code deployment during this time.
Running supervisor -> gunicorn to serve the application and to the front nginx.
Unfortunately it just stopped serving the login page (standard framework login).
I wrote a small view that checks if the DB connection is working and it comes up within seconds.
def updown(request):
from django.shortcuts import HttpResponse
from django.db import connections
from django.db.utils import OperationalError
status = True
# Check database connection
if status is True:
db_conn = connections['default']
try:
c = db_conn.cursor()
except OperationalError:
status = False
error = 'No connection to database'
else:
status = True
if status is True:
message = 'OK'
elif status is False:
message = 'NOK' + ' \n' + error
return HttpResponse(message)
This delivers back an OK.
But the second I am trying to reach /admin or anything else requiring the login, it times out.
wget http://127.0.0.1:8000
--2022-07-20 22:54:58-- http://127.0.0.1:8000/
Connecting to 127.0.0.1:8000... connected.
HTTP request sent, awaiting response... 302 Found
Location: /business/dashboard/ [following]
--2022-07-20 22:54:58-- http://127.0.0.1:8000/business/dashboard/
Connecting to 127.0.0.1:8000... connected.
HTTP request sent, awaiting response... 302 Found
Location: /account/login/?next=/business/dashboard/ [following]
--2022-07-20 22:54:58-- http://127.0.0.1:8000/account/login/? next=/business/dashboard/
Connecting to 127.0.0.1:8000... connected.
HTTP request sent, awaiting response... No data received.
Retrying.
--2022-07-20 22:55:30-- (try: 2) http://127.0.0.1:8000/account/login/?next=/business/dashboard/
Connecting to 127.0.0.1:8000... connected.
HTTP request sent, awaiting response...
Supervisor / Gunicorn Log is not helpful at all:
[2022-07-20 23:06:34 +0200] [980] [INFO] Starting gunicorn 20.1.0
[2022-07-20 23:06:34 +0200] [980] [INFO] Listening at: http://127.0.0.1:8000 (980)
[2022-07-20 23:06:34 +0200] [980] [INFO] Using worker: sync
[2022-07-20 23:06:34 +0200] [986] [INFO] Booting worker with pid: 986
[2022-07-20 23:08:01 +0200] [980] [CRITICAL] WORKER TIMEOUT (pid:986)
[2022-07-20 23:08:02 +0200] [980] [WARNING] Worker with pid 986 was terminated due to signal 9
[2022-07-20 23:08:02 +0200] [1249] [INFO] Booting worker with pid: 1249
[2022-07-20 23:12:26 +0200] [980] [CRITICAL] WORKER TIMEOUT (pid:1249)
[2022-07-20 23:12:27 +0200] [980] [WARNING] Worker with pid 1249 was terminated due to signal 9
[2022-07-20 23:12:27 +0200] [1515] [INFO] Booting worker with pid: 1515
Nginx is just giving:
502 Bad Gateway
I don't see anything in the logs, I don't see any error when running the dev server from Django, also Sentry is not showing anything. Totally lost.
I am running Django 4.0.x and all libraries are updated.
The check up script for the database is only checking the connection. Due to misconfiguration of the database replication, the db was connecting and also reading, but when writing it hang.
The login page tries to write a session to the tables, which failed in this case.

_InactiveRpcError while querying Vertex AI Matching Engine Index

I am following the example notebook as per GCP docs to test Vertex Matching Engine. I have deployed an index but while trying to query the index I am getting _InactiveRpcError. The VPC network is in us-west2 with private service access enabled and the Index is deployed in us-central1. My VPC network contains the pre-populated firewall rules.
Index
createTime: '2021-11-23T15:25:53.928606Z'
deployedIndexes:
- deployedIndexId: brute_force_glove_deployed_v3
indexEndpoint: projects/XXXXXXXXXXXX/locations/us-central1/indexEndpoints/XXXXXXXXXXXX
description: testing python script for creating index
displayName: glove_100_brute_force_20211123152551
etag: AMEw9yOVPWBOTpbAvJLllqxWMi2YurEV_sad2n13QvbIlqjOdMyiq_j20gG1ldhdZNTL
metadata:
config:
algorithmConfig:
bruteForceConfig: {}
dimensions: 100
distanceMeasureType: DOT_PRODUCT_DISTANCE
metadataSchemaUri: gs://google-cloud-aiplatform/schema/matchingengine/metadata/nearest_neighbor_search_1.0.0.yaml
name: projects/XXXXXXXXXXXX/locations/us-central1/indexes/XXXXXXXXXXXX
updateTime: '2021-11-23T16:04:17.993730Z'
Index-Endpoint
createTime: '2021-11-24T10:59:51.975949Z'
deployedIndexes:
- automaticResources:
maxReplicaCount: 1
minReplicaCount: 1
createTime: '2021-11-30T15:16:12.323028Z'
deploymentGroup: default
displayName: brute_force_glove_deployed_v3
enableAccessLogging: true
id: brute_force_glove_deployed_v3
index: projects/XXXXXXXXXXXX/locations/us-central1/indexes/XXXXXXXXXXXX
indexSyncTime: '2021-11-30T16:37:35.597200Z'
privateEndpoints:
matchGrpcAddress: 10.242.4.5
displayName: index_endpoint_for_demo
etag: AMEw9yO6cuDfgpBhGVw7-NKnlS1vdFI5nnOtqVgW1ddMP-CMXM7NfGWVpqRpMRPsNCwc
name: projects/XXXXXXXXXXXX/locations/us-central1/indexEndpoints/XXXXXXXXXXXX
network: projects/XXXXXXXXXXXX/global/networks/XXXXXXXXXXXX
updateTime: '2021-11-24T10:59:53.271100Z'
Code
import grpc
# import the generated classes
import match_service_pb2
import match_service_pb2_grpc
DEPLOYED_INDEX_SERVER_IP = '10.242.0.5'
DEPLOYED_INDEX_ID = 'brute_force_glove_deployed_v3'
query = [-0.11333, 0.48402, 0.090771, -0.22439, 0.034206, -0.55831, 0.041849, -0.53573, 0.18809, -0.58722, 0.015313, -0.014555, 0.80842, -0.038519, 0.75348, 0.70502, -0.17863, 0.3222, 0.67575, 0.67198, 0.26044, 0.4187, -0.34122, 0.2286, -0.53529, 1.2582, -0.091543, 0.19716, -0.037454, -0.3336, 0.31399, 0.36488, 0.71263, 0.1307, -0.24654, -0.52445, -0.036091, 0.55068, 0.10017, 0.48095, 0.71104, -0.053462, 0.22325, 0.30917, -0.39926, 0.036634, -0.35431, -0.42795, 0.46444, 0.25586, 0.68257, -0.20821, 0.38433, 0.055773, -0.2539, -0.20804, 0.52522, -0.11399, -0.3253, -0.44104, 0.17528, 0.62255, 0.50237, -0.7607, -0.071786, 0.0080131, -0.13286, 0.50097, 0.18824, -0.54722, -0.42664, 0.4292, 0.14877, -0.0072514, -0.16484, -0.059798, 0.9895, -0.61738, 0.054169, 0.48424, -0.35084, -0.27053, 0.37829, 0.11503, -0.39613, 0.24266, 0.39147, -0.075256, 0.65093, -0.20822, -0.17456, 0.53571, -0.16537, 0.13582, -0.56016, 0.016964, 0.1277, 0.94071, -0.22608, -0.021106]
channel = grpc.insecure_channel("{}:10000".format(DEPLOYED_INDEX_SERVER_IP))
stub = match_service_pb2_grpc.MatchServiceStub(channel)
request = match_service_pb2.MatchRequest()
request.deployed_index_id = DEPLOYED_INDEX_ID
for val in query:
request.float_val.append(val)
response = stub.Match(request)
response
Error
_InactiveRpcError Traceback (most recent call last)
/tmp/ipykernel_3451/467153318.py in <module>
108 request.float_val.append(val)
109
--> 110 response = stub.Match(request)
111 response
/opt/conda/lib/python3.7/site-packages/grpc/_channel.py in __call__(self, request, timeout, metadata, credentials, wait_for_ready, compression)
944 state, call, = self._blocking(request, timeout, metadata, credentials,
945 wait_for_ready, compression)
--> 946 return _end_unary_response_blocking(state, call, False, None)
947
948 def with_call(self,
/opt/conda/lib/python3.7/site-packages/grpc/_channel.py in _end_unary_response_blocking(state, call, with_call, deadline)
847 return state.response
848 else:
--> 849 raise _InactiveRpcError(state)
850
851
_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"#1638277076.941429628","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3093,"referenced_errors":[{"created":"#1638277076.941428202","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
Currently, Matching Engine only supports Query from the same region. Can you try running the code from VM in us-central1.

Flask-sqlalchemy / uwsgi: DB connection problem when more than on process is used

I have a Flask app running on Heroku with uwsgi server in which each user connects to his own database. I have implemented the solution reported here for a very similar situation. In particular, I have implemented the connection registry as follows:
class DBSessionRegistry():
_registry = {}
def get(self, URI, **kwargs):
if URI not in self._registry:
current_app.logger.info(f'INFO - CREATING A NEW CONNECTION')
try:
engine = create_engine(URI,
echo=False,
pool_size=5,
max_overflow=5)
session_factory = sessionmaker(bind=engine)
Session = scoped_session(session_factory)
a_session = Session()
self._registry[URI] = a_session
except ArgumentError:
raise Exception('Error')
current_app.logger.info(f'SESSION ID: {id(self._registry[URI])}')
current_app.logger.info(f'REGISTRY ID: {id(self._registry)}')
current_app.logger.info(f'REGISTRY SIZE: {len(self._registry.keys())}')
current_app.logger.info(f'APP ID: {id(current_app)}')
return self._registry[URI]
In my create_app() I assign a registry to the app:
app.DBregistry = DBSessionRegistry()
and whenever I need to talk to the DB I call:
current_app.DBregistry.get(URI)
where the URI is dependent on the user. This works nicely if I use uwsgi with one single process. With more processes,
[uwsgi]
processes = 4
threads = 1
sometimes it gets stuck on some requests, returning a 503 error code. I have found that the problem appears when the requests are handled by different processes in uwsgi. This is an excerpt of the log, which I commented to illustrate the issue:
# ... EVERYTHING OK UP TO HERE.
# ALL PREVIOUS REQUESTS HANDLED BY PROCESS pid = 12
INFO in utils: SESSION ID: 139860361716304
INFO in utils: REGISTRY ID: 139860484608480
INFO in utils: REGISTRY SIZE: 1
INFO in utils: APP ID: 139860526857584
# NOTE THE pid IN THE NEXT LINE...
[pid: 12|app: 0|req: 1/1] POST /manager/_save_task =>
generated 154 bytes in 3457 msecs (HTTP/1.1 200) 4 headers in 601
bytes (1 switches on core 0)
# PREVIOUS REQUEST WAS MANAGED BY PROCESS pid = 12
# THE NEXT REQUEST IS FROM THE SAME USER AND TO THE SAME URL.
# SO THERE IS NO NEED FOR CREATING A NEW CONNECTION, BUT INSTEAD...
INFO - CREATING A NEW CONNECTION
# TO THIS POINT, I DON'T UNDERSTAND WHY IT CREATED A NEW CONNECTION.
# THE SESSION ID CHANGES, AS IT IS A NEW SESSION
INFO in utils: SESSION ID: 139860363793168 # <<--- CHANGED
INFO in utils: REGISTRY ID: 139860484608480
INFO in utils: REGISTRY SIZE: 1
# THE APP AND THE REGISTRY ARE UNIQUE
INFO in utils: APP ID: 139860526857584
# uwsgi GIVES UP...
*** HARAKIRI ON WORKER 4 (pid: 11, try: 1) ***
# THE FAILED REQUEST WAS MANAGED BY PROCESS pid = 11
# I ASSUME THIS IS WHY IT CREATED A NEW CONNECTION
HARAKIRI: -- syscall> 7 0x7fff4290c6d8 0x1 0xffffffff 0x4000 0x0 0x0
0x7fff4290c6b8 0x7f33d6e3cbc4
HARAKIRI: -- wchan> poll_schedule_timeout
HARAKIRI !!! worker 4 status !!!
HARAKIRI [core 0] - POST /manager/_save_task since 1587660997
HARAKIRI !!! end of worker 4 status !!!
heroku[router]: at=error code=H13 desc="Connection closed without
response" method=POST path="/manager/_save_task"
DAMN ! worker 4 (pid: 11) died, killed by signal 9 :( trying respawn ...
Respawned uWSGI worker 4 (new pid: 14)
# FROM HERE ON, NOTHINGS WORKS ANYMORE
This behavior is consistent over several attempts: when the pid changes, the request fails. Even with a pool_size = 1 in the create_engine function the issue persists. No issue instead is uwsgi is used with one process.
I am pretty sure it is my fault, there is something I don't know or I don't understand about how uwsgi and/or sqlalchemy work. Could you please help me?
Thanks
What is hapeening is that you are trying to share memory between processes.
There are some exaplanations in these posts.
(is it possible to share memory between uwsgi processes running flask app?).
(https://stackoverflow.com/a/45383617/11542053)
You can use an extra layer to store your sessions outsite of the app.
For that, you can use uWsgi's SharedArea(https://uwsgi-docs.readthedocs.io/en/latest/SharedArea.html) which is very low level or you can user other approaches like uWsgi's caching(https://uwsgi-docs.readthedocs.io/en/latest/Caching.html)
hope it helps.

Configure DDS with scrapy-splash. ERROR: no base objects

LS,
I have installed Django-Dynamic-Scraper. And i would like to render Javascript via Splash. Therefor i have installed scrapy-splash and installed the docker splash image. The image below shows that the docker container can be reached.
Splash docker container
Nevertheless when i test it via DDS it returns the following error:
2016-10-25 17:06:00 [scrapy] INFO: Spider opened
2016-10-25 17:06:00 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-10-25 17:06:00 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-10-25 17:06:05 [scrapy] DEBUG: Crawled (200) <POST http://192.168.0.150:8050/render.html> (referer: None)
2016-10-25 17:06:06 [root] ERROR: No base objects found!
2016-10-25 17:06:06 [scrapy] INFO: Closing spider (finished)
2016-10-25 17:06:06 [scrapy] INFO: Dumping Scrapy stats:
when executing:
scrapy crawl my_spider -a id=1
I have configured the DDS admin page and checked the checkbox to render the javascript:
Admin configuration
I have followed the configuration from scrapy-splash:
# ----------------------------------------------------------------------
# SPLASH SETTINGS
# https://github.com/scrapy-plugins/scrapy-splash#configuration
# --------------------------------------------------------------------
SPLASH_URL = 'http://192.168.0.150:8050/'
DSCRAPER_SPLASH_ARGS = {'wait': 3}
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
# This middleware is needed to support cache_args feature;
# it allows to save disk space by not storing duplicate Splash arguments
# multiple times in a disk request queue.
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
# If you use Scrapy HTTP cache then a custom cache storage backend is required.
# scrapy-splash provides a subclass
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
I assume with a correct configuration of DDS/scrapy-splash, it will sends the required arguments to the splash docker container to render, Is this the case?
What am I missing? Do i need to adjust the spider with a splash script?

Can't get callback method in Scrapy rule to trigger

I am having some problems getting CrawlSpider rules working properly in Scrapy 0.22.2 for Python 2.7.3. It seems to me that no matter what I do, the callback method specified in my rule is never triggered. I have tried setting my rules to allow everything, but at no point is the print method inside my callback method (parse_units) ever called. I've made sure not to try to override the parse method for this purpose, as this seems to be a common mistake.
The starting URL is of the form
http://training.gov.au/Training/Details/CHC40708?tableUnits-page=1&pageSizeKey=Training_Details_tableUnits&pageSize=100
The links to the individual pages I want the crawler to follow through and parse look like this
http://training.gov.au/Training/Details/BSBWOR204A
Here is my python code:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
import urllib2
import os
qualification_code = raw_input( "enter the qualification code\n");
class trainingGovSpider(CrawlSpider):
name="trainingGov"
allowed_domains = ["training.gov.au"]
# refine this to accept user input
start_urls = ["http://training.gov.au/Training/Details/"+qualification_code+"?tableUnits-page=1&pageSizeKey=Training_Details_tableUnits&pageSize=100"]
rules = [Rule(SgmlLinkExtractor(restrict_xpaths=('//a[contains(#title, "View details for unit code")]')), callback='parse_units')]
def parse(self, response):
sel = Selector(response)
# first get the unit outline
qual_outline = sel.xpath('//a[contains(#title, "Download Qualification in PDF format.")]')
qual_outline_link = qual_outline.xpath('#href').extract()[0]
url = "http://training.gov.au" + qual_outline_link
#get url to qualification outline
print url
# next, get link elements which point to each individual unit in the qualification
sites = sel.xpath('//a[contains(#title, "View details for unit code")]')
for site in sites:
title = site.xpath('text()').extract()
# will need to follow each link
link = site.xpath('#href').extract()[0]
# will need to combine with http://training.gov.au
print link
#this should allow all individual unit links in a qualification to be parsed, but it isn't being called
def parse_units(self, response):
print 'parse_unit called'
hxs = HtmlXPathSelector(response)
current_status = hxs.xpath('text()').extract()
#current_status = hxs.xpath('/html/body/div/div/div/div/div/div[#class="outer"]/div[#class="fieldset"]/div[#class="display-row"]/div[#class="display-row"]/div[#class="display-field"]/span[re:test(., "Current", "i")]').extract()
print current_status
Log file as follows:
scrapy crawl trainingGov
enter the qualification code
CHC52212
2014-04-17 15:30:05+0800 [scrapy] INFO: Scrapy 0.22.2 started (bot: tutorial)
2014-04-17 15:30:05+0800 [scrapy] INFO: Optional features available: ssl, http11
2014-04-17 15:30:05+0800 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tutorial'}
2014-04-17 15:30:05+0800 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-04-17 15:30:05+0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-04-17 15:30:05+0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-04-17 15:30:05+0800 [scrapy] INFO: Enabled item pipelines:
2014-04-17 15:30:05+0800 [trainingGov] INFO: Spider opened
2014-04-17 15:30:05+0800 [trainingGov] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-04-17 15:30:05+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6024
2014-04-17 15:30:05+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6081
2014-04-17 15:30:06+0800 [trainingGov] DEBUG: Crawled (200) <GET http://training.gov.au/Training/Details/CHC52212?tableUnits-page=1&pageSizeKey=Training_Details_tableUnits&pageSize=100> (referer: None)
http://training.gov.au/TrainingComponentFiles/CHC08/CHC52212_R1.pdf
/Training/Details/BSBSUS501A
/Training/Details/BSBWOR403A
/Training/Details/CHCAC317A
/Training/Details/CHCAC318B
/Training/Details/CHCAC416A
/Training/Details/CHCAC417A
/Training/Details/CHCAC507E
/Training/Details/CHCAD402D
/Training/Details/CHCAD504B
/Training/Details/CHCADMIN508B
/Training/Details/CHCAL523D
/Training/Details/CHCAOD402B
/Training/Details/CHCCD401E
/Training/Details/CHCCD412B
/Training/Details/CHCCD516B
/Training/Details/CHCCH301C
/Training/Details/CHCCH427B
/Training/Details/CHCCHILD401B
/Training/Details/CHCCHILD505B
/Training/Details/CHCCOM403A
/Training/Details/CHCCOM504B
/Training/Details/CHCCS426B
/Training/Details/CHCCS427B
/Training/Details/CHCCS502C
/Training/Details/CHCCS503B
/Training/Details/CHCCS505B
/Training/Details/CHCCS512C
/Training/Details/CHCCS513C
/Training/Details/CHCDIS301C
/Training/Details/CHCDIS410A
/Training/Details/CHCDIS507C
/Training/Details/CHCES311B
/Training/Details/CHCES415A
/Training/Details/CHCES502C
/Training/Details/CHCES511B
/Training/Details/CHCHC401C
/Training/Details/CHCICS409A
/Training/Details/CHCINF407D
/Training/Details/CHCINF408C
/Training/Details/CHCINF505D
/Training/Details/CHCMH301C
/Training/Details/CHCMH402B
/Training/Details/CHCMH411A
/Training/Details/CHCNET501C
/Training/Details/CHCNET503D
/Training/Details/CHCORG405E
/Training/Details/CHCORG406C
/Training/Details/CHCORG423C
/Training/Details/CHCORG428A
/Training/Details/CHCORG501B
/Training/Details/CHCORG506E
/Training/Details/CHCORG525D
/Training/Details/CHCORG607D
/Training/Details/CHCORG610B
/Training/Details/CHCORG611C
/Training/Details/CHCPA402B
/Training/Details/CHCPR510B
/Training/Details/CHCSD512C
/Training/Details/CHCSW401A
/Training/Details/CHCSW402B
/Training/Details/CHCYTH401B
/Training/Details/CHCYTH402C
/Training/Details/CHCYTH506B
/Training/Details/HLTFA311A
/Training/Details/HLTFA412A
/Training/Details/HLTHIR403C
/Training/Details/HLTHIR404D
/Training/Details/HLTWHS401A
/Training/Details/PSPGOV517A
/Training/Details/PSPMNGT605B
/Training/Details/SISCCRD302A
/Training/Details/SRXGOV004B
/Training/Details/TAEDEL402A
2014-04-17 15:30:06+0800 [trainingGov] INFO: Closing spider (finished)
2014-04-17 15:30:06+0800 [trainingGov] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 310,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 17347,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 4, 17, 7, 30, 6, 96000),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2014, 4, 17, 7, 30, 5, 522000)}
2014-04-17 15:30:06+0800 [trainingGov] INFO: Spider closed (finished)
Thank you!