Configure DDS with scrapy-splash. ERROR: no base objects - django

LS,
I have installed Django-Dynamic-Scraper. And i would like to render Javascript via Splash. Therefor i have installed scrapy-splash and installed the docker splash image. The image below shows that the docker container can be reached.
Splash docker container
Nevertheless when i test it via DDS it returns the following error:
2016-10-25 17:06:00 [scrapy] INFO: Spider opened
2016-10-25 17:06:00 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-10-25 17:06:00 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-10-25 17:06:05 [scrapy] DEBUG: Crawled (200) <POST http://192.168.0.150:8050/render.html> (referer: None)
2016-10-25 17:06:06 [root] ERROR: No base objects found!
2016-10-25 17:06:06 [scrapy] INFO: Closing spider (finished)
2016-10-25 17:06:06 [scrapy] INFO: Dumping Scrapy stats:
when executing:
scrapy crawl my_spider -a id=1
I have configured the DDS admin page and checked the checkbox to render the javascript:
Admin configuration
I have followed the configuration from scrapy-splash:
# ----------------------------------------------------------------------
# SPLASH SETTINGS
# https://github.com/scrapy-plugins/scrapy-splash#configuration
# --------------------------------------------------------------------
SPLASH_URL = 'http://192.168.0.150:8050/'
DSCRAPER_SPLASH_ARGS = {'wait': 3}
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
# This middleware is needed to support cache_args feature;
# it allows to save disk space by not storing duplicate Splash arguments
# multiple times in a disk request queue.
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
# If you use Scrapy HTTP cache then a custom cache storage backend is required.
# scrapy-splash provides a subclass
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
I assume with a correct configuration of DDS/scrapy-splash, it will sends the required arguments to the splash docker container to render, Is this the case?
What am I missing? Do i need to adjust the spider with a splash script?

Related

If a proxy is good then how to stick only to that proxy until gets banned and then move to another one, in scrapy-proxy-rotation?

Yesterday, I asked a question and from the answer I figured out that I need to use proxies to scrape that site. So I implement scrapy-rotation-proxy in that script.
Here is the changed settings.py
ROTATING_PROXY_LIST_PATH = '/my/path/proxies.txt'
DOWNLOADER_MIDDLEWARES = {
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}
ROBOTSTXT_OBEY = False
After implementing all of that, the scrapy still stops after scraping about 370+ pages. Since I am new in rotation-proxy, I want to know how to stick only to one proxy/ip(in case if that's good),until that gets banned,before rotating to another proxy/ip in proxies.txt file. Because I noticed that in case if a proxy is good, the required data gets scraped as shown below
2019-10-06 12:50:11 [rotating_proxies.expire] DEBUG: Proxy <http://197.254.16.30:8080> is DEAD
2019-10-06 12:50:11 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.gulahmedshop.com/khadi-net-3-pc-outfit-glamour-19-48> with another proxy (failed 3 times, max retries: 5)
2019-10-06 12:50:12 [rotating_proxies.expire] DEBUG: Proxy <http://181.30.95.162:33078> is GOOD
2019-10-06 12:50:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gulahmedshop.com/gls-18-143> (referer: https://www.gulahmedshop.com/women?cat=399&price=-3000)
2019-10-06 12:50:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.gulahmedshop.com/gls-18-143>
{'Image Url': u'https://d224nth7ac0evy.cloudfront.net/catalog/product/cache/1e8ef93b9b4867ab9f3538dde2cb3b8a/g/l/gls-18-143_1_.jpg', 'Price': u'PKR 2,058', 'Category Name': u'and above', 'Product Title': u'GLS-18-143', 'Prouct page': 'https://www.gulahmedshop.com/gls-18-143'}
2019-10-06 12:50:12 [rotating_proxies.expire] DEBUG: Proxy <http://171.239.46.185:8080> is DEAD
2019-10-06 12:50:12 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.gulahmedshop.com/gls-18-230> with another proxy (failed 2 times, max retries: 5)
Unless you need persistent sessions, you don’t want to do that.
You want requests to be spread across your different proxies, so that the amount of requests per minute of each proxy is as low as possible, making it less likely that they are throttled.
If you are using free proxies, what you need is either more free proxies, switching to paid proxies or, even better, a single smart proxy.
It would also bee a good idea to figure out why you are getting banned in the first place, but that can be hard to do.

2 RabbitMQ workers and 2 Scrapyd daemons running on 2 local Ubuntu instances, in which one of the rabbitmq worker is not working

I am currently working on building "Scrapy spiders control panel" in which I am testing this existing solution available on [Distributed Multi-user Scrapy Spiders Control Panel] https://github.com/aaldaber/Distributed-Multi-User-Scrapy-System-with-a-Web-UI.
I am trying to run this on my local Ubuntu Dev Machine but having issues with scrapd daemon.
One of the Workers, linkgenerator is working but scraper as worker1 is not working.
I can not figure out why scrapyd won't run on another local instance.
Background Information about the configuration.
The application comes bundled with Django, Scrapy, Pipeline for MongoDB (for saving the scraped items) and Scrapy scheduler for RabbitMQ (for distributing the links among workers). I have 2 local Ubuntu instances in which Django, MongoDB, Scrapyd daemon and RabbitMQ server running on Instance1.
On another Scrapyd daemon is running on Instance2.
RabbitMQ Workers:
linkgenerator
worker1
IP Configurations for Instances:
IP For local Ubuntu Instance1: 192.168.0.101
IP for local Ubuntu Instance2: 192.168.0.106
List of tools used:
MongoDB server
RabbitMQ server
Scrapy Scrapyd API
One RabbitMQ linkgenerator worker (WorkerName: linkgenerator) server with Scrapy installed and running scrapyd daemon on local Ubuntu Instance1: 192.168.0.101
Another one RabbitMQ scraper worker (WorkerName: worker1) server with Scrapy installed and running scrapyd daemon on local Ubuntu Instance2: 192.168.0.106
Instance1: 192.168.0.101
"Instance1" on which Django, RabbitMQ, scrapyd daemon servers running -- IP : 192.168.0.101
Instance2: 192.168.0.106
Scrapy installed on instance2 and running scrapyd daemon
Scrapy Control Panel UI Snapshot:
from snapshot, control panel outlook can be been seen, there are two workers, linkgenerator worked successfully but worker1 did not, the logs given in the end of the post
RabbitMQ status info
linkgenerator worker can successfully push the message to RabbitMQ queue, linkgenerator spider generates start_urls for "scraper spider* are consumed by scraper (worker1), which is not working, please see the logs for worker1 in end of the post
RabbitMQ settings
The below file contains the settings for MongoDB and RabbitMQ:
SCHEDULER = ".rabbitmq.scheduler.Scheduler"
SCHEDULER_PERSIST = True
RABBITMQ_HOST = 'ScrapyDevU79'
RABBITMQ_PORT = 5672
RABBITMQ_USERNAME = 'guest'
RABBITMQ_PASSWORD = 'guest'
MONGODB_PUBLIC_ADDRESS = 'OneScience:27017' # This will be shown on the web interface, but won't be used for connecting to DB
MONGODB_URI = 'localhost:27017' # Actual uri to connect to DB
MONGODB_USER = 'tariq'
MONGODB_PASSWORD = 'toor'
MONGODB_SHARDED = True
MONGODB_BUFFER_DATA = 100
# Set your link generator worker address here
LINK_GENERATOR = 'http://192.168.0.101:6800'
SCRAPERS = ['http://192.168.0.106:6800']
LINUX_USER_CREATION_ENABLED = False # Set this to True if you want a linux user account
linkgenerator scrapy.cfg settings:
[settings]
default = tester2_fda_trial20.settings
[deploy:linkgenerator]
url = http://192.168.0.101:6800
project = tester2_fda_trial20
scraper scrapy.cfg settings:
[settings]
default = tester2_fda_trial20.settings
[deploy:worker1]
url = http://192.168.0.101:6800
project = tester2_fda_trial20
scrapyd.conf file settings for Instance1 (192.168.0.101)
cat /etc/scrapyd/scrapyd.conf
[scrapyd]
eggs_dir = /var/lib/scrapyd/eggs
dbs_dir = /var/lib/scrapyd/dbs
items_dir = /var/lib/scrapyd/items
logs_dir = /var/log/scrapyd
max_proc = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5.0
bind_address = 0.0.0.0
#bind_address = 127.0.0.1
http_port = 6800
debug = on
runner = scrapyd.runner
application = scrapyd.app.application
launcher = scrapyd.launcher.Launcher
webroot = scrapyd.website.Root
[services]
schedule.json = scrapyd.webservice.Schedule
cancel.json = scrapyd.webservice.Cancel
addversion.json = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json = scrapyd.webservice.ListSpiders
delproject.json = scrapyd.webservice.DeleteProject
delversion.json = scrapyd.webservice.DeleteVersion
listjobs.json = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus
scrapyd.conf file settings for Instance2 (192.168.0.106)
cat /etc/scrapyd/scrapyd.conf
[scrapyd]
eggs_dir = /var/lib/scrapyd/eggs
dbs_dir = /var/lib/scrapyd/dbs
items_dir = /var/lib/scrapyd/items
logs_dir = /var/log/scrapyd
max_proc = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5.0
bind_address = 0.0.0.0
#bind_address = 127.0.0.1
http_port = 6800
debug = on
runner = scrapyd.runner
application = scrapyd.app.application
launcher = scrapyd.launcher.Launcher
webroot = scrapyd.website.Root
[services]
schedule.json = scrapyd.webservice.Schedule
cancel.json = scrapyd.webservice.Cancel
addversion.json = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json = scrapyd.webservice.ListSpiders
delproject.json = scrapyd.webservice.DeleteProject
delversion.json = scrapyd.webservice.DeleteVersion
listjobs.json = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus
RabbitMQ Status
sudo service rabbitmq-server status
[sudo] password for mtaziz:
Status of node rabbit#ScrapyDevU79
[{pid,53715},
{running_applications,
[{rabbitmq_shovel_management,
"Management extension for the Shovel plugin","3.6.11"},
{rabbitmq_shovel,"Data Shovel for RabbitMQ","3.6.11"},
{rabbitmq_management,"RabbitMQ Management Console","3.6.11"},
{rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.6.11"},
{rabbitmq_management_agent,"RabbitMQ Management Agent","3.6.11"},
{rabbit,"RabbitMQ","3.6.11"},
{os_mon,"CPO CXC 138 46","2.2.14"},
{cowboy,"Small, fast, modular HTTP server.","1.0.4"},
{ranch,"Socket acceptor pool for TCP protocols.","1.3.0"},
{ssl,"Erlang/OTP SSL application","5.3.2"},
{public_key,"Public key infrastructure","0.21"},
{cowlib,"Support library for manipulating Web protocols.","1.0.2"},
{crypto,"CRYPTO version 2","3.2"},
{amqp_client,"RabbitMQ AMQP Client","3.6.11"},
{rabbit_common,
"Modules shared by rabbitmq-server and rabbitmq-erlang-client",
"3.6.11"},
{inets,"INETS CXC 138 49","5.9.7"},
{mnesia,"MNESIA CXC 138 12","4.11"},
{compiler,"ERTS CXC 138 10","4.9.4"},
{xmerl,"XML parser","1.3.5"},
{syntax_tools,"Syntax tools","1.6.12"},
{asn1,"The Erlang ASN1 compiler version 2.0.4","2.0.4"},
{sasl,"SASL CXC 138 11","2.3.4"},
{stdlib,"ERTS CXC 138 10","1.19.4"},
{kernel,"ERTS CXC 138 10","2.16.4"}]},
{os,{unix,linux}},
{erlang_version,
"Erlang R16B03 (erts-5.10.4) [source] [64-bit] [smp:4:4] [async-threads:64] [kernel-poll:true]\n"},
{memory,
[{connection_readers,0},
{connection_writers,0},
{connection_channels,0},
{connection_other,6856},
{queue_procs,145160},
{queue_slave_procs,0},
{plugins,1959248},
{other_proc,22328920},
{metrics,160112},
{mgmt_db,655320},
{mnesia,83952},
{other_ets,2355800},
{binary,96920},
{msg_index,47352},
{code,27101161},
{atom,992409},
{other_system,31074022},
{total,87007232}]},
{alarms,[]},
{listeners,[{clustering,25672,"::"},{amqp,5672,"::"},{http,15672,"::"}]},
{vm_memory_calculation_strategy,rss},
{vm_memory_high_watermark,0.4},
{vm_memory_limit,3343646720},
{disk_free_limit,50000000},
{disk_free,56257699840},
{file_descriptors,
[{total_limit,924},{total_used,2},{sockets_limit,829},{sockets_used,0}]},
{processes,[{limit,1048576},{used,351}]},
{run_queue,0},
{uptime,34537},
{kernel,{net_ticktime,60}}]
scrapyd daemon on Instance1 ( 192.168.0.101 ) running status:
scrapyd
2017-09-11T06:16:07+0600 [-] Loading /home/mtaziz/.virtualenvs/onescience_dist_env/local/lib/python2.7/site-packages/scrapyd/txapp.py...
2017-09-11T06:16:07+0600 [-] Scrapyd web console available at http://0.0.0.0:6800/
2017-09-11T06:16:07+0600 [-] Loaded.
2017-09-11T06:16:07+0600 [twisted.scripts._twistd_unix.UnixAppLogger#info] twistd 17.5.0 (/home/mtaziz/.virtualenvs/onescience_dist_env/bin/python 2.7.6) starting up.
2017-09-11T06:16:07+0600 [twisted.scripts._twistd_unix.UnixAppLogger#info] reactor class: twisted.internet.epollreactor.EPollReactor.
2017-09-11T06:16:07+0600 [-] Site starting on 6800
2017-09-11T06:16:07+0600 [twisted.web.server.Site#info] Starting factory <twisted.web.server.Site instance at 0x7f5e265c77a0>
2017-09-11T06:16:07+0600 [Launcher] Scrapyd 1.2.0 started: max_proc=16, runner='scrapyd.runner'
2017-09-11T06:16:07+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:16:07 +0000] "GET /listprojects.json HTTP/1.1" 200 98 "-" "python-requests/2.18.4"
2017-09-11T06:16:07+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:16:07 +0000] "GET /listversions.json?project=tester2_fda_trial20 HTTP/1.1" 200 80 "-" "python-requests/2.18.4"
2017-09-11T06:16:07+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:16:07 +0000] "GET /listjobs.json?project=tester2_fda_trial20 HTTP/1.1" 200 92 "-" "python-requests/2.18.4"
scrapyd daemon on instance2 (192.168.0.106) running status:
scrapyd
2017-09-11T06:09:28+0600 [-] Loading /home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapyd/txapp.py...
2017-09-11T06:09:28+0600 [-] Scrapyd web console available at http://0.0.0.0:6800/
2017-09-11T06:09:28+0600 [-] Loaded.
2017-09-11T06:09:28+0600 [twisted.scripts._twistd_unix.UnixAppLogger#info] twistd 17.5.0 (/home/mtaziz/.virtualenvs/scrapydevenv/bin/python 2.7.6) starting up.
2017-09-11T06:09:28+0600 [twisted.scripts._twistd_unix.UnixAppLogger#info] reactor class: twisted.internet.epollreactor.EPollReactor.
2017-09-11T06:09:28+0600 [-] Site starting on 6800
2017-09-11T06:09:28+0600 [twisted.web.server.Site#info] Starting factory <twisted.web.server.Site instance at 0x7fbe6eaeac20>
2017-09-11T06:09:28+0600 [Launcher] Scrapyd 1.2.0 started: max_proc=16, runner='scrapyd.runner'
2017-09-11T06:09:32+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:32 +0000] "GET /listprojects.json HTTP/1.1" 200 98 "-" "python-requests/2.18.4"
2017-09-11T06:09:32+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:32 +0000] "GET /listversions.json?project=tester2_fda_trial20 HTTP/1.1" 200 80 "-" "python-requests/2.18.4"
2017-09-11T06:09:32+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:32 +0000] "GET /listjobs.json?project=tester2_fda_trial20 HTTP/1.1" 200 92 "-" "python-requests/2.18.4"
2017-09-11T06:09:37+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:37 +0000] "GET /listprojects.json HTTP/1.1" 200 98 "-" "python-requests/2.18.4"
2017-09-11T06:09:37+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:37 +0000] "GET /listversions.json?project=tester2_fda_trial20 HTTP/1.1" 200 80 "-" "python-requests/2.18.4"
worker1 logs
After updating the code for RabbitMQ server settings followed by the suggestions made by #Tarun Lalwani
The suggestion was to use RabbitMQ Server IP - 192.168.0.101:5672 instead of
127.0.0.1:5672. After I updated as suggested by Tarun Lalwani got the new problems as below............
2017-09-11 15:49:18 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: tester2_fda_trial20)
2017-09-11 15:49:18 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tester2_fda_trial20.spiders', 'ROBOTSTXT_OBEY': True, 'LOG_LEVEL': 'INFO', 'SPIDER_MODULES': ['tester2_fda_trial20.spiders'], 'BOT_NAME': 'tester2_fda_trial20', 'FEED_URI': 'file:///var/lib/scrapyd/items/tester2_fda_trial20/tester2_fda_trial20/79b1123a96d611e79276000c29bad697.jl', 'SCHEDULER': 'tester2_fda_trial20.rabbitmq.scheduler.Scheduler', 'TELNETCONSOLE_ENABLED': False, 'LOG_FILE': '/var/log/scrapyd/tester2_fda_trial20/tester2_fda_trial20/79b1123a96d611e79276000c29bad697.log'}
2017-09-11 15:49:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.corestats.CoreStats']
2017-09-11 15:49:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-09-11 15:49:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-09-11 15:49:18 [scrapy.middleware] INFO: Enabled item pipelines:
['tester2_fda_trial20.pipelines.FdaTrial20Pipeline',
'tester2_fda_trial20.mongodb.scrapy_mongodb.MongoDBPipeline']
2017-09-11 15:49:18 [scrapy.core.engine] INFO: Spider opened
2017-09-11 15:49:18 [pika.adapters.base_connection] INFO: Connecting to 192.168.0.101:5672
2017-09-11 15:49:18 [pika.adapters.blocking_connection] INFO: Created channel=1
2017-09-11 15:49:18 [scrapy.core.engine] INFO: Closing spider (shutdown)
2017-09-11 15:49:18 [pika.adapters.blocking_connection] INFO: Channel.close(0, Normal Shutdown)
2017-09-11 15:49:18 [pika.channel] INFO: Channel.close(0, Normal Shutdown)
2017-09-11 15:49:18 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method ?.close_spider of <scrapy.extensions.feedexport.FeedExporter object at 0x7f94878b8c50>>
Traceback (most recent call last):
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
result = f(*args, **kw)
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapy/extensions/feedexport.py", line 201, in close_spider
slot = self.slot
AttributeError: 'FeedExporter' object has no attribute 'slot'
2017-09-11 15:49:18 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method ?.spider_closed of <Tester2Fda_Trial20Spider 'tester2_fda_trial20' at 0x7f9484f897d0>>
Traceback (most recent call last):
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
result = f(*args, **kw)
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "/tmp/user/1000/tester2_fda_trial20-10-d4Req9.egg/tester2_fda_trial20/spiders/tester2_fda_trial20.py", line 28, in spider_closed
AttributeError: 'Tester2Fda_Trial20Spider' object has no attribute 'statstask'
2017-09-11 15:49:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'finish_reason': 'shutdown',
'finish_time': datetime.datetime(2017, 9, 11, 9, 49, 18, 159896),
'log_count/ERROR': 2,
'log_count/INFO': 10}
2017-09-11 15:49:18 [scrapy.core.engine] INFO: Spider closed (shutdown)
2017-09-11 15:49:18 [twisted] CRITICAL: Unhandled error in Deferred:
2017-09-11 15:49:18 [twisted] CRITICAL:
Traceback (most recent call last):
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
result = g.send(result)
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapy/crawler.py", line 95, in crawl
six.reraise(*exc_info)
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapy/crawler.py", line 79, in crawl
yield self.engine.open_spider(self.spider, start_requests)
OperationFailure: command SON([('saslStart', 1), ('mechanism', 'SCRAM-SHA-1'), ('payload', Binary('n,,n=tariq,r=MjY5OTQ0OTYwMjA4', 0)), ('autoAuthorize', 1)]) on namespace admin.$cmd failed: Authentication failed.
MongoDBPipeline
# coding:utf-8
import datetime
from pymongo import errors
from pymongo.mongo_client import MongoClient
from pymongo.mongo_replica_set_client import MongoReplicaSetClient
from pymongo.read_preferences import ReadPreference
from scrapy.exporters import BaseItemExporter
try:
from urllib.parse import quote
except:
from urllib import quote
def not_set(string):
""" Check if a string is None or ''
:returns: bool - True if the string is empty
"""
if string is None:
return True
elif string == '':
return True
return False
class MongoDBPipeline(BaseItemExporter):
""" MongoDB pipeline class """
# Default options
config = {
'uri': 'mongodb://localhost:27017',
'fsync': False,
'write_concern': 0,
'database': 'scrapy-mongodb',
'collection': 'items',
'replica_set': None,
'buffer': None,
'append_timestamp': False,
'sharded': False
}
# Needed for sending acknowledgement signals to RabbitMQ for all persisted items
queue = None
acked_signals = []
# Item buffer
item_buffer = dict()
def load_spider(self, spider):
self.crawler = spider.crawler
self.settings = spider.settings
self.queue = self.crawler.engine.slot.scheduler.queue
def open_spider(self, spider):
self.load_spider(spider)
# Configure the connection
self.configure()
self.spidername = spider.name
self.config['uri'] = 'mongodb://' + self.config['username'] + ':' + quote(self.config['password']) + '#' + self.config['uri'] + '/admin'
self.shardedcolls = []
if self.config['replica_set'] is not None:
self.connection = MongoReplicaSetClient(
self.config['uri'],
replicaSet=self.config['replica_set'],
w=self.config['write_concern'],
fsync=self.config['fsync'],
read_preference=ReadPreference.PRIMARY_PREFERRED)
else:
# Connecting to a stand alone MongoDB
self.connection = MongoClient(
self.config['uri'],
fsync=self.config['fsync'],
read_preference=ReadPreference.PRIMARY)
# Set up the collection
self.database = self.connection[spider.name]
# Autoshard the DB
if self.config['sharded']:
db_statuses = self.connection['config']['databases'].find({})
partitioned = []
notpartitioned = []
for status in db_statuses:
if status['partitioned']:
partitioned.append(status['_id'])
else:
notpartitioned.append(status['_id'])
if spider.name in notpartitioned or spider.name not in partitioned:
try:
self.connection.admin.command('enableSharding', spider.name)
except errors.OperationFailure:
pass
else:
collections = self.connection['config']['collections'].find({})
for coll in collections:
if (spider.name + '.') in coll['_id']:
if coll['dropped'] is not True:
if coll['_id'].index(spider.name + '.') == 0:
self.shardedcolls.append(coll['_id'][coll['_id'].index('.') + 1:])
def configure(self):
""" Configure the MongoDB connection """
# Set all regular options
options = [
('uri', 'MONGODB_URI'),
('fsync', 'MONGODB_FSYNC'),
('write_concern', 'MONGODB_REPLICA_SET_W'),
('database', 'MONGODB_DATABASE'),
('collection', 'MONGODB_COLLECTION'),
('replica_set', 'MONGODB_REPLICA_SET'),
('buffer', 'MONGODB_BUFFER_DATA'),
('append_timestamp', 'MONGODB_ADD_TIMESTAMP'),
('sharded', 'MONGODB_SHARDED'),
('username', 'MONGODB_USER'),
('password', 'MONGODB_PASSWORD')
]
for key, setting in options:
if not not_set(self.settings[setting]):
self.config[key] = self.settings[setting]
def process_item(self, item, spider):
""" Process the item and add it to MongoDB
:type item: Item object
:param item: The item to put into MongoDB
:type spider: BaseSpider object
:param spider: The spider running the queries
:returns: Item object
"""
item_name = item.__class__.__name__
# If we are working with a sharded DB, the collection will also be sharded
if self.config['sharded']:
if item_name not in self.shardedcolls:
try:
self.connection.admin.command('shardCollection', '%s.%s' % (self.spidername, item_name), key={'_id': "hashed"})
self.shardedcolls.append(item_name)
except errors.OperationFailure:
self.shardedcolls.append(item_name)
itemtoinsert = dict(self._get_serialized_fields(item))
if self.config['buffer']:
if item_name not in self.item_buffer:
self.item_buffer[item_name] = []
self.item_buffer[item_name].append([])
self.item_buffer[item_name].append(0)
self.item_buffer[item_name][1] += 1
if self.config['append_timestamp']:
itemtoinsert['scrapy-mongodb'] = {'ts': datetime.datetime.utcnow()}
self.item_buffer[item_name][0].append(itemtoinsert)
if self.item_buffer[item_name][1] == self.config['buffer']:
self.item_buffer[item_name][1] = 0
self.insert_item(self.item_buffer[item_name][0], spider, item_name)
return item
self.insert_item(itemtoinsert, spider, item_name)
return item
def close_spider(self, spider):
""" Method called when the spider is closed
:type spider: BaseSpider object
:param spider: The spider running the queries
:returns: None
"""
for key in self.item_buffer:
if self.item_buffer[key][0]:
self.insert_item(self.item_buffer[key][0], spider, key)
def insert_item(self, item, spider, item_name):
""" Process the item and add it to MongoDB
:type item: (Item object) or [(Item object)]
:param item: The item(s) to put into MongoDB
:type spider: BaseSpider object
:param spider: The spider running the queries
:returns: Item object
"""
self.collection = self.database[item_name]
if not isinstance(item, list):
if self.config['append_timestamp']:
item['scrapy-mongodb'] = {'ts': datetime.datetime.utcnow()}
ack_signal = item['ack_signal']
item.pop('ack_signal', None)
self.collection.insert(item, continue_on_error=True)
if ack_signal not in self.acked_signals:
self.queue.acknowledge(ack_signal)
self.acked_signals.append(ack_signal)
else:
signals = []
for eachitem in item:
signals.append(eachitem['ack_signal'])
eachitem.pop('ack_signal', None)
self.collection.insert(item, continue_on_error=True)
del item[:]
for ack_signal in signals:
if ack_signal not in self.acked_signals:
self.queue.acknowledge(ack_signal)
self.acked_signals.append(ack_signal)
To sum up, I believe the problem lies in scrapyd daemons running on both instances but somehow scraper or worker1 can not access it, I could not figure it out, I did not find any use cases on stackoverflow.
Any help is highly appreciated in this regard. Thank you in advance!

How can i use scrapy shell to with username and password on url ( on login require website )

I want to scrap a login require website and check my xpath right or wrong using scrapy shell in python scrapy-framework like
C:\Users\Ranvijay.Sachan>scrapy shell https://www.google.co.in/?gfe_rd=cr&ei=mIl8V
6LovC8gegtYHYDg&gws_rd=ssl
:0: UserWarning: You do not have a working installation of the service_identity m
ule: 'No module named service_identity'. Please install it from <https://pypi.py
on.org/pypi/service_identity> and make sure all of its dependencies are satisfied
Without the service_identity module and a recent enough pyOpenSSL to support it,
wisted can perform only rudimentary TLS client hostname verification. Many valid
ertificate/hostname mappings may be rejected.
2014-12-01 21:00:04-0700 [scrapy] INFO: Scrapy 0.24.2 started (bot: scrapybot)
2014-12-01 21:00:04-0700 [scrapy] INFO: Optional features available: ssl, http11
2014-12-01 21:00:04-0700 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL'
0}
2014-12-01 21:00:05-0700 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseS
der, WebService, CoreStats, SpiderState
2014-12-01 21:00:05-0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthM
dleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, Default
adersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddle
re, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-12-01 21:00:05-0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMidd
ware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-12-01 21:00:05-0700 [scrapy] INFO: Enabled item pipelines:
2014-12-01 21:00:05-0700 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:60
2014-12-01 21:00:05-0700 [scrapy] DEBUG: Web service listening on 127.0.0.1:6081
2014-12-01 21:00:05-0700 [default] INFO: Spider opened
2014-12-01 21:00:06-0700 [default] DEBUG: Crawled (200) <GET https://www.google.c
in/?gfe_rd=cr> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x01B71910>
[s] item {}
[s] request <GET https://www.google.co.in/?gfe_rd=cr>
[s] response <200 https://www.google.co.in/?gfe_rd=cr>
[s] settings <scrapy.settings.Settings object at 0x023CBC90>
[s] spider <Spider 'default' at 0x29402f0>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
>>> response.xpath("//div[#id='_eEe']/text()").extract()
[u'Google.co.in offered in: ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u
', u' ']
>>>

Can't get callback method in Scrapy rule to trigger

I am having some problems getting CrawlSpider rules working properly in Scrapy 0.22.2 for Python 2.7.3. It seems to me that no matter what I do, the callback method specified in my rule is never triggered. I have tried setting my rules to allow everything, but at no point is the print method inside my callback method (parse_units) ever called. I've made sure not to try to override the parse method for this purpose, as this seems to be a common mistake.
The starting URL is of the form
http://training.gov.au/Training/Details/CHC40708?tableUnits-page=1&pageSizeKey=Training_Details_tableUnits&pageSize=100
The links to the individual pages I want the crawler to follow through and parse look like this
http://training.gov.au/Training/Details/BSBWOR204A
Here is my python code:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
import urllib2
import os
qualification_code = raw_input( "enter the qualification code\n");
class trainingGovSpider(CrawlSpider):
name="trainingGov"
allowed_domains = ["training.gov.au"]
# refine this to accept user input
start_urls = ["http://training.gov.au/Training/Details/"+qualification_code+"?tableUnits-page=1&pageSizeKey=Training_Details_tableUnits&pageSize=100"]
rules = [Rule(SgmlLinkExtractor(restrict_xpaths=('//a[contains(#title, "View details for unit code")]')), callback='parse_units')]
def parse(self, response):
sel = Selector(response)
# first get the unit outline
qual_outline = sel.xpath('//a[contains(#title, "Download Qualification in PDF format.")]')
qual_outline_link = qual_outline.xpath('#href').extract()[0]
url = "http://training.gov.au" + qual_outline_link
#get url to qualification outline
print url
# next, get link elements which point to each individual unit in the qualification
sites = sel.xpath('//a[contains(#title, "View details for unit code")]')
for site in sites:
title = site.xpath('text()').extract()
# will need to follow each link
link = site.xpath('#href').extract()[0]
# will need to combine with http://training.gov.au
print link
#this should allow all individual unit links in a qualification to be parsed, but it isn't being called
def parse_units(self, response):
print 'parse_unit called'
hxs = HtmlXPathSelector(response)
current_status = hxs.xpath('text()').extract()
#current_status = hxs.xpath('/html/body/div/div/div/div/div/div[#class="outer"]/div[#class="fieldset"]/div[#class="display-row"]/div[#class="display-row"]/div[#class="display-field"]/span[re:test(., "Current", "i")]').extract()
print current_status
Log file as follows:
scrapy crawl trainingGov
enter the qualification code
CHC52212
2014-04-17 15:30:05+0800 [scrapy] INFO: Scrapy 0.22.2 started (bot: tutorial)
2014-04-17 15:30:05+0800 [scrapy] INFO: Optional features available: ssl, http11
2014-04-17 15:30:05+0800 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tutorial'}
2014-04-17 15:30:05+0800 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-04-17 15:30:05+0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-04-17 15:30:05+0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-04-17 15:30:05+0800 [scrapy] INFO: Enabled item pipelines:
2014-04-17 15:30:05+0800 [trainingGov] INFO: Spider opened
2014-04-17 15:30:05+0800 [trainingGov] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-04-17 15:30:05+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6024
2014-04-17 15:30:05+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6081
2014-04-17 15:30:06+0800 [trainingGov] DEBUG: Crawled (200) <GET http://training.gov.au/Training/Details/CHC52212?tableUnits-page=1&pageSizeKey=Training_Details_tableUnits&pageSize=100> (referer: None)
http://training.gov.au/TrainingComponentFiles/CHC08/CHC52212_R1.pdf
/Training/Details/BSBSUS501A
/Training/Details/BSBWOR403A
/Training/Details/CHCAC317A
/Training/Details/CHCAC318B
/Training/Details/CHCAC416A
/Training/Details/CHCAC417A
/Training/Details/CHCAC507E
/Training/Details/CHCAD402D
/Training/Details/CHCAD504B
/Training/Details/CHCADMIN508B
/Training/Details/CHCAL523D
/Training/Details/CHCAOD402B
/Training/Details/CHCCD401E
/Training/Details/CHCCD412B
/Training/Details/CHCCD516B
/Training/Details/CHCCH301C
/Training/Details/CHCCH427B
/Training/Details/CHCCHILD401B
/Training/Details/CHCCHILD505B
/Training/Details/CHCCOM403A
/Training/Details/CHCCOM504B
/Training/Details/CHCCS426B
/Training/Details/CHCCS427B
/Training/Details/CHCCS502C
/Training/Details/CHCCS503B
/Training/Details/CHCCS505B
/Training/Details/CHCCS512C
/Training/Details/CHCCS513C
/Training/Details/CHCDIS301C
/Training/Details/CHCDIS410A
/Training/Details/CHCDIS507C
/Training/Details/CHCES311B
/Training/Details/CHCES415A
/Training/Details/CHCES502C
/Training/Details/CHCES511B
/Training/Details/CHCHC401C
/Training/Details/CHCICS409A
/Training/Details/CHCINF407D
/Training/Details/CHCINF408C
/Training/Details/CHCINF505D
/Training/Details/CHCMH301C
/Training/Details/CHCMH402B
/Training/Details/CHCMH411A
/Training/Details/CHCNET501C
/Training/Details/CHCNET503D
/Training/Details/CHCORG405E
/Training/Details/CHCORG406C
/Training/Details/CHCORG423C
/Training/Details/CHCORG428A
/Training/Details/CHCORG501B
/Training/Details/CHCORG506E
/Training/Details/CHCORG525D
/Training/Details/CHCORG607D
/Training/Details/CHCORG610B
/Training/Details/CHCORG611C
/Training/Details/CHCPA402B
/Training/Details/CHCPR510B
/Training/Details/CHCSD512C
/Training/Details/CHCSW401A
/Training/Details/CHCSW402B
/Training/Details/CHCYTH401B
/Training/Details/CHCYTH402C
/Training/Details/CHCYTH506B
/Training/Details/HLTFA311A
/Training/Details/HLTFA412A
/Training/Details/HLTHIR403C
/Training/Details/HLTHIR404D
/Training/Details/HLTWHS401A
/Training/Details/PSPGOV517A
/Training/Details/PSPMNGT605B
/Training/Details/SISCCRD302A
/Training/Details/SRXGOV004B
/Training/Details/TAEDEL402A
2014-04-17 15:30:06+0800 [trainingGov] INFO: Closing spider (finished)
2014-04-17 15:30:06+0800 [trainingGov] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 310,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 17347,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 4, 17, 7, 30, 6, 96000),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2014, 4, 17, 7, 30, 5, 522000)}
2014-04-17 15:30:06+0800 [trainingGov] INFO: Spider closed (finished)
Thank you!

rails 4 flash not persisting across redirect "ActionDispatch::Request::Session -- not yet loaded"

I am having a weird issue where a flash message is not persisting across redirects in a Rails 4.0.2 application. I suspect that the problem has something to do with the session not yet being loaded when I try to access the flash hash as I am logging the session and getting the following output:
#<ActionDispatch::Request::Session:0x7ffda10835a8 not yet loaded>
I am trying to update an asset in the 'assets_controller' and then redirect to the 'home' action of the 'dashboard_controller' if the asset was successfully updated. My code is as follows:
AssetsController:
def update
#asset = Asset.update_asset(asset_parameters, params[:id])
if #asset.errors.any?
flash.now[:error] = "There were some errors"
render 'edit'
else
flash[:success] = "Asset updated!"
logger.debug("flash in assets: #{flash.inspect}")
redirect_to root_path()
end
end
DashboardController:
def home
logger.debug("session #{session.inspect}")
logger.debug("flash in home #{flash.inspect}")
res = Bid.bids_query({:generation => 'current'})
#bids = []
#staging_jobs = []
res.each do |bid|
bid.staging_job_id ? #staging_jobs << bid : #bids << bid
end
end
Log Output:
Started PATCH "/assets/19" for 127.0.0.1 at 2014-03-16 21:45:53 -0700
Processing by AssetsController#update as HTML
flash in assets: #<ActionDispatch::Flash::FlashHash:0x007ff7abdcfc30 #discard=#<Set:
{}>, #flashes={:success=>"Asset updated!"}, #now=nil>
Redirected by /Users/louism2/rails_projects/rosebank-
b/app/controllers/assets_controller.rb:31:in `update'
Redirected to http://localhost:3000/
Completed 302 Found in 14ms (ActiveRecord: 2.4ms)
Started GET "/" for 127.0.0.1 at 2014-03-16 21:45:53 -0700
Processing by DashboardController#home as HTML
session #<ActionDispatch::Request::Session:0x7ffda10835a8 not yet loaded>
flash in home #<ActionDispatch::Flash::FlashHash:0x007ff7abe743e8 #discard=#<Set: {}>,
#flashes={}, #now=nil>
Completed 200 OK in 81ms (Views: 53.7ms | ActiveRecord: 2.6ms)
The issue had to do with the fact that my model was named "assets" and that was somehow conflicting with the asset pipeline. The trick was to prefix the asset pipeline with an identifier other than assets which you can specify in your application.rb file with the following line:
config.assets.prefix = '/new_name_for_asset_pipeline'