Running Scrapy with a task queue - django

I built a web crawler with Scrapy and Django and put the CrawlerRunner code into task queue. In my local everything works fine until run the tasks in the server. I'm thinking multiple threads causing the problem.
This is the task code, I'm using huey for the tasks
from huey import crontab
from huey.contrib.djhuey import db_periodic_task, on_startup
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from apps.core.tasks import CRONTAB_PERIODS
from apps.scrapers.crawler1 import Crawler1
from apps.scrapers.crawler2 import Crawler2
from apps.scrapers.crawler3 import Crawler3
#on_startup(name="scrape_all__on_startup")
#db_periodic_task(crontab(**CRONTAB_PERIODS["every_10_minutes"]))
def scrape_all():
configure_logging()
settings = get_project_settings()
runner = CrawlerRunner(settings=settings)
runner.crawl(Crawler1)
runner.crawl(Crawler2)
runner.crawl(Crawler3)
defer = runner.join()
defer.addBoth(lambda _: reactor.stop())
reactor.run()
and this is the first error I get from sentry.io, it's truncated
Unhandled Error
Traceback (most recent call last):
File "/home/deployer/env/lib/python3.10/site-packages/twisted/internet/base.py", line 501, in fireEvent
DeferredList(beforeResults).addCallback(self._continueFiring)
File "/home/deployer/env/lib/python3.10/site-packages/twisted/internet/defer.py", line 532, in addCallback
return self.addCallbacks(callback, callbackArgs=args, callbackKeywords=kwargs)
File "/home/deployer/env/lib/python3.10/site-packages/twisted/internet/defer.py", line 512, in addCallbacks
self._runCallbacks()
File "/home/deployer/env/lib/python3.10/site-packages/twisted/internet/defer.py", line 892, in _runCallbacks
current.result = callback( # type: ignore[misc]
--- <exception caught here> ---
File "/home/deployer/env/lib/python3.10/site-packages/twisted/internet/base.py", line 513, in _continueFiring
callable(*args, **kwargs)
File "/home/deployer/env/lib/python3.10/site-packages/twisted/internet/base.py", line 1314, in _reallyStartRunning
self._handle...
the task is set to run every 10 minutes, on the second run I'm getting this error from sentry.io
ReactorNotRestartable: null
File "huey/api.py", line 379, in _execute
task_value = task.execute()
File "huey/api.py", line 772, in execute
return func(*args, **kwargs)
File "huey/contrib/djhuey/__init__.py", line 135, in inner
return fn(*args, **kwargs)
File "apps/series/tasks.py", line 31, in scrape_all
reactor.run()
File "twisted/internet/base.py", line 1317, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "twisted/internet/base.py", line 1299, in startRunning
ReactorBase.startRunning(cast(ReactorBase, self))
File "twisted/internet/base.py", line 843, in startRunning
raise error.ReactorNotRestartable()
Assuming at the first run twisted reactor didn't kill itself and after 10 minutes huey trying to start a twisted reactor again and fails.
I'm not proficient about multi-threads but i'm assuming task runner and twisted are running on different threads and they can't communicate with each other.
Any advices ?

Related

Scrapy spider not working on Django after implementing WebSockets with Channels (cannot call it from an async context)

I'm opening a new question as I'm having an issue with Scrapy and Channels in a Django application and I would appreciate if someone could guide me in the right direction.
The reason why I'm using channels is because I want to retrieve in real-time the crawl statuses from Scrapyd API, without having to use setIntervals all the time, as this is supposed to become a SaaS service which could potentially be used by many users.
I've implemented channels correctly, if I run:
python manage.py runserver
I can correctly see that the system is now using ASGI:
System check identified no issues (0 silenced).
September 01, 2020 - 15:12:33
Django version 3.0.7, using settings 'seotoolkit.settings'
Starting ASGI/Channels version 2.4.0 development server at http://127.0.0.1:8000/
Quit the server with CONTROL-C.
Also, the client and server connect correctly via the WebSocket:
WebSocket HANDSHAKING /crawler/22/ [127.0.0.1:50264]
connected {'type': 'websocket.connect'}
WebSocket CONNECT /crawler/22/ [127.0.0.1:50264]
So far so good, the problem comes when I run scrapy via the Scrapyd-API
2020-09-01 15:31:25 [scrapy.core.scraper] ERROR: Error processing {'url': 'https://www.example.com'}
raceback (most recent call last):
File "/Users/Andrea/anaconda3/envs/DjangoScrape/lib/python3.6/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Users/Andrea/anaconda3/envs/DjangoScrape/lib/python3.6/site-packages/scrapy/utils/defer.py", line 157, in f
return deferred_from_coro(coro_f(*coro_args, **coro_kwargs))
File "/private/var/folders/qz/ytk7wml54zd6rssxygt512hc0000gn/T/crawler-1597767314-spxv81dy.egg/webspider/pipelines.py", line 67, in process_item
File "/Users/Andrea/anaconda3/envs/DjangoScrape/lib/python3.6/site-packages/django/db/models/manager.py", line 82, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/Users/Andrea/anaconda3/envs/DjangoScrape/lib/python3.6/site-packages/django/db/models/query.py", line 411, in get
num = len(clone)
File "/Users/Andrea/anaconda3/envs/DjangoScrape/lib/python3.6/site-packages/django/db/models/query.py", line 258, in __len__
self._fetch_all()
File "/Users/Andrea/anaconda3/envs/DjangoScrape/lib/python3.6/site-packages/django/db/models/query.py", line 1261, in _fetch_all
self._result_cache = list(self._iterable_class(self))
File "/Users/Andrea/anaconda3/envs/DjangoScrape/lib/python3.6/site-packages/django/db/models/query.py", line 57, in __iter__
results = compiler.execute_sql(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size)
File "/Users/Andrea/anaconda3/envs/DjangoScrape/lib/python3.6/site-packages/django/db/models/sql/compiler.py", line 1150, in execute_sql
cursor = self.connection.cursor()
File "/Users/Andrea/anaconda3/envs/DjangoScrape/lib/python3.6/site-packages/django/utils/asyncio.py", line 24, in inner
raise SynchronousOnlyOperation(message)
django.core.exceptions.SynchronousOnlyOperation: You cannot call this from an async context - use a thread or sync_to_async.
I think the error message is quite clear: You cannot call this from an async context - use a thread or sync_to_async = I guess that by enabling ASGI there is a conflict with Scrapy library that prevents it from working correctly.
Unfortunately I cannot understand the reason behind this and neither where I should use a "thread or sync_to_async" as suggested.
Note that WebSockets are only used to check crawl status and nothing else.
Can anyone try to explain to me the reason behind this incompatibility and give me some hints on how to overcome this obstacle? I spend a lot of hours looking for an answer but could not find any.
Thanks a lot.
You can solve this error by simply going to your pipelines.py file. Importing sync_to_async from asgiref.sync.
from asgiref.sync import sync_to_async
After importing sync_to_async, you need to use it as a decorator on the function you are using for storing data to the database.
For instance
from itemadapter import ItemAdapter
from crawler.models import Movie
from asgiref.sync import sync_to_async
class MovieSpiderPipeline:
#sync_to_async
def process_item(self, item, spider):
movie = Movie(**item)
movie.save()
return item

Import Python file which contains pySpark functions into Django app

I'm trying to import in views.py of my Django app, a python file "load_model.py" which contains my custom pyspark API but I got an error And I can't figure out how to solve it.
I import the file "load-model.py" with a simple:
import load_model as lm
My load_model.py contains the following code (this is just part of the code):
import findspark
# findspark.init('/home/student/spark-2.1.1-bin-hadoop2.7')
findspark.init('/Users/fabiomagarelli/spark-2.4.3-bin-hadoop2.7')
from pyspark.sql import SparkSession
from pyspark.ml.regression import RandomForestRegressionModel
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.sql import Row
from collections import OrderedDict
spark = SparkSession.builder.appName('RForest_Regression').getOrCreate()
sc = spark.sparkContext
model = RandomForestRegressionModel.load('model/')
def predict(df):
predictions = model.transform(df)
return int(predictions.select('prediction').collect()[0].prediction)
# etc... ... ...
when I lunch python manage.py run server on my command line, I get this error log:
19/07/20 07:22:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Traceback (most recent call last):
File "manage.py", line 21, in <module>
main()
File "manage.py", line 17, in main
execute_from_command_line(sys.argv)
File "/anaconda3/lib/python3.7/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line
utility.execute()
File "/anaconda3/lib/python3.7/site-packages/django/core/management/__init__.py", line 375, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/anaconda3/lib/python3.7/site-packages/django/core/management/base.py", line 323, in run_from_argv
self.execute(*args, **cmd_options)
File "/anaconda3/lib/python3.7/site-packages/django/core/management/commands/runserver.py", line 60, in execute
super().execute(*args, **options)
File "/anaconda3/lib/python3.7/site-packages/django/core/management/base.py", line 364, in execute
output = self.handle(*args, **options)
File "/anaconda3/lib/python3.7/site-packages/django/core/management/commands/runserver.py", line 95, in handle
self.run(**options)
File "/anaconda3/lib/python3.7/site-packages/django/core/management/commands/runserver.py", line 102, in run
autoreload.run_with_reloader(self.inner_run, **options)
File "/anaconda3/lib/python3.7/site-packages/django/utils/autoreload.py", line 585, in run_with_reloader
start_django(reloader, main_func, *args, **kwargs)
File "/anaconda3/lib/python3.7/site-packages/django/utils/autoreload.py", line 570, in start_django
reloader.run(django_main_thread)
File "/anaconda3/lib/python3.7/site-packages/django/utils/autoreload.py", line 288, in run
self.run_loop()
File "/anaconda3/lib/python3.7/site-packages/django/utils/autoreload.py", line 294, in run_loop
next(ticker)
File "/anaconda3/lib/python3.7/site-packages/django/utils/autoreload.py", line 334, in tick
for filepath, mtime in self.snapshot_files():
File "/anaconda3/lib/python3.7/site-packages/django/utils/autoreload.py", line 350, in snapshot_files
for file in self.watched_files():
File "/anaconda3/lib/python3.7/site-packages/django/utils/autoreload.py", line 249, in watched_files
yield from iter_all_python_module_files()
File "/anaconda3/lib/python3.7/site-packages/django/utils/autoreload.py", line 101, in iter_all_python_module_files
modules_view = sorted(list(sys.modules.items()), key=lambda i: i[0])
RuntimeError: dictionary changed size during iteration
Exception ignored in: <function JavaWrapper.__del__ at 0x11d2de6a8>
Traceback (most recent call last):
File "/Users/fabiomagarelli/spark-2.4.3-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 41, in __del__
File "/Users/fabiomagarelli/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 2000, in detach
File "/Users/fabiomagarelli/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1298, in _detach
File "/Users/fabiomagarelli/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 628, in _garbage_collect_object
File "/anaconda3/lib/python3.7/logging/__init__.py", line 1370, in debug
File "/anaconda3/lib/python3.7/logging/__init__.py", line 1626, in isEnabledFor
TypeError: 'NoneType' object is not callable
Exception ignored in: <function GatewayConnection.__init__.<locals>.<lambda> at 0x11da84d90>
Traceback (most recent call last):
File "/Users/fabiomagarelli/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1061, in <lambda>
File "/Users/fabiomagarelli/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 640, in _garbage_collect_connection
File "/Users/fabiomagarelli/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 487, in quiet_shutdown
File "/anaconda3/lib/python3.7/logging/__init__.py", line 1370, in debug
File "/anaconda3/lib/python3.7/logging/__init__.py", line 1626, in isEnabledFor
TypeError: 'NoneType' object is not callable
pySpark is installed on my computer, I was using it on my jupyter notebook for fitting the model so I don't think the problem is that pyspark is not installed. Any suggestions?
So I found some tutorials on how to deploy a pySpark ML model using databricks, tensor flow etc. All too complex solutions for my limited pySpark knowledge and a project deadline in 4 weeks.
However, I found a workaround which consists in "deploying" the ML model on a Flask App then call it from my Django App (my project app). I think this may be very useful for someone facing my same problem. Not the best practice maybe but working! That's why I'm going to explain each step:
1. Create a Flask Application
in the command line (in your virtual env if you have one), type: pip install flask.
make a new folder (i call it 'static') and place in it the model folder which is obtained by saving the pySpark model (it contains other folders: data, metadata...)
create a new folder for your flask app (can be in the parent folder of your django app) and create a file in it named main.py (you can use whatever name but for the code I'm gonna post, this is the name I used).
in main.py, copy paste this:
from flask import Flask, request
import findspark
findspark.init('/home/student/spark-2.1.1-bin-hadoop2.7')
# various pySpark imports here...
app = Flask(__name__)
spark = SparkSession.builder.appName('RForest_Regression').getOrCreate()
sc = spark.sparkContext
# I'm using a RandomForest ML model, change it as appropriate
model = RandomForestRegressionModel.load('static/model/')
# define here all your functions to make a prediction (eventual arguments cleaning...)
#app.route('/predict')
# this is the function called when the page: '127.0.0.1/5000/predict' is requested.
# you can pass arguments in here by calling: '127.0.0.1/5000/predict?data=...'
numbers = request.args.get('data') # numbers = '...'
makePredictions(numbers)
def makePredictions(n):
# your function here
now on the django app, open views.py
add the fucntion to request the predictions from the flask app:
# Send a request to the flask App where the model is hosted
def getPredictions(request):
try:
data_to_predict = request.GET['data']
url = 'http://127.0.0.1:5000/predict?data=%s' % data_to_predict
response = get(url)
return JsonResponse(response.text, safe=False)
except:
print('ERROR getPredictions: no pySpark module or Flask App not running or wrong arguments')
then call the getPredictions function from javascript in your django app (I haven't done it yet so I don't have a snippet but so far is working, I tested it passing custom arguments).
You have to remember to run the flask app and the django app together in order to make it works:
cd into your Flask app folder (where you have the main.py file) then type: export FLASK_APP=main.py and flask run
then cd into your django app (where you have the manage.py file) then type: python manage.py runserver
I hope this will be useful to someone and that my explaination is not too messy. I will appreciate any comments, suggestions and requests. :)

ThreadPoolExecutor fails when run with manage.py

# test.py
# python 3.4.5
import time
from concurrent.futures import ThreadPoolExecutor
def a():
time.sleep(1)
print("success")
executor = ThreadPoolExecutor(1)
executor.submit(a).result()
The above snippet works when run like
$ python test.py
success
But fails when run like
$ python manage.py shell < test.py
Traceback (most recent call last):
File "manage.py", line 22, in <module>
execute_from_command_line(sys.argv)
File "/var/www/cgi-bin/tracking/lib64/python3.4/site-packages/django/core/management/__init__.py", line 363, in execute_from_command_line
utility.execute()
File "/var/www/cgi-bin/tracking/lib64/python3.4/site-packages/django/core/management/__init__.py", line 355, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/var/www/cgi-bin/tracking/lib64/python3.4/site-packages/django/core/management/base.py", line 283, in run_from_argv
self.execute(*args, **cmd_options)
File "/var/www/cgi-bin/tracking/lib64/python3.4/site-packages/django/core/management/base.py", line 330, in execute
output = self.handle(*args, **options)
File "/var/www/cgi-bin/tracking/lib64/python3.4/site-packages/django/core/management/commands/shell.py", line 101, in handle
exec(sys.stdin.read())
File "<string>", line 11, in <module>
File "/usr/lib64/python3.4/concurrent/futures/_base.py", line 395, in result
return self.__get_result()
File "/usr/lib64/python3.4/concurrent/futures/_base.py", line 354, in __get_result
raise self._exception
File "/usr/lib64/python3.4/concurrent/futures/thread.py", line 54, in run
result = self.fn(*self.args, **self.kwargs)
File "<string>", line 7, in a
NameError: name 'time' is not defined
Which is really strange to me. What is it about running the script with the manage.py shell command that results in the time module being undefined in the function a?
Checking in the Django implementation (django/core/management/commands/shell.py line 83):
# Execute stdin if it has anything to read and exit.
# Not supported on Windows due to select.select() limitations.
if sys.platform != 'win32' and select.select([sys.stdin], [], [], 0)[0]:
exec(sys.stdin.read())
return
The developers did not add a globals() scope in the exec() method, that means you are importing time and ThreadPoolExecutor in the 'locals()' dictionary of the handle() scope (in shell.py) but after, when you try to use inside a() it tries to search in the locals() dictionary of the "a" scope and in the globals() dictionary so it throws an import error, you can see an example in this snippet:
command = """
import time
def b():
time.sleep(1)
b()
"""
def a():
exec(command)
a()
and try to change exec(command) by exec(command, globals())
I think it's not working because you did not set the environment variable DJANGO_SETTING_MODULE to your settings, and call django.setup() or set the path to sys.path.append('path/')
(NOT SURE)
But these 2 options can work like a charm:
Either you import the module time inside the function:
from concurrent.futures import ThreadPoolExecutor
def a():
import time
time.sleep(1)
print("success")
executor = ThreadPoolExecutor(1)
executor.submit(a).result()
or just import time at the beginning like you did, and use the module as a global one:
from concurrent.futures import ThreadPoolExecutor
import time
def a():
global time
time.sleep(1)
print("success")
executor = ThreadPoolExecutor(1)
executor.submit(a).result()

How to deal with httplib.BadStatusLine: ''

I'm scraping some data of the web using Python, BeautifulSoup and Selenium. I am also using PyVirtualDisplay so that I do not need a display.
It works perfectly from my laptop but when I run if from a server I'm getting the following error:
httplib.BadStatusLine: ''
I got this the second time it scraped a page. It now does it all the time. What is the issue?
EDIT
Code Added:
import requests, bs4
import csv
import re
import datetime
import time
import os
from contextlib import closing
from selenium import webdriver
from selenium.webdriver import Firefox # pip install selenium
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1500, 1200))
display.start()
url_base = "https://www.seek.com.au/jobs?page="
# open web browser and login
binary = FirefoxBinary('/home/firefox/firefox/firefox')
driver = webdriver.Firefox(firefox_binary=binary)
overlap = False
page = 0
while not overlap:
page += 1
driver.get(url_base+str(page))
...
And here is the traceback:
Traceback (most recent call last):
File "manage.py", line 22, in <module>
execute_from_command_line(sys.argv)
File "/var/www/matt/env/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 367, in execute_from_command_line
utility.execute()
File "/var/www/matt/env/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 359, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/var/www/matt/env/local/lib/python2.7/site-packages/django/core/management/base.py", line 294, in run_from_argv
self.execute(*args, **cmd_options)
File "/var/www/matt/env/local/lib/python2.7/site-packages/django/core/management/base.py", line 345, in execute
output = self.handle(*args, **options)
File "/var/www/matt/matt/management/commands/mattv3.py", line 109, in handle
driver.get(url_base+str(page))
File "/var/www/matt/env/local/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 245, in get
self.execute(Command.GET, {'url': url})
File "/var/www/matt/env/local/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 231, in execute
response = self.command_executor.execute(driver_command, params)
File "/var/www/matt/env/local/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 395, in execute
return self._request(command_info[0], url, body=data)
File "/var/www/matt/env/local/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 426, in _request
resp = self._conn.getresponse()
File "/usr/lib/python2.7/httplib.py", line 1136, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 453, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 417, in _read_status
raise BadStatusLine(line)
httplib.BadStatusLine: ''
I was running this on a very small server (512MB, 20GB SSD). I've increased it and it is running fine. If someone could explain the issue to me I would love to understand.

ImportError being generated when trying to run django-celery worker process

I'm trying to integrate django-celery into an existing site and I'm coming up against an error that I can't seem to get fixed.
For context, I went through the Django first steps and the test project was successful, ie everything worked as it should.
Now, in my existing project, I can't get the celery worker running from the command line:
manage.py celery worker --loglevel=info --settings=myproject.settings.dev_settings
When i run that I get the following stack trace and error:
Traceback (most recent call last):
File "C:\sites\corecrm\manage.py", line 10, in <module>
execute_from_command_line(sys.argv)
File "C:\Python27\lib\site-packages\django\core\management\__init__.py", line 453, in execute_from_command_line
utility.execute()
File "C:\Python27\lib\site-packages\django\core\management\__init__.py", line 392, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "C:\Python27\lib\site-packages\djcelery\management\commands\celery.py", line 22, in run_from_argv
['%s %s' % (argv[0], argv[1])] + argv[2:],
File "C:\Python27\lib\site-packages\celery\bin\celery.py", line 901, in execute_from_commandline
super(CeleryCommand, self).execute_from_commandline(argv)))
File "C:\Python27\lib\site-packages\celery\bin\base.py", line 187, in execute_from_commandline
return self.handle_argv(prog_name, argv[1:])
File "C:\Python27\lib\site-packages\celery\bin\celery.py", line 893, in handle_argv
return self.execute(command, argv)
File "C:\Python27\lib\site-packages\celery\bin\celery.py", line 868, in execute
return cls(app=self.app).run_from_argv(self.prog_name, argv)
File "C:\Python27\lib\site-packages\celery\bin\celery.py", line 148, in run_from_argv
return self(*args, **options)
File "C:\Python27\lib\site-packages\celery\bin\celery.py", line 118, in __call__
ret = self.run(*args, **kwargs)
File "C:\Python27\lib\site-packages\celery\bin\celery.py", line 220, in run
return self.target.run(*args, **kwargs)
File "C:\Python27\lib\site-packages\celery\bin\celeryd.py", line 153, in run
return self.app.Worker(**kwargs).run()
File "C:\Python27\lib\site-packages\celery\apps\worker.py", line 162, in run
self.app.loader.init_worker()
File "C:\Python27\lib\site-packages\celery\loaders\base.py", line 130, in init_worker
self.import_default_modules()
File "C:\Python27\lib\site-packages\djcelery\loaders.py", line 138, in import_default_modules
self.autodiscover()
File "C:\Python27\lib\site-packages\djcelery\loaders.py", line 141, in autodiscover
self.task_modules.update(mod.__name__ for mod in autodiscover() or ())
File "C:\Python27\lib\site-packages\djcelery\loaders.py", line 176, in autodiscover
for app in settings.INSTALLED_APPS])
File "C:\Python27\lib\site-packages\djcelery\loaders.py", line 195, in find_related_module
return importlib.import_module('%s.%s' % (app, related_name))
File "C:\Python27\lib\importlib\__init__.py", line 37, in import_module
__import__(name)
File "C:\sites\corecrm\people\tasks.py", line 15, in <module>
from people.models import Customer, CustomerCsvFile, CustomerToTag, get_customer_from_csv_row
File "C:\sites\corecrm\people\models.py", line 163, in <module>
UserProfile._meta.get_field_by_name('username')[0]._max_length = 75
File "C:\Python27\lib\site-packages\django\db\models\options.py", line 351, in get_field_by_name
cache = self.init_name_map()
File "C:\Python27\lib\site-packages\django\db\models\options.py", line 380, in init_name_map
for f, model in self.get_all_related_m2m_objects_with_model():
File "C:\Python27\lib\site-packages\django\db\models\options.py", line 469, in get_all_related_m2m_objects_with_model
cache = self._fill_related_many_to_many_cache()
File "C:\Python27\lib\site-packages\django\db\models\options.py", line 483, in _fill_related_many_to_many_cache
for klass in get_models(only_installed=False):
File "C:\Python27\lib\site-packages\django\db\models\loading.py", line 197, in get_models
self._populate()
File "C:\Python27\lib\site-packages\django\db\models\loading.py", line 75, in _populate
self.load_app(app_name)
File "C:\Python27\lib\site-packages\django\db\models\loading.py", line 96, in load_app
models = import_module('.models', app_name)
File "C:\Python27\lib\site-packages\django\utils\importlib.py", line 35, in import_module
__import__(name)
File "C:\sites\corecrm\booking\models.py", line 17, in <module>
from people.models import Customer, UserProfile
ImportError: cannot import name Customer
To try and work out what the booking/models.py script sees in people I added the following at the start:
import people
print 'path: %s' % people.__path__
for item in dir(people):
print item
and that gives me the following output:
path: ['C:\\sites\\corecrm\\people']
__builtins__
__doc__
__file__
__name__
__package__
__path__
path: ['C:\\sites\\corecrm\\people']
__builtins__
__doc__
__file__
__name__
__package__
__path__
however, when I run manage.py shell --settings=myproject.settings.dev_settings I get the following output:
path: ['C:\\sites\\corecrm\\people']
__builtins__
__doc__
__file__
__name__
__package__
__path__
path: ['C:\\sites\\corecrm\\people']
__builtins__
__doc__
__file__
__name__
__package__
__path__
models
As you can see the models module is available at the end of the 2nd list for the shell command (and I've confirmed this is also the case for manage.py commands other than celery). How would I make sure that module is available at the same point when I run the celery command?
EDIT: I've now also set up this project on an Ubuntu VM and I'm getting the same error when I try to run the worker manage command. Any ideas? Anyone?
ANOTHER EDIT: I've pasted the code for booking/models.py and people/models.py at http://pastebin.com/fTVVBtB4
I'm pretty sure this line is your problem:
File "C:\sites\corecrm\people\models.py", line 163, in <module>
UserProfile._meta.get_field_by_name('username')[0]._max_length = 75
While you're still busy importing from people.models, this line (in particular get_field_by_name) forces Django to evaluate the model and setup all relationships between that model and it's related models. This, in turn, forces an import of Customer in people.models, while you're still busy importing that exact model. This is what results in an ImportError.
For a working solution you'll need to post your models.py.
Why does this error only occur with celery? I can't say for sure without some more information, but my best guess is that Celery handles importing everything slightly different (Django probably doesn't import Customer, CustomerCsvFile, CustomerToTag and get_customer_from_csv_row all at once) and that this exposes the bug in your code.
EDIT/SOLUTION:
I would remove this line:
UserProfile._meta.get_field_by_name('username')[0]._max_length = 75
And move it to the instance level, into the __init__ method:
class UserProfile(AbstractUser):
def __init__(self, *args, **kwargs):
self._meta.get_field_by_name('username')[0]._max_length = 75
super(UserProfile, self).__init__(*args, **kwargs)
If the cause of the issue is indeed what I think it is, this will fix the circular import while providing the same functionality. If the max_length functionality gets broken somehow (most likely because internally a max_length validator is added to CharField and _max_length is changed too late) I would instead override the complete username field in the init method:
class UserProfile(AbstractUser):
def __init__(self, *args, **kwargs):
super(UserProfile, self).__init__(*args, **kwargs)
self._meta.get_field_by_name('username')[0] = models.CharField(max_length=75, etc.)