How to schedule my crawler function in django periodically using celery? - django

Here I have a view CrawlerHomeView which is used to create the task object from a form now I want to schedule this task periodically with celery.
I want to schedule this CrawlerHomeView process with the task object search_frequency and by checking some task object fields.
Task Model
class Task(models.Model):
INITIAL = 0
STARTED = 1
COMPLETED = 2
task_status = (
(INITIAL, 'running'),
(STARTED, 'running'),
(COMPLETED, 'completed'),
(ERROR, 'error')
)
FREQUENCY = (
('1', '1 hrs'),
('2', '2 hrs'),
('6', '6 hrs'),
('8', '8 hrs'),
('10', '10 hrs'),
)
name = models.CharField(max_length=255)
scraping_end_date = models.DateField(null=True, blank=True)
search_frequency = models.CharField(max_length=5, null=True, blank=True, choices=FREQUENCY)
status = models.IntegerField(choices=task_status)
tasks.py
I want to run the view below posted periodically [period=(task's search_frequency time] if the task status is 0 or 1 and not crossed the task scraping end date. But I got stuck here. How can I do this?
#periodic_task(run_every=crontab(hour="task.search_frequency")) # how to do with task search_frequency value
def schedule_task(pk):
task = Task.objects.get(pk=pk)
if task.status == 0 or task.status == 1 and not datetime.date.today() > task.scraping_end_date:
# perform the crawl function ---> def crawl() how ??
if task.scraping_end_date == datetime.date.today():
task.status = 2
task.save() # change the task status as complete.
views.py
I want to run this view periodically.How can I do it?
class CrawlerHomeView(LoginRequiredMixin, View):
login_url = 'users:login'
def get(self, request, *args, **kwargs):
# all_task = Task.objects.all().order_by('-id')
frequency = Task()
categories = Category.objects.all()
targets = TargetSite.objects.all()
keywords = Keyword.objects.all()
form = CreateTaskForm()
context = {
'targets': targets,
'keywords': keywords,
'frequency': frequency,
'form':form,
'categories': categories,
}
return render(request, 'index.html', context)
def post(self, request, *args, **kwargs):
form = CreateTaskForm(request.POST)
if form.is_valid():
# try:
unique_id = str(uuid4()) # create a unique ID.
obj = form.save(commit=False)
# obj.keywords = keywords
obj.created_by = request.user
obj.unique_id = unique_id
obj.status = 0
obj.save()
form.save_m2m()
keywords = ''
# for keys in ast.literal_eval(obj.keywords.all()): #keywords change to csv
for keys in obj.keywords.all():
if keywords:
keywords += ', ' + keys.title
else:
keywords += keys.title
# tasks = request.POST.get('targets')
# targets = ['thehimalayantimes', 'kathmandupost']
# print('$$$$$$$$$$$$$$$ keywords', keywords)
task_ids = [] #one Task/Project contains one or multiple scrapy task
settings = {
'spider_count' : len(obj.targets.all()),
'keywords' : keywords,
'unique_id': unique_id, # unique ID for each record for DB
'USER_AGENT': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
}
# res = ast.literal_eval(ini_list)
for site_url in obj.targets.all():
domain = urlparse(site_url.address).netloc # parse the url and extract the domain
spider_name = domain.replace('.com', '')
task = scrapyd.schedule('default', spider_name, settings=settings, url=site_url.address, domain=domain, keywords=keywords)
# task = scrapyd.schedule('default', spider_name , settings=settings, url=obj.targets, domain=domain, keywords=obj.keywords)
return redirect('crawler:task-list')
# except:
# return render(request, 'index.html', {'form':form})
return render(request, 'index.html', {'form':form, 'errors':form.errors})
Any Suggestions or answer is there for this problem ?

After fighting Celery for 5 years in a 15k tasks/second setup I highly recommend you to switch to Dramatiq, which has a sane, reliable, performant code base that isn't split across multiple convoluted packages and works perfectly in two of my newer projects so far.
From the author's motivation
I’ve used Celery professionally for years and my growing frustration with it is one of the reasons why I developed dramatiq. Here are some of the main differences between Dramatiq, Celery and RQ:
There's also a a Django helper package: https://github.com/Bogdanp/django_dramatiq
Granted, you won't have a builtin celerybeat, but a cron calling python tasks is more robust anyway, we lost a good amount of data because celerybeat decided to stall regularly :)
There are two projects that aim to add periodic task creation: https://gitlab.com/bersace/periodiq and https://apscheduler.readthedocs.io/en/stable/
I haven't used those packages yet, what you could try with periodiq is selecting your database entries, loop through those and define a periodic-task for each (but this requires regular restarts of the periodiq worker to pick up changes):
# tasks.py
from dramatiq import get_broker
from periodiq import PeriodiqMiddleware, cron
broker = get_broker()
broker.add_middleware(PeriodiqMiddleware(skip_delay=30))
for obj in Task.objects.all():
#dramatiq.actor(periodic=cron(obj.frequency))
def hourly(obj=obj):
# import logic based on obj.name
# Do something each hour…

For the error,
Exception Type: EncodeError
Exception Value:
Object of type timedelta is not JSON serializable
Instead of defining following variable in django settings,
CELERY_BEAT_SCHEDULE = {
'task-first': {
'task': 'scheduler.tasks.create_task',
'schedule': timedelta(minutes=1)
},
can you try following in your celery file:
app.conf.beat_schedule = {
'task-first': {
'task': 'scheduler.tasks.create_task',
'schedule': crontab(minute='*/1')
}
}
this works for me given, celery server is up and running.
Apart from this why are you redirecting to 'list_tasks' after each task, what does it exactly do? Also, you have called the celery task from the view add_task_celery.delay(name,date,freq), is it just another way to add task apart from periodic task defined using celery-beat?
Edit 1:
My structure looks like as follow:
settings.py
CELERY_TIMEZONE = 'Asia/Kolkata'
CELERY_BROKER_URL = 'amqp://localhost'
celery.py
app.conf.beat_schedule = {
'task1': {
'task': '<app_name>.tasks.random_task',
'schedule': crontab(minute=0, hour=0)
},
}
Here you should note that I have a file named tasks in my app folder and there I have written a shared task as follow:
#shared_task
def random_task(total):
...
Also, apart from this you should start both celery beat as well as a celery worker process as follow:
celery -A <project_name>.celery worker -l error
celery -A <project_name>.celery beat -l error --scheduler django_celery_beat.schedulers:DatabaseScheduler
You can any scheduler you want, on production I use DatabaseScheduler. For testing you can try with following command:
celery -A <project_name> beat -l info -S django
You should run all these commands from the project folder of the Django project

I believe the problem is with 2nd and 3rd parameter in the task definition, which is freq and date. Although from the error, you posted, Object of type timedelta is not JSON serializable, it looks like it's talking about freq field which is of type DurationField that returns timedelta object.
Ideally, both fields must be serialized before passing to the task.
one simple way would be -
1) You can explicitly serialize these fields and pass to the task and in the task again convert it to datetime / timedelta object.
alternatively, you can dump whole data dict if there are too many items.
add_task_celery.delay(json.dumps(form.cleaned_data)),
and then in the task do -> json.loads(...)
2) Another thing you can try is to pass the serializer in the parameters explicitly.(using apply_async instead of delay)
add_task_celery.apply_async((name, date, freq), serializer='json')
3) You can also set value, if you haven't already, for setting CELERY_TASK_SERIALIZER = 'json' (default value is 'pickle').

Related

Scheduling my crawler with celery not working

Here I want to run my crawler with celery every 1 minute. I write the tasks as below and called the task in the view with delay but I am not getting the result.
I run celery -A mysite worker -l info celery , rabbitmq broker , scrapy and django server in different terminals.
The CrawlerHomeView redirects to the task list successfully by creating the task object.But the celery is not working
It is throwing this error in the celery console
ValueError: not enough values to unpack (expected 3, got 0) [2020-06-08 15:36:06,732: INFO/MainProcess] Received task: crawler.tasks.schedule_task[3b537143-caa8-4445-b3d6-c0bc8d301b89] [2020-06-08 15:36:06,735: ERROR/MainProcess] Task handler raised error: ValueError('not enough values to unpack (expected 3, got 0)') Traceback (most recent call last): File "....\venv\lib\site-packages\billiard\pool.py", line 362, in workloop result = (True, prepare_result(fun(*args, **kwargs))) File "....\venv\lib\site-packages\celery\app\trace.py", line 600, in _fast_trace_task tasks, accept, hostname = _loc ValueError: not enough values to unpack (expected 3, got 0)
views
class CrawlerHomeView(LoginRequiredMixin, View):
login_url = 'users:login'
def get(self, request, *args, **kwargs):
frequency = Task()
categories = Category.objects.all()
targets = TargetSite.objects.all()
keywords = Keyword.objects.all()
form = CreateTaskForm()
context = {
'targets': targets,
'keywords': keywords,
'frequency': frequency,
'form':form,
'categories': categories,
}
return render(request, 'index.html', context)
def post(self, request, *args, **kwargs):
form = CreateTaskForm(request.POST)
if form.is_valid():
unique_id = str(uuid4()) # create a unique ID.
obj = form.save(commit=False)
obj.created_by = request.user
obj.unique_id = unique_id
obj.status = 0
obj.save()
form.save_m2m()
schedule_task.delay(obj.pk)
return render(request, 'index.html', {'form':form, 'errors':form.errors})
tasks.py
scrapyd = ScrapydAPI('http://localhost:6800')
#periodic_task(run_every=crontab(minute=1)) # how to do with task search_frequency value ?
def schedule_task(pk):
task = Task.objects.get(pk=pk)
if task.status == 0 or task.status == 1 and not datetime.date.today() >= task.scraping_end_date:
unique_id = str(uuid4()) # create a unique ID.
keywords = ''
# for keys in ast.literal_eval(obj.keywords.all()): #keywords change to csv
for keys in task.keywords.all():
if keywords:
keywords += ', ' + keys.title
else:
keywords += keys.title
settings = {
'spider_count': len(task.targets.all()),
'keywords': keywords,
'unique_id': unique_id, # unique ID for each record for DB
'USER_AGENT': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
}
# res = ast.literal_eval(ini_list)
for site_url in task.targets.all():
domain = urlparse(site_url.address).netloc # parse the url and extract the domain
spider_name = domain.replace('.com', '')
scrapyd.schedule('default', spider_name, settings=settings, url=site_url.address, domain=domain,
keywords=keywords)
elif task.scraping_end_date == datetime.date.today():
task.status = 2
task.save() # change the task status as completed.
settings
CELERY_BROKER_URL = 'amqp://localhost'
EDIT
This answer helped me to find the solution Celery raises ValueError: not enough values to unpack.
Now this errors has gone.
Now in the celery console I am seeing this
[2020-06-08 16:33:23,123: INFO/MainProcess] Task crawler.tasks.schedule_task[0578558d-0dc6-4db7-b69f-e912b604ff3d] succeeded in 0.016000000000531145s: None and getting no scraped results in my frontend .
Now my question is how can I check that my task is running periodically every 1 minute ?
It is the very first time I am using celery so here might be some problems.
Celery is no longer supported on Windows as platform ( version 4 dropped official support )
I highly suggest that you dockerize your app instead (or use wsl2),if you don't want to go this route
You would probably need to use gevent ( notice there could be some additional problems if you go this route)
pip install gevent
celery -A <module> worker -l info -P gevent
found similar detailed answer here

Celery not returning any results after success?

Here I am crawling some websites with different keywords. Before It was only scraping and it worked but I implemented celery for this. After using celery I am not being able to get the scraping result but no error is showing. I am using rabbitmq as the message broker here.
tasks.py
#shared_task()
def schedule_task(pk):
task = Task.objects.get(pk=pk)
keywords = ''
# for keys in ast.literal_eval(obj.keywords.all()): #keywords change to csv
for keys in task.keywords.all():
if keywords:
keywords += ', ' + keys.title
else:
keywords += keys.title
task_ids = [] # one Task/Project contains one or multiple scrapy task
settings = {
'spider_count': len(task.targets.all()),
'keywords': keywords,
'unique_id': str(uuid4()), # unique ID for each record for DB
'USER_AGENT': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
}
# res = ast.literal_eval(ini_list)
for site_url in task.targets.all():
domain = urlparse(site_url.address).netloc # parse the url and extract the domain
spider_name = domain.replace('.com', '')
task = scrapyd.schedule('default', spider_name, settings=settings, url=site_url.address, domain=domain,
keywords=keywords)
views
def post(self, request, *args, **kwargs):
form = CreateTaskForm(request.POST)
if form.is_valid():
unique_id = str(uuid4()) # create a unique ID.
obj = form.save(commit=False)
obj.created_by = request.user
obj.unique_id = unique_id
obj.status = 0
obj.save()
form.save_m2m()
print(obj.pk)
schedule_task.delay(pk=obj.pk)
return redirect('crawler:task-list')
views before using celery ( which returns the scraped results worked fine) I just split the scraping part into tasks.py and call it from view with .delay but didn't returned the result(before it returned).
form = CreateTaskForm(request.POST)
if form.is_valid():
unique_id = str(uuid4()) # create a unique ID.
obj = form.save(commit=False)
obj.created_by = request.user
obj.unique_id = unique_id
obj.status = 0
obj.save()
form.save_m2m()
keywords = ''
# for keys in ast.literal_eval(obj.keywords.all()): #keywords change to csv
for keys in obj.keywords.all():
if keywords:
keywords += ', ' + keys.title
else:
keywords += keys.title
task_ids = [] #one Task/Project contains one or multiple scrapy task
settings = {
'spider_count' : len(obj.targets.all()),
'keywords' : keywords,
'unique_id': unique_id, # unique ID for each record for DB
'USER_AGENT': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
}
# res = ast.literal_eval(ini_list)
for site_url in obj.targets.all():
domain = urlparse(site_url.address).netloc # parse the url and extract the domain
spider_name = domain.replace('.com', '')
task = scrapyd.schedule('default', spider_name, settings=settings, url=site_url.address, domain=domain, keywords=keywords)
return redirect('crawler:task-list')
celery console
[2020-06-10 20:42:55,885: INFO/MainProcess] celery#DESKTOP-ENPLHOS ready.
[2020-06-10 20:42:55,900: INFO/MainProcess] pidbox: Connected to amqp://guest:**#127.0.0.1:5672//.
[2020-06-10 20:43:13,730: INFO/MainProcess] Received task: crawler.tasks.schedule_task[10e7bf06-5e4e-413c-85a3-79d61b9835cf]
[2020-06-10 20:43:17,077: INFO/MainProcess] Task crawler.tasks.schedule_task[10e7bf06-5e4e-413c-85a3-79d61b9835cf] succeeded in 3.3590000000040163s: None
http://localhost:6800/jobs here I can see the spiders are running but the results are not appearing in my view.
views before using celery ( which returns the scraped results worked fine)
that is because your code runs synchronous....one after the other.
Celery on the other hand runs asynchronous and alway you will get a None as the returned value from it.
If you chain 2 or more Celery tasks (of which all of them run async) then you can make use of their returned value, but not chaining a synchronous view with an async celery task.
Celery tasks are meant to be dispatched and run in background...while your view is suppose to return something else...(without waiting for your spiders to finish)
To be able to make use of the Celery results:
Collected data needs to be stored somewhere (a file like csv, json, etc, .. OR inside a database) and handle the Django View in 2 steps:
first you trigger the Celery task
second collect the stored results and display them

How to do something after 2 days in django like deleting an object of a model

This is a model for all assignment taken by any user. How can I delete a particular instance of this when the user hasn't submitted his assignment in two days. After submission user data is saved in subassignment model.
class UserAssignment(models.Model):
user = models.ForeignKey(settings.AUTH_USER_MODEL)
assignment = models.ForeignKey(Assignment)
time_taken = models.DateTimeField(auto_now_add=True)
submitted = models.DateTimeField(null=True, blank = True)
class SubAssignment(models.Model):
user = models.ForeignKey(settings.AUTH_USER_MODEL)
assignment = models.ForeignKey(Assignment)
time_submitted =models.DateTimeField(blank = True, null = True)
score = models.IntegerField(default=0)
pip install django-celery==3.2.2
add 'djcelery' to INSTALLED_APPS
add task to check all UserAssignment everyday in 1 am:
tasks.py:
from celery import task
from .models import *
#task
def check_user_assignment():
for user_assignment in UserAssignment.objects.all():
# check all every day,if need to delete,then delete it
pass
add this task to
settings.py
BROKER_URL = 'amqp://root:root#localhost:5672/'
CELERY_TIMEZONE = 'Asia/Shanghai'
CELERYBEAT_SCHEDULER = 'djcelery.schedulers.DatabaseScheduler'
CELERYBEAT_SCHEDULE = {
'check_user_assignment_everyday': {
'task': 'user.tasks.check_user_assignment',
'schedule': crontab(minute=0, hour=1),
'args': (),
},
}
run celery :
input in terminal
python manage.py celery beat -l info
python manage.py celery worker -E -l info
You have to use timedelta:
from datetime import datetime, timedelta
# Some of your code
# Let's create 2 day threshold
threshold = datetime.now()-timedelta(days=2)
missed_assignment = UserAssignment.objects.filter(user=user,
time_taken__gte=threshold)
missed_assignment.delete()
missed_subassignment = SubAssignment.objects.filter(user=user,
time_submitted__gte=threshold)
missed_subassignment.delete()
Note that time_submitted__gte means "time passed is more than or at threshold".
If you do not actually run code, you may use rq worker -https://github.com/rq/django-rq or just a cron tab that periodically checks.

Method works fine in iPython but runs endlessly on Gunicorn

I wrote an app in Falcon framework that I am running using the Gunicorn server. When the server starts, the app first learns random forest model:
forest = sklearn.ensemble.ExtraTreesClassifier(n_estimators=150, n_jobs=-1)
forest.fit(x, t)
and then returns probabilities for requests posted to it. This works fine on my server when I run the code in iPython (training this model takes 15s, running on 12 cores).
When I was writing the app, i set n_estimators=10 and everything was working. When I finished tweaking the app, I set n_estimators back to 150. However, when I ran Gunicorn then with gunicorn -c ./app.conf app:app, from htop I could see the the forest.fit(x, t) runs for few seconds on all cores, after which the usage of all cores drops to 0. After that, the method keeps running indefinitely until the Gunicorn worker timeouts after 10 minutes.
This is my first time using Gunicorn and Falcon, or any WSGI technologies for that matter, and I am clueless as to what might be causing the problem or how to troubleshoot it.
Edit:
The settings file for gunicorn:
# app.conf
# run with gunicorn -c ./app.conf app:app
import sys
sys.path.append('/home/user/project/Module')
bind = "127.0.0.1:8123"
timeout = 60*20 # Timeout worker after more than 20 minutes`
The falcon code:
class Processor(object):
""" Processor object handles the training of the models,
feature generation of requests and probability predictions.
"""
# data statistics used in feature calculations
data_statistics = {}
# Classification targets
targets = ()
# Select features for the models.
cols1 = [ #...
]
cols2 = [ #...
]
model_1 = ExtraTreesClassifier(n_estimators=150, n_jobs=-1)
model_2 = ExtraTreesClassifier(n_estimators=150, n_jobs=-1)
def __init__(self, features_dataset, tr_prepro):
# Get the datasets
da_1, da_2 = self.prepare_datasets(features_dataset)
# Train models
# ----THIS IS WHERE THE PROGRAM HANGS -----------------------------------
self.model_1.fit(da_1.x, utils.vectors_to_labels(da_1.t))
# -----------------------------------------------------------------------
self.model_2.fit(da_2.x, utils.vectors_to_labels(da_2.t))
# Generate data statistics for feature calculations
self.calculate_data_statistics(tr_prepro)
def prepare_datasets(self, features_dataset):
sel_cols = [ #...
]
# Build dataset
d = features_dataset[sel_cols].dropna()
da, scalers = ft.build_dataset(d, scaling='std', target_feature='outcome')
# Binirize data
da_bin = utils.binirize_dataset(da)
# What are the classification targets
self.targets = da_bin.t_labels
# Prepare the datasets
da_1 = da_bin.select_attributes(self.cols1)
da_2 = da_bin.select_attributes(self.cols2)
return da_1, da_2
def calculate_data_statistics(self, tr_prepro):
logger.info('Getting data and feature statistics...')
#...
logger.info('Done.')
def import_data(self, data):
# convert dictionary generated from json to Pandas DataFrame
return tr
def generate_features(self, tr):
# Preprocessing, Feature calculations, imputation
return tr
def predict_proba(self, data):
# Convert Data
tr = self.import_data(data)
# Generate features
tr = self.generate_features(tr)
# Select model based on missing values - either no. 1 or no. 2
tr_1 = #...
tr_2 = #...
# Get the probabilities from different models
if tr_1.shape[0] > 0:
tr_1.loc[:, 'prob'] = self.model_1.predict_proba(tr_1.loc[:, self.cols1])[:, self.targets.index('POSITIVE')]
if tr_2.shape[0] > 0:
tr_2.loc[:, 'prob'] = self.model_2.predict_proba(tr_2.loc[:, self.cols2])[:, self.targets.index('POSITIVE')]
return pd.concat([tr_1, tr_2], axis=0)
#staticmethod
def export_single_result(tr):
result = {'sample_id': tr.loc[0, 'sample_id'],
'batch_id': tr.loc[0, 'batch_id'],
'prob': tr.loc[0, 'prob']
}
return result
class JSONTranslator(object):
def process_request(self, req, resp):
"""Generic method for extracting json from requets
Throws
------
HTTP 400 (Bad Request)
HTTP 753 ('Syntax Error')
"""
if req.content_length in (None, 0):
# Nothing to do
return
body = req.stream.read()
if not body:
raise falcon.HTTPBadRequest('Empty request body',
'A valid JSON document is required.')
try:
req.context['data'] = json.loads(body.decode('utf-8'))
except (ValueError, UnicodeDecodeError):
raise falcon.HTTPError(falcon.HTTP_753,
'Malformed JSON',
'Could not decode the request body. The '
'JSON was incorrect or not encoded as '
'UTF-8.')
def process_response(self, req, resp, resource):
"""Generic method for putting response to json
Does not do anything if 'result_json' not in req.context.
"""
if 'result_json' not in req.context:
return
resp.body = json.dumps(req.context['result_json'])
class ProbResource(object):
def __init__(self, processor):
self.schema_raw = open(config.__ROOT__ + "app_docs/broadcast_schema.json").read()
self.schema = json.loads(self.schema_raw)
self.processor = processor
def validate_request(self, req):
""" Validate the request json against the schema.
Throws
------
HTTP 753 ('Syntax Error')
"""
data = req.context['data']
# validate the json
try:
v = jsonschema.Draft4Validator(self.schema) # using jsonschema draft 4
err_msg = str()
for error in sorted(v.iter_errors(data), key=str):
err_msg += str(error)
if len(err_msg) > 0:
raise falcon.HTTPError(falcon.HTTP_753,
'JSON failed validation',
err_msg)
except jsonschema.ValidationError as e:
print("Failed to use schema:\n" + str(self.schema_raw))
raise e
def on_get(self, req, resp):
"""Handles GET requests
Throws
------
HTTP 404 (Not Found)
"""
self.validate_request(req)
data = req.context['data']
try:
# get probability
tr = self.processor.predict_proba(data)
# convert pandas dataframe to dictionary
result = self.processor.export_single_result(tr)
# send the dictionary away
req.context['result_json'] = result
except Exception as ex:
raise falcon.HTTPError(falcon.HTTP_404, 'Error', ex.message)
resp.status = falcon.HTTP_200
# Get data
features_data = fastserialize.load(config.__ROOT__ + 'data/samples.s')
prepro_data = fastserialize.load(config.__ROOT__ + 'data/prepro/samples_preprocessed.s')
# Get the models - this is where the code hangs
sp = SampleProcessor(features_data, prepro_data)
app = falcon.API(middleware=[JSONTranslator()])
prob = ProbResource(sp)
app.add_route('/prob', prob)

Django Celerybeat PeriodicTask running far more than expected

I'm struggling with Django, Celery, djcelery & PeriodicTasks.
I've created a task to pull a report for Adsense to generate a live stat report. Here is my task:
import datetime
import httplib2
import logging
from apiclient.discovery import build
from celery.task import PeriodicTask
from django.contrib.auth.models import User
from oauth2client.django_orm import Storage
from .models import Credential, Revenue
logger = logging.getLogger(__name__)
class GetReportTask(PeriodicTask):
run_every = datetime.timedelta(minutes=2)
def run(self, *args, **kwargs):
scraper = Scraper()
scraper.get_report()
class Scraper(object):
TODAY = datetime.date.today()
YESTERDAY = TODAY - datetime.timedelta(days=1)
def get_report(self, start_date=YESTERDAY, end_date=TODAY):
logger.info('Scraping Adsense report from {0} to {1}.'.format(
start_date, end_date))
user = User.objects.get(pk=1)
storage = Storage(Credential, 'id', user, 'credential')
credential = storage.get()
if not credential is None and credential.invalid is False:
http = httplib2.Http()
http = credential.authorize(http)
service = build('adsense', 'v1.2', http=http)
reports = service.reports()
report = reports.generate(
startDate=start_date.strftime('%Y-%m-%d'),
endDate=end_date.strftime('%Y-%m-%d'),
dimension='DATE',
metric='EARNINGS',
)
data = report.execute()
for row in data['rows']:
date = row[0]
revenue = row[1]
try:
record = Revenue.objects.get(date=date)
except Revenue.DoesNotExist:
record = Revenue()
record.date = date
record.revenue = revenue
record.save()
else:
logger.error('Invalid Adsense Credentials')
I'm using Celery & RabbitMQ. Here are my settings:
# Celery/RabbitMQ
BROKER_HOST = "localhost"
BROKER_PORT = 5672
BROKER_USER = "myuser"
BROKER_PASSWORD = "****"
BROKER_VHOST = "myvhost"
CELERYD_CONCURRENCY = 1
CELERYD_NODES = "w1"
CELERY_RESULT_BACKEND = "amqp"
CELERY_TIMEZONE = 'America/Denver'
CELERYBEAT_SCHEDULER = 'djcelery.schedulers.DatabaseScheduler'
import djcelery
djcelery.setup_loader()
On first glance everything seems to work, but after turning on the logger and watching it run I have found that it is running the task at least four times in a row - sometimes more. It also seems to be running every minute instead of every two minutes. I've tried changing the run_every to use a crontab but I get the same results.
I'm starting celerybeat using supervisor. Here is the command I use:
python manage.py celeryd -B -E -c 1
Any ideas as to why its not working as expected?
Oh, and one more thing, after the day changes, it continues to use the date range it first ran with. So as days progress it continues to get stats for the day the task started running - unless I run the task manually at some point then it changes to the date I last ran it manually. Can someone tell me why this happens?
Consider creating a separate queue with one worker process and fixed rate for this type of tasks and just add the tasks in this new queue instead of running them in directly from celerybeat. I hope that could help you to figure out what is wrong with your code, is it problem with celerybeat or your tasks are running longer than expected.
#task(queue='create_report', rate_limit='0.5/m')
def create_report():
scraper = Scraper()
scraper.get_report()
class GetReportTask(PeriodicTask):
run_every = datetime.timedelta(minutes=2)
def run(self, *args, **kwargs):
create_report.delay()
in settings.py
CELERY_ROUTES = {
'myapp.tasks.create_report': {'queue': 'create_report'},
}
start additional celery worker with that would handle tasks in your queue
celery worker -c 1 -Q create_report -n create_report.local
Problem 2. Your YESTERDAY and TODAY variables are set at class level, so within one thread they are set only once.