How to use Python threadpool and celery in combination? - flask

Originally I have a small script doing some IO bound jobs using native Python threadpool, e.g.
from multiprocessing.dummy import Pool as ThreadPool
def f():
# some IO bound action here.
pass
def pooled_f():
pool = ThreadPool(4)
args = [1, 2, 3, 4]
results = []
for i in pool.imap_unordered(f, args):
results.append(i)
pool.close()
pool.join()
return results
Now I'm using Flask + Celery + Redis to allow user to submit different jobs to the worker and do it effeciently: the user submits a file containg some data, and celery would do some work, and showes the user progess. Maybe it's overkill for such a small app, but it should work anyway.
#celery.task(bind=True)
def sillywork(self, filename):
filesize = os.path.getsize(filename)
with open(filename, 'rb') as f:
chunked = 0
for i, line in enumerate(f):
message = line.strip()
chunked += len(message)
# print line
self.update_state(
state='PROGRESS',
meta={
'current': chunked,
'total': filesize,
'status': message
})
time.sleep(0.01)
return {
'current': filesize,
'total': filesize,
'status': 'OK',
'result': 1
}
#app.route('/jobs', methods=['POST', 'GET'])
def jobs():
form = Form()
if request.method == 'POST':
filename = form.data.filename
# file contains the data that user submitted
file = request.files['file']
file.save(os.path.join(app.config['UPLOAD_FOLDER'], filename))
# How should I add the celery job here?
sillywork.apply_async(args=[filename])
return redirect(url_for('index', task_id=task.id))
I followed the example https://blog.miguelgrinberg.com/post/using-celery-with-flask , somehow I couldn't find a way to make celery run the job I needed.

Related

Passing Audio Files To Celery Task

I have a music uploading app and believe that it would be smart to pass the files to a celery task to handle uploading. However, when attempting to pass the files, as I will show in my code below, I get a message stating that they are not JSON serializable. What would be the correct way to handle this operation?
Everything below uploaded_songs in .views.py is my current code that successfully uploads the audio tracks. It doesn't, however, utilize celery yet.
.task.py
from django.contrib.auth import get_user_model
from Beyond_April_Base_Backend.celery import app
from django.contrib.auth.models import User
#app.task
def upload_songs(songs, user_id):
try:
user = User.objects.get(pk=user_id)
print('user and songs')
print(user)
print(songs)
except User.DoesNotExist:
logging.warning("Tried to find non-exisiting user '%s'" % user_id)
.views.py
class ConcertUploadView(APIView):
permission_classes = [permissions.IsAuthenticated]
def post(self, request):
track_files = request.FILES.getlist('files')
current_user = self.request.user
upload_songs.delay(track_files, current_user.pk)
try:
selected_band = Band.objects.get(name=request.data['band'])
except ObjectDoesNotExist:
print('band not received from form')
selected_band = Band.objects.get(name='Band')
venue_name = request.data['venue']
concert_date_str = request.data['concertDate']
concert_date_split = concert_date_str.split('(')[0]
concert_date = datetime.strptime(concert_date_split, '%a %b %d %Y %H:%M:%S %Z%z ')
concert_city = request.data['city']
concert_state = request.data['state']
concert_country = request.data['country']
new_concert = Concert(
venue=venue_name,
date=concert_date,
city=concert_city,
state=concert_state,
country=concert_country,
band=selected_band,
user=current_user,
)
new_concert.save()
i = 0
for song in track_files:
audio_metadata = music_tag.load_file(track_files[i].temporary_file_path())
temp_path = song.temporary_file_path
song_title = str(audio_metadata['title'])
audio_file_instance = Song(
title=song_title,
concert=new_concert,
user=current_user,
concert_order = i + 1,
audio_file = track_files[i],
)
audio_file_instance.save()
i += 1
return Response(status=status.HTTP_201_CREATED)
When you create a celery task, it serializes the arguments so that it can store the message in the queue backend (RabbitMQ, Redis, etc). The default serializer is JSON, and a binary file is not JSON-serializable. See celery's serialization docs for more info.
You could base64 encode the binary file to text, but you shouldn't: it will increase the size of the data, and you'll be passing around potentially very large messages. With lots of large messages, you could run out of memory/space in your backend, and it will make it hard to inspect or log messages.
Instead, you should store the binary file somewhere, and pass a reference (filename, S3 URL, database key, etc) to the task. The task can then load the file, do what it needs to, and delete the original (if appropriate).

Scheduling my crawler with celery not working

Here I want to run my crawler with celery every 1 minute. I write the tasks as below and called the task in the view with delay but I am not getting the result.
I run celery -A mysite worker -l info celery , rabbitmq broker , scrapy and django server in different terminals.
The CrawlerHomeView redirects to the task list successfully by creating the task object.But the celery is not working
It is throwing this error in the celery console
ValueError: not enough values to unpack (expected 3, got 0) [2020-06-08 15:36:06,732: INFO/MainProcess] Received task: crawler.tasks.schedule_task[3b537143-caa8-4445-b3d6-c0bc8d301b89] [2020-06-08 15:36:06,735: ERROR/MainProcess] Task handler raised error: ValueError('not enough values to unpack (expected 3, got 0)') Traceback (most recent call last): File "....\venv\lib\site-packages\billiard\pool.py", line 362, in workloop result = (True, prepare_result(fun(*args, **kwargs))) File "....\venv\lib\site-packages\celery\app\trace.py", line 600, in _fast_trace_task tasks, accept, hostname = _loc ValueError: not enough values to unpack (expected 3, got 0)
views
class CrawlerHomeView(LoginRequiredMixin, View):
login_url = 'users:login'
def get(self, request, *args, **kwargs):
frequency = Task()
categories = Category.objects.all()
targets = TargetSite.objects.all()
keywords = Keyword.objects.all()
form = CreateTaskForm()
context = {
'targets': targets,
'keywords': keywords,
'frequency': frequency,
'form':form,
'categories': categories,
}
return render(request, 'index.html', context)
def post(self, request, *args, **kwargs):
form = CreateTaskForm(request.POST)
if form.is_valid():
unique_id = str(uuid4()) # create a unique ID.
obj = form.save(commit=False)
obj.created_by = request.user
obj.unique_id = unique_id
obj.status = 0
obj.save()
form.save_m2m()
schedule_task.delay(obj.pk)
return render(request, 'index.html', {'form':form, 'errors':form.errors})
tasks.py
scrapyd = ScrapydAPI('http://localhost:6800')
#periodic_task(run_every=crontab(minute=1)) # how to do with task search_frequency value ?
def schedule_task(pk):
task = Task.objects.get(pk=pk)
if task.status == 0 or task.status == 1 and not datetime.date.today() >= task.scraping_end_date:
unique_id = str(uuid4()) # create a unique ID.
keywords = ''
# for keys in ast.literal_eval(obj.keywords.all()): #keywords change to csv
for keys in task.keywords.all():
if keywords:
keywords += ', ' + keys.title
else:
keywords += keys.title
settings = {
'spider_count': len(task.targets.all()),
'keywords': keywords,
'unique_id': unique_id, # unique ID for each record for DB
'USER_AGENT': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
}
# res = ast.literal_eval(ini_list)
for site_url in task.targets.all():
domain = urlparse(site_url.address).netloc # parse the url and extract the domain
spider_name = domain.replace('.com', '')
scrapyd.schedule('default', spider_name, settings=settings, url=site_url.address, domain=domain,
keywords=keywords)
elif task.scraping_end_date == datetime.date.today():
task.status = 2
task.save() # change the task status as completed.
settings
CELERY_BROKER_URL = 'amqp://localhost'
EDIT
This answer helped me to find the solution Celery raises ValueError: not enough values to unpack.
Now this errors has gone.
Now in the celery console I am seeing this
[2020-06-08 16:33:23,123: INFO/MainProcess] Task crawler.tasks.schedule_task[0578558d-0dc6-4db7-b69f-e912b604ff3d] succeeded in 0.016000000000531145s: None and getting no scraped results in my frontend .
Now my question is how can I check that my task is running periodically every 1 minute ?
It is the very first time I am using celery so here might be some problems.
Celery is no longer supported on Windows as platform ( version 4 dropped official support )
I highly suggest that you dockerize your app instead (or use wsl2),if you don't want to go this route
You would probably need to use gevent ( notice there could be some additional problems if you go this route)
pip install gevent
celery -A <module> worker -l info -P gevent
found similar detailed answer here

Celery not returning any results after success?

Here I am crawling some websites with different keywords. Before It was only scraping and it worked but I implemented celery for this. After using celery I am not being able to get the scraping result but no error is showing. I am using rabbitmq as the message broker here.
tasks.py
#shared_task()
def schedule_task(pk):
task = Task.objects.get(pk=pk)
keywords = ''
# for keys in ast.literal_eval(obj.keywords.all()): #keywords change to csv
for keys in task.keywords.all():
if keywords:
keywords += ', ' + keys.title
else:
keywords += keys.title
task_ids = [] # one Task/Project contains one or multiple scrapy task
settings = {
'spider_count': len(task.targets.all()),
'keywords': keywords,
'unique_id': str(uuid4()), # unique ID for each record for DB
'USER_AGENT': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
}
# res = ast.literal_eval(ini_list)
for site_url in task.targets.all():
domain = urlparse(site_url.address).netloc # parse the url and extract the domain
spider_name = domain.replace('.com', '')
task = scrapyd.schedule('default', spider_name, settings=settings, url=site_url.address, domain=domain,
keywords=keywords)
views
def post(self, request, *args, **kwargs):
form = CreateTaskForm(request.POST)
if form.is_valid():
unique_id = str(uuid4()) # create a unique ID.
obj = form.save(commit=False)
obj.created_by = request.user
obj.unique_id = unique_id
obj.status = 0
obj.save()
form.save_m2m()
print(obj.pk)
schedule_task.delay(pk=obj.pk)
return redirect('crawler:task-list')
views before using celery ( which returns the scraped results worked fine) I just split the scraping part into tasks.py and call it from view with .delay but didn't returned the result(before it returned).
form = CreateTaskForm(request.POST)
if form.is_valid():
unique_id = str(uuid4()) # create a unique ID.
obj = form.save(commit=False)
obj.created_by = request.user
obj.unique_id = unique_id
obj.status = 0
obj.save()
form.save_m2m()
keywords = ''
# for keys in ast.literal_eval(obj.keywords.all()): #keywords change to csv
for keys in obj.keywords.all():
if keywords:
keywords += ', ' + keys.title
else:
keywords += keys.title
task_ids = [] #one Task/Project contains one or multiple scrapy task
settings = {
'spider_count' : len(obj.targets.all()),
'keywords' : keywords,
'unique_id': unique_id, # unique ID for each record for DB
'USER_AGENT': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
}
# res = ast.literal_eval(ini_list)
for site_url in obj.targets.all():
domain = urlparse(site_url.address).netloc # parse the url and extract the domain
spider_name = domain.replace('.com', '')
task = scrapyd.schedule('default', spider_name, settings=settings, url=site_url.address, domain=domain, keywords=keywords)
return redirect('crawler:task-list')
celery console
[2020-06-10 20:42:55,885: INFO/MainProcess] celery#DESKTOP-ENPLHOS ready.
[2020-06-10 20:42:55,900: INFO/MainProcess] pidbox: Connected to amqp://guest:**#127.0.0.1:5672//.
[2020-06-10 20:43:13,730: INFO/MainProcess] Received task: crawler.tasks.schedule_task[10e7bf06-5e4e-413c-85a3-79d61b9835cf]
[2020-06-10 20:43:17,077: INFO/MainProcess] Task crawler.tasks.schedule_task[10e7bf06-5e4e-413c-85a3-79d61b9835cf] succeeded in 3.3590000000040163s: None
http://localhost:6800/jobs here I can see the spiders are running but the results are not appearing in my view.
views before using celery ( which returns the scraped results worked fine)
that is because your code runs synchronous....one after the other.
Celery on the other hand runs asynchronous and alway you will get a None as the returned value from it.
If you chain 2 or more Celery tasks (of which all of them run async) then you can make use of their returned value, but not chaining a synchronous view with an async celery task.
Celery tasks are meant to be dispatched and run in background...while your view is suppose to return something else...(without waiting for your spiders to finish)
To be able to make use of the Celery results:
Collected data needs to be stored somewhere (a file like csv, json, etc, .. OR inside a database) and handle the Django View in 2 steps:
first you trigger the Celery task
second collect the stored results and display them

unable to use session variables outside view function

I have login page where the user inputs the username and password. I have made the username as session variable in the /login view function and would like to use this variable outside the view function in the main body of the code in a if-else block.
session['username'] = request.form['username'].lower()
How do I do this?
Here is part of the code for this:
import os
import csv
import pymysql
import pymysql.cursors
from datetime import date
import calendar
import ssl
from ldap3 import Connection, Server, ANONYMOUS, SIMPLE, SYNC, ASYNC,ALL
from flask import Flask, make_response,render_template,url_for,redirect,request,session,escape
from validusers import users
app = Flask(__name__)
IT = pymysql.connect(host='xx.xx.xx.xx', user='xxxxx', password='xxxxx',
db='xxxx')#Connect to the IT database
Others = pymysql.connect(host='xxxxx', user='xxxxxx', password='xxxxxx',
db='xxxxx')#Connect to the non IT database
a=IT.cursor() # Open Cursor for IT database
b=Others.cursor()#Open Cursor for non-IT database
**#app.route('/')
#app.route('/login',methods=['GET', 'POST'])
def login():
error=None
if request.method =='POST':
#if not request.form['username']:
#error='You forgot to enter "Username", please try again'
#return render_template('login.html',error=error)
if request.form['username'].lower() not in users:
error='You are not authorized to view this page !!'
return render_template('login.html',error=error)
#if not request.form['password']:
#error='You forgot to enter "Password", please try again'
#return render_template('login.html',error=error)
#else:
#s = Server('appauth.corp.domain.com:636', use_ssl=True, get_info=ALL)
#c = Connection(s,user=request.form['username'],password=request.form['password'],check_names=True, lazy=False,raise_exceptions=False)
#c.open()
#c.bind()
#if (c.bind() != True) is True:
#error='Invalid credentials. Please try again'
#else:
#session['username'] = request.form['username'].lower()
#return redirect(url_for('index'))
return render_template('login.html',error=error)**
#app.route('/index',methods=['GET','POST'])
def index():
if 'username' in session:
return render_template('index.html')
Filename = os.getenv("HOMEDRIVE") + os.getenv("HOMEPATH") + "\\Desktop\RosterUnified.csv" # Create/write a CSV file in the user's desktop
Filename1=os.getenv("HOMEDRIVE") + os.getenv("HOMEPATH") + "\\Desktop\RosterCurrentMonth.csv"
d=open(Filename, 'w',newline='\n') #Format for CSV input
c = csv.writer(d)
c.writerow(["Manager NT ID"," Vertical Org","Employee ID" ]+ dayssl)# Write the header list of strings in the first row
for row in result_IT:
c.writerow(row)#Write output for IT to csv
d.close()
#result_IT and result_Oters part of code is ommitted
e=open(Filename, 'a',newline='\n')
f= csv.writer(e)
for row in result_Others:
f.writerow(row)# Append to the existing CSV file with non IT data
e.close()
x=session['username']
sql="select verticalorg from tbl_employeedetails where empntid=(%s)"
args=x
a.execute(sql,args)
b.execute(sql,args)
c=a.fetchall()
d1=b.fetchall()
s=c+d1
q=[x[0] for x in s]
sql1="select role from tbl_employeedetails where empntid=(%s)"
a.execute(sql1,args)
b.execute(sql1,args)
c1=a.fetchall()
d2=b.fetchall()
Role=c1+d2
r=[x[0] for x in Role]
if r=='O':
if q==27:
f1=open(Filename,'r',newline='\n')
f2=open(Filename1,'w',newline='\n')
reader = csv.DictReader(f1)
writer = csv.writer(f2)
writer.writerow(["Manager NT ID"," Vertical Org","Employee ID" ]+ dayssl)
rows = [row for row in reader if row['Vertical Org'] == 'HR']
writer.writerow[row in rows]
elif q==2:
f1=open(Filename,'r',newline='\n')
f2=open(Filename1,'w',newline='\n')
reader = csv.DictReader(f1)
writer = csv.writer(f2)
writer.writerow(["Manager NT ID"," Vertical Org","Employee ID" ]+ dayssl)
f2.close()
z=open(Filename1)
with z as f:
p = f.read()
else:
z=open(Filename)
with z as f:
p = f.read()
#app.route('/csv/')
def download_csv():
csv = p
response = make_response(csv)
cd = 'attachment; filename=RosterCurrentMonth.csv'
response.headers['Content-Disposition'] = cd
response.mimetype='text/csv'
return response
z.close()
os.remove(Filename)
#app.route('/logout')
def logout():
# remove the username from the session if it's there
session.pop('username', None)
return redirect(url_for('login'))
app.secret_key ='secret key generated'
if __name__=='__main__':
context=('RosterWeb.crt','RosterWeb.key')
app.run(ssl_context=context, threaded=True, debug=True)
getting the error:
Traceback (most recent call last):
File "roster.py", line 175, in <module>
x=session['username']
File "C:\Users\dasa17\Envs\r_web\lib\site-packages\werkzeug\local.py", line 37
3, in <lambda>
__getitem__ = lambda x, i: x._get_current_object()[i]
File "C:\Users\dasa17\Envs\r_web\lib\site-packages\werkzeug\local.py", line 30
2, in _get_current_object
return self.__local()
File "C:\Users\dasa17\Envs\r_web\lib\site-packages\flask\globals.py", line 37,
in _lookup_req_object
raise RuntimeError(_request_ctx_err_msg)
RuntimeError: Working outside of request context.
This typically means that you attempted to use functionality that needed
an active HTTP request. Consult the documentation on testing for
information about how to avoid this problem.
You need to place the code you want to run before any routing is done in a function and use the before_first_request decorator on it. It will be executed before your first request is done and you can make use of your session variable from there.
#app.route('/')
def index():
# ... index route here ...
pass
#app.before_first_request
def init_app():
# ... do some preparation here ...
session['username'] = 'me'

Celery: check if a task is completed to send an email to

I'm new to celery and an overall python noob. I must have stumbled upon the right solution during my research but I just don't seem to understand what I need to do for what seems to be a simple case scenario.
I followed the following guide to learn about flask+celery.
What I understand:
There seems there is something obvious I'm missing about how to trigger a task after the first one is finished. I tried using callbacks, using loops, even tried using Celery Flower and Celery beat to realise this has nothing with what I'm doing...
Goal:
After filling the form, I want to send an email with attachements (result of the task) or a failure email otherwise. Without having to wonder what my user is doing on the app (no HTTP requests)
My code:
class ClassWithTheTask:
def __init__(self, filename, proxies):
# do stuff until a variable results is created
self.results = 'this contains my result'
#app.route('/', methods=['GET', 'POST'])
#app.route('/index', methods=['GET', 'POST'])
def index():
form = MyForm()
if form.validate_on_submit():
# ...
# the task
my_task = task1.delay(file_path, proxies)
return redirect(url_for('taskstatus', task_id=my_task.id, filename=filename, email=form.email.data))
return render_template('index.html',
form=form)
#celery.task(bind=True)
def task1(self, filepath, proxies):
task = ClassWithTheTask(filepath, proxies)
return results
#celery.task
def send_async_email(msg):
"""Background task to send an email with Flask-Mail."""
with app.app_context():
mail.send(msg)
#app.route('/status/<task_id>/<filename>/<email>')
def taskstatus(task_id, filename, email):
task = task1.AsyncResult(task_id)
if task.state == 'PENDING':
# job did not start yet
response = {
'state': task.state,
'status': 'Pending...'
}
elif task.state != 'FAILURE':
response = {
'state': task.state,
'status': task.info.get('status', '')
}
if 'results' in task.info:
response['results'] = task.info['results']
response['untranslated'] = task.info['untranslated']
msg = Message('Task Complete for %s !' % filename,
recipients=[email])
msg.body = 'blabla'
with app.open_resource(response['results']) as fp:
msg.attach(response['results'], "text/csv", fp.read())
with app.open_resource(response['untranslated']) as fp:
msg.attach(response['untranslated'], "text/csv", fp.read())
# the big problem here is that it will send the email only if the user refreshes the page and get the 'SUCCESS' status.
send_async_email.delay(msg)
flash('task finished. sent an email.')
return redirect(url_for('index'))
else:
# something went wrong in the background job
response = {
'state': task.state,
'status': str(task.info), # this is the exception raised
}
return jsonify(response)
I don't get the goal of your method for status check. Anyway what you are describing can be accomplished this way.
if form.validate_on_submit():
# ...
# the task
my_task = (
task1.s(file_path, proxies).set(link_error=send_error_email.s(filename, error))
| send_async_email.s()
).delay()
return redirect(url_for('taskstatus', task_id=my_task.id, filename=filename, email=form.email.data))
Then your error task will look like this. The normal task can stay the way it is.
#celery.task
def send_error_email(task_id, filename, email):
task = AsyncResult(task_id)
.....
What happens here is that you are using a chain. You are telling Celery to run your task1, if that completes successfully then run send_async_email, if it fails run send_error_email. This should work, but you might need to adapt the parameters, consider it as pseudocode.
This does not seem right at all:
def task1(self, filepath, proxies):
task = ClassWithTheTask(filepath, proxies)
return results
The line my_task = task1.delay(file_path, proxies) earlier in your code suggests you want to return task but you return results which is not defined anywhere. (ClassWithTheTask is also undefined). This code would crash, and your task would never execute.