I'm using Django 1.5.5.
Say I have an object as such:
class Encounter(model.Models):
date = models.DateTimeField(blank=True, null=True)
How can I detect when a given Encounter has reached current time ? I don't see how signals can help me.
You can't detect it using just Django.
You need some scheduler, that will check every Encounters date (for example, by using corresponding filter query), and do needed actions.
It can be a simple cron script. You can write it as django custom management command. And cron must call it every 5 minute, for example.
Or, you can use Celery. With it, you can see worker status from admin and do some other things.
What you could do is use Celery. When you save an object of Encounter this would then get into the task queue and execute only once it has reached current time.
There is one caveat though, it might execute a bit later depending on how busy the celery workers are.
Related
We have a Django application that uses Django-river for workflow management. For performance improvement, we had to use bulk_create. We need to insert data into a couple of tables and several rows in each.
Initially, we were using the normal .save() method and the workflow was working as expected (as the post save() signals were creating properly). But once we moved to the bulk_create, the performance was improved from minutes to seconds. But the Django_river stopped working and there was no default post save signals. We had to implement the signals based on the documentation available.
class CustomManager(models.Manager):
def bulk_create(items,....):
super().bulk_create(...)
for i in items:
[......] # code to send signal
And
class Task(models.Model):
objects = CustomManager()
....
This got the workflow working again, but the generation of signals is taking time and this destroys all the performance improvement gained with bulk_create.
So is there a way to improve the signal creation?
More details
def post_save_fn(obj):
post_save.send(obj.__class__, instance=obj, created=True)
class CustomManager(models.Manager):
def bulk_create(self, objs, **kwargs):
#Your code here
data_obj = super(CustomManager, self).bulk_create(objs,**kwargs)
for i in data_obj:
# t1 = threading.Thread(target=post_save_fn, args=(i,))
# t1.start()
post_save.send(i.__class__, instance=i, created=True)
return data_obj
class Test(Base):
test_name = models.CharField(max_length=100)
test_code = models.CharField(max_length=50)
objects = CustomManager()
class Meta:
db_table = "test_db"
What is the problem?
As others have mentioned in the comments, the problem is that the functions that are getting called via the post_save are taking a long time. (Remember that signals are not async!! - this is a common misconception).
I'm not familiar with django-river but taking a quick look at the functions that will get called post-save (see here and here) we can see that they involve additional calls to the database.
Whilst you save a lot of individual db hits by using bulk_create you are still doing calling the database again multiple times for each post_save signal.
What can be done about it?
In short. Not much!! For the vast majority of django requests, the slow part will be calling the database. This is why we try and minimise the number of calls to the db (using things like bulk_create).
Reading through the first few paragraphs of django-river the whole idea is to move things that would normally be in code to the database. The big advantage here is that you don't need to re-write code and re-deploy so often. But the disadvantage is that you're inevitably going to have to refer to the database more, which is going to slow things down. This will be fine for some use-cases, but not all.
There are two things I can think of which might help:
Does all of this currently happen as part of the request/response cycle. And if it is, does it need to be? If the answers to these two questions are 'yes' and 'no' respectively, then you could move this work to a separate task queue. This will still be slow, but at least it won't slow down your site.
Depending on exactly what your workflows are and the nature of the data you are creating, it might be the case that you can do everything that the post_save signals are doing in your own function, and do it more efficiently. But this will definitely depend upon your data, and your app, and will move away from the philosophy of django-river.
Use a separated worker if the "signal" logic allows you to be executed after the bulk save.
You can create an additional queue table and put the metadata about what to do for your future worker.
Create a separated worker (Django module) with needed logic and data from the queue table. You can do it as management command, this will allow you to run the worker in the main flow (you can run management commands from regular Django code) or you can run it by crontab based on a schedule.
How to run such a worker?
If you need something to be done as closely as you've created records - run it in a separate thread using the threading module. So your request-response lifecycle will be done right after you've started a new thread.
Else if you can do it later - make a schedule and run it by crontab using the management command framework.
In a database, i have a field called date. Is there a way to delete a row when the date passes, so that it doesnt show up anymore? Ive tried comparing it to todays date in the view, but this wouldnt happen everyday, and people would still see it on the first page load. Any ideas?
Removing something from your database is not safe for many reasons. Starting from permissions going to on_delete logic. If you are not sure about that it's totally required to delete something, just mark this row as active=false.
I would not recomend to use cron, since it hard to maintain: you have to set different tasks on different environments manually, copy these files somewhere on your VCS, work with bash instead of python.
Also, when talking about events, I would not recommend to store something like this in your database, since it is not controlled by VCS and hard to maintain.
If your app is pretty simple schedule is an option.
But if you are looking for some extra info like:
What rows were deleted?
Were there any exceptions?
You can move to more complex Celery with Beat turned on. Extra dependencies (like Redis, RabbitMQ) are the main disadvantage.
Docs:
celery beat
Related:
How do I get a Cron like scheduler in Python?
I believe the best way would be to use a Cron Job or to use a additional conditional in the view to show only rows after the said date.
I would recommend you use a mysql event, since this will run constantly, unlike triggers that are only fired on database operations. You want this to occur outside of anything happening in the application, just based on time, so mysql event will work for this scenario. See full tutorial here: http://www.sitepoint.com/working-with-mysql-events/
I had a easier approach, i guess you could call it "hard-coded". I made a function called deleteevent, which had the following code
def deleteevent():
yesterday = date.today() - timedelta(1)
if Events.objects.filter(event_date = yesterday).count():
Events.objects.filter(event_date = yesterday).delete()
Then, in every other function i had, i called this at the beginning, so the event would be deleted before the page loaded
I need to create a standalone script which accesses the database, fetches data from a table and processes it and stores it into another table. I am also using django-rq in order to run this script.
Where should place this script in django project structure?
Without using views.py, how should I run this script using django-rq?
If I understand your case correctly, I would use custom management commands https://docs.djangoproject.com/en/1.7/howto/custom-management-commands/
In one of your views import the scripts' functions and django-rq and continue with your processing in the view.
I just worked this same issue, One added variable was that I wanted to run a job every hour, so in addition to Django-RQ I am also using RQ-Scheduler.
In this approach, you schedule a function call in the job you create
scheduler.schedule(
scheduled_time=datetime.utcnow(), # Time for first execution, in UTC timezone
func=func, # Function to be queued
args=[arg1, arg2], # Arguments passed into function when executed
kwargs={'foo': 'bar'}, # Keyword arguments passed into function when executed
interval=60, # Time before the function is called again, in seconds
repeat=10 # Repeat this number of times (None means repeat forever)
)
I created a module my_helpers.py in the root of my Django Project which has functions that do the work I want and which scheduled the task as I needed. Then in a separate shell python manage.py shell I import the helpers and run my function to scheduled the task.
I hope that helps, its working for me.
I'm trying to build a Django app for a translation crowdsourcing task.
For each task in the database, I have an is_completed boolean flag that is set when the user completes the task. I also have a 'give me a random task' button, which chooses from the list of uncompleted tasks.
My question is this. How do I prevent two users being given the same task, if one user clicks the button shortly after another?
I was thinking of setting a has_started flag on the row when a task is loaded, and removing started tasks from the list of random available tasks: but what if the user starts a task and then closes the page without finishing it, so it never gets unset? I'll end up with a lot of unfinished tasks.
Could I flag this in a cleverer way with session variables that expire, perhaps? But I know it's hard to capture the 'user closes page' event reliably in JavaScript.
Thanks!
Instead of making has_started a flag, you could make it a timestamp and decide on a reasonable amount of time for task completion (which will allow you to assume that a task has been dropped after X minutes).
There is a risk that this will result in multiple translations of the same thing (i.e. if someone is really really slow and the job is recirculated early), but I think it will cover most cases.
I would use locking, you add a field "lock_time" to your database. You update this to the current time as soon as a user starts a task. Then, with an event that's called every, let's say: 10 seconds in javascript, you update the lock_time. Now you can check if the lock_time is more than 30 seconds ago, if so: you "break" the lock.
You'll have to use a timeout. There are no javascript events for "user spills coffee on computer" or "user does a hard reset" etc.
I think you'd best set the userid and the startdate on start.
When you update a database like this --
UPDATE task t
SET t.userid = :USERID, t.lastprogress = sysdate()
WHERE t.userid is null and t.taskid = :TASKID
-- you will notice 0 modified records when a task is already assigned to a user. This addresses your first problem.
Then, when you save a last modified date, you can run a cron job to clean up abandoned tasks, being tasks that haven't been modified in a certain period of time. But this is a different problem altogether. It's hard to find the right balance of deciding too early or too late whether a task is abandoned or not.
If every modification also updates this date, a user can even work on a task for a longer time, without it being stolen by someone else, as long as they do regular saves.
Also, when saving the modification data (you can write a routine to do that), you can check if the userid still matches. If the userid of the task is NULL (cron decided 'abandoned') or another userid (abandoned task picked up by someone else), you can raise an error to tell the user that the task no longer belongs to them.
I am using mTurk for surveys, and I need a way of making sure that people who have participated in a previous survey / HIT do not participate in certain future surveys / HITs. I am not sure whether I should do this as a qualification or in some other way.
I know there is some way to do this, but I have no idea how. I have very limited programming experience and would greatly, greatly appreciate specific instructions on how I might do this. My understanding is that I might need to use AWS? Many thanks!
Mass rejections as suggested above are a really, really bad idea in terms of your reputation as a requester. You are much better off creating a Qualification for the new HIT, which automatically grants a score of 100 (or whatever) to anyone who takes it, and assigning scores of zero to everyone who has done the previous surveys. This prevents repeats but doesn't annoy any of your workers.
The easiest way to create a Qualification is at https://requester.mturk.com/qualification_types.
If you download the csv of workers from here https://requester.mturk.com/workers, you can assign scores to workers who have done the previous HIT(s).
To make the qualification grant scores to new workers automatically requires the API, though.
Here's a hacky way to do it:
When you accept HITs for surveys, save every participating worker's ID.
In the writeup, note that "if you've done previous surveys w/ us, then you can't do this one (IE, you can, but we won't approve it)".
When you approve HITs, cross-reference the worker ids with anybody who participated in a previous survey, and reject the hits of any that match.
If you're doing enough surveys, then yes, you probably want to use AWS API for at least the approval part. otherwise, most things appear to be do-able from the requester interface.
Amazon Mechanical Turk service has this option for requesters to grant their workers by Qualification_Type. In this way by connecting your HITs to a qualification_type naming "A", then granting workers exactly the same qualification_type, only workers who have that qualification can see and work on HITs.
First, creating desired qualification types through mturk web UI.(it is only name and description) requester.mturk.com > manage > QualificationTypes. It will give you a qualification id after generating it. (you will need it soon)
Second, in HIT creation loop, you have to use QualificationRequirement class. (I am using java code and it looks like the below-mentioned code):
QualificationRequirement[] qualReq = new QualificationRequirement[1];
qualReq[0] = new QualificationRequirement();
qualReq[0].setQualificationTypeId(qualID);
qualReq[0].setComparator(Comparator.EqualTo);
qualReq[0].setIntegerValue(100);
qualReq[0].setRequiredToPreview(false);
then in HIT creation loop, I will use this:
try {
hit = this.service.createHIT(null,
props.getTitle(),
props.getDescription(),
props.getKeywords(),
question.getQuestion(),
new Double(props.getRewardAmount()),
new Long(props.getAssignmentDuration()),
new Long(props.getAutoApprovalDelay()),
new Long(props.getLifetime()),
new Integer(props.getMaxAssignments()),
props.getAnnotation(),
qualReq,
null);
Third is assigning the qualification type to the workers that you want them to work on your HITs. It is very straightforward, I usually use mturk UI to do it. https://requester.mturk.com/ > manage tab > Workers. You should download the CSV file if you want to assign this qualification to a bunch of workers.
(Workers are who worked with you in the past)
you could notify workers by sending them an email after qualifying them
Notice: Some workers are very slow in answering your new HITs after qualifying them; so keep in mind that you should have some backup plan and time if you will not receive enough response in a certain amount of time.