How (in code) can I prevent two people starting the same crowdsourcing task at once? - django

I'm trying to build a Django app for a translation crowdsourcing task.
For each task in the database, I have an is_completed boolean flag that is set when the user completes the task. I also have a 'give me a random task' button, which chooses from the list of uncompleted tasks.
My question is this. How do I prevent two users being given the same task, if one user clicks the button shortly after another?
I was thinking of setting a has_started flag on the row when a task is loaded, and removing started tasks from the list of random available tasks: but what if the user starts a task and then closes the page without finishing it, so it never gets unset? I'll end up with a lot of unfinished tasks.
Could I flag this in a cleverer way with session variables that expire, perhaps? But I know it's hard to capture the 'user closes page' event reliably in JavaScript.
Thanks!

Instead of making has_started a flag, you could make it a timestamp and decide on a reasonable amount of time for task completion (which will allow you to assume that a task has been dropped after X minutes).
There is a risk that this will result in multiple translations of the same thing (i.e. if someone is really really slow and the job is recirculated early), but I think it will cover most cases.

I would use locking, you add a field "lock_time" to your database. You update this to the current time as soon as a user starts a task. Then, with an event that's called every, let's say: 10 seconds in javascript, you update the lock_time. Now you can check if the lock_time is more than 30 seconds ago, if so: you "break" the lock.

You'll have to use a timeout. There are no javascript events for "user spills coffee on computer" or "user does a hard reset" etc.

I think you'd best set the userid and the startdate on start.
When you update a database like this --
UPDATE task t
SET t.userid = :USERID, t.lastprogress = sysdate()
WHERE t.userid is null and t.taskid = :TASKID
-- you will notice 0 modified records when a task is already assigned to a user. This addresses your first problem.
Then, when you save a last modified date, you can run a cron job to clean up abandoned tasks, being tasks that haven't been modified in a certain period of time. But this is a different problem altogether. It's hard to find the right balance of deciding too early or too late whether a task is abandoned or not.
If every modification also updates this date, a user can even work on a task for a longer time, without it being stolen by someone else, as long as they do regular saves.
Also, when saving the modification data (you can write a routine to do that), you can check if the userid still matches. If the userid of the task is NULL (cron decided 'abandoned') or another userid (abandoned task picked up by someone else), you can raise an error to tell the user that the task no longer belongs to them.

Related

Event Sourcing: concurrently creating conflicting events

I am trying to implement an Event Sourcing system using Kafka and have run into the following issue. During a new user sign-up I want to check if the username the user provided is already taken. However, consider the case where 2 users are trying to sign-up at the same time providing the same username.
In my understanding of how ES works the controller that processes the sign-up request will check if the request is valid, it will then send a new event (e.g. NewUser) to Kafka, and finally that event will be picked up by another controller which will persist it in a materialized view (e.g. Postgres DB). The problem is that the validation of the request is done against the materialized view but the actual persistence to it happens later. So because the 2 requests are being processed in parallel (by different service instances) they might both pass the validation, resulting in 2 NewUser messages. However, when the second controller tries to persist those 2 NewUser messages in the database saving the second event will fail because of the violation of the uniqueness constraint for the username.
Any ideas on how to address this?
Thanks.
UPDATE:
In particular, I would like to verify whether the following are accepted approaches to the problem:
use the username as the userId (restrictive)
send an event to a topic partitioned by username and when validation
is done send an event to another topic
Initial validation against the materialized view won't be enough in most scenarios where you have constraints. There can always be some relevant events haven't been materialized yet. There are two main concurrency control approaches to ensure that correct results are generated:
1. Pessimistic approach:
If you want to validate constraints before you publish an event, you need to lock relevant resources (entity, aggregate or data set). The locking means your services must not be able to publish events on these resources. After this point, to get the current state of your data:
You can wait until all events published before locking are materialized.
You can read current state from the database and apply events on it in a separate process.
2. Optimistic approach:
In this approach, you perform your validations after publishing events. To achieve this, you need to implement a feedback mechanism. The process which consumes events and performs validations should be able to publish validation results. You can perform the validations in-memory when possible. Otherwise, you can rely on your materialized data store.
Martin Kleppman talks about a two-step solution for exactly the same problem here and in his book. In this solution, there are two topics: "claims" and "registrations". First, you publish a claim to take the username, then try to write it to the database, and finally publish the result to the registrations topic. At conceptual level, it follows the same steps in the second approach you have mentioned. In validation step, it avoids implementing validation logic and keeping secondary indexes in memory by relying on the database.
During a new user sign-up I want to check if the username the user provided is already taken.
You may want to review Greg Young's essay on Set Validation.
In my understanding of how ES works the controller that processes the sign-up request will check if the request is valid, it will then send a new event (e.g. NewUser) to Kafka, and finally that event will be picked up by another controller which will persist it in a materialized view (e.g. Postgres DB).
That's a little bit different from the usual arrangement. (You may also want to review Greg's talk on polyglot data.)
Suppose we begin with two writers; that's fine, but if there is going to be a single point of truth, then you are going to need synchronization somewhere.
The usual arrangement is to use a form of optimistic concurrency; when processing a request, you reserve a copy of your original state, then you do your calculation, and finally you send the book of record a `replace(originalState,newState)'.
So at this point, we have two writes racing toward the book of record
replace(red,green)
replace(red,blue)
At the book of record, the writes are processed in series.
[...,replace(red,blue)...,replace(red,green)]
So when the book of record processes replace(red,blue), it performs a check that yes, the state is currently red, and swaps in blue. Later, when the book of record tries to process replace(red,green), the book of record performs the check, which fails because the state is no longer red.
So one of the writes has succeeded, and the other fails; the latter can propagate the failure outwards, or retry, or..., precisely what depends on the specific mechanics in question. A retry should mean, of course, reload the "original state", at which point the model would discover that some previous edit already claimed the username.
Any ideas on how to address this?
Single writer per stream makes the rest of the problem pretty simple, by eliminating the ambiguity introduced by having multiple in memory copies of the model.
Multiple writers using a synchronous write to the durable store is probably the most common design. It requires an event store that understands the idea of writing to a specific location in a stream -- aka "expected version".
You can perform an asynchronous write, and then start doing other work until you get an acknowledgement that the write succeeded (or not, or until you time out, or)....
There's no magic -- if you want uniqueness (or any other sort of invariant enforcement, for that matter), then everybody needs to agree on a single authority, and anybody else who wants to propose a change won't know if it has been accepted without getting word back from the authority, and needs to be prepared for a rejected proposal.
(Note: this shouldn't be a surprise -- if you were using a traditional design with current state stored in a RDBMS, then your authority would be a user table in the database, with a uniqueness constraint on the username column, and the race would be between the two insert statements trying to finish their transaction first....)

Delete row from database when date passes

In a database, i have a field called date. Is there a way to delete a row when the date passes, so that it doesnt show up anymore? Ive tried comparing it to todays date in the view, but this wouldnt happen everyday, and people would still see it on the first page load. Any ideas?
Removing something from your database is not safe for many reasons. Starting from permissions going to on_delete logic. If you are not sure about that it's totally required to delete something, just mark this row as active=false.
I would not recomend to use cron, since it hard to maintain: you have to set different tasks on different environments manually, copy these files somewhere on your VCS, work with bash instead of python.
Also, when talking about events, I would not recommend to store something like this in your database, since it is not controlled by VCS and hard to maintain.
If your app is pretty simple schedule is an option.
But if you are looking for some extra info like:
What rows were deleted?
Were there any exceptions?
You can move to more complex Celery with Beat turned on. Extra dependencies (like Redis, RabbitMQ) are the main disadvantage.
Docs:
celery beat
Related:
How do I get a Cron like scheduler in Python?
I believe the best way would be to use a Cron Job or to use a additional conditional in the view to show only rows after the said date.
I would recommend you use a mysql event, since this will run constantly, unlike triggers that are only fired on database operations. You want this to occur outside of anything happening in the application, just based on time, so mysql event will work for this scenario. See full tutorial here: http://www.sitepoint.com/working-with-mysql-events/
I had a easier approach, i guess you could call it "hard-coded". I made a function called deleteevent, which had the following code
def deleteevent():
yesterday = date.today() - timedelta(1)
if Events.objects.filter(event_date = yesterday).count():
Events.objects.filter(event_date = yesterday).delete()
Then, in every other function i had, i called this at the beginning, so the event would be deleted before the page loaded

How to detect when my Django object's DataTimeField reach current time

I'm using Django 1.5.5.
Say I have an object as such:
class Encounter(model.Models):
date = models.DateTimeField(blank=True, null=True)
How can I detect when a given Encounter has reached current time ? I don't see how signals can help me.
You can't detect it using just Django.
You need some scheduler, that will check every Encounters date (for example, by using corresponding filter query), and do needed actions.
It can be a simple cron script. You can write it as django custom management command. And cron must call it every 5 minute, for example.
Or, you can use Celery. With it, you can see worker status from admin and do some other things.
What you could do is use Celery. When you save an object of Encounter this would then get into the task queue and execute only once it has reached current time.
There is one caveat though, it might execute a bit later depending on how busy the celery workers are.

In mTurk, how can I use participation in a previous HIT (or series of HITs) as a qualification?

I am using mTurk for surveys, and I need a way of making sure that people who have participated in a previous survey / HIT do not participate in certain future surveys / HITs. I am not sure whether I should do this as a qualification or in some other way.
I know there is some way to do this, but I have no idea how. I have very limited programming experience and would greatly, greatly appreciate specific instructions on how I might do this. My understanding is that I might need to use AWS? Many thanks!
Mass rejections as suggested above are a really, really bad idea in terms of your reputation as a requester. You are much better off creating a Qualification for the new HIT, which automatically grants a score of 100 (or whatever) to anyone who takes it, and assigning scores of zero to everyone who has done the previous surveys. This prevents repeats but doesn't annoy any of your workers.
The easiest way to create a Qualification is at https://requester.mturk.com/qualification_types.
If you download the csv of workers from here https://requester.mturk.com/workers, you can assign scores to workers who have done the previous HIT(s).
To make the qualification grant scores to new workers automatically requires the API, though.
Here's a hacky way to do it:
When you accept HITs for surveys, save every participating worker's ID.
In the writeup, note that "if you've done previous surveys w/ us, then you can't do this one (IE, you can, but we won't approve it)".
When you approve HITs, cross-reference the worker ids with anybody who participated in a previous survey, and reject the hits of any that match.
If you're doing enough surveys, then yes, you probably want to use AWS API for at least the approval part. otherwise, most things appear to be do-able from the requester interface.
Amazon Mechanical Turk service has this option for requesters to grant their workers by Qualification_Type. In this way by connecting your HITs to a qualification_type naming "A", then granting workers exactly the same qualification_type, only workers who have that qualification can see and work on HITs.
First, creating desired qualification types through mturk web UI.(it is only name and description) requester.mturk.com > manage > QualificationTypes. It will give you a qualification id after generating it. (you will need it soon)
Second, in HIT creation loop, you have to use QualificationRequirement class. (I am using java code and it looks like the below-mentioned code):
QualificationRequirement[] qualReq = new QualificationRequirement[1];
qualReq[0] = new QualificationRequirement();
qualReq[0].setQualificationTypeId(qualID);
qualReq[0].setComparator(Comparator.EqualTo);
qualReq[0].setIntegerValue(100);
qualReq[0].setRequiredToPreview(false);
then in HIT creation loop, I will use this:
try {
hit = this.service.createHIT(null,
props.getTitle(),
props.getDescription(),
props.getKeywords(),
question.getQuestion(),
new Double(props.getRewardAmount()),
new Long(props.getAssignmentDuration()),
new Long(props.getAutoApprovalDelay()),
new Long(props.getLifetime()),
new Integer(props.getMaxAssignments()),
props.getAnnotation(),
qualReq,
null);
Third is assigning the qualification type to the workers that you want them to work on your HITs. It is very straightforward, I usually use mturk UI to do it. https://requester.mturk.com/ > manage tab > Workers. You should download the CSV file if you want to assign this qualification to a bunch of workers.
(Workers are who worked with you in the past)
you could notify workers by sending them an email after qualifying them
Notice: Some workers are very slow in answering your new HITs after qualifying them; so keep in mind that you should have some backup plan and time if you will not receive enough response in a certain amount of time.

Django: How can I protect against concurrent modification of database entries

If there a way to protect against concurrent modifications of the same data base entry by two or more users?
It would be acceptable to show an error message to the user performing the second commit/save operation, but data should not be silently overwritten.
I think locking the entry is not an option, as a user might use the "Back" button or simply close his browser, leaving the lock for ever.
This is how I do optimistic locking in Django:
updated = Entry.objects.filter(Q(id=e.id) && Q(version=e.version))\
.update(updated_field=new_value, version=e.version+1)
if not updated:
raise ConcurrentModificationException()
The code listed above can be implemented as a method in Custom Manager.
I am making the following assumptions:
filter().update() will result in a single database query because filter is lazy
a database query is atomic
These assumptions are enough to ensure that no one else has updated the entry before. If multiple rows are updated this way you should use transactions.
WARNING Django Doc:
Be aware that the update() method is
converted directly to an SQL
statement. It is a bulk operation for
direct updates. It doesn't run any
save() methods on your models, or emit
the pre_save or post_save signals
This question is a bit old and my answer a bit late, but after what I understand this has been fixed in Django 1.4 using:
select_for_update(nowait=True)
see the docs
Returns a queryset that will lock rows until the end of the transaction, generating a SELECT ... FOR UPDATE SQL statement on supported databases.
Usually, if another transaction has already acquired a lock on one of the selected rows, the query will block until the lock is released. If this is not the behavior you want, call select_for_update(nowait=True). This will make the call non-blocking. If a conflicting lock is already acquired by another transaction, DatabaseError will be raised when the queryset is evaluated.
Of course this will only work if the back-end support the "select for update" feature, which for example sqlite doesn't. Unfortunately: nowait=True is not supported by MySql, there you have to use: nowait=False, which will only block until the lock is released.
Actually, transactions don't help you much here ... unless you want to have transactions running over multiple HTTP requests (which you most probably don't want).
What we usually use in those cases is "Optimistic Locking". The Django ORM doesn't support that as far as I know. But there has been some discussion about adding this feature.
So you are on your own. Basically, what you should do is add a "version" field to your model and pass it to the user as a hidden field. The normal cycle for an update is :
read the data and show it to the user
user modify data
user post the data
the app saves it back in the database.
To implement optimistic locking, when you save the data, you check if the version that you got back from the user is the same as the one in the database, and then update the database and increment the version. If they are not, it means that there has been a change since the data was loaded.
You can do that with a single SQL call with something like :
UPDATE ... WHERE version = 'version_from_user';
This call will update the database only if the version is still the same.
Django 1.11 has three convenient options to handle this situation depending on your business logic requirements:
Something.objects.select_for_update() will block until the model become free
Something.objects.select_for_update(nowait=True) and catch DatabaseError if the model is currently locked for update
Something.objects.select_for_update(skip_locked=True) will not return the objects that are currently locked
In my application, which has both interactive and batch workflows on various models, I found these three options to solve most of my concurrent processing scenarios.
The "waiting" select_for_update is very convenient in sequential batch processes - I want them all to execute, but let them take their time. The nowait is used when an user wants to modify an object that is currently locked for update - I will just tell them it's being modified at this moment.
The skip_locked is useful for another type of update, when users can trigger a rescan of an object - and I don't care who triggers it, as long as it's triggered, so skip_locked allows me to silently skip the duplicated triggers.
For future reference, check out https://github.com/RobCombs/django-locking. It does locking in a way that doesn't leave everlasting locks, by a mixture of javascript unlocking when the user leaves the page, and lock timeouts (e.g. in case the user's browser crashes). The documentation is pretty complete.
You should probably use the django transaction middleware at least, even regardless of this problem.
As to your actual problem of having multiple users editing the same data... yes, use locking. OR:
Check what version a user is updating against (do this securely, so users can't simply hack the system to say they were updating the latest copy!), and only update if that version is current. Otherwise, send the user back a new page with the original version they were editing, their submitted version, and the new version(s) written by others. Ask them to merge the changes into one, completely up-to-date version. You might try to auto-merge these using a toolset like diff+patch, but you'll need to have the manual merge method working for failure cases anyway, so start with that. Also, you'll need to preserve version history, and allow admins to revert changes, in case someone unintentionally or intentionally messes up the merge. But you should probably have that anyway.
There's very likely a django app/library that does most of this for you.
Another thing to look for is the word "atomic". An atomic operation means that your database change will either happen successfully, or fail obviously. A quick search shows this question asking about atomic operations in Django.
The idea above
updated = Entry.objects.filter(Q(id=e.id) && Q(version=e.version))\
.update(updated_field=new_value, version=e.version+1)
if not updated:
raise ConcurrentModificationException()
looks great and should work fine even without serializable transactions.
The problem is how to augment the deafult .save() behavior as to not have to do manual plumbing to call the .update() method.
I looked at the Custom Manager idea.
My plan is to override the Manager _update method that is called by Model.save_base() to perform the update.
This is the current code in Django 1.3
def _update(self, values, **kwargs):
return self.get_query_set()._update(values, **kwargs)
What needs to be done IMHO is something like:
def _update(self, values, **kwargs):
#TODO Get version field value
v = self.get_version_field_value(values[0])
return self.get_query_set().filter(Q(version=v))._update(values, **kwargs)
Similar thing needs to happen on delete. However delete is a bit more difficult as Django is implementing quite some voodoo in this area through django.db.models.deletion.Collector.
It is weird that modren tool like Django lacks guidance for Optimictic Concurency Control.
I will update this post when I solve the riddle. Hopefully solution will be in a nice pythonic way that does not involve tons of coding, weird views, skipping essential pieces of Django etc.
To be safe the database needs to support transactions.
If the fields is "free-form" e.g. text etc. and you need to allow several users to be able to edit the same fields (you can't have single user ownership to the data), you could store the original data in a variable.
When the user committs, check if the input data has changed from the original data (if not, you don't need to bother the DB by rewriting old data),
if the original data compared to the current data in the db is the same you can save, if it has changed you can show the user the difference and ask the user what to do.
If the fields is numbers e.g. account balance, number of items in a store etc., you can handle it more automatically if you calculate the difference between the original value (stored when the user started filling out the form) and the new value you can start a transaction read the current value and add the difference, then end transaction. If you can't have negative values, you should abort the transaction if the result is negative, and tell the user.
I don't know django, so I can't give you teh cod3s.. ;)
From here:
How to prevent overwriting an object someone else has modified
I'm assuming that the timestamp will be held as a hidden field in the form you're trying to save the details of.
def save(self):
if(self.id):
foo = Foo.objects.get(pk=self.id)
if(foo.timestamp > self.timestamp):
raise Exception, "trying to save outdated Foo"
super(Foo, self).save()