Many of my views fetch external resources. I want to make sure that under heavy load I don't blow up the remote sites (and/or get banned).
I only have 1 crawler so having a central lock will work fine.
I want to allow at most 3 queries to a host per second, and have the rest block for a maximum of 15 seconds. How could I do this (easily)?
Use django cache
Seems to only have 1 second resolution
Use a file based semaphore
Easy to do locks for concurrency. Not sure how to make sure only 3 fetches happen a second.
Use some shared memory state
I'd rather not install more things, but will if I have to.
One approach; create a table like this:
class Queries(models.Model):
site = models.CharField(max_length=200, db_index=True)
start_time = models.DateTimeField(null = True)
finished = models.BooleanField(default=False)
This records when each query has either taken place, or will take place in the future if the limiting prevents it from happening immediately. start_time is the time the action is to start; this is in the future if the action is currently blocking.
Instead of thinking in terms of queries per second, let's think in terms of seconds per query; in this case, 1/3 second per query.
Whenever an action is to be performed, do the following:
Create a row for the action. q = Queries.objects.create(site=sitename)
On the object you just created (q.id), atomically set start_time to the greatest start_time for this site plus 1/3 second. If the greatest is 10 seconds in the future, then we can start our action at 10 1/3 seconds. If that time is in the past, clamp it to now().
If the start_time that was just set is in the future, sleep until that time. If it's too far in the future (eg. over 15 seconds), delete the row and error out.
When the query is finished, set finished to True, so the row can be purged later on.
The atomic action is what's important. You can't simply do an aggregate on Queries and then save it, since it'll race. I don't know if Django can do this natively, but it's easy enough in raw SQL:
UPDATE site_queries
SET start_time = MAX(now(), COALESCE(now(), (
SELECT MAX(start_time) + 1.0/3 FROM site_queries WHERE site = site_name
)))
WHERE id = object_id
Then, reload the model and sleep if necessary. You'll also need to purge old rows. Something like Queries.objects.filter(site=site, finished=True).exclude(id=id).delete() will probably work: delete all finished queries except the one you just made. (That way, you never delete the latest query, since later queries need that to be scheduled.)
Finally, make sure the UPDATE doesn't take place in a transaction. Autocommit must be turned on for this to work. Otherwise, the UPDATE won't be atomic: it'd be possible for two requests to UPDATE at the same time, and receive the same result. Django and Python typically have autocommit off, so you need to turn it on and then back off. With Postgres, this is connection.set_isolation_level(ISOLATION_LEVEL_AUTOCOMMIT) and ISOLATION_LEVEL_READ_COMMITTED. I don't know how to do this with MySQL.
(I consider the default of having autocommit turned off in Python's DB-API to be a seriously design flaw.)
The benefit of this approach is that it's quite simple, with straightforward state; you don't need things like event listeners and wakeups, which have their own sets of problems.
A possible issue is that if the user cancels the request during the delay, whether or not you do the action, the delay is still enforced. If you never start the action, other requests won't move down into the unused "timeslot".
If you're not able to get autocommit to work, a workaround would be to add a UNIQUE constraint to (site, start_time). (I don't think Django understands that directly, so you'd need to add the constraint yourself.) Then, if the race happens and two requests to the same site end up at the same time, one of them will throw a constraint exception that you can catch, and you can just retry. You could also use a normal Django aggregate instead of raw SQL. Catching constraint exceptions isn't as robust, though.
What about using a different process to handle scraping, and a queue for the communication between it and Django?
This way you would be able to easily change the number of concurrent requests, and it would also automatically keep track of the requests, without blocking the caller.
Most of all, I think it would help lowering the complexity of the main application (in Django).
Related
Let's say i want to implement a "Like/Unlike" system in my app. I need to count each like for sorting purposes later. Can i simply insert the current value + 1 ? I think it's too simple.
What if two user click on the same time ? How to prevent my counter to be disturbed ?
I read i need to implement transactions by a simple decorator #transaction.atomic but i am wonder if this can handle my concern.
Transactions are designed to execute a "bloc" of operations triggered by one user, whereas in my case i need be able to handle multiple request at the same time and safely update the counter.
Any advise ?
You can use F() expression, eg.
content.likes_count = F('likes_count') + 1
content.save()
So the operation will be excuted in database not in python.
From the django documentation.
Another useful benefit of F() is that having the database - rather
than Python - update a field’s value avoids a race condition.
If two Python threads execute the code in the first example above, one
thread could retrieve, increment, and save a field’s value after the
other has retrieved it from the database. The value that the second
thread saves will be based on the original value; the work of the
first thread will simply be lost.
If the database is responsible for updating the field, the process is
more robust: it will only ever update the field based on the value of
the field in the database when the save() or update() is executed,
rather than based on its value when the instance was retrieved.
I'm not sure about proper design of an approach.
We use optimistic locking using long incremented version placed on every entity. Each update of such entity is executed via compare-and-swap algorithm which just succeed or fail depending on whether some other client updates entity in the meantime or not. Classic optimistic locking as e.g. hibernate do.
We also need to adopt re-trying approach. We use http based storage (etcd) and it can happen that some update request is just timeouted.
And here it's the problem. How to combine optimistic locking and re-try. Here is the specific issue I'm facing.
Let say I have an entity having version=1 and I'm trying to update it. Next version is obviously 2. My client than executes conditional update. It's successfully executed only when the version in persistence is 1 and it's atomically updated to version=2. So far, so good.
Now, let say that a response for the update request does not arrive. It's impossible to say if it succeeded or not at this moment. The only thing I can do now is to re-try the update again. In memory entity still contains version=1 intending to update value to 2.
The real problem arise now. What if the second update fails because a version in persistence is 2 and not 1?
There is two possible reasons:
first request indeed caused the update - the operation was successful but the response got lost or my client timeout, whatever. It just did not arrived but it passed
some other client performed the update concurrently on the background
Now I can't say what is true. Did my client update the entity or some other client did? Did the operation passed or failed?
Current approach we use just compares persisted entity and the entity in main memory. Either as java equal or json content equality. If they are equal, the update methods is declared as successful. I'm not satisfied with the algorithm as it's not both cheap and reasonable for me.
Another possible approach is to do not use long version but timestamp instead. Every client generates own timestamp within the update operation in the meaning that potential concurrent client would generate other in high probability. The problem for me is the probability, especially when two concurrent updates would come from same machine.
Is there any other solution?
You can fake transactions in etcd by using a two-step protocol.
Algorithm for updating:
First phase: record the update to etcd
add an "update-lock" node with a fairly small TTL. If it exists, wait until it disappears and try again.
add a watchdog to your code. You MUST abort if performing the next steps takes longer than the lock's TTL (or if you fail to refresh it).
add a "update-plan" node with [old,new] values. Its structure is up to you, but you need to ensure that the old values are copied while you hold the lock.
add a "committed-update" node. At this point you have "atomically" updated the data.
Second phase: perform the actual update
read the "planned-update" node and apply the changes it describes.
If a change fails, verify that the new value is present.
If it's not, you have a major problem. Bail out.
delete the committed-update node
delete the update-plan node
delete the update-lock node
If you want to read consistent data:
While there is no committed-update node, your data are OK.
Otherwise, wait for it to get deleted.
Whenever committed-update is present but update-lock is not, initiate recovery.
Transaction recovery, if you find an update-plan node without a lock:
Get the update-lock.
if there is no committed-update node, delete the plan and release the lock.
Otherwise, continue at "Second phase", above.
IMHO, as etcd is built upon HTTP which is inherently an unsecure protocol, it will be very hard to have a bullet proof solution.
Classical SQL databases use connected protocols, transactions and journalisation to allow users to make sure that a transaction as a whole will be either fully committed or fully rollbacked, even in worst case of power outage in the middle of the operation.
So if 2 operations depend on each other (money transfert from one bank account to the other) you can make sure that either both are ok or none, and you can simply implement in the database a journal of "operations" with their status to be able to later see if a particuliar one was passed by consulting the journal, even if you were disconnected in the middle of the commit.
But I simply cannot imagine such a solution for etcd. So unless someone else finds a better way, you are left with two options
use a classical SQL database in the backend, using etcd (or equivalent) as a simple cache
accept the weaknesses of the protocol
BTW, I do not think that timestamp in lieue of long version number will strengthen the system, because in high load, the probability that two client transaction use same timestamp increases. Maybe you could try to add a unique id (client id or just technical uuid) to your fields, and when version is n+1 just compare the UUID that increased it : if it is yours, the transaction passed if not id did not.
But the really worse problem would arise if at the moment you can read the version, it is not at n+1 but already at n+2. If UUID is yours, you are sure your transaction passed, but if it is not nobody can say.
I have a system whereby a central MSSQL database keeps in a table a queue of jobs that need to be done.
For the reasons that processing requirements would not be that high, and that there would not be a particularly high frequency of requests (probably once every few seconds at most) we made the decision to have the applications that utilise the queue simply query the database whenever one is needed; there is no message queue service at this time.
A single fetch is performed by having the client application run a stored procedure, which performs the query(ies) involved and returns a job ID. The client application then fetches the job information by querying by ID and sets the job as handled.
Performance is fine; the only snag we have felt is that, because the client application has to query for the details and perform a check before the job is marked as handled, on very rare occasions (once every few thousand jobs), two clients pick up the same job.
As a way of solving this problem, I was suggesting having the initial stored procedure that runs "tag" the record it pulls with the time and date. The stored procedure, when querying for records, will only pull records where this "tag" is a certain amount of time, say 5 seconds, in the past. That way, if the stored procedure runs twice within 5 seconds, the second instance will not pick up the same job.
Can anyone foresee any problems with fixing the problem this way or offer an alternative solution?
Use a UNIQUEIDENTIFIER field as your marker. When the stored procedure runs, lock the row you're reading and update the field with a NEWID(). You can mark your polling statement using something like WITH(READPAST) if you're worried about deadlocking issues.
The reason to use a GUID here is to have a unique identifier that will serve to mark a batch. Your NEWID() call is guaranteed to give you a unique value, which will be used to prevent you from accidentally picking up the same data twice. GETDATE() wouldn't work here because you could end up having two calls that resolve to the same time; BIT wouldn't work because it wouldn't uniquely mark off batches for picking up or reporting.
For example,
declare #ReadID uniqueidentifier
declare #BatchSize int = 20; -- make a parameter to your procedure
set #ReadID = NEWID();
UPDATE tbl WITH (ROWLOCK)
SET HasBeenRead = #ReadID -- your UNIQUEIDENTIFIER field
FROM (
SELECT TOP (#BatchSize) Id
FROM tbl WITH(UPDLOCK ROWLOCK READPAST )
WHERE HasBeenRead IS null ORDER BY [Id])
AS t1
WHERE ( tbl.Id = t1.Id)
SELECT Id, OtherCol, OtherCol2
FROM tbl WITH(UPDLOCK ROWLOCK READPAST )
WHERE HasBeenRead = #ReadID
And then you can use a polling statement like
SELECT COUNT(*) FROM tbl WITH(READPAST) WHERE HasBeenRead IS NULL
Adapted from here: https://msdn.microsoft.com/en-us/library/cc507804%28v=bts.10%29.aspx
I am working on an API, and I have a question. I was looking into the usage of select_related(), in order to save myself some database queries, and indeed it does help in reducing the amount of database queries performed, on the expense of bigger and more complex queries.
My question is, does using select_related() cause heavier memory usage? Running some experiments I noticed that indeed this is the case, but I'm wondering why. Regardless of whether I use select_related(), the response will contain the exact same data, so why does the use of select_related() cause more memory to be used?
Is it because of caching? Maybe separate data objects are used to cache the same model instances? I don't know what else to think.
It's a tradeoff. It takes time to send a query to the database, the database to prepare results, and then send those results back. select_related works off the principle that the most expensive part of this process is the request and response cycle, not the actual query, so it allows you to combine what would otherwise have been distinct queries into just one so there's only one request and response instead of multiple.
However, if your database server is under-powered (not enough RAM, processing power, etc.), the larger query could actually end up taking longer than the request and response cycle. If that's the case, you probably need to upgrade the server, though, rather than not use select_related.
The rule of thumb is that if you need related data, you use select_related. If it's not actually faster, then that's a sign that you need to optimize your database.
UPDATE (adding more explanation)
Querying a database actually involves multiple steps:
Application generates the query (negligible)
Query is sent to the database server (milliseconds to seconds)
Database processes the query (milliseconds to seconds)
Query results are sent back to application (milliseconds to seconds)
In a well-tuned environment (sufficient server resources, fast connections) the entire process is finished in mere milliseconds. However, steps 2 and 4, still usually take more time overall than step 3. This is why it makes more sense to send fewer more complex queries than multiple simpler queries: the bottleneck is most usually the transport layer not the processing.
However, a poorly optimized database, on an under-powered machine with large and complex tables could take a very long time to run the query, becoming the bottleneck. That would end up negating the decrease in time gained from sending one complex query instead of multiple simpler ones, i.e. the database would have responded quicker to the simpler queries and the entire process would have taken less net time.
Nevertheless, if this is the case, the proper response is to fix the database-side: optimize the database and its configuration, add more server resources, etc., rather than reverting to sending multiple simple queries instead.
I am currently developing an application for Azure Table Storage. In that application I have table which will have relatively few inserts (a couple of thousand/day) and the primary key of these entities will be used in another table, which will have billions of rows.
Therefore I am looking for a way to use an auto-incremented integer, instead of GUID, as primary key in the small table (since it will save lots of storage and scalability of the inserts is not really an issue).
There've been some discussions on the topic, e.g. on http://social.msdn.microsoft.com/Forums/en/windowsazure/thread/6b7d1ece-301b-44f1-85ab-eeb274349797.
However, since concurrency problems can be really hard to debug and spot, I am a bit uncomfortable with implementing this on own. My question is therefore if there is a well tested impelemntation of this?
For everyone who will find it in search, there is a better solution. Minimal time for table lock is 15 seconds - that's awful. Do not use it if you want to create a truly scalable solution. Use Etag!
Create one entity in table for ID (you can even name it as ID or whatever).
1) Read it.
2) Increment.
3) InsertOrUpdate WITH ETag specified (from the read query).
if last operation (InsertOrUpdate) succeeds, then you have a new, unique, auto-incremented ID. If it fails (exception with HttpStatusCode == 412), it means that some other client changed it. So, repeat again 1,2 and 3.
The usual time for Read+InsertOrUpdate is less than 200ms. My test utility with source on github.
See UniqueIdGenerator class by Josh Twist.
I haven't implemented this yet but am working on it ...
You could seed a queue with your next ids to use, then just pick them off the queue when you need them.
You need to keep a table to contain the value of the biggest number added to the queue. If you know you won't be using a ton of the integers, you could have a worker every so often wake up and make sure the queue still has integers in it. You could also have a used int queue the worker could check to keep an eye on usage.
You could also hook that worker up so if the queue was empty when your code needed an id (by chance) it could interupt the worker's nap to create more keys asap.
If that call failed you would need a way to (tell the worker you are going to do the work for them (lock), then do the workers work of getting the next id and unlock)
lock
get the last key created from the table
increment and save
unlock
then use the new value.
The solution I found that prevents duplicate ids and lets you autoincrement it is to
lock (lease) a blob and let that act as a logical gate.
Then read the value.
Write the incremented value
Release the lease
Use the value in your app/table
Then if your worker role were to crash during that process, then you would only have a missing ID in your store. IMHO that is better than duplicates.
Here is a code sample and more information on this approach from Steve Marx
If you really need to avoid guids, have you considered using something based on date/time and then leveraging partition keys to minimize the concurrency risk.
Your partition key could be by user, year, month, day, hour, etc and the row key could be the rest of the datetime at a small enough timespan to control concurrency.
Of course you have to ask yourself, at the price of date in Azure, if avoiding a Guid is really worth all of this extra effort (assuming a Guid will just work).