I am working on doing some simple analytics on a Django webstite (v1.4.1). Seeing as this data will be gathered on pretty much every server request, I figured the right way to do this would be with a piece of custom middleware.
One important metric for the site is how often given images are accessed. Since each image is its own object, I thought about using django-hitcount, but figured that was unnecessary for what I was trying to do. If it proves easier, I may use it though.
The current conundrum I face is that I don't want to query the database and look for a given object for every HttpRequest that occurs. Instead, I would like to wait until a successful response (indicated by an HttpResponse.status of 200 or whatever), and then query the server and update a hit field for the corresponding image. The reason the only way to access the path of the image is in process_request, while the only way to access the status code is in process_response.
So, what do I do? Is it as simple as creating a class variable that can hold the path and then lookup the file once the response code of 200 is returned, or should I just use django-hitcount?
Thanks for your help
Set up a cron task to parse your Apache/Nginx/whatever access logs on a regular basis, perhaps with something like pylogsparser.
You could use memcache to store the counters and then periodically persist them to the database. There are risks that memcache will evict the value before it's been persisted but this could be acceptable to you.
This article provides more information and highlights a risk arising when using hosted memcache with keys distributed over multiple servers. http://bjk5.com/post/36567537399/dangers-of-using-memcache-counters-for-a-b-tests
Related
I have a FastAPI project in Cloud Run and it has some background jobs inside it. (Not heavy stuff)
However, when a new instance is being created by Cloud Run due to number of requests etc. every instance runs the background job concurrently.
For example;
I have a task that creates some invoices for customers in the background and if three instances is created immediately, three invoices will be created.
I researched about "FOR UPDATE" usage in PostgreSQL etc. It seems like I can solve by modifying my database but I just wonder if it can be solved in Cloud's side.
I don't want to limit the max. number of instances to 1
What would you do in this situation?
Thank you for your time.
If you can potentially have N instances of a job (because you don't want to set the max limit to 1), you need to implement your jobs in an idempotent way. Broadly speaking, you have a few ways to achieve idempotency:
by enforcing a business constraint.
by storing an idempotency key.
by using the Etag HTTP response header.
For example, Stripe lets you define an idempotency key for all of your API requests. Stripe stores this key on its servers, and when you make a POST request with the same payload of a previous one, Stripe returns you the same result. POST requests are not idempotent, but using this "trick" they become idempotent.
Stripe's idempotency works by saving the resulting status code and body of the first request made for any given idempotency key, regardless of whether it succeeded or failed. Subsequent requests with the same key return the same result, including 500 errors.
https://stripe.com/docs/api/idempotent_requests
Tip: you could expand your question by clarifying how these background tasks are created, and where they run.
Could not find a definitive answer here.
According to Amazon S3 docs, the caveat for read after write is if I got 404 for GET, then PUT a new object, then GET.
My question is, after I do GET a successful read,
does subsequent reads will be successful too?
Example:
GET key 404
PUT key 200
GET key 404 # because caveat
GET key 200
From now on, does any subsequent GET key is guaranteed to be successful?
The caveat AWS describes in the S3 documentation suggests that they use a caching layer on top of the database they use to store details of objects in S3 like it's key and meta data.
If you do a PUT for a object as first operation and a GET afterwards, there will be a cache miss for the GET operation so the caching layer will fetch information about this object from the database.
If you do a GET before the PUT the caching layer will query the database, will receive the information that this object doesn't exist and cache that information, even though after the PUT creates the mentioned object shortly after. So the GET after the PUT will receive the information that the object doesn't exist from the cache.
That's probably why this caveat exists. Unfortunately that doesn't answer your question, because we don't know how that caching layer works. If this layer uses shared state, then you should receive a 200 response for all requests, once you received one response with 200. My guess is that they don't use shared state for the caching layer, as that's easier to scale. Without shared state it depends on your luck, the time-to-live for items in the cache and if they employ some kind of cache invalidation for updated objects whether you receive a 200 or a 404 for requests even after the first successful 200 request.
Because the details of the inner workings of S3 are unknown I wouldn't rely on ubsequent calls to succeed, but my guess is that the probability of receiving a 404 after a successful 200 is rather low. In the end you have to decide based on your use case if and how it makes sense to account for this situation or not.
Updated answer
Snippets from the official AWS blog
S3 is Now Strongly Consistent
After that overly-long introduction, I
am ready to share some good news!
Effective immediately, all S3 GET, PUT, and LIST operations, as well
as operations that change object tags, ACLs, or metadata, are now
strongly consistent. What you write is what you will read, and the
results of a LIST will be an accurate reflection of what’s in the
bucket. This applies to all existing and new S3 objects, works in all
regions, and is available to you at no extra charge! There’s no impact
on performance, you can update an object hundreds of times per second
if you’d like, and there are no global dependencies.
We have a service which inserts into dynamodb certain values. For sake of this question let's say its key:value pair i.e., customer_id:customer_email. The inserts don't happen that frequently and once the inserts are done, that specific key doesn't get updated.
What we have done is create a client library which, provided with customer_id will fetch customer_email from dynamodb.
Given that customer_id data is static, what we were thinking is to add cache to the table but one thing which we are not sure that what will happen in the following use-case
client_1 uses our library to fetch customer_email for customer_id = 2.
The customer doesn't exist so API Gateway returns not found
APIGateway will cache this response
For any subsequent calls, this cached response will be sent
Now another system inserts customer_id = 2 with its email id. This system doesn't know if this response has been cached previously or not. It doesn't even know that any other system has fetched this specific data. How can we invalidate cache for this specific customer_id when it gets inserted into dynamodb
You can send a request to the API endpoint with a Cache-Control: max-age=0 header which will cause it to refresh.
This could open your application up to attack as a bad actor can simply flood an expensive endpoint with lots of traffic and buckle your servers/database. In order to safeguard against that it's best to use a signed request.
In case it's useful to people, here's .NET code to create the signed request:
https://gist.github.com/secretorange/905b4811300d7c96c71fa9c6d115ee24
We've built a Lambda which takes care of re-filling cache with updated results. It's a quite manual process, with very little re-usable code, but it works.
Lambda is triggered by the application itself following application needs. For example, in CRUD operations the Lambda is triggered upon successful execution of POST, PATCH and DELETE on a specific resource, in order to clear the general GET request (i.e. clear GET /books whenever POST /book succeeded).
Unfortunately, if you have a View with a server-side paginated table you are going to face all sorts of issues because invalidating /books is not enough since you actually may have /books?page=2, /books?page=3 and so on....a nightmare!
I believe APIG should allow for more granular control of cache entries, otherwise many use cases aren't covered. It would be enough if they would allow to choose a root cache group for each request, so that we could manage cache entries by group rather than by single request (which, imho, is also less common).
Did you look at this https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-caching.html ?
There is way to invalidate entire cache or a particular cache entry
I send a POST request to create an object. That object is created successfully on the server, but I cannot receive the response (dropped somewhere), so I try to send the POST request again (and again). The result is there are many duplicated objects on the server side.
What is the official way to handle that issue? I think it is a very common problem, but I don't know its exact name, so cannot google it. Thanks.
In REST terminology, which is how interfaces where POST is used to create an object (and PUT to modify, DELETE to delete and GET to retrieve) are called, the POST operation is attributed un-'safe' and non-'idempotent, because the second operation of every other type of petition has no effect in the collection of objects.
I doubt there is an "official" way to deal with this, but there are probably some design patterns to deal with it. For example, these two alternatives may solve this problem in certain scenarios:
Objects have unicity constraints. For example, a record that stores a unique username cannot be duplicated, since the database will reject it.
Issue an one-time use token to each client before it makes the POST request, usually when the client loads the page with the input form. The first POST creates an object and marks the token as used. The second POST will see that the token is already used and you can answer with a "Yes, yes, ok, ok!" error or success message.
Useful link where you can read more about REST.
It is unreliable to fix these issues on the client only.
In my experience, RESTful services with lots of traffic are bound to receive duplicate incoming POST requests unintentionally - e.g. sometimes a user will click 'Signup' and two requests will be sent simultaneously; this should be expected and handled by your backend service.
When this happens, two identical users will be created even if you check for uniqueness on the User model. This is because unique checks on the model are handled in-memory using a full-table scan.
Solution: these cases should be handled in the backend using unique checks and SQL Server Unique Indices.
I am trying to load my data using a separate query to the server after the records get dirty in the store. The updated values are sent to the server and relevant actions are performed using a custom ajax call and handled at the server side to update all the related records. But when the data is loaded again I get the above mentioned error.
The possible reason could be, since the records are dirty in the store, and without committing the store I am trying to load the data again, it is giving me the error. So, I tried to do an "Application.defaultTransaction.rollback()". It removes those records from the updated bucket, but the "key" in the updated bucket (the object type) still exists and I still get the error. Can anyone help me with this?
In short: is there a way to force clean the store or move all the objects in created/updated/inflight buckets to clean bucket?
Application.store.get('defaultTransaction').rollback() will remove any dirty objects in the store and take it to the initial state.
There is an open issue for store.rollback() too which might be an alternative when merged to master.
https://github.com/emberjs/data/pull/350#issuecomment-9578563