Race condition for Microservice architecture [CosmosDB] - concurrency

We have a micro service based architecture. Let's say we have front and backend completely isolated. The backend microserviceA exposes a rest endpoint which basically calls a thirdParty service and updates a record in cosmosDB. Now, this micro service is deployed over kubernetes cluster and hence can have multiple replication factor for load balancing. As mentioned before, the frontEnd is isolated and it consumes the exposed endpoint.
Problem :
FrontEnd has been written in such a manner that if the response is not obtained within a certain time frame or if a network failure occurs, it retries the endpoint. It has been observed that in some rare scenarios(doesn't matter what) UI makes multiple calls (mostly 2) one after another with time difference in milliseconds. Now here comes the race condition at the backend logic.
If the first call goes to ThirdParty first and obtained a success response, the second call will get a failure(bcz the first one was already a success). We can not change the behaviour of ThirdParty.
Taking above scenario as base, Now if the second call(failure one) updates the DB first and reaches the UI. UI takes this as a failure(whereas the first call was already a success) and take failure actions.
If the success calls makes it to the UI first, everything works fine.
Possible solution I can think of:
1)
Put a cache as source of truth.
apiCall : Status
If (entry not present in cache) {
Put Entry in cache With Status NULL or Something with specific TTL
(acquire lock on specific entry) {
If (status is success) return successResponse.
MAKE ThirdParty Call
Update DB
Update cache
Release LOCK
}
} else {
(acquire lock on specific entry) {
MAKE ThirdParty Call
Update DB
Update cache
Release LOCK
}
}
Else block will never be executed. seems like.
Only in case of failure, instead of updating the DB, put a thread.sleep(10000) for couple of times in hope that another thread will update the DB with success response.
If still not success, return a failure update and update DB.
Put a poller on UI side. If it is a failure. Try to poll couple of times more in hope that the status changes. If not, take the failure actions.
Optimistic locking for cosmos record.
https://cosmosdb.github.io/labs/dotnet/labs/10-concurrency-control.html
Not sure how this can help.
Let's say, both api calls read the record when the version was 0.
Now the second api call update the the DB record, as the version was not changed,
it will be a successful update.
Now the DB holds Failure as value.
The first api call tries to update it and it found a version mismatch,
the update will not go through and another attempt will be made to update the DB as it was a success.
In case of failure, no attempts to update DB will be made.
Now, the second API call will appear to UI first and UI will again take the failure action.
UI require a poller in such cases.
But if the UI requires a poller, why do we need the optimistic locking in first place. :)
I don't know cosmosDB functionality much. If there is some functionality cosmos provides to handle, Please be kind enough to share.
What will be the best way to handle such kind of scenarios.

It seems in your application design you have made it necessary to wait for each execution to finish before you fire the next one, I am not debating if this is good or bad that's a different discussion, but it seems the only option you have to fire all your DB Updates in a synchronous manner in this case.
Optimistic locking is very good to ensure that the document you are updating have not been updated while your code did other things but it will not help your UI issue here.
I think you need to abstract the UI in order to make this work properly otherwise you are stuck running things in synchronous mode

Related

How to update progress bar while making a Django Rest api request?

My django rest app accepts request to scrape multiple pages for prices & compare them (which takes time ~5 seconds) then returns a list of the prices from each page as a json object.
I want to update the user with the current operation, for example if I scrape 3 pages I want to update the interface like this :
Searching 1/3
Searching 2/3
Searching 3/3
How can I do this?
I am using Angular 2 for my front end but this shouldn't make a big difference as it's a backend issue.
This isn't the only way, but this is how I do this in Django.
Things you'll need
Asynchronous worker procecess
This allows you to do work outside the context of the request-response cycle. The most common are either django-rq or Celery. I'd recommend django-rq for its simplicity, especially if all you're implementing is a progress indicator.
Caching layer (optional)
While you can use the database for persistence in this case, temporary cache key-value stores make more sense here as the progress information is ephemeral. The Memcached backend is built into Django, however I'd recommend switching to Redis as it's more fully featured, super fast, and since it's behind Django's caching abstraction, does not add complexity. (It's also a requirement for using the django-rq worker processes above)
Implementation
Overview
Basically, we're going to send a request to the server to start the async worker, and poll a different progress-indicator endpoint which gives the current status of that worker's progress until it's finished (or failed).
Server side
Refactor the function you'd like to track the progress of into an async task function (using the #job decorator in the case of django-rq)
The initial POST endpoint should first generate a random unique ID to identify the request (possibly with uuid). Then, pass the POST data along with this unique ID to the async function (in django-rq this would look something like function_name.delay(payload, unique_id)). Since this is an async call, the interpreter does not wait for the task to finish and moves on immediately. Return a HttpResponse with a JSON payload that includes the unique ID.
Back in the async function, we need to set the progress using cache. At the very top of the function, we should add a cache.set(unique_id, 0) to show that there is zero progress so far. Using your own math implementation, as the progress approaches 100% completion, change this value to be closer to 1. If for some reason the operation fails, you can set this to -1.
Create a new endpoint to be polled by the browser to check the progress. This looks for a unique_id query parameter and uses this to look up the progress with cache.get(unique_id). Return a JSON object back with the progress amount.
Client side
After sending the POST request for the action and receiving a response, that response should include the unique_id. Immediately start polling the progress endpoint at a regular interval, setting the unique_id as a query parameter. The interval could be something like 1 second using setInterval(), with logic to prevent sending a new request if there is still a pending request.
When the progress received equals to 1 (or -1 for failures), you know the process is finished and you can stop polling
That's it! It's a bit of work just to get progress indicators, but once you've done it once it's much easier to re-use the pattern in other projects.
Another way to do this which I have not explored is via Webhooks / Channels. In this way, polling is not required, and the server simply sends the messages to the client directly.

Api Gateway Api Key immediate use upon creation giving forbidden

Application creates an api key on a per user basis, meaning the process is as follows:
Lambda function creates api key and adds to a usage plan
Api key value is returned from lambda function
Api key is then immediately used to call an Api Gateway end point
Forbidden message is returned
If I delay execution between api key creation and the http request to the api gateway end point (by around 5 seconds), then it works as intended, but less than that I get an error.
I suspect that the api key takes a few seconds to propagate to the endpoint but I can't find an AWS API method that correctly lets me know when it has done so. Has anyone come across this problem before and how did you solve it?
The best solution I have at the moment is to retry the api call on a sliding timeout until an unreasonable amount of time has passed.
How long should I wait after applying an AWS IAM policy before it is valid? is not the same question but seems likely to be similar in its underlying explanation -- it's not so much a case of the API key taking time to exist but rather taking time to propagate and become visible at every possible place where it might need to exist before being valid for any subsequent request.
If those assumptions are correct, there is no mechanism for authoritatively determining whether the key is ready for use or not, because for some period of time after the key creation request succeeds, it's in a situation arguably reminiscent of Schrödinger's cat -- the key both exists and doesn't exist -- you don't know until you try it, and (unlike the cat) even a successful test does not necessarily prove that it is fully ready for use, because of the possibility (however unlikely) of a result such as fail fail fail fail pass fail pass pass pass. Such is the characteristic behavior of many large-scale, distributed systems.
From comments:
If an API call returns the api key value then I would expect it to be able to be used instantly, or at least return only when the key has been propagated fully to the end points.
That makes sense on the surface, but it becomes problematic in implementation. What if one of the endpoints is failed, offline for maintenance, or in the middle of recovering from an outage and lagging... what then? Fail the request? Delay the response waiting for something statistically unlikely to impact you?
The resource cost of observing replication tends to outweigh the benefits in many cases and can destabilize the control plane of a system if a replication issue causes a sufficient backlog, and is often not implemented except in cases where it has a high value, viz. the GetChange action in Route 53 which allows you to verify the propagation of a change through the system -- and note that even in this case, the change request itself succeeds without waiting -- if you need to verify the sync state, you have to ask separately.
A lot of AWS services take time to create. Usually there is a way to detect if the job has been completed. In this case it looks like you get a forbidden response until the key is created.
I think you will have to handle this in your client.

The rate of control plane requests made by this account is too high

I'm using AWS Dynamo DB and it keeps giving me the following error when trying to create DB by https://www.npmjs.org/package/dynamodb:
The rate of control plane requests made by this account is too high
Does anyone know what the reason is?
Thanks
Could you share your code that is calling the create? And does this happen every time, or only sometimes? If you can get insight into whether the CreateTable API call is failing, or a DescribeTable API call is failing, that would be helpful too. If you can log the request ids of all of the requests you're making, and share them on this post, we (the DynamoDB folks) can see if we can get more details on our side.
This error may occur when you create, update, or delete many tables simultaneously (as in call the API with many operations simultaneously). This is easy to do in Node.js because of its non-blocking programming model. The error may also happen if you CreateTable and then immediately call DescribeTable simultaneously or immediately after (this typically doesn't happen though).

AppFabric Syncing Local Caches

We have a very simple AppFabric setup where there are two clients -- lets call them Server A and Server B. Server A is also the lead cache host, and both Server A and B have a local cache enabled. We'd like to be able to make an update to an item from server B and have that change propagate to the local cache of Server A within 30 seconds (for example).
As I understand it, there appears to be two different ways of getting changes propagated to the client:
Set a timeout on the client cache to evict items every X seconds. On next request for the item it will get the item from the host cache since the local cache doesn't have the item
Enable notifications and effectively subscribe to get updates from the cache host
If my requirement is to get updates to all clients within 30 seconds then setting a timeout of less than 30 seconds on the local cache appears to be the only choice if going with option #1 above. Due to the size of the cache, this would be inefficient to evict all of the cache (99.99% of which probably hasn't changed in the last 30 seconds).
I think what we need to implement is option #2 above, but I'm not sure I understand how this works. I've read all of the msdn documentation (http://msdn.microsoft.com/en-us/library/ee808091.aspx) and have looked at some examples but it is still unclear to me whether it is really necessary to write custom code or if this is only if you want to do extra handling.
So my question is: is it necessary to add code to your existing application if want to have updates propagated to all local caches via notifications, or is the callback feature just an bonus way of adding extra handling or code if a notification is pushed down? Can I just enable Notifications and set the appropriate polling interval at the client and things will just work?
It seems like the default behavior (when Notifications are enabled) should be to pull down fresh items automatically at each polling interval.
I ran some tests and am happy to say that you do NOT need to write any code to ensure that all clients are kept in sync. If you set the following as a child element of the cluster config:
In the client config you need to set sync="NotificationBased" on the element.
The element in the client config will tell the client how often it should check for new notifications on the server. In this case, every 15 seconds the client will check for notifications and pull down any items that have changed.
I'm guessing the callback logic that you can add to your app is just in case you want to add your own special logic (like emailing the president every time an item changes in the cache).

Architecture for robust payment processing

Imagine 3 system components:
1. External ecommerce web service to process credit card transactions
2. Local Database to store processing results
3. Local UI (or win service) to perform payment processing of the customer order document
The external web service is obviously not transactional, so how to guarantee:
1. results to be eventually persisted to database when received from web service even in case the database is not accessible at that moment(network issue, db timeout)
2. prevent clients from processing the customer order while payment initiated by other client but results not successfully persisted to database yet(and waiting in some kind of recovery queue)
The aim is to do processing having non transactional system components and guarantee the transaction won't be repeated by other process in case of failure.
(please look at it in the context of post sell payment processing, where multiple operators might attempt manual payment processing; not web checkout application)
Ask the payment processor whether they can detect duplicate transactions based on an order ID you supply. Then if you are unable to store the response due to a database failure, you can safely resubmit the request without fear of double-charging (at least one PSP I've used returned the same response/auth code in this scenario, along with a flag to say that this was a duplicate).
Alternatively, just set a flag on your order immediately before attempting payment, and don't attempt payment if the flag was already set. If an error then occurs during payment, you can investigate and fix the data at your leisure.
I'd be reluctant to go down the route of trying to automatically cancel the order and resubmitting, as this just gets confusing (e.g. what if cancelling fails - should you retry or not?). Best to keep the logic simple so when something goes wrong you know exactly where you stand.
In any system like this, you need robust error handling and error reporting. This is doubly true when it comes to dealing with payments, where you absolutely do not want to accidentaly take someone's money and not deliver the goods.
Because you're outsourcing your payment handling to a 3rd party, you're ultimately very reliant on the gateway having robust error handling and reporting systems.
In general then, you hand off control to the payment gateway and start a task that waits for a response from the gateway, which is either 'payment accepted' or 'payment declined'. When you get that response you move onto the next step in your process and everything is good.
When you don't get a response at all (time out), or the response is invalid, then how you proceed very much depends on the payment gateway:
If the gateway supports it send a 'cancel payment' style request. If the payment cancels successfully then you probably want to send the user to a 'sorry, please try again' style page.
If the gateway doesn't support canceling, or you have no communications to the gateway then you will need to manually (in person, such as telephone) contact the 3rd party to discover what went wrong and how to proceed. To aid this you need to dump as much detail as you have to error logs, such as date/time, customer id, transaction value, product ids etc.
Once you're back on your site (and payment is accepted) then you're much more in control of errors, but in brief if you cant complete the order, then you should either dump the details to disk (such as csv file for manual handling) or contact the gateway to cancel the payment.
Its also worth having a system in place to track errors as they occur, and if an excessive number occur then consider what should happen. If its a high traffic site for example you may want to temporarily prevent further customers from placing orders whilst the issue is investigated.
Distributed messaging.
When your payment gateway returns submit a message to a durable queue that guarantees a handler will eventually get it and process it. The handler would update the database. Should failure occur at that point the handler can leave the message in the queue or repost it to the queue, or post an alternate message.
Should something occur later that invalidates the transaction, another message could be queued to "undo" the change.
There's a fair amount of buzz lately about eventual consistency and distribute messaging. NServiceBus is the new component hotness. I suggest looking into this, I know we are.