I have an issue with concurrent requests to (any) one API endpoint (ASP.NET Web API 2). When the second request starts to get processed before the first one is completed there can be concurrency issues with database entities.
Example:
1: Request R1 starts to be processed, it reads entities E1 and E2 into memory
2: Request R2 starts to be processed, it reads entities E2 and E1 into memory
3: Request R1 updates object E1
4: Request R1 creates E3, if E1 and E2 have been updated
5: Request R2 updates object E2
6: Request R2 creates E3, if E1 and E2 have been updated
The problem is that both requests read the entities E1 and E2 before the other request updated it. So both see outdated, non-updated versions and E3 is never created.
All requests are processed in a single database transaction (isolation level "read committed", Entity Framework 6 on SQL Server 2017).
In my current understanding, I don't think the issue can be solved on the data access/data store level (e.g. stricter database isolation level or optimistic locking) but needs to be higher up in the code hierarchy (request level).
My proposal is to pessimistically lock all entities that could be updated when entering an API method. When another API request is started that needs a write-lock on any entitiy that is already locked it needs to wait until all previous locks are released (queue). A request is normally processed in less than 250ms.
This could be implemented with a custom attribute that decorates the API methods: [LockEntity(typeof(ExampleEntity), exampleEntityId)]. A single API method could be decorated with zero to many of these attributes. The locking mechanism would work with async/await if possible or just put the thread to sleep. The lock synchronization needs to work across multiple servers so the easiest option I see here is a representation in the applications data store (single, global one).
This approach is complex and has some performance disadvantages (blocks request processing, performs DB queries during lock processing) so I am open to any suggestions or other ideas.
My questions:
1. Are there any terms that describe this problem? It seems like a very common thing but it was not known to me so far.
2. Are there any best practices on how to solve this?
3. If not, does my proposal make sense?
I wouldn't hack around the stateless of the web and REST and not block the requests. 250ms is a quarter second what is pretty slow IMO. E.g 100 concurrent requests will block each other and at least one stream will wait for 25 seconds! This slows down the application a lot.
I've seen some apps (including Atlassians Confluence) that notifies the user, that the current data set has changed in the meanwhile, by another user. (Sometime the other user was mentioned as well.) I would do it the same way using Websockets, via e.g. SignalR. Other apps like service-now, are merging the remote changes into the current open set of data.
If you don't want to use Websockets, you could also try to merge or to check on the server side and to notify the user with a specific HTTP Status code and an message. Tell the user what changed in the meanwhile, try to merge and ask whether the merge is fine or not.
Related
We have a micro service based architecture. Let's say we have front and backend completely isolated. The backend microserviceA exposes a rest endpoint which basically calls a thirdParty service and updates a record in cosmosDB. Now, this micro service is deployed over kubernetes cluster and hence can have multiple replication factor for load balancing. As mentioned before, the frontEnd is isolated and it consumes the exposed endpoint.
Problem :
FrontEnd has been written in such a manner that if the response is not obtained within a certain time frame or if a network failure occurs, it retries the endpoint. It has been observed that in some rare scenarios(doesn't matter what) UI makes multiple calls (mostly 2) one after another with time difference in milliseconds. Now here comes the race condition at the backend logic.
If the first call goes to ThirdParty first and obtained a success response, the second call will get a failure(bcz the first one was already a success). We can not change the behaviour of ThirdParty.
Taking above scenario as base, Now if the second call(failure one) updates the DB first and reaches the UI. UI takes this as a failure(whereas the first call was already a success) and take failure actions.
If the success calls makes it to the UI first, everything works fine.
Possible solution I can think of:
1)
Put a cache as source of truth.
apiCall : Status
If (entry not present in cache) {
Put Entry in cache With Status NULL or Something with specific TTL
(acquire lock on specific entry) {
If (status is success) return successResponse.
MAKE ThirdParty Call
Update DB
Update cache
Release LOCK
}
} else {
(acquire lock on specific entry) {
MAKE ThirdParty Call
Update DB
Update cache
Release LOCK
}
}
Else block will never be executed. seems like.
Only in case of failure, instead of updating the DB, put a thread.sleep(10000) for couple of times in hope that another thread will update the DB with success response.
If still not success, return a failure update and update DB.
Put a poller on UI side. If it is a failure. Try to poll couple of times more in hope that the status changes. If not, take the failure actions.
Optimistic locking for cosmos record.
https://cosmosdb.github.io/labs/dotnet/labs/10-concurrency-control.html
Not sure how this can help.
Let's say, both api calls read the record when the version was 0.
Now the second api call update the the DB record, as the version was not changed,
it will be a successful update.
Now the DB holds Failure as value.
The first api call tries to update it and it found a version mismatch,
the update will not go through and another attempt will be made to update the DB as it was a success.
In case of failure, no attempts to update DB will be made.
Now, the second API call will appear to UI first and UI will again take the failure action.
UI require a poller in such cases.
But if the UI requires a poller, why do we need the optimistic locking in first place. :)
I don't know cosmosDB functionality much. If there is some functionality cosmos provides to handle, Please be kind enough to share.
What will be the best way to handle such kind of scenarios.
It seems in your application design you have made it necessary to wait for each execution to finish before you fire the next one, I am not debating if this is good or bad that's a different discussion, but it seems the only option you have to fire all your DB Updates in a synchronous manner in this case.
Optimistic locking is very good to ensure that the document you are updating have not been updated while your code did other things but it will not help your UI issue here.
I think you need to abstract the UI in order to make this work properly otherwise you are stuck running things in synchronous mode
Imagine 3 system components:
1. External ecommerce web service to process credit card transactions
2. Local Database to store processing results
3. Local UI (or win service) to perform payment processing of the customer order document
The external web service is obviously not transactional, so how to guarantee:
1. results to be eventually persisted to database when received from web service even in case the database is not accessible at that moment(network issue, db timeout)
2. prevent clients from processing the customer order while payment initiated by other client but results not successfully persisted to database yet(and waiting in some kind of recovery queue)
The aim is to do processing having non transactional system components and guarantee the transaction won't be repeated by other process in case of failure.
(please look at it in the context of post sell payment processing, where multiple operators might attempt manual payment processing; not web checkout application)
Ask the payment processor whether they can detect duplicate transactions based on an order ID you supply. Then if you are unable to store the response due to a database failure, you can safely resubmit the request without fear of double-charging (at least one PSP I've used returned the same response/auth code in this scenario, along with a flag to say that this was a duplicate).
Alternatively, just set a flag on your order immediately before attempting payment, and don't attempt payment if the flag was already set. If an error then occurs during payment, you can investigate and fix the data at your leisure.
I'd be reluctant to go down the route of trying to automatically cancel the order and resubmitting, as this just gets confusing (e.g. what if cancelling fails - should you retry or not?). Best to keep the logic simple so when something goes wrong you know exactly where you stand.
In any system like this, you need robust error handling and error reporting. This is doubly true when it comes to dealing with payments, where you absolutely do not want to accidentaly take someone's money and not deliver the goods.
Because you're outsourcing your payment handling to a 3rd party, you're ultimately very reliant on the gateway having robust error handling and reporting systems.
In general then, you hand off control to the payment gateway and start a task that waits for a response from the gateway, which is either 'payment accepted' or 'payment declined'. When you get that response you move onto the next step in your process and everything is good.
When you don't get a response at all (time out), or the response is invalid, then how you proceed very much depends on the payment gateway:
If the gateway supports it send a 'cancel payment' style request. If the payment cancels successfully then you probably want to send the user to a 'sorry, please try again' style page.
If the gateway doesn't support canceling, or you have no communications to the gateway then you will need to manually (in person, such as telephone) contact the 3rd party to discover what went wrong and how to proceed. To aid this you need to dump as much detail as you have to error logs, such as date/time, customer id, transaction value, product ids etc.
Once you're back on your site (and payment is accepted) then you're much more in control of errors, but in brief if you cant complete the order, then you should either dump the details to disk (such as csv file for manual handling) or contact the gateway to cancel the payment.
Its also worth having a system in place to track errors as they occur, and if an excessive number occur then consider what should happen. If its a high traffic site for example you may want to temporarily prevent further customers from placing orders whilst the issue is investigated.
Distributed messaging.
When your payment gateway returns submit a message to a durable queue that guarantees a handler will eventually get it and process it. The handler would update the database. Should failure occur at that point the handler can leave the message in the queue or repost it to the queue, or post an alternate message.
Should something occur later that invalidates the transaction, another message could be queued to "undo" the change.
There's a fair amount of buzz lately about eventual consistency and distribute messaging. NServiceBus is the new component hotness. I suggest looking into this, I know we are.
Are there any best practices that dictate the maximum time between an asynchronous call and its corresponding response.
Basically I have a process that takes a long time to run (eg: 5 minutes). Option 1: I could expose the process as an asynchronous call. In which case the user calls my service and then at some later time, I respond with a process status.
Option 2
The other way I could implement it is to setup the system such that there is a one-way operation on my web-service that begins the process and immediately returns an id for the process. I could then mandate that the consumer provide a one-way operation, that I can call and report back when the process is done.
The first option is easier as I dont have to mandate anything from the caller. The second seems better as I can report back at anytime (5 minutes to years later).
As I have complete control over the caller and its an internally available service, I am leaning towards option 2.
So I am wondering if there are any time limits imposed on async calls (can they span days? if not what is the best practice). Is option 2 a standard pattern employed?
References would be extremely useful.
Option #2 is better as it's more event driven.
However, there exists an Option #3. Client issues request to server. Server queues request and responds with the id. Client checks back every so often, passing the request id, to see if it's completed.
This way you don't have to depend on the client being available when the request is completed.
I'd probably mix options #2 and #3 and let the client choose if they want an event fired on their side or if they just want to check back later.
UPDATE
Rajah has asked about the maximum time between async request and response. For a WEB application, this is typically measured in seconds. Most servers have timeout values that are typically defaulted in the 30 second range. Personally, I think this is too long.
Consider that an Async call requires the communications channel between the client and server to be open for the duration. How many of those channels can a single server handle? More to the point, how many channels will you have to maintain as requests are made? This can become quite outrageous even if you do control both ends.
Whatever is hosting your services is going to determine the maximum amount of time to keep a request open. Again, every server I've seen measures this in seconds.
Company A has async pooling based webservice for notifications. Company B checks for notifications. Every time when it reads new notifications A deletes them from the system. Thus subsequent read requests return only new notifications. There is also requirement for the client B to interrupt the connection if there is no response within 30 sec.
This causes one potential problem: Due to unexpected slowness it is possible for A get the request deleted a notification and send the response back while B is already interrupted the connection. Under this scenario notification gets lost. Now one can argue that the core problem lies within operation realm (the HTTP response must be delivered withing 20 sec ) still on practice it is not always feasible.
How to design B (the client) to avoid this problem?
One way I can see is to do not delete the notifications by A and make B be aware of its state, so that it knows starting from what ID it needs to process notifications, but that presumes that ID will be sequential. Which is controlled by A. Even if B defines its own sequence A still has to be altered to return it back.
Are there any other approaches?
Thanks!
Web services in general are unreliable enough that it's rarely a good idea to make a "read" request serve double-duty as a "delete" request, especially without the client's knowledge. There is just too much risk of a connection dropping or timing out. There is no way to get around this only by modifying the client, because it's the server that is at fault here - the way it's designed is fundamentally unsuited for a web service.
I think you're on the right track with the incrementing IDs idea. The client knows (or can be modified to know) which notifications it's received, so if it can supply the ID of the last message it's received when it polls for notifications, the server should be able to respond based on that ID.
It really seems like Company A's webservice should be synchronous instead of asynchronous. If that is not possible, it may be a good idea to send a "ACK"-like response to a new Company A webservice that indicates a specific notification was received (by Company B) and can be deleted.
In a web service that I am working on, a user's data needs to be updated in the background - for example pulling down and storing their tweets. As there may be multiple servers performing these updates, I want to ensure that only one can update any single user's data at one time. Therefore, (I believe) I need a method of doing an atomic read (is the user already being updated) and write (no? Then I am going to start updating). What I need to avoid is this:
Server 1 sends request to see if user is being updated.
Server 2 sends request to see if user is being updated.
Server 1 receives response back saying the user is not being updated.
Server 2 receives response back saying the user is not being updated.
Server 1 starts downloading tweets.
Server 2 starts downloading the same set of tweets.
Madness!!!
Steps 1 and 3 need to be combined into an atomic read+write operation so that Step 2 would have to wait until Step 3 had completed before a response was given. Is there a simple mechanism for effectively providing a "lock" around access to something across multiple servers, similar to the synchronized keyword in Java (but obviously distributed across all servers)?
Take a loot at Dekker's algorithm, it might give you an idea.
http://en.wikipedia.org/wiki/Dekker%27s_algorithm