How to survive a database outage?

How to survive a database outage? - web-services

I have a web service that is made using spring, hibernate and c3p0. I also have a service wide cache(which has the results of requests ever made to the service) which can be used to return results when the service isn't able to return(due to whatever reason). The cache might return stale results when the database is out but that's ok.
I recently faced a database outage and my service came to a crashing halt.
I want the clients of my service to survive database outages happening ever again in future.
For that, I need my service to:
Handle new incoming requests like this: quickly say that the database is down and throw some exception(fast-fail).
Requests already being processed: Don't last longer than x seconds. How do I make the thread handling the request be interrupted somehow.
Cache the whole database in memory for read-only purposes(Is this insane?).
There are some observations that I made:
If there is one or more connection(s) with status ESTABLISHED, then an attempt to checkout a new connection is not made. Seems like any one connection with status ESTABLISED is handed over to the thread receiving the request. Now, this thread just hangs till the time the database comes back up.
I would want to make this request fast-fail by knowing before handling over a connection to a thread whether db is up or not. If no, the service should throw exception instead of hanging up.
If there's no connection with status ESTABLISHED, then the request fails in 10 secs with the exception that "Could not checkout a new connection". This is due to my checkout timeout being set for 10s.
If the service was processing some request, now the db goes and then the service makes a call to db, the thread making the call to db gets stuck forever. It resumes execution only after the db comes back.
I would like to interrupt the thread after say x seconds whether or not it was able to complete the request.
Are there ways to accomplish what I seek?
Thanks in advance.

Related

Continue request django rest framework

I have a request that lasts more than 3 minutes, I want the request to be sent and immediately give the answer 200 and after the end of the work - give the result

The workflow you've described is called asynchronous task execution.
The main idea is to remove time or resource consuming parts of work from the code that handles HTTP requests and deligate it to some kind of worker. The worker might be a diffrent thread or process or even a separate service that runs on a different server.
This makes your application more responsive, as the users gets the HTTP response much quicker. Also, with this approach you can display such UI-friendly things as progress bars and status marks for the task, create retrial policies if task failes etc.
Example workflow:
user makes HTTP request initiating the task
the server creates the task, adds it to the queue and returns the HTTP response with task_id immediately
the front-end code starts ajax polling to get the results of the task passing task_id
the server handles polling HTTP requests and gets status information for this task_id. It returns the info (whether results or "still waiting") with the HTTP response
the front-end displays spinner if server returns "still waiting" or the results if they are ready
The most popular way to do this in Django is using the celery disctributed task queue.

Suppose a request comes, you will have to verify it. Then send response and use a mechanism to complete the request in the background. You will have to be clear that the request can be completed. You can use pipelining, where you put every task into pipeline, Django-Celery is an option but don't use it unless required. Find easy way to resolve the issue

Django concurrency with celery

I am using django framework and ran into some performance problems.
There is a very heavy (which costs about 2 seconds) in my views.py. And let's call it heavy().
The client uses ajax to send a request, which is routed to heavy(), and waits for a json response.
The bad thing is that, I think heavy() is not concurrent. As shown in the image below, if there are two requests routed to heavy() at the same time, one must wait for another. In another word, heavy() is serial: it cannot take another request before returning from current request. The observation is tested and proven on my local machine.
I am trying to make the functions in views.py concurrent and asynchronous. Ideally, when there are two requests coming to heavy(), heavy() should throw the job to some remote worker with a callback, and return. Then, heavy() can process another request. When the task is done, the callback can send the results back to client. The logic is demonstrated as below:
However, there is a problem: if heavy() wants to process another request, it must return; but if it returns something, the django framework will send a (fake)response to the client, and the client may not wait for another response. Moreover, the fake response doesn't contain the correct data. I have searched throught stackoverflow and find less useful tips. I wonder if anyone have tried this and knows a good way to solve this problem.
Thanks,

First make sure that 'inconcurrency' is actually caused by your heavy task. If you're using only one worker for django, you will be able to process only one request at a time, no matter what it will be. Consider having more workers for some concurrency, because it will affect also short requests.
For returning some information when task is done, you can do it in at least two ways:
sending AJAX requests periodicaly to fetch status of your task
using SSE or websocket to subscribe for actual result
Both of them will require to write some more JavaScript code for handling it. First one is really easy achievable, for second one you can use uWSGI capabilities, as described here. It can be handled asynchronously that way, independently of your django workers (django will just create connection and start task in celery, checking status and sending it to client will be handled by gevent.

To follow up on GwynBliedD's answer:
celery is commonly used to process tasks, it has very simple django integration. #GwynBlieD's first suggestion is very commonly implemented using celery and a celery result backend.
https://www.reddit.com/r/django/comments/1wx587/how_do_i_return_the_result_of_a_celery_task_to/
A common workflow Using celery is:
client hits heavy()
heavy() queues heavy() task asynchronously
heavy() returns future task ID to client (view returns very quickly because little work was actually performed)
client starts polling a status endpoint using the task ID
when task completes status returns result to client

Occasional high latency in qpid application

I'm hoping someone can help me with an issue I'm seeing with a Qpid C++ application I'm using. Essentially, we have one application publishing a status to a last_value_queue at about a 10Hz rate and a couple other applications continuously processing this status. The receivers also use the status as a kind of heartbeat and will complain if the status message isn't updated for a certain amount of time (500ms, to be exact.)
This works fine for about a day, after which we start seeing issues. Every couple hours, a single fetch call by a receiver will block for over 500ms (sometimes for up to 900ms.) This behavior will continue until we restart the broker.
I'm no expert, but I don't think I'm doing anything particularly dumb. I've been able to repeat this behavior with a pair of small applications that connect to the broker. Every 100ms the sender sends a std::chrono::time_point object set to the current time. The receiver fetches the message and calculates the delay to the millisecond. The delay is always 0ms or 1ms, except for the single spikes every hour or so after the initial day of everything being happy. The connection is created like so:
qpid::messaging::Connection c("host1:5672","{ reconnect: true}");
and the sender and receiver are both created with the string
"testQueue; { mode: browse, create: always, node: { type: queue, x-declare:{ arguments:{'qpid.last_value_queue_key':'key','qpid.replicate':'none'}}}}"
High availability replication is enabled on the broker, but I have it explicitly disabled for everything for the purpose of my testing. I see no difference in behavior when the broker and apps are running on the same host or different hosts on the LAN. Using qpid-stat, I can see that the broker replication queue is still transmitting quite a bit of data, but its message count is always at 0 so I don't think it's sending more than it can handle. Can anyone think of anything I might be missing that could cause this behavior? We're using the Qpid 0.26 and the C++ broker.

web service best practice - server timeout longer than http client timeout

I am trying to build a web service on top of hbase, so the code looks roughly like:
#GET
#Path("/blabla")
#Override
public List<String> getEvents($$$params$$$) {
......
//calling hbase query the events
......
}
When Hbase service is down, the hbase Java API keeps retrying to connect to Hbase region server util eventually it times out and throws a RT Exception:
NoServerForRegionException: Unable to find region for event,,99999999999999 after 10 tries.
The logic has no problem, my issue here is that the HttpClient times out way before hbase times out the retries. Then my web service API consumer gets no response, ugly.
Question -
What's the best practice here if you have server's timeout potentially longer than the http connection itself? How to have the web service respond to client gracefully in this case?

set the cashing for you scan object to some reasonable value. another thing, since you are using a web service to show the results to your users, i am assuming that you must be showing only a few rows(or records) at a time. you can use Hbase PageFilter so that you get only a specified no of rows each time and don't have to wait to get all the rows in one shot.

Web Services design

Company A has async pooling based webservice for notifications. Company B checks for notifications. Every time when it reads new notifications A deletes them from the system. Thus subsequent read requests return only new notifications. There is also requirement for the client B to interrupt the connection if there is no response within 30 sec.
This causes one potential problem: Due to unexpected slowness it is possible for A get the request deleted a notification and send the response back while B is already interrupted the connection. Under this scenario notification gets lost. Now one can argue that the core problem lies within operation realm (the HTTP response must be delivered withing 20 sec ) still on practice it is not always feasible.
How to design B (the client) to avoid this problem?
One way I can see is to do not delete the notifications by A and make B be aware of its state, so that it knows starting from what ID it needs to process notifications, but that presumes that ID will be sequential. Which is controlled by A. Even if B defines its own sequence A still has to be altered to return it back.
Are there any other approaches?
Thanks!

Web services in general are unreliable enough that it's rarely a good idea to make a "read" request serve double-duty as a "delete" request, especially without the client's knowledge. There is just too much risk of a connection dropping or timing out. There is no way to get around this only by modifying the client, because it's the server that is at fault here - the way it's designed is fundamentally unsuited for a web service.
I think you're on the right track with the incrementing IDs idea. The client knows (or can be modified to know) which notifications it's received, so if it can supply the ID of the last message it's received when it polls for notifications, the server should be able to respond based on that ID.

It really seems like Company A's webservice should be synchronous instead of asynchronous. If that is not possible, it may be a good idea to send a "ACK"-like response to a new Company A webservice that indicates a specific notification was received (by Company B) and can be deleted.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js