Life of redis connection/pipeline?

Life of redis connection/pipeline? - python-2.7

I am creating Redis pipeline as below in python:
rPipe = redis.Redis(...).pipeline()
Variable rPipe is defined in the __init__ of a class.
The functions in the class execute set and get commands when called by user using rpipe.
rpipe.set(...)
rpipe.execute()
But as I understand, Redis connections are closed by Redis server automatically, so how long my rPipe will be valid once I created the object?

Under normal conditions (e.g. unless you're hitting the limit on max number of clients or max buffer size, or if your client sets a specific timeout) Redis doesn't close client connections automatically.
Pipelines in Redis are a simple way to group commands together and send them to the server all at once, then receiving all the replies in a single step.
Assuming you're using the redis-py library (but the same arguments may reasonably hold for any well thought client), (only) when you call execute() on a pipeline object the commands are packed and sent to Redis. Then the state of the pipeline object is reset and it can be safely reused by the client.
As a side note, if using redis-py, consider that pipelined commands are wrapped in a MULTI/EXEC transaction by default, which is not always desirable.

Related

How to update progress bar while making a Django Rest api request?

My django rest app accepts request to scrape multiple pages for prices & compare them (which takes time ~5 seconds) then returns a list of the prices from each page as a json object.
I want to update the user with the current operation, for example if I scrape 3 pages I want to update the interface like this :
Searching 1/3
Searching 2/3
Searching 3/3
How can I do this?
I am using Angular 2 for my front end but this shouldn't make a big difference as it's a backend issue.

This isn't the only way, but this is how I do this in Django.
Things you'll need
Asynchronous worker procecess
This allows you to do work outside the context of the request-response cycle. The most common are either django-rq or Celery. I'd recommend django-rq for its simplicity, especially if all you're implementing is a progress indicator.
Caching layer (optional)
While you can use the database for persistence in this case, temporary cache key-value stores make more sense here as the progress information is ephemeral. The Memcached backend is built into Django, however I'd recommend switching to Redis as it's more fully featured, super fast, and since it's behind Django's caching abstraction, does not add complexity. (It's also a requirement for using the django-rq worker processes above)
Implementation
Overview
Basically, we're going to send a request to the server to start the async worker, and poll a different progress-indicator endpoint which gives the current status of that worker's progress until it's finished (or failed).
Server side
Refactor the function you'd like to track the progress of into an async task function (using the #job decorator in the case of django-rq)
The initial POST endpoint should first generate a random unique ID to identify the request (possibly with uuid). Then, pass the POST data along with this unique ID to the async function (in django-rq this would look something like function_name.delay(payload, unique_id)). Since this is an async call, the interpreter does not wait for the task to finish and moves on immediately. Return a HttpResponse with a JSON payload that includes the unique ID.
Back in the async function, we need to set the progress using cache. At the very top of the function, we should add a cache.set(unique_id, 0) to show that there is zero progress so far. Using your own math implementation, as the progress approaches 100% completion, change this value to be closer to 1. If for some reason the operation fails, you can set this to -1.
Create a new endpoint to be polled by the browser to check the progress. This looks for a unique_id query parameter and uses this to look up the progress with cache.get(unique_id). Return a JSON object back with the progress amount.
Client side
After sending the POST request for the action and receiving a response, that response should include the unique_id. Immediately start polling the progress endpoint at a regular interval, setting the unique_id as a query parameter. The interval could be something like 1 second using setInterval(), with logic to prevent sending a new request if there is still a pending request.
When the progress received equals to 1 (or -1 for failures), you know the process is finished and you can stop polling
That's it! It's a bit of work just to get progress indicators, but once you've done it once it's much easier to re-use the pattern in other projects.
Another way to do this which I have not explored is via Webhooks / Channels. In this way, polling is not required, and the server simply sends the messages to the client directly.

How can I make sure pushPayload doesn't happen while affected records are inFlight?

I am experimenting with turning a more traditional ember-data based app into a real-time app that uses websockets to keep multiple instances in sync.
My first attempt involves sending any updated record back to all open sessions that have accessed the record so that they all can have the latest copy. This includes the session that initiated the change. This means that after I call record.save() in the client, I get back the updated copy both from the REST API and the websocket. The client-end of the websocket simply calls store.pushPayload(data) to update the store.
This causes problems because the record might be inFlight at the time, and I get the error:
Attempted to handle event `pushedData` on [...] while in state root.deleted.inFlight.
I have several ideas:
Somehow prevent the client from receiving its own records back and only send them to other websocket connections.
Somehow synchronize access to the store so that when I call pushPayload the affected records are not in-flight.
Both of these seem rather complicated and I was hoping there's an established means of keeping multiple Ember apps up-to-date.

How we can use JDBC connection pooling with AWS Lambda?

Can we use JDBC connection pooling with AWS Lambda ? AS AWS lambda function get called on a specific event, so its life time persist even after it finishing one of its call ?

No. Technically, you could create a connection pool outside of the handler function but since you can only make use of any one single connection per invocation so all you would be doing is tying up database connections and allocating a pool of which you could only ever use 1.
After uploading your Lambda function to AWS, the first time it is invoked AWS will create a container and run the setup code (the code outside of your handler function that creates the pool- let's say N connections) before invoking the handler code.
When the next request arrives, AWS may re-use the container again (or may not. It usually does, but that's down to AWS and not under your control).
Assuming it reuses the container, your handler function will be invoked (the setup code will not be run again) and your function would use one of N the connections to your database from the pool (held at the container level). This is most likely the first connection from the pool, number 1 as it is guaranteed to not be in use, since it's impossible for two functions to run at the same time within the same container. Read on for an explanation.
If AWS does not reuse the container, it will create a new container and your code will allocate another pool of N connections. Depending on the turnover of containers, you may exhaust the database pool entirely.
If two requests arrive concurrently, AWS cannot invoke the same handler at the same time. If this were possible, you'd have a shared state problem with the variables defined at the container scope level. Instead, AWS will use two separate containers and these will both allocate a pool of N connections each, i.e. 2N connections to your database.
It's never necessary for a single invocation function to require more than one connection (unless of course you need to communicate to two independent databases within the same context).
The only time a connection pool would be useful is if it were at one level above the container scope, that is, handed down by the AWS environment itself to the container. This is not possible.
The best case you can hope for is to have a single connection per container. Even then you would have to manage this single connection to ensure the database server hasn't disconnect or rebooted. If it does, your container's connection will die and your handler will never be able to connect again (until the container dies), unless you write some code in your function to check for dropped connections. On a busy server, the container might take a long time to die.
Also keep in mind that if your handler function fails, for example half way through a transaction or having locked a table, the next request invocation will get the dirty connection state from the container. The first invocation may have opened a transaction and died. The second invocation may commit and include all the previous queries up to the failure.
I recommend not managing state outside of the handler function at all, unless you have a specific need to optimise. If you do, then use a single connection, not a pool.

Yes, the lambda is mostly persistent, so JDBC connection pooling should work. The first time a lambda function is invoked, the environment will be created and it may or may not get reused. But in practice, subsequent invocations will often reuse the same lambda process along with all program state if your triggering events occur often.
This short lambda function demonstrates this:
package test;
import com.amazonaws.services.lambda.runtime.Context;
import com.amazonaws.services.lambda.runtime.RequestHandler;
public class TestLambda implements RequestHandler<String, String> {
private int invocations = 0;
public String handleRequest(String request, Context context) {
invocations++;
System.out.println("invocations = " + invocations);
return request;
}
}
Invoke this from the AWS console with any string as the test event. In the CloudWatch logs, you'll see the invocations number increment each time.

Kudos to the AWS RDS proxy, now you can used pooled MySql and postgrese connections without any extra configs in your Java or other any code specific to AWS Lambda. All you need is to create and Add a Database proxy your AWS Lambda function you want to reuse/pool connections. See how-to here.
Note: AWS RDS proxy is not included in the Free-Tier (more here).

It has caveat
There is no destroy method which ensures closing pool. One may say DB connection idle time would handle.
What if same DB being used for other use cases like pool maintain in regular machine Luke EC2.
As many say, if there is sudden spike in requests, create chaos to DB as there will be always some maximum connection setting at database side per user.

Caching in Djangos object model

I'm running a system with a few workers that's taking jobs from a message queue, all using Djangos ORM.
In one case I'm actually passing a message along from one worker to another in another queue.
It works like this:
Worker1 in queue1 creates an object (MySQL INSERT) and pushes a message to queue2
Worker2 accepts the new message in queue2 and retrieves the object (MySQL SELECT), using Djangos objects.get(pk=object_id)
This works for the first message. But in the second message worker 2 always fails on that it can't find object with id object_id (with Django exception DoesNotExist).
This works seamlessly in my local setup with Django 1.2.3 and MySQL 5.1.66, the problem occurs only in my test environment which runs Django 1.3.1 and MySQL 5.5.29.
If I restart worker2 every time before worker1 pushes a message, it works fine. This makes me believe there's some kind of caching going on.
Is there any caching involved in Django's objects.get() that differs between these versions? If that's the case, can I clear it in some way?

The issue is likely related to the use of MySQL transactions. On the sender's site, the transaction must be committed to the database before notifying the receiver of an item to read. On the receiver's side, the transaction level used for a session must be set such that the new data becomes visible in the session after the sender's commit.
By default, MySQL uses the REPEATABLE READ isolation level. This poses problems where there are more than one process reading/writing to the database. One possible solution is to set the isolation level in the Django settings.py file using a DATABASES option like the following:
'OPTIONS': {'init_command': 'SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED'},
Note however that changing the transaction isolation level may have other side effects, especially when using statement based replication.
The following links provide more useful information:
How do I force Django to ignore any caches and reload data?
Django ticket#13906

Using Django ORM in threads and avoiding "too many clients" exception by using BoundedSemaphore

I work on manage.py command which creates about 200 threads to check remote hosts. My database setup allows me to use 120 connections, so I need to use some kind of pooling. I've tried using separated thread, like this
class Pool(Thread):
def __init__(self):
Thread.__init__(self)
self.semaphore = threading.BoundedSemaphore(10)
def give(self, trackers):
self.semaphore.acquire()
data = ... some ORM (not lazy, query triggered here) ...
self.semaphore.release()
return data
I pass instance of this object to every check-thread but still getting "OperationalError: FATAL: sorry, too many clients already" inside Pool object after init-ing 120 threads .
I've expected that only 10 database connections will be opened and threads will wait for free semaphore slot. I can check that semaphore works by commenting "release()", in that case only 10 threads will work and other will wait till app termination.
As much as I understand, every thread is opening new connection to database even if actual call is inside different thread, but why? Is there any way to perform all database queries inside only one thread?

Django's ORM manages database connections in thread-local variables. So each different thread accessing the ORM will create its own connection. You can see that in the first few lines of django/db/backends/__init__.py.
If you want to limit the number of database connections made, you must limit the number of different threads that actually access the ORM. A solution could be to implement a service that delegates ORM requests to a pool of dedicated ORM threads. To transmit the requests and their results from and to other threads you will have to implement some sort of message passing mechanism. Since this is a typical producer/consumer problem, the Python docs about threading should give some hints how to achieve this.
Edit: I've just googled for "django connection pooling". There are many people who complain that Django does not provide a proper connection pool. Some of them managed to integrate a separate pooling package. For PostgreSQL, I would take a look at the pgpool middleware.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js