ZODB with lightweight web server - web-services

For academic purpose, I need to develop a demonstrative application of using ZODB together with some lightweight web server, say CherryPy. ZODB is used in read-only manner. Clients query the server-side ZODB. The server is supposed to "have multiple processes accessing a ZODB concurrently".
Zope documentation says "ZEO is the only way to have multiple processes accessing a ZODB concurrently." But how should I understand and implement this ZEO thing?
For the sake of clarity I wrote some code, just a simple TCP server:
import SocketServer
import ZODB
from ZEO import ClientStorage
from ZODB.DB import DB
addr = ('localhost', 9999)
storage = ClientStorage.ClientStorage(addr)
zdb = DB(storage)
# or the following method:
# import ZODB.config
# zdb = ZODB.config.databaseFromURL('zeo.conf')
class MyTCPHandler(SocketServer.BaseRequestHandler):
def ZODBThing(self, rec_id):
con = zdb.open()
records = con.root()['records']
buff=records[int(rec_id)]
con.close()
return buff
def handle(self):
self.data = self.request.recv(1024).strip()
self.request.sendall(str(self.ZODBThing(self.data)))
if __name__ == "__main__":
HOST, PORT = "localhost", 9998
server = SocketServer.TCPServer((HOST, PORT), MyTCPHandler)
server.serve_forever()
zdb.close()
storage.close()
The above code represents what I understand about the usage of ZEO. I need to know whether I understand right or wrong.
what are the purposes and uses of zeoctl.exe?

You do not need ZEO for a single process that opens multiple connections and handles concurrency through one connection-per-thread. With FileStorage (instead of ClientStorage) you can do a single-process with your example code above, since you have only one ZODB.DB.DB object per application.
However, once you get more than trivial traffic, Python's scaling model typically does not work well with many threads in one process (due to Global Interpreter Lock) -- you typically want some number of processes on a single host to take advantage of multiple CPU cores, at which point you need either something like ZEO or RelStorage as a back-end for multiple processes to share the same database.
Note: at any scale, too, you likely want to use connection pooling of some sort (just reuse them, only one at a time) instead of creating a connection on each request.

Related

Reinitialize flask extensions

I am using flask. I have a few routes defined that are expensive as they need to access a database and do lengthy computations. The database connectivity is relying on the Flask-Mongoengine extension which relies on PyMongo which is not threadsafe.
Hence my thoughts are as follows:
#blueprint.route("/refresh/data", methods=['GET'])
def refresh_data():
cache.clear()
with Pool(4) as p:
print(p.map(func=f, iterable=["recently", "mtd", "ytd", "sector"]))
Get a small pool and call the function f. The function f is based on
def f(name):
print(current_app.extensions)
print(current_app.config)
current_app.extensions["mongoengine"] = MongoEngine(app=current_app)
print(current_app.extensions)
get(address="reports/{path}/json".format(path=name))
get(address="reports/{path}/html".format(path=name))
return name
The problem here is that one cannot init_app the MongoEngine again. In fact, extensions can be initialize only once but what happens it the extensions is needed on multiple threads and is not threadsafe?

multithreaded SimpleXMLRPCServer

Can someone give a small example of multi threaded SimpleXMLRPCServer. I tried googling around but none of the thins i found were what i needed. Most tell you to use some other library. I have a simple SimpleXMLRPCServer setup, but i don't know where do i add in the multi threading.
dumpServer just has pile of function that i want to call RPC manner. But now i need to add multi threading to it.
dump = dumpServer()
server_sock = ('127.0.0.1', 7777)
# Create XML_server
server = SimpleXMLRPCServer(server_sock,
requestHandler=DumpServerRequestHandler, allow_none=True)
server.register_introspection_functions()
# Register all functions of the Mboard instance
server.register_instance(dump)
How can i make it so that it can handle multiple clients simutaneously?

PyMongo find query returns empty/partial cursor when running in a Django+uWsgi project

We developed a REST API using Django & mongoDB (PyMongo driver). The problem is that, on some requests to the API endpoints, PyMongo cursor returns a partial response which contains less documents than it should (but it’s a completely valid JSON document).
Let me explain it with an example of one of our views:
def get_data(key):
return collection.find({'key': key}, limit=24)
def my_view(request):
key = request.POST.get('key')
query = get_data(key)
res = [app for app in query]
return JsonResponse({'list': res})
We're sure that there is more than 8000 documents matching the query, but in
some calls we get less than 24 results (even zero). The first problem we've
investigated was that we had more than one MongoClient definition in our code. By resolving this, the number of occurrences of the problem decreased, but we still had it in a lot of calls.
After all of these investigations, we've designed a test in which we made 16 asynchronous requests at the same time to the server. With this approach, we could reproduce the problem. On each of these 16 requests, 6-8 of them had partial results. After running this test we reduced uWsgi’s number of processes to 6 and restarted the server. All results were good but after applying another heavy load on the server, the problem began again. At this point, we restarted uwsgi service and again everything was OK. With this last experiment we have a clue now that when the uwsgi service starts running, everything is working correctly but after a period of time and heavy load, the server begins to return partial or empty results again.
The latest investigation we had was to run the API using python manage.py with DEBUG=False, and we had the problem again after a period of time in this situation.
We can't figure out what the problem is and how to solve it. One reason that we can think of is that Django closes pymongo’s connections before completion. Because the returned result is a valid JSON.
Our stack is:
nginx (with no cache enabled)
uWsgi
MemCached (disabled during debugging procedure)
Django (v1.8 on python 3)
PyMongo (v3.0.3)
Your help is really appreciated.
Update:
Mongo version:
db version v3.0.7
git version: 6ce7cbe8c6b899552dadd907604559806aa2e9bd
We are running single mongod instance. No sharding/replicating.
We are creating connection using this snippet:
con = MongoClient('localhost', 27017)
Update 2
Subject thread in Pymongo issue tracker.
Pymongo cursors are not thread safe elements. So using them like what I did in a multi-threaded environment will cause what I've described in question. On the other hand Python's list operations are mostly thread safe, and changing snippet like this will solve the problem:
def get_data(key):
return list(collection.find({'key': key}, limit=24))
def my_view(request):
key = request.POST.get('key')
query = get_data(key)
res = [app for app in query]
return JsonResponse({'list': res})
My very speculative guess is that you are reusing a cursor somewhere in your code. Make sure you are initializing your collection within the view stack itself, and not outside of it.
For example, as written, if you are doing something like:
import ...
import con
collection = con.documents
# blah blah code
def my_view(request):
key = request.POST.get('key')
query = collection.find({'key': key}, limit=24)
res = [app for app in query]
return JsonResponse({'list': res})
You could end us reusing a cursor. Better to do something like
import ...
import con
# blah blah code
def my_view(request):
collection = con.documents
key = request.POST.get('key')
query = collection.find({'key': key}, limit=24)
res = [app for app in query]
return JsonResponse({'list': res})
EDIT at asker's request for clarification:
The reason you need to define the collection within the view stack and not when the file loads is that the collection variable has a cursor, which is basically how the database and your application talk to each other. Cursors do things like keep track of where you are in a long list of data, in addition to a bunch of other stuff, but thats the important part.
When you create the collection cursor outside the view method, it will re-use the cursor on each request if it exists. So, if you make one request, and then another really, really fast right after that (like what happened when you applied high load), the cursor might only be half way through talking to the database, and so some of your data goes to the first request, and some to the second. The reason you would get NO data in a request would be if a cursor finished fetching data but hadn't been closed yet, so the next request tried to fetch data from the cursor, and there was none left to fetch in the query.
By moving the collection definition (and by association, the cursor definition) into the view stack, you will ALWAYS get a new cursor when you process a new request. You wont get any cross talking between your cursors and different requests, as each request cycle will have its own.

Stopping a function in Python using a timeout

I'm writing a healthcheck endpoint for my web service.
The end point calls a series of functions which return True if the component is working correctly:
The system is considered to be working if all the components are working:
def is_health():
healthy = all(r for r in (database(), cache(), worker(), storage()))
return healthy
When things aren't working, the functions may take a long time to return. For example if the database is bogged down with slow queries, database() could take more than 30 seconds to return.
The healthcheck endpoint runs in the context of a Django view, running inside a uWSGI container. If the request / response cycle takes longer than 30 seconds, the request is harakiri-ed!
This is a huge bummer, because I lose all contextual information that I could have logged about which component took a long time.
What I'd really like, is for the component functions to run within a timeout or a deadline:
with timeout(seconds=30):
database_result = database()
cache_result = cache()
worker_result = worker()
storage_result = storage()
In my imagination, as the deadline / harakiri timeout approaches, I can abort the remaining health checks and just report the work I've completely.
What's the right way to handle this sort of thing?
I've looked at threading.Thread and Queue.Queue - the idea being that I create a work and result queue, and then use a thread to consume the work queue while placing the results in result queue. Then I could use the thread's Thread.join function to stop processing the rest of the components.
The one challenge there is that I'm not sure how to hard exit the thread - I wouldn't want it hanging around forever if it didn't complete it's run.
Here is the code I've got so far. Am I on the right track?
import Queue
import threading
import time
class WorkThread(threading.Thread):
def __init__(self, work_queue, result_queue):
super(WorkThread, self).__init__()
self.work_queue = work_queue
self.result_queue = result_queue
self._timeout = threading.Event()
def timeout(self):
self._timeout.set()
def timed_out(self):
return self._timeout.is_set()
def run(self):
while not self.timed_out():
try:
work_fn, work_arg = self.work_queue.get()
retval = work_fn(work_arg)
self.result_queue.put(retval)
except (Queue.Empty):
break
def work(retval, timeout=1):
time.sleep(timeout)
return retval
def main():
# Two work items that will take at least two seconds to complete.
work_queue = Queue.Queue()
work_queue.put_nowait([work, 1])
work_queue.put_nowait([work, 2])
result_queue = Queue.Queue()
# Run the `WorkThread`. It should complete one item from the work queue
# before it times out.
t = WorkThread(work_queue=work_queue, result_queue=result_queue)
t.start()
t.join(timeout=1.1)
t.timeout()
results = []
while True:
try:
result = result_queue.get_nowait()
results.append(result)
except (Queue.Empty):
break
print results
if __name__ == "__main__":
main()
Update
It seems like in Python you've got a few options for timeouts of this nature:
Use SIGALARMS which work great if you have full control of the signals used by the process but probably are a mistake when you're running in a container like uWSGI.
Threads, which give you limited timeout control. Depending on your container environment (like uWSGI) you might need to set options to enable them.
Subprocesses, which give you full timeout control, but you need to be conscious of how they might change how your service consumes resources.
Use existing network timeouts. For example, if part of your healthcheck is to use Celery workers, you could rely on AsyncResult's timeout parameter to bound execution.
Do nothing! Log at regular intervals. Analyze later.
I'm exploring the benefits of these different options more.
Update #2
I put together a GitHub repo with quite a bit more information on the topic:
https://github.com/johnboxall/pytimeout
I'll type it up into a answer one day but the TLDR is here:
https://github.com/johnboxall/pytimeout#recommendations

Should a database connection be opened only once in a django app or once for every user within views.py?

I'm working on my first Django project.
I need to connect to a pre-existing key value store (in this case it is Kyoto Tycoon) for a one off task. i.e. I am not talking about the main database used by django.
Currently, I have something that works, but I don't know if what I'm doing is sensible/optimal.
views.py
from django.http import HttpResponse
from pykt import KyotoTycoon
def get_from_kv(user_input):
kt=KyotoTycoon()
kt.open('127.0.0.1',1978)
# some code to define the required key
# my_key = ...
my_value = kt.get(my_key)
kt.close()
return HttpResponse(my_value)
i.e. it opens a new connection to the database every time a user makes a query, then closes the connection again after it has finished.
Or, would something like this be better?
views.py
from django.http import HttpResponse
from pykt import KyotoTycoon
kt=KyotoTycoon()
kt.open('127.0.0.1',1978)
def get_from_kv(user_input):
# some code to define the required key
# my_key = ...
my_value = kt.get(my_key)
return HttpResponse(my_value)
In the second approach, will Django only open the connection once when the app is first started? i.e. will all users share the same connection?
Which approach is best?
Opening a connection when it is required is likely to be the better solution. Otherwise, there is the potential that the connection is no longer open. Thus, you would need to test that the connection is still open, and restart it if it isn't before continuing anyway.
This means you could run the queries within a context manager block, which would auto-close the connection for you, even if an unhanded exception occurs.
Alternatively, you could have a pool of connections, and just grab one that is not currently in use (I don't know if this would be an issue in this case).
It all depends just how expensive creating connections is, and if it makes sense to be able to re-use them.