Process Memory grows huge -Tornado CurlAsyncHTTPClient - python-2.7

I am using Tornado CurlAsyncHTTPClient. My process memory keeps growing for both blocking and non blocking requests when I instantiate corresponding httpclients for each request. This memory usage growth does not happen if I just have one instance of the httpclients(tornado.httpclient.HTTPClient/tornado.httpclient.AsyncHTTPClient) and reuse them.
Also If I use SimpleAsyncHTTPClient instead of CurlAsyncHTTPClient this memory growth doesnot happen irrespective of how I instantiate.
Here is a sample code that reproduces this,
import tornado.httpclient
import json
import functools
instantiate_once = False
tornado.httpclient.AsyncHTTPClient.configure('tornado.curl_httpclient.CurlAsyncHTTPClient')
hc, io_loop, async_hc = None, None, None
if instantiate_once:
hc = tornado.httpclient.HTTPClient()
io_loop = tornado.ioloop.IOLoop()
async_hc = tornado.httpclient.AsyncHTTPClient(io_loop=io_loop)
def fire_sync_request():
global count
if instantiate_once:
global hc
if not instantiate_once:
hc = tornado.httpclient.HTTPClient()
url = '<Please try with a url>'
try:
resp = hc.fetch(url)
except (Exception,tornado.httpclient.HTTPError) as e:
print str(e)
if not instantiate_once:
hc.close()
def fire_async_requests():
#generic response callback fn
def response_callback(response):
response_callback_info['response_count'] += 1
if response_callback_info['response_count'] >= request_count:
io_loop.stop()
if instantiate_once:
global io_loop, async_hc
if not instantiate_once:
io_loop = tornado.ioloop.IOLoop()
requests = ['<Please add ur url to try>']*5
response_callback_info = {'response_count': 0}
request_count = len(requests)
global count
count +=request_count
hcs=[]
for url in requests:
kwargs ={}
kwargs['method'] = 'GET'
if not instantiate_once:
async_hc = tornado.httpclient.AsyncHTTPClient(io_loop=io_loop)
async_hc.fetch(url, callback=functools.partial(response_callback), **kwargs)
if not instantiate_once:
hcs.append(async_hc)
io_loop.start()
for hc in hcs:
hc.close()
if not instantiate_once:
io_loop.close()
if __name__ == '__main__':
import sys
if sys.argv[1] == 'sync':
while True:
output = fire_sync_request()
elif sys.argv[1] == 'async':
while True:
output = fire_async_requests()
Here set instantiate_once variable to True, and execute
python check.py sync or python check.py async. The process memory increases continuously
With instantiate_once=False, this doesnot happen.
Also If I use SimpleAsyncHTTPClient instead of CurlAsyncHTTPClient this memory growth doesnot happen.
I have python 2.7/ tornado 2.3.2/ pycurl(libcurl/7.26.0 GnuTLS/2.12.20 zlib/1.2.7 libidn/1.25 libssh2/1.4.2 librtmp/2.3)
I could reproduce the same issue with latest tornado 3.2
Please help me to understand this behaviour and figure out the right way of using tornado as http library.

HTTPClient and AsyncHTTPClient are designed to be reused, so it will always be more efficient not to recreate them all the time. In fact, AsyncHTTPClient will try to magically detect if there is an existing AsyncHTTPClient on the same IOLoop and use that instead of creating a new one.
But even though it's better to reuse one http client object, it shouldn't leak to create many of them as you're doing here (as long as you're closing them). This looks like a bug in pycurl: https://github.com/pycurl/pycurl/issues/182

Use pycurl 7.19.5 and this hack to avoid memory leaks:
Your Tornado main file:
tornado.httpclient.AsyncHTTPClient.configure("curl_httpclient_leaks_patched.CurlAsyncHTTPClientEx")
curl_httpclient_leaks_patched.py
from tornado import curl_httpclient
class CurlAsyncHTTPClientEx(curl_httpclient.CurlAsyncHTTPClient):
def close(self):
super(CurlAsyncHTTPClientEx, self).close()
del self._multi

Related

Multiple Greenlets in a loop and ZMQ. Greenlet blocks in a first _run

I wrote two types of greenlets. MyGreenletPUB will publish message via ZMQ with message type 1 and message type 2.
MyGreenletSUB instances will subscribe to ZMQ PUB, based on parameter ( "1" and "2" ).
Problem here is that when I start my Greenlets run method in MyGreenletSUB code will stop on message = sock.recv() and will never return run time execution to other greenlets.
My question is how can I avoid this and how can I start my greenlets asynchronous with a while TRUE, without using gevent.sleep() in while methods to switch execution between greenlets
from gevent.monkey import patch_all
patch_all()
import zmq
import time
import gevent
from gevent import Greenlet
class MyGreenletPUB(Greenlet):
def _run(self):
# ZeroMQ Context
context = zmq.Context()
# Define the socket using the "Context"
sock = context.socket(zmq.PUB)
sock.bind("tcp://127.0.0.1:5680")
id = 0
while True:
gevent.sleep(1)
id, now = id + 1, time.ctime()
# Message [prefix][message]
message = "1#".format(id=id, time=now)
sock.send(message)
# Message [prefix][message]
message = "2#".format(id=id, time=now)
sock.send(message)
id += 1
class MyGreenletSUB(Greenlet):
def __init__(self, b):
Greenlet.__init__(self)
self.b = b
def _run(self):
context = zmq.Context()
# Define the socket using the "Context"
sock = context.socket(zmq.SUB)
# Define subscription and messages with prefix to accept.
sock.setsockopt(zmq.SUBSCRIBE, self.b)
sock.connect("tcp://127.0.0.1:5680")
while True:
message = sock.recv()
print message
g = MyGreenletPUB.spawn()
g2 = MyGreenletSUB.spawn("1")
g3 = MyGreenletSUB.spawn("2")
try:
gevent.joinall([g, g2, g3])
except KeyboardInterrupt:
print "Exiting"
A default ZeroMQ .recv() method modus operandi is to block until there has arrived anything, that will pass to the hands of the .recv() method caller.
For indeed smart, non-blocking agents, always use rather .poll() instance-methods and .recv( zmq.NOBLOCK ).
Beware, that ZeroMQ subscription is based on topic-filter matching from left and may get issues if mixed unicode and non-unicode strings are being distributed / collected at the same time.
Also, mixing several event-loops might become a bit tricky, depends on your control-needs. I personally always prefer non-blocking systems, even at a cost of more complex design efforts.

How to get multiple objects from S3 using boto3 get_object (Python 2.7)

I've got 100s of thousands of objects saved in S3. My requirement entails me needing to load a subset of these objects (anywhere between 5 to ~3000) and read the binary content of every object. From reading through the boto3/AWS CLI docs it looks like it's not possible to get multiple objects in one request so currently I have implemented this as a loop that constructs the key of every object, requests for the object then reads the body of the object:
for column_key in outstanding_column_keys:
try:
s3_object_key = "%s%s-%s" % (path_prefix, key, column_key)
data_object = self.s3_client.get_object(Bucket=bucket_key, Key=s3_object_key)
metadata_dict = data_object["Metadata"]
metadata_dict["key"] = column_key
metadata_dict["version"] = float(metadata_dict["version"])
metadata_dict["data"] = data_object["Body"].read()
records.append(Record(metadata_dict))
except Exception as exc:
logger.info(exc)
if len(records) < len(column_keys):
raise Exception("Some objects are missing!")
My issue is that when I attempt to get multiple objects (e.g 5 objects), I get back 3 and some aren't processed by the time I check if all objects have been loaded. I'm handling that in a custom exception. I'd come up with a solution to wrap the above code snippet in a while loop because I know the outstanding keys that I need:
while (len(outstanding_column_keys) > 0) and (load_attempts < 10):
for column_key in outstanding_column_keys:
try:
s3_object_key = "%s%s-%s" % (path_prefix, key, column_key)
data_object = self.s3_client.get_object(Bucket=bucket_key, Key=s3_object_key)
metadata_dict = data_object["Metadata"]
metadata_dict["key"] = column_key
metadata_dict["version"] = float(metadata_dict["version"])
metadata_dict["data"] = data_object["Body"].read()
records.append(Record(metadata_dict))
except Exception as exc:
logger.info(exc)
if len(records) < len(column_keys):
raise Exception("Some objects are missing!")
But I took this out suspecting that S3 is actually still processing the outstanding responses and the while loop would unnecessarily make additional requests for objects that S3 is already in the process of returning.
I did a separate investigation to verify that get_object requests are synchronous and it seems they are:
import boto3
import time
import os
s3_client = boto3.client('s3', aws_access_key_id=os.environ["S3_AWS_ACCESS_KEY_ID"], aws_secret_access_key=os.environ["S3_AWS_SECRET_ACCESS_KEY"])
print "Saving 3000 objects to S3..."
start = time.time()
for x in xrange(3000):
key = "greeting_{}".format(x)
s3_client.put_object(Body="HelloWorld!", Bucket='bucket_name', Key=key)
end = time.time()
print "Done saving 3000 objects to S3 in %s" % (end - start)
print "Sleeping for 20 seconds before trying to load the saved objects..."
time.sleep(20)
print "Loading the saved objects..."
arr = []
start_load = time.time()
for x in xrange(3000):
key = "greeting_{}".format(x)
try:
obj = s3_client.get_object(Bucket='bucket_name', Key=key)
arr.append(obj)
except Exception as exc:
print exc
end_load= time.time()
print "Done loading the saved objects. Found %s objects. Time taken - %s" % (len(arr), end_load - start_load)
My question and something I need confirmation is:
Whether the get_object requests are indeed synchronous? If they are then I expect that when I check for loaded objects in the first
code snippet then all of them should be returned.
If the get_object requests are asynchronous then how do I handle the responses in a way that avoids making extra requests to S3 for
objects that are still in the process of being returned?
Further clarity/refuting of any of my assumptions about S3 would also be appreciated.
Thank you!
Unlike Javascript, Python processes requests synchronously unless you do some sort of multithreading (which you aren't doing in your snippet above). In your for loop, you issue a request to s3_client.get_object, and that call blocks until the data is returned. Since the records array is smaller than it should be, that must mean that some exception is being thrown, and it should be caught in the except block:
except Exception as exc:
logger.info(exc)
If that isn't printing anything, it might be because logging is configured to ignore INFO level messages. If you aren't seeing any errors, you might try printing with logger.error.

psutil's cpu_percent always returns 0.0

I would like my Flask application to report how much CPU and memory it is currently using as a percentage:
import psutil
from flask import Flask, request, jsonify
app = Flask(__name__)
#app.route("/test", methods=["GET"])
def healthz():
return jsonify(msg="OK"), 200
#app.route("/stats", methods=["GET"])
def stats():
p = psutil.Process()
json_body = {
"cpu_percent": p.cpu_percent(interval=None),
"cpu_times": p.cpu_times(),
"mem_info": p.memory_info(),
"mem_percent": p.memory_percent()
}
return jsonify(json_body), 200
def main():
app.run(host="0.0.0.0", port=8000, debug=False)
if __name__ == '__main__':
main()
While sending a lot of requests to /test, /stats will always returns 0.0 for cpu_percent:
$ while true; do curl http://127.0.0.1:8000/test &>/dev/null; done &
$ curl http://127.0.0.1:8000/stats
{
"cpu_percent": 0.0,
"cpu_times": [
4.97,
1.28,
0.0,
0.0
],
"mem_info": [
19652608,
243068928,
4292608,
4096,
0,
14675968,
0
],
"mem_percent": 1.8873787935409003
}
However, if I manually check using ipython:
import psutil
p = psutil.Process(10993)
p.cpu_percent()
This correctly returns a value greater than 0.0.
Simply define "p = psutil.Process()" globally (outside of stat() function). cpu_percent() keeps track of CPU times since last call, and that's how it is able to determine percentage.
The first call will always be 0.0 as calculating percentage is something which requires comparing two values over time, and as such, some time has to pass.
As Giampaolo pointed out, the instance of the Process needs to be at global scope because the instance tracks state to work it out based on prior call.
Do be aware though that CPU percentage can jump around quite a lot from one moment to another and especially where the time period it is calculated over keeps changing, can be quite confusing. It is perhaps better to use a background thread which works out CPU percentage average over set time ranges.
Some code I happened to have handy may be of interest:
from __future__ import print_function
import os
import time
import atexit
import threading
try:
import Queue as queue
except ImportError:
import queue
import psutil
_running = False
_queue = queue.Queue()
_lock = threading.Lock()
_cpu_percentage = 1800 * [0.0]
_processes = {}
def _monitor():
global _cpu_percentage
global _processes
while True:
marker = time.time()
total = 0.0
pids = psutil.pids()
processes = {}
for pid in pids:
process = _processes.get(pid)
if process is None:
process = psutil.Process(pid)
processes[pid] = process
total += process.cpu_percent()
_processes = processes
_cpu_percentage.insert(0, total)
_cpu_percentage = _cpu_percentage[:1800]
duration = max(0.0, 1.0 - (time.time() - marker))
try:
return _queue.get(timeout=duration)
except queue.Empty:
pass
_thread = threading.Thread(target=_monitor)
_thread.setDaemon(True)
def _exiting():
try:
_queue.put(True)
except Exception:
pass
_thread.join()
def track_changes(path):
if not path in _files:
_files.append(path)
def start_monitor():
global _running
_lock.acquire()
if not _running:
prefix = 'monitor (pid=%d):' % os.getpid()
print('%s Starting CPU monitor.' % prefix)
_running = True
_thread.start()
atexit.register(_exiting)
_lock.release()
def cpu_averages():
values = _cpu_percentage[:60]
averages = {}
def average(secs):
return min(100.0, sum(values[:secs])/secs)
averages['cpu.average.1s'] = average(1)
averages['cpu.average.5s'] = average(5)
averages['cpu.average.15s'] = average(15)
averages['cpu.average.30s'] = average(30)
averages['cpu.average.1m'] = average(60)
averages['cpu.average.5m'] = average(300)
averages['cpu.average.15m'] = average(900)
averages['cpu.average.30m'] = average(1800)
return averages
I had other stuff in this which I deleted, so hopefully what remains is still in a usable state.
To use it, add to file monitor.py and then import the module in your main and start the monitoring loop.
import monitor
monitor.start_monitor()
Then on each request call:
monitor.cpu_averages()
and extract value for time period you think makes sense.
The solution fo Graham seems to work but, but I found a way simpler solution, by telling it the interval, in this example it measures the last second:
psutil.cpu_percent(interval=1)

how to make flask pass a generator to task such as celery

I have a bunch of code that I have working in flask correctly, but these requests can take over 30 minutes to finish. I am using chained generators to use my existing code with yields to return to the browser.
Since these tasks take 30 minutes or more to complete, I want to offload these tasks but at am a loss. I have not succesfully gotten celery/rabbitmq/redis or any other combination to work correctly and am looking for how I can accomplish this so my page returns right away and I can check if the task is complete in the background.
Here is example code that works for now but takes 4 seconds of processing for the page to return.
I am looking for advice on how to get around this problem, can celery/redis or rabbitmq deal with generators like this? should I be looking at a different solution?
Thanks!
import time
import flask
from itertools import chain
class TestClass(object):
def __init__(self):
self.a=4
def first_generator(self):
b = self.a + 2
yield str(self.a) + '\n'
time.sleep(1)
yield str(b) + '\n'
def second_generator(self):
time.sleep(1)
yield '5\n'
def third_generator(self):
time.sleep(1)
yield '6\n'
def application(self):
return chain(tc.first_generator(),
tc.second_generator(),
tc.third_generator())
tc = TestClass()
app = flask.Flask(__name__)
#app.route('/')
def process():
return flask.Response(tc.application(), mimetype='text/plain')
if __name__ == "__main__":
app.run(host="0.0.0.0", port=5000, debug=True)
Firstly, it's not clear what it would even mean to "pass a generator to Celery". The whole point of Celery is that is not directly linked to your app: it's a completely separate thing, maybe even running on a separate machine, to which you would pass some fixed data. You can of course pass the initial parameters and get Celery itself to call the functions that create the generators for processing, but you can't drip-feed data to Celery.
Secondly, this is not at all an appropriate use for Celery in any case. Celery is for offline processing. You can't get it to return stuff to a waiting request. The only thing you could do would be to get it to save the results somewhere accessible by Flask, and then get your template to fire an Ajax request to get those results when they are available.

Google api limitation overcome

Hi all we are using google api for e.g. this one 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' % query via python script but very fast it gets blocked. Any work around for this? Thank you.
Below is my current codes.
#!/usr/bin/env python
import math,sys
import json
import urllib
def gsearch(searchfor):
query = urllib.urlencode({'q': searchfor})
url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' % query
search_response = urllib.urlopen(url)
search_results = search_response.read()
results = json.loads(search_results)
data = results['responseData']
return data
args = sys.argv[1:]
m = 45000000000
if len(args) != 2:
print "need two words as arguments"
exit
n0 = int(gsearch(args[0])['cursor']['estimatedResultCount'])
n1 = int(gsearch(args[1])['cursor']['estimatedResultCount'])
n2 = int(gsearch(args[0]+" "+args[1])['cursor']['estimatedResultCount'])
The link doesn't work, and there is no code here, so all I can suggest is finding out from the API what the limits are, and delaying your requests appropriately. Alternatively, you can probably pay for less restricted API usage.
Link is bad.
Usually you can overcome this by paying for use.