How to handle timeouts in a python lambda? - amazon-web-services

I know this has been questioned before, but no real solution was proposed and I was wondering if there any new ways nowadays.
Is there anyway to hook an event using any AWS service to check if a lambda has timed out? I mean it logs into the CloudWatch logs that it timed out so there must be a way.
Specifically in Python because its not so simple to keep checking if its reaching the 20 minute mark as you can with Javascript and other naturally concurrent languages.
Ideally I want to execute a lambda if the python lambda times out, with the same payload the original one received.

Here's an example from cloudformation-custom-resources/lambda/python ยท GitHub showing how an AWS Lambda function written in Python can realise that it is about to timeout.
(I've edited out the other stuff, here's the relevant bits):
import signal
def handler(event, context):
# Setup alarm for remaining runtime minus a second
signal.alarm((context.get_remaining_time_in_millis() / 1000) - 1)
# Do other stuff
...
def timeout_handler(_signal, _frame):
'''Handle SIGALRM'''
raise Exception('Time exceeded')
signal.signal(signal.SIGALRM, timeout_handler)

I want to update on #John Rotenstein answer which worked for me yet resulted in the following errors populating the cloudwatch logs:
START RequestId: ********* Version: $LATEST
Traceback (most recent call last):
File "/var/runtime/bootstrap", line 9, in <module>
main()
File "/var/runtime/bootstrap.py", line 350, in main
event_request = lambda_runtime_client.wait_next_invocation()
File "/var/runtime/lambda_runtime_client.py", line 57, in wait_next_invocation
response = self.runtime_connection.getresponse()
File "/var/lang/lib/python3.7/http/client.py", line 1369, in getresponse
response.begin()
File "/var/lang/lib/python3.7/http/client.py", line 310, in begin
version, status, reason = self._read_status()
File "/var/lang/lib/python3.7/http/client.py", line 271, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/var/lang/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
File "/var/task/lambda_function.py", line 6, in timeout_handler
raise Exception('Time limit exceeded')
Exception: Time limit exceeded
END RequestId
So I just had to reset the signals alarm before returning each response:
import logging
import signal
def timeout_handler(_signal, _frame):
raise Exception('Time limit exceeded')
signal.signal(signal.SIGALRM, timeout_handler)
def lambda_handler(event, context):
try:
signal.alarm(int(context.get_remaining_time_in_millis() / 1000) - 1)
logging.info('Testing stuff')
# Do work
except Exception as e:
logging.error(f'Exception:\n{e}')
signal.alarm(0)# This line fixed the issue above!
return {'statusCode': 200, 'body': 'Complete'}

Two options I can think of, the first is quick and dirty, but also less ideal:
run it in a step function (check out step functions in AWS) which has the capability to retry on timeouts/errors
a better way would be to re-architect your code to be idempotent. In this example, the process that triggers the lambda checks a condition, and as long as this condition is true, trigger the lambda. That condition needs to remain true unless the lambda finished executing the logic successfully. This can be obtained by persisting the parameters sent to the lambda in a table in the DB, for example, and have an extra field called "processed" which will be modified to "true" only once the lambda finished running successfully for that event.
Using method #2 will make your code more resilient, easy to re-run on errors, and also easy to monitor: basically all you have to do is check how many such records do you have which are not processed, and what's their create/update timestamp on the DB.

If you care not only to identify the timeout, but to give your Lambdas an option of a "healthy" shutdown and pass the remaining payload to another execution automatically, you may have a look at the Siblings components of the sosw package.
Here is an example use-case where you call the sibling when the time is running out. You pass a pointer to where you have left the job to the Sibling. For example you may store the remaining payload in S3 and the cursor will show where you have stopped processing.
You will have to grant the Role of this Lambda permission to lambda:InvokeFunction on itself.
import logging
import time
from sosw import Processor as SoswProcessor
from sosw.app import LambdaGlobals, get_lambda_handler
from sosw.components.siblings import SiblingsManager
logger = logging.getLogger()
logger.setLevel(logging.INFO)
class Processor(SoswProcessor):
DEFAULT_CONFIG = {
'init_clients': ['Siblings'], # Automatically initialize Siblings Manager
'shutdown_period': 10, # Some time to shutdown in a healthy manner.
}
siblings_client: SiblingsManager = None
def __call__(self, event):
cursor = event.get('cursor', 0)
while self.sufficient_execution_time_left:
self.process_data(cursor)
cursor += 1
if cursor == 20:
return f"Reached the end of data"
else:
# Spawning another sibling to continue the processing
payload = {'cursor': cursor}
self.siblings_client.spawn_sibling(global_vars.lambda_context, payload=payload, force=True)
self.stats['siblings_spawned'] += 1
def process_data(self, cursor):
""" Your custom logic respecting current cursor. """
logger.info(f"Processing data at cursor: {cursor}")
time.sleep(1)
#property
def sufficient_execution_time_left(self) -> bool:
""" Return if there is a sufficient execution time for processing ('shutdown period' is in seconds). """
return global_vars.lambda_context.get_remaining_time_in_millis() > self.config['shutdown_period'] * 1000
global_vars = LambdaGlobals()
lambda_handler = get_lambda_handler(Processor, global_vars)

Related

How to run BigQuery after Dataflow job completed successfully

I am trying to run a query in BigQuery right after a dataflow job completes successfully. I have defined 3 different functions in main.py.
The first one is for running the dataflow job. The second one checks the dataflow jobs status. And the last one runs the query in BigQuery.
The trouble is the second function checks the dataflow job status multiple times for a period of time and after the dataflow job completes successfully, it does not stop checking the status.
And then function deployment fails due to 'function load attempt timed out' error.
from googleapiclient.discovery import build
from oauth2client.client import GoogleCredentials
import os
import re
import config
from google.cloud import bigquery
import time
global flag
def trigger_job(gcs_path, body):
credentials = GoogleCredentials.get_application_default()
service = build('dataflow', 'v1b3', credentials=credentials, cache_discovery=False)
request = service.projects().templates().launch(projectId=config.project_id, gcsPath=gcs_path, body=body)
response = request.execute()
def get_job_status(location, flag):
credentials=GoogleCredentials.get_application_default()
dataflow=build('dataflow', 'v1b3', credentials=credentials, cache_discovery=False)
result=dataflow.projects().jobs().list(projectId=config.project_id, location=location).execute()
for job in result['jobs']:
if re.findall(r'' + re.escape(config.job_name) + '', job['name']):
while flag==0:
if job['currentState'] != "JOB_STATE_DONE":
print('NOT DONE')
else:
flag=1
print('DONE')
break
def bq(sql):
client = bigquery.Client()
query_job = client.query(sql, location='US')
gcs_path = config.gcs_path
body=config.body
trigger_job(gcs_path,body)
flag=0
location='us-central1'
get_job_status(location,flag)
sql= """CREATE OR REPLACE TABLE 'table' AS SELECT * FROM 'table'"""
bq(SQL)
Cloud Function timeout is set to 540 seconds but deployment fails in 3-4 minutes.
Any help is very appreciated.
It appears from the code snippet provided that your HTTP-triggered cloud function is not returning a HTTP response.
All HTTP-based cloud functions must return a HTTP response for proper termination. From the google documentation Ensure HTTP functions send an HTTP response (Emphasis - mine):
If your function is HTTP-triggered, remember to send an HTTP response,
as shown below. Failing to do so can result in your function executing
until timeout. If this occurs, you will be charged for the entire
timeout time. Timeouts may also cause unpredictable behavior or cold
starts on subsequent invocations, resulting in unpredictable behavior
or additional latency.
Thus, you must have a function that in your main.py that returns some sort of value, ideally a value that can be coerced into a Flask http response.

Run tasks in parallel way inside aws lambda function

I'm trying to figure out the best solution to increase the speed of my lambda function by running my code in parallel as always my loop doing same thing over and over again is there a solution ? or way?
This is prime example for multithreading.
Taken from python std lib:
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
You can easily adapt it for your needs.
One caveat is that you cannot open more than 1000 threads in AWS Lambda, so be mindful of that.

Testing Motor calls with IOLoop

I'm running unittests in the callbacks for motor database calls, and I'm successfully catching AssertionErrors and having them surface when running nosetests, but the AssertionErrors are being caught in the wrong test. The tracebacks are to different files.
My unittests look generally like this:
def test_create(self):
#self.callback
def create_callback(result, error):
self.assertIs(error, None)
self.assertIsNot(result, None)
question_db.create(QUESTION, create_callback)
self.wait()
And the unittest.TestCase class I'm using looks like this:
class MotorTest(unittest.TestCase):
bucket = Queue.Queue()
# Ensure IOLoop stops to prevent blocking tests
def callback(self, func):
def wrapper(*args, **kwargs):
try:
func(*args, **kwargs)
except Exception as e:
self.bucket.put(traceback.format_exc())
IOLoop.current().stop()
return wrapper
def wait(self):
IOLoop.current().start()
try:
raise AssertionError(self.bucket.get(block = False))
except Queue.Empty:
pass
The errors I'm seeing:
======================================================================
FAIL: test_sync_user (app.tests.db.test_user_db.UserDBTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/----/Documents/app/app-Server/app/tests/db/test_user_db.py", line 39, in test_sync_user
self.wait()
File "/Users/----/Documents/app/app-Server/app/tests/testutils/mongo.py", line 25, in wait
raise AssertionError(self.bucket.get(block = False))
AssertionError: Traceback (most recent call last):
File "/Users/----/Documents/app/app-Server/app/tests/testutils/mongo.py", line 16, in wrapper
func(*args, **kwargs)
File "/Users/----/Documents/app/app-Server/app/tests/db/test_question_db.py", line 32, in update_callback
self.assertEqual(result["question"], "updated question?")
TypeError: 'NoneType' object has no attribute '__getitem__'
Where the error is reported to be in UsersDbTest but is clearly in test_questions_db.py (which is QuestionsDbTest)
I'm having issues with nosetests and asynchronous tests in general, so if anyone has any advice on that, it'd be greatly appreciated as well.
I can't fully understand your code without an SSCCE, but I'd say you're taking an unwise approach to async testing in general.
The particular problem you face is that you don't wait for your test to complete (asynchronously) before leaving the test function, so there's work still pending in the IOLoop when you resume the loop in your next test. Use Tornado's own "testing" module -- it provides convenient methods for starting and stopping the loop, and it recreates the loop between tests so you don't experience interference like what you're reporting. Finally, it has extremely convenient means of testing coroutines.
For example:
import unittest
from tornado.testing import AsyncTestCase, gen_test
import motor
# AsyncTestCase creates a new loop for each test, avoiding interference
# between tests.
class Test(AsyncTestCase):
def callback(self, result, error):
# Translate from Motor callbacks' (result, error) convention to the
# single arg expected by "stop".
self.stop((result, error))
def test_with_a_callback(self):
client = motor.MotorClient()
collection = client.test.collection
collection.remove(callback=self.callback)
# AsyncTestCase starts the loop, runs until "remove" calls "stop".
self.wait()
collection.insert({'_id': 123}, callback=self.callback)
# Arguments passed to self.stop appear as return value of "self.wait".
_id, error = self.wait()
self.assertIsNone(error)
self.assertEqual(123, _id)
collection.count(callback=self.callback)
cnt, error = self.wait()
self.assertIsNone(error)
self.assertEqual(1, cnt)
#gen_test
def test_with_a_coroutine(self):
client = motor.MotorClient()
collection = client.test.collection
yield collection.remove()
_id = yield collection.insert({'_id': 123})
self.assertEqual(123, _id)
cnt = yield collection.count()
self.assertEqual(1, cnt)
if __name__ == '__main__':
unittest.main()
(In this example I create a new MotorClient for each test, which is a good idea when testing applications that use Motor. Your actual application must not create a new MotorClient for each operation. For decent performance you must create one MotorClient when your application begins, and use that same one client throughout the process's lifetime.)
Take a look at the testing module, and particularly the gen_test decorator:
http://tornado.readthedocs.org/en/latest/testing.html
These test conveniences take care of many details related to unittesting Tornado applications.
I gave a talk and wrote an article about testing in Tornado, there's more info here:
http://emptysqua.re/blog/eventually-correct-links/

Python Celery group() - TypeError: [...] argument after ** must be a mapping, not long

I'm trying to run a celery (3.1.17) task that executes further tasks in a group but I always run into errors. This is how I set up the code:
from celery import task, group
#task
def daily_emails():
[...]
all_tasks = []
for chunk in range(0, users.count(), 1000):
some_users = users[chunk:chunk+1000]
all_tasks.append(write_email_bunch.subtask(some_users, execnum))
job = group(all_tasks)
# result = job.apply_async()
# job.get()
result = job.delay()
print result
results = result.join()
print results
print "done writing email tasks"
count = sum(results)
print count
#task
def write_email_bunch(some_users, execnum):
[...]
return len(some_users) - skipped_email_count
And this is the output:
<GroupResult: 3d766c85-21af-4ed0-90cb-a1ca2d281db1 [69527252-8468-4358-9328-144f727f372b, 6d03d86e-1b69-4f43-832e-bd27c4dfc092, 1d868d1b-b502-4672-9895-430089e9532e]>
Traceback (most recent call last):
File "send_daily_emails.py", line 8, in <module>
daily_emails()
File "/var/www/virtualenvs/nt_dev/local/lib/python2.7/site-packages/celery/app/task.py", line 420, in __call__
return self.run(*args, **kwargs)
File "/var/www/nt_dev/nt/apps/emails/tasks.py", line 124, in daily_emails
results = result.join()
File "/var/www/virtualenvs/nt_dev/local/lib/python2.7/site-packages/celery/result.py", line 642, in join
interval=interval, no_ack=no_ack,
File "/var/www/virtualenvs/nt_dev/local/lib/python2.7/site-packages/celery/result.py", line 870, in get
raise self.result
TypeError: write_email_bunch() argument after ** must be a mapping, not long
So I get a GroupResult but somehow Im unable to join it or further process it.
And when I use write_email_bunch.s(some_users, execnum) I get this exception:
File "/var/www/virtualenvs/nt_dev/local/lib/python2.7/site-packages/celery/result.py", line 870, in get
raise self.result
TypeError: 'tuple' object is not callable
How would I wait for all the Group Tasks to be completed to continue afterwards?
job.get() gives me this exception:
TypeError: get expected at least 1 arguments, got 0
subtask takes a tuple of args, a dict of kwargs and task options so it should be called like this:
all_tasks.append(write_email_bunch.subtask((some_users, execnum)))
note that we are passing it a tuple containing the args
Also you shouldn't wait on a task inside a task - this can cause deadlocks. In this case I reckon daily_emails does not need to be a celery task - it can be a regular function that creates a canvas object and runs apply async.
def daily_emails():
all_tasks = []
for chunk in range(0, users.count(), 1000):
some_users = users[chunk:chunk+1000]
all_tasks.append(write_email_bunch.subtask(some_users, execnum))
job = group(all_tasks)
result = job.apply_async()
return result.id
In addition to the other answer you could be using chunks here:
http://docs.celeryproject.org/en/latest/userguide/canvas.html#chunks
#app.task
def daily_emails():
return write_email.chunks(users, 1000).delay()
#task
def write_email(user):
[...]
It may be beneficial to do it manually if getting several objects at once
from the db is important. You should also consider that the model objects will be serialized here, to avoid that you can send the pk only and refetch the model in the task, or send the fields that you care about (like email address or whatever is required to send that email to the user).

Twisted getPage, exceptions.OSError: [Errno 24] Too many open files

I'm trying to run the following script with about 3000 items. The script takes the link provided by self.book and returns the result using getPage. It loops through each item in self.book until there are no more items in the dictionary.
Here's the script:
from twisted.internet import reactor
from twisted.web.client import getPage
from twisted.web.error import Error
from twisted.internet.defer import DeferredList
import logging
from src.utilitybelt import Utility
class getPages(object):
""" Return contents from HTTP pages """
def __init__(self, book, logger=False):
self.book = book
self.data = {}
util = Utility()
if logger:
log = util.enable_log("crawler")
def start(self):
""" get each page """
for key in self.book.keys():
page = self.book[key]
logging.info(page)
d1 = getPage(page)
d1.addCallback(self.pageCallback, key)
d1.addErrback(self.errorHandler, key)
dl = DeferredList([d1])
# This should stop the reactor
dl.addCallback(self.listCallback)
def errorHandler(self,result, key):
# Bad thingy!
logging.error(result)
self.data[key] = False
logging.info("Appended False at %d" % len(self.data))
def pageCallback(self, result, key):
########### I added this, to hold the data:
self.data[key] = result
logging.info("Data appended")
return result
def listCallback(self, result):
#print result
# Added for effect:
if reactor.running:
reactor.stop()
logging.info("Reactor stopped")
About halfway through, I experience this error:
File "/usr/lib64/python2.7/site-packages/twisted/internet/posixbase.py", line 303, in _handleSignals
File "/usr/lib64/python2.7/site-packages/twisted/internet/posixbase.py", line 205, in __init__
File "/usr/lib64/python2.7/site-packages/twisted/internet/posixbase.py", line 138, in __init__
exceptions.OSError: [Errno 24] Too many open files
libgcc_s.so.1 must be installed for pthread_cancel to work
libgcc_s.so.1 must be installed for pthread_cancel to work
As of right now, I'll try to run the script with less items to see if that resolves the issue. However, there must be a better way to do it & I'd really like to learn.
Thank you for your time.
It looks like you are hitting the open file descriptors limit (ulimit -n) which is likely to be 1024.
Each new getPage call opens a new file handle which maps to the client TCP socket opened for the HTTP request. You might want to limit the amount of getPage calls you run concurrently. Another way around is to up the file descriptor limit for your process, but then you might still exhaust ports or FDs if self.book grows beyond 32K items.