Run tasks in parallel way inside aws lambda function - amazon-web-services

I'm trying to figure out the best solution to increase the speed of my lambda function by running my code in parallel as always my loop doing same thing over and over again is there a solution ? or way?

This is prime example for multithreading.
Taken from python std lib:
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
You can easily adapt it for your needs.
One caveat is that you cannot open more than 1000 threads in AWS Lambda, so be mindful of that.

Related

AWS Error "Calling the invoke API action failed with this message: Rate Exceeded" when I use s3.get_paginator('list_objects_v2')

Some third party application is uploading around 10000 object to my bucket+prefix in a day. My requirement is to fetch all objects which were uploaded to my bucket+prefix in last 24 hours.
There are so many files in my bucket+prefix.
So I assume that when I call
response = s3_paginator.paginate(Bucket=bucket,Prefix='inside-bucket-level-1/', PaginationConfig={"PageSize": 1000})
then may be it makes multiple calls to S3 API and may be that's why it is showing Rate Exceeded error.
Below is my Python Lambda Function.
import json
import boto3
import time
from datetime import datetime, timedelta
def lambda_handler(event, context):
s3 = boto3.client("s3")
from_date = datetime.today() - timedelta(days=1)
string_from_date = from_date.strftime("%Y-%m-%d, %H:%M:%S")
print("Date :", string_from_date)
s3_paginator = s3.get_paginator('list_objects_v2')
list_of_buckets = ['kush-dragon-data']
bucket_wise_list = {}
for bucket in list_of_buckets:
response = s3_paginator.paginate(Bucket=bucket,Prefix='inside-bucket-level-1/', PaginationConfig={"PageSize": 1000})
filtered_iterator = response.search(
"Contents[?to_string(LastModified)>='\"" + string_from_date + "\"'].Key")
keylist = []
for key_data in filtered_iterator:
if "/" in key_data:
splitted_array = key_data.split("/")
if len(splitted_array) > 1:
if splitted_array[-1]:
keylist.append(splitted_array[-1])
else:
keylist.append(key_data)
bucket_wise_list.update({bucket: keylist})
print("Total Number Of Object = ", bucket_wise_list)
# TODO implement
return {
'statusCode': 200,
'body': json.dumps(bucket_wise_list)
}
So when we execute above Lambda Function then it shows below error.
"Calling the invoke API action failed with this message: Rate Exceeded."
Can anyone help to resolve this error and achieve my requirement ?
This is probably due to your account restrictions, you should add retry with some seconds between retries or increase pagesize
This is most likely due to you reaching your quota limit for AWS S3 API calls. The "bigger hammer" solution is to request a quota increase, but if you don't want to do that, there is another way using botocore.Config built in retries, for example:
import json
import time
from datetime import datetime, timedelta
from boto3 import client
from botocore.config import Config
config = Config(
retries = {
'max_attempts': 10,
'mode': 'standard'
}
)
def lambda_handler(event, context):
s3 = client('s3', config=config)
###ALL OF YOUR CURRENT PYTHON CODE EXACTLY THE WAY IT IS###
This config will use exponentially increasing sleep timer for a maximum number of retries. From the docs:
Any retry attempt will include an exponential backoff by a base factor of 2 for a maximum backoff time of 20 seconds.
There is also an adaptive mode which is still experimental. For more info, see the docs on botocore.Config retries
Another (much less robust IMO) option would be to write your own paginator with a sleep programmed in, though you'd probably just want to use the builtin backoff in 99.99% of cases (even if you do have to write your own paginator). (this code is untested and isn't even asynchronous, so the sleep will be in addition to the wait time for a page response. To make the "sleep time" exactly sleep_secs, you'll need to use concurrent.futures or asyncio (AWS built in paginators mostly use concurrent.futures)):
from boto3 import client
from typing import Generator
from time import sleep
def get_pages(bucket:str,prefix:str,page_size:int,sleep_secs:float) -> Generator:
s3 = client('s3')
page:dict = client.list_objects_v2(
Bucket=bucket,
MaxKeys=page_size,
Prefix=prefix
)
next_token:str = page.get('NextContinuationToken')
yield page
while(next_token):
sleep(sleep_secs)
page = client.list_objects_v2(
Bucket=bucket,
MaxKeys=page_size,
Prefix=prefix,
ContinuationToken=next_token
)
next_token = page.get('NextContinuationToken')
yield page

How to handle timeouts in a python lambda?

I know this has been questioned before, but no real solution was proposed and I was wondering if there any new ways nowadays.
Is there anyway to hook an event using any AWS service to check if a lambda has timed out? I mean it logs into the CloudWatch logs that it timed out so there must be a way.
Specifically in Python because its not so simple to keep checking if its reaching the 20 minute mark as you can with Javascript and other naturally concurrent languages.
Ideally I want to execute a lambda if the python lambda times out, with the same payload the original one received.
Here's an example from cloudformation-custom-resources/lambda/python ยท GitHub showing how an AWS Lambda function written in Python can realise that it is about to timeout.
(I've edited out the other stuff, here's the relevant bits):
import signal
def handler(event, context):
# Setup alarm for remaining runtime minus a second
signal.alarm((context.get_remaining_time_in_millis() / 1000) - 1)
# Do other stuff
...
def timeout_handler(_signal, _frame):
'''Handle SIGALRM'''
raise Exception('Time exceeded')
signal.signal(signal.SIGALRM, timeout_handler)
I want to update on #John Rotenstein answer which worked for me yet resulted in the following errors populating the cloudwatch logs:
START RequestId: ********* Version: $LATEST
Traceback (most recent call last):
File "/var/runtime/bootstrap", line 9, in <module>
main()
File "/var/runtime/bootstrap.py", line 350, in main
event_request = lambda_runtime_client.wait_next_invocation()
File "/var/runtime/lambda_runtime_client.py", line 57, in wait_next_invocation
response = self.runtime_connection.getresponse()
File "/var/lang/lib/python3.7/http/client.py", line 1369, in getresponse
response.begin()
File "/var/lang/lib/python3.7/http/client.py", line 310, in begin
version, status, reason = self._read_status()
File "/var/lang/lib/python3.7/http/client.py", line 271, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/var/lang/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
File "/var/task/lambda_function.py", line 6, in timeout_handler
raise Exception('Time limit exceeded')
Exception: Time limit exceeded
END RequestId
So I just had to reset the signals alarm before returning each response:
import logging
import signal
def timeout_handler(_signal, _frame):
raise Exception('Time limit exceeded')
signal.signal(signal.SIGALRM, timeout_handler)
def lambda_handler(event, context):
try:
signal.alarm(int(context.get_remaining_time_in_millis() / 1000) - 1)
logging.info('Testing stuff')
# Do work
except Exception as e:
logging.error(f'Exception:\n{e}')
signal.alarm(0)# This line fixed the issue above!
return {'statusCode': 200, 'body': 'Complete'}
Two options I can think of, the first is quick and dirty, but also less ideal:
run it in a step function (check out step functions in AWS) which has the capability to retry on timeouts/errors
a better way would be to re-architect your code to be idempotent. In this example, the process that triggers the lambda checks a condition, and as long as this condition is true, trigger the lambda. That condition needs to remain true unless the lambda finished executing the logic successfully. This can be obtained by persisting the parameters sent to the lambda in a table in the DB, for example, and have an extra field called "processed" which will be modified to "true" only once the lambda finished running successfully for that event.
Using method #2 will make your code more resilient, easy to re-run on errors, and also easy to monitor: basically all you have to do is check how many such records do you have which are not processed, and what's their create/update timestamp on the DB.
If you care not only to identify the timeout, but to give your Lambdas an option of a "healthy" shutdown and pass the remaining payload to another execution automatically, you may have a look at the Siblings components of the sosw package.
Here is an example use-case where you call the sibling when the time is running out. You pass a pointer to where you have left the job to the Sibling. For example you may store the remaining payload in S3 and the cursor will show where you have stopped processing.
You will have to grant the Role of this Lambda permission to lambda:InvokeFunction on itself.
import logging
import time
from sosw import Processor as SoswProcessor
from sosw.app import LambdaGlobals, get_lambda_handler
from sosw.components.siblings import SiblingsManager
logger = logging.getLogger()
logger.setLevel(logging.INFO)
class Processor(SoswProcessor):
DEFAULT_CONFIG = {
'init_clients': ['Siblings'], # Automatically initialize Siblings Manager
'shutdown_period': 10, # Some time to shutdown in a healthy manner.
}
siblings_client: SiblingsManager = None
def __call__(self, event):
cursor = event.get('cursor', 0)
while self.sufficient_execution_time_left:
self.process_data(cursor)
cursor += 1
if cursor == 20:
return f"Reached the end of data"
else:
# Spawning another sibling to continue the processing
payload = {'cursor': cursor}
self.siblings_client.spawn_sibling(global_vars.lambda_context, payload=payload, force=True)
self.stats['siblings_spawned'] += 1
def process_data(self, cursor):
""" Your custom logic respecting current cursor. """
logger.info(f"Processing data at cursor: {cursor}")
time.sleep(1)
#property
def sufficient_execution_time_left(self) -> bool:
""" Return if there is a sufficient execution time for processing ('shutdown period' is in seconds). """
return global_vars.lambda_context.get_remaining_time_in_millis() > self.config['shutdown_period'] * 1000
global_vars = LambdaGlobals()
lambda_handler = get_lambda_handler(Processor, global_vars)

Twisted getPage, exceptions.OSError: [Errno 24] Too many open files

I'm trying to run the following script with about 3000 items. The script takes the link provided by self.book and returns the result using getPage. It loops through each item in self.book until there are no more items in the dictionary.
Here's the script:
from twisted.internet import reactor
from twisted.web.client import getPage
from twisted.web.error import Error
from twisted.internet.defer import DeferredList
import logging
from src.utilitybelt import Utility
class getPages(object):
""" Return contents from HTTP pages """
def __init__(self, book, logger=False):
self.book = book
self.data = {}
util = Utility()
if logger:
log = util.enable_log("crawler")
def start(self):
""" get each page """
for key in self.book.keys():
page = self.book[key]
logging.info(page)
d1 = getPage(page)
d1.addCallback(self.pageCallback, key)
d1.addErrback(self.errorHandler, key)
dl = DeferredList([d1])
# This should stop the reactor
dl.addCallback(self.listCallback)
def errorHandler(self,result, key):
# Bad thingy!
logging.error(result)
self.data[key] = False
logging.info("Appended False at %d" % len(self.data))
def pageCallback(self, result, key):
########### I added this, to hold the data:
self.data[key] = result
logging.info("Data appended")
return result
def listCallback(self, result):
#print result
# Added for effect:
if reactor.running:
reactor.stop()
logging.info("Reactor stopped")
About halfway through, I experience this error:
File "/usr/lib64/python2.7/site-packages/twisted/internet/posixbase.py", line 303, in _handleSignals
File "/usr/lib64/python2.7/site-packages/twisted/internet/posixbase.py", line 205, in __init__
File "/usr/lib64/python2.7/site-packages/twisted/internet/posixbase.py", line 138, in __init__
exceptions.OSError: [Errno 24] Too many open files
libgcc_s.so.1 must be installed for pthread_cancel to work
libgcc_s.so.1 must be installed for pthread_cancel to work
As of right now, I'll try to run the script with less items to see if that resolves the issue. However, there must be a better way to do it & I'd really like to learn.
Thank you for your time.
It looks like you are hitting the open file descriptors limit (ulimit -n) which is likely to be 1024.
Each new getPage call opens a new file handle which maps to the client TCP socket opened for the HTTP request. You might want to limit the amount of getPage calls you run concurrently. Another way around is to up the file descriptor limit for your process, but then you might still exhaust ports or FDs if self.book grows beyond 32K items.

Uploading video to YouTube and adding it to playlist using YouTube Data API v3 in Python

I wrote a script to upload a video to YouTube using YouTube Data API v3 in the python with help of example given in Example code.
And I wrote another script to add uploaded video to playlist using same YouTube Data API v3 you can be seen here
After that I wrote a single script to upload video and add that video to playlist. In that I took care of authentication and scops still I am getting permission error. here is my new script
#!/usr/bin/python
import httplib
import httplib2
import os
import random
import sys
import time
from apiclient.discovery import build
from apiclient.errors import HttpError
from apiclient.http import MediaFileUpload
from oauth2client.file import Storage
from oauth2client.client import flow_from_clientsecrets
from oauth2client.tools import run
# Explicitly tell the underlying HTTP transport library not to retry, since
# we are handling retry logic ourselves.
httplib2.RETRIES = 1
# Maximum number of times to retry before giving up.
MAX_RETRIES = 10
# Always retry when these exceptions are raised.
RETRIABLE_EXCEPTIONS = (httplib2.HttpLib2Error, IOError, httplib.NotConnected,
httplib.IncompleteRead, httplib.ImproperConnectionState,
httplib.CannotSendRequest, httplib.CannotSendHeader,
httplib.ResponseNotReady, httplib.BadStatusLine)
# Always retry when an apiclient.errors.HttpError with one of these status
# codes is raised.
RETRIABLE_STATUS_CODES = [500, 502, 503, 504]
CLIENT_SECRETS_FILE = "client_secrets.json"
# A limited OAuth 2 access scope that allows for uploading files, but not other
# types of account access.
YOUTUBE_UPLOAD_SCOPE = "https://www.googleapis.com/auth/youtube.upload"
YOUTUBE_API_SERVICE_NAME = "youtube"
YOUTUBE_API_VERSION = "v3"
# Helpful message to display if the CLIENT_SECRETS_FILE is missing.
MISSING_CLIENT_SECRETS_MESSAGE = """
WARNING: Please configure OAuth 2.0
To make this sample run you will need to populate the client_secrets.json file
found at:
%s
with information from the APIs Console
https://code.google.com/apis/console#access
For more information about the client_secrets.json file format, please visit:
https://developers.google.com/api-client-library/python/guide/aaa_client_secrets
""" % os.path.abspath(os.path.join(os.path.dirname(__file__),
CLIENT_SECRETS_FILE))
def get_authenticated_service():
flow = flow_from_clientsecrets(CLIENT_SECRETS_FILE, scope=YOUTUBE_UPLOAD_SCOPE,
message=MISSING_CLIENT_SECRETS_MESSAGE)
storage = Storage("%s-oauth2.json" % sys.argv[0])
credentials = storage.get()
if credentials is None or credentials.invalid:
credentials = run(flow, storage)
return build(YOUTUBE_API_SERVICE_NAME, YOUTUBE_API_VERSION,
http=credentials.authorize(httplib2.Http()))
def initialize_upload(title,description,keywords,privacyStatus,file):
youtube = get_authenticated_service()
tags = None
if keywords:
tags = keywords.split(",")
insert_request = youtube.videos().insert(
part="snippet,status",
body=dict(
snippet=dict(
title=title,
description=description,
tags=tags,
categoryId='26'
),
status=dict(
privacyStatus=privacyStatus
)
),
# chunksize=-1 means that the entire file will be uploaded in a single
# HTTP request. (If the upload fails, it will still be retried where it
# left off.) This is usually a best practice, but if you're using Python
# older than 2.6 or if you're running on App Engine, you should set the
# chunksize to something like 1024 * 1024 (1 megabyte).
media_body=MediaFileUpload(file, chunksize=-1, resumable=True)
)
vid=resumable_upload(insert_request)
#Here I added lines to add video to playlist
#add_video_to_playlist(youtube,vid,"PL2JW1S4IMwYubm06iDKfDsmWVB-J8funQ")
#youtube = get_authenticated_service()
add_video_request=youtube.playlistItems().insert(
part="snippet",
body={
'snippet': {
'playlistId': "PL2JW1S4IMwYubm06iDKfDsmWVB-J8funQ",
'resourceId': {
'kind': 'youtube#video',
'videoId': vid
}
#'position': 0
}
}
).execute()
def resumable_upload(insert_request):
response = None
error = None
retry = 0
vid=None
while response is None:
try:
print "Uploading file..."
status, response = insert_request.next_chunk()
if 'id' in response:
print "'%s' (video id: %s) was successfully uploaded." % (
title, response['id'])
vid=response['id']
else:
exit("The upload failed with an unexpected response: %s" % response)
except HttpError, e:
if e.resp.status in RETRIABLE_STATUS_CODES:
error = "A retriable HTTP error %d occurred:\n%s" % (e.resp.status,
e.content)
else:
raise
except RETRIABLE_EXCEPTIONS, e:
error = "A retriable error occurred: %s" % e
if error is not None:
print error
retry += 1
if retry > MAX_RETRIES:
exit("No longer attempting to retry.")
max_sleep = 2 ** retry
sleep_seconds = random.random() * max_sleep
print "Sleeping %f seconds and then retrying..." % sleep_seconds
time.sleep(sleep_seconds)
return vid
if __name__ == '__main__':
title="sample title"
description="sample description"
keywords="keyword1,keyword2,keyword3"
privacyStatus="public"
file="myfile.mp4"
vid=initialize_upload(title,description,keywords,privacyStatus,file)
print 'video ID is :',vid
I am not able to figure out what is wrong. I am getting permission error. both script works fine independently.
could anyone help me figure out where I am wrong or how to achieve uploading video and adding that too playlist.
I got the answer actually in both the independent script scope is different.
scope for uploading is "https://www.googleapis.com/auth/youtube.upload"
scope for adding to playlist is "https://www.googleapis.com/auth/youtube"
as scope is different so I had to handle authentication separately.

using topic exchange to send message from one method to another

Recently, I have been going though celery & kombu documentation as i need them integrated in one of my projects. I have a basic understanding of how this should work but documentation examples using different brokers have me confused.
Here is the scenario:
Within my application i have two views ViewA and ViewB both of them does some expensive processing, so i wanted to have them use celery tasks for processing. So this is what i did.
views.py
def ViewA(request):
tasks.do_task_a.apply_async(args=[a, b])
def ViewB(request):
tasks.do_task_b.apply_async(args=[a, b])
tasks.py
#task()
def do_task_a(a, b):
# Do something Expensive
#task()
def do_task_b(a, b):
# Do something Expensive here too
Until now, everything is working fine. The problem is that do_task_a creates a txt file on the system, which i need to use in do_task_b. Now, in the do_task_b method i can check for the file existence and call the tasks retry method [which is what i am doing right now] if the file does not exist.
Here, I would rather want to take a different approach (i.e. where messaging comes in). I would want do_task_a to send a message to do_task_b once the file has been created instead of looping the retry method until the file is created.
I read through the documentation of celery and kombu and updated my settings as follows.
BROKER_URL = "django://"
CELERY_RESULT_BACKEND = "database"
CELERY_RESULT_DBURI = "sqlite:///celery"
TASK_RETRY_DELAY = 30 #Define Time in Seconds
DATABASE_ROUTERS = ['portal.db_routers.CeleryRouter']
CELERY_QUEUES = (
Queue('filecreation', exchange=exchanges.genex, routing_key='file.create'),
)
CELERY_ROUTES = ('celeryconf.routers.CeleryTaskRouter',)
and i am stuck here.
don't know where to go from here.
What should i do next to make do_task_a to broadcast a message to do_task_b on file creation ? and what should i do to make do_task_b receive (consume) the message and process the code further ??
Any Ideas and suggestions are welcome.
This is a good example for using Celery's callback/linking function.
Celery supports linking tasks together so that one task follows another.
You can read more about it here
apply_async() functions has two optional arguments
+link : excute the linked function on success
+link_error : excute the linked function on an error
#task
def add(a, b):
return a + b
#task
def total(numbers):
return sum(numbers)
#task
def error_handler(uuid):
result = AsyncResult(uuid)
exc = result.get(propagate=False)
print('Task %r raised exception: %r\n%r' % (exc, result.traceback))
Now in your calling function do something like
def main():
#for error_handling
add.apply_async((2, 2), link_error=error_handler.subtask())
#for linking 2 tasks
add.apply_async((2, 2), link=add.subtask((8, )))
# output 12
#what you can do is your case is something like this.
if user_requires:
add.apply_async((2, 2), link=add.subtask((8, )))
else:
add.apply_async((2, 2))
Hope this is helpful