Configuring AWS Lambda to a parallel computations for Dynamodb Streams

Configuring AWS Lambda to a parallel computations for Dynamodb Streams - amazon-web-services

I have a flask on EC2 and python 3.6 AWS Lambda architecture. When response comes to flask the new item is added to dynamoDB, which triggers Lambda that starts some process with new added item. For some strange reason it doesn't process triggers in parallel, starting new lambda function for each trigger, but processes them one by one.
I tried setting concurrency limit to maximum value, but that didn't work.
I need to get a result as fast as possible and don't manage any of scaling processes by myself. So triggers are need to be processed in parallel not one-by-one as it is now.

If you develop a Lambda function with Python, parallelism doesn’t come by default. Lambda supports Python 2.7 and Python 3.6, both of which have multiprocessing and threading modules.
On the other hand, you can use multiprocessing.Pipe instead of multiprocessing.Queue to accomplish what you need without getting any errors during the execution of the Lambda function.
Please refer below link for more details about Source Code for Parallel Execution:
https://aws.amazon.com/blogs/compute/parallel-processing-in-python-with-aws-lambda/
Also, you can refer below code:
import time
import multiprocessing
region_maps = {
"eu-west-1": {
"dynamodb":"dynamodb.eu-west-1.amazonaws.com"
},
"us-east-1": {
"dynamodb":"dynamodb.us-east-1.amazonaws.com"
},
"us-east-2": {
"dynamodb": "dynamodb.us-east-2.amazonaws.com"
}
}
def multiprocessing_func(region):
time.sleep(1)
endpoint = region_maps[region]['dynamodb']
print('endpoint for {} is {}'.format(region, endpoint))
def lambda_handler(event, context):
starttime = time.time()
processes = []
regions = ['us-east-1', 'us-east-2', 'eu-west-1']
for region in regions:
p = multiprocessing.Process(target=multiprocessing_func, args=(region,))
processes.append(p)
p.start()
for process in processes:
process.join()
output = 'That took {} seconds'.format(time.time() - starttime)
print(output)
return output
Hope this helps.

Number of parallel lambdas are controlled by number of shards you are writing to, in dynamodb.
Amazon DynamoDB, AWS Lambda polls your stream and invokes your Lambda function.
When your Lambda function is throttled, Lambda attempts to process the
throttled batch of records until the time the data expires.
This time period can be up to seven days for Amazon Kinesis.
The throttled request is treated as blocking per shard, and
Lambda doesn't read any new records from the shard until the
throttled batch of records either expires or succeeds.
If there is more than one shard in the stream,
Lambda continues invoking on the non-throttled shards until one gets through.
source
This is done to control that the events are processed in order they were done on dynamodb. But number of shards are not directly controlled by you.
Now the best thing you can do is,
set a higher Batch size in the lambda function. By doing this you will receive multiple events in the same lambda. You can have parallelism in the lambda function to process all of them together. but this will have obvious drawbacks like what if you are not able to process all of them before lambda times out. you will have to make sure that code is thread safe.

Probably writing to DynamoDB is blocking parallelism in this case.
Alternative architecture for fast and very scalable processing of items: add items to S3 bucket as files. Then trigger on S3 bucket will start Lambda. New file - new Lambda, this way only Lambda concurrency would limit how many lambdas you have in parallel.

Related

How to handle Lambda function heavy job

I have AWS lambda function that gets details using multiple ids via rest API. The problem is the API only accept 1 id at a time/per call. Per my observation, the job can only cater around 30 ids else the job won’t finish or would max my 10 mins time limit. Currently, my ids can go as high as 200 ids per job process so I’m thinking of a way how I can resolve this issue.
So far I’m thinking of using step function so I can asynchronously run the job and just chunked my ids into multiple payload but I’m not sure how I can pass ids/payload from lambda to step function. Another solution I’m thinking is I can invoke the same lambda with chunked ids but i’m afraid that recursive would happen.
Any other suggestions or AWS services I can use to fix this?

I would have a process that dumps all the IDs into an SQS queue. Then have a Lambda function that uses the SQS queue as an event source. Lambda will then automatically spin up multiple instances of your Lambda function, passing each one a batch of IDs to process.

Boto3 invocations of long-running Lambda runs break with TooManyRequestsException

Experience with "long-running" Lambda's
In my company, we recently ran into this behaviour, when triggering Lambdas, that run for > 60 seconds (boto3's default timeout for connection establishment and reads).
The beauty of the Lambda invocation with boto3 (using the 'InvocationType' 'RequestResponse') is, that the API returns the result state of the respective Lambda run, so we wanted to stick to that.
The issue seems to be, that the client fires to many requests per minute on the standing connection to the API. Therefore, we experimented with the boto3 client configuration, but increasing the read timeout resulted in new (unwanted) invocations after each timeout period and increasing the connection timeout triggered a new invocation, after the Lambda was finished.

Workaround
As various investigations and experimentation with boto3's Lambda client did not result in a working setup using 'RequestResponse' invocations,
we circumvented the problem now by making use of Cloudwatch logs. For this, the Lambda has to be setup up to write to an accessible log group. Then, these logs can the queried for the state. Then you would invoke the Lambda and monitor it like this:
import boto3
lambda_client = boto3.client('lambda')
logs_clients = boto3.client('logs')
invocation = lambda_client.invoke(
FunctionName='your_lambda',
InvocationType='Event'
)
# Identifier of the invoked Lambda run
request_id = invocation['ResponseMetadata']['RequestID']
while True:
# filter the logs for the Lambda end event
events = logs_client.filter_log_events(
logGroupName='your_lambda_loggroup',
filterPattern=f'"END RequestId: {request_id}"'
).get('events', [])
if len(events) > 0:
# the Lambda invocation finished
break
This approach works for us now, but it's honestly ugly. To make this approach slightly better, I recommend to set the time range filtering in the filter_log_events call.
One thing, that was not tested (yet): The above approach only tells, whether the Lambda terminated, but not the state (failed or successful) and the default logs don't hold anything useful in that regards. Therefore, I will investigate, if a Lambda run can know its own request id during runtime. Then the Lambda code can be prepared to also write error messages with the request id, which then can be filtered for again.

Recursive AWS lambda with updating state

I have an AWS Lambda that polls from an external server for new events every 6 hours. On every call, if there are any new events, it publishes the updated total number of events polled to a SNS. So I essentially need to call the lambda on fixed intervals but also pass a counter state across calls.
I'm currently considering the following options:
Store the counter somewhere on a EFS/S3, but it seems an
overkill for a simple number
EventBridge, which would be ok to schedule the execution, but doesn't store state across calls
A step function with a loop + wait on the the lambda would do it, but it doesn't seem to be the most efficient/cost effective way to do it
use a SQS with a delay so that the lambda essentially
triggers itself, passing the updated state. Again I don't think
this is the most effective, and to actually get to the 6 hours delay
I would have to implement some checks/delays within the lambda, as the max delay for SQS is 15 minutes
What would be the best way to do it?

For scheduling Lambda at intervals, you can use CloudWatch Events. Scheduling Lambda using Serverless framework is a breeze. A cronjob type statement can schedule your lambda call. Here's a guide on scheduling: https://www.serverless.com/framework/docs/providers/aws/events/schedule
As for saving data, you can use AWS Systems Manager Parameter Store. It's a simple Key value pair storate for such small amount of data.
https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-parameter-store.html
OR you can also save it in DynamoDB. Since the data is small and frequency is less, you wont be charged much and there's no hassle of reading files or parsing.

Can I schedule a lambda function execution with a lambda function?

I'm looking for the ability to programmatically schedule a lambda function to run a single time with another lambda function. For example, I made a request to myFirstFunction with date and time parameters, and then at that date and time, have mySecondFunction execute. Is that possible only with stateless AWS services? I'm trying to avoid an always-on ec2 instance.
Most of the results I'm finding for scheduling a lambda functions have to do with cloudwatch and regularly scheduled events, not ad-hoc events.

This is a perfect use case for aws step functions.
Use Wait state with SecondsPath or TimestampPath to add the required delay before executing the Next State.

What you're tring to do (schedule Lambda from Lambda) it's not possible with the current AWS services.
So, in order to avoid an always-on ec2 instance, there are other options:
1) Use AWS default or custom metrics. You can use, for example, ApproximateNumberOfMessagesVisible or CPUUtilization (if your app fires a big CPU utilization when process a request). You can also create a custom metric and fire it when your instance is idle (depending on the app that's running in your instance).
The problem with this option is that you'll waste already paid minutes (AWS always charge a full-hour, no matter if you used your instance for 15 minutes).
2) A better option, in my opinion, would be to run a Lambda function once per minute to check if your instances are idle and shut them down only if they are close to the full hour.
import boto3
from datetime import datetime
def lambda_handler(event, context):
print('ManageInstances function executed.')
environments = [['instance-id-1', 'SQS-queue-url-1'], ['instance-id-2', 'SQS-queue-url-2'], ...]
ec2_client = boto3.client('ec2')
for environment in environments:
instance_id = environment[0]
queue_url = environment[1]
print 'Instance:', instance_id
print 'Queue:', queue_url
rsp = ec2_client.describe_instances(InstanceIds=[instance_id])
if rsp:
status = rsp['Reservations'][0]['Instances'][0]
if status['State']['Name'] == 'running':
current_time = datetime.now()
diff = current_time - status['LaunchTime'].replace(tzinfo=None)
total_minutes = divmod(diff.total_seconds(), 60)[0]
minutes_to_complete_hour = 60 - divmod(total_minutes, 60)[1]
print 'Started time:', status['LaunchTime']
print 'Current time:', str(current_time)
print 'Minutes passed:', total_minutes
print 'Minutes to reach a full hour:', minutes_to_complete_hour
if(minutes_to_complete_hour <= 2):
sqs_client = boto3.client('sqs')
response = sqs_client.get_queue_attributes(QueueUrl=queue_url, AttributeNames=['All'])
messages_in_flight = int(response['Attributes']['ApproximateNumberOfMessagesNotVisible'])
messages_available = int(response['Attributes']['ApproximateNumberOfMessages'])
print 'Messages in flight:', messages_in_flight
print 'Messages available:', messages_available
if(messages_in_flight + messages_available == 0):
ec2_resource = boto3.resource('ec2')
instance = ec2_resource.Instance(instance_id)
instance.stop()
print('Stopping instance.')
else:
print('Status was not running. Nothing is done.')
else:
print('Problem while describing instance.')

UPDATE - I wouldn't recommend using this approach. Things changed in when TTL deletions happen and they are not close to TTL time. The only guarantee is that the item will be deleted after the TTL. Thanks #Mentor for highlighting this.
2 months ago AWS announced DynamoDB item TTL, which allows you to insert an item and mark when you wish for it to be deleted. It will be deleted automatically when the time comes.
You can use this feature in conjunction with DynamoDB Streams to achieve your goal - your first function inserts an item to a DynamoDB table. The record TTL should be when you want the second lambda triggered. Setup a stream that triggers your second lambda. In this lambda you will identify deletion events and if that's a delete then run your logic.
Bonus point - you can use the table item as a mechanism for the first lambda to pass parameters to the second lambda.
About DynamoDB TTL:
https://aws.amazon.com/blogs/aws/new-manage-dynamodb-items-using-time-to-live-ttl/

It does depend on your use case, but the idea that you want to trigger something at a later date is a common pattern. The way I do it serverless is I have a react application that triggers an action to store a date in the future. I take the date format like 24-12-2020 and then convert it using date(), having researched that the date format mentioned is correct, so I might try 12-24-2020 and see what I get(!). When I am happy I convert it to a Unix number in javascript React I use this code:
new Date(action.data).getTime() / 1000
where action.data is the date and maybe the time for the action.
I run React in Amplify (serverless), I store that to dynamodb (serverless). I then run a Lambda function (serverless) to check my dynamodb for any dates (I actually use the Unix time for now) and compare the two Unix dates now and then (stored) which are both numbers, so the comparison is easy. This seems to me to be super easy and very reliable.
I just set the crontab on the Lambda to whatever is needed depending on the approximate frequency required, in most cases running a lambda every five minutes is pretty good, although if I was only operating this in a certain time zone for a business weekday application I would control the Lambda a little more. Lambda is free for the first 1m functions per month and running it every few minutes will cost nothing. Obviously things change, so you will need to look that up in your area.
You will never get perfect timing in this scenario. It will, however, for the vast majority of use cases be close enough according to the timing settings of the Lambda function, you could set it up to check every minute or just once per day, it all depends on your application.
Alternatively, If I wanted an instant reaction to an event I might use SMS, SQS, or Kinesis to instantly stream a message, it all depends on your use case.

I'd opt for enqueuing deferred work to SQS using message timers in myFirstFunction.
Currently, you can't use SQS as a Lambda event source, but you can either periodically schedule mySecondFunction to check the queue via scheduled CloudWatch Events (somewhat of a variant of the other options you've found) or use a CloudWatch alarm on the ApproximateNumberOfMessagesVisible to fire an SNS message to a Lambda and avoid constant polling for queues that are frequently inactive for long periods.

Lambda : Is any Batch processing scheduler available?

Problem : Fetch 2000 items from Dynamo DB and process(Create a POST req from 100 items) it batch by batch (Batch size = 100).
Question : Is there anyway that I can achieve it from any configuration in AWS.
PS : I've configured a cron schedule to run my Lambda function. I'm using Java. I've made multi-threaded application which synchronously does so, but this eventually increases my computation time drastically.

I have the same problem and thinking of solving it in following way. Please let me know if you try it.
Schedule Job to fetch N items from DynamoDB using Lambda function
Lambda function in #1 will submit M messages to SQS to process each
item and trigger lambda functions, in this case it should call
lambda functions M times Each lambda function will process request
given in the message
In order to achieve this you need to schedule an event via CloudWatch, setup SQS and create lambda function triggered by SQS events.
Honestly, I am not sure if this is price effective but it should be working. Assuming your fetch size is so low, this should be reasonable.
Also you can try using SNS in this case you don't need to worry about SQS message polling.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js