Exceeding the Lambda execution time out after 15 minutes - amazon-web-services

I am exceeding the Lambda execution time out after 15 mins.
The reason why is because I have long running operations that are inserting and updating thousands of rows of data into Salesforce.
Below is the code that is executing one by one:
sfdc_ops.insert_case_records( records_to_insert_df , sf)
sfdc_ops.update_case_records( records_to_update_df , sf)
sfdc_ops.update_case_records( unprocessed_in_IKM_df , sf)
sfdc_ops.update_case_records( processed_in_IKM_df , sf)
I ultimately do not need to wait for each line. What I really want to do is launch all 4 of these update and insert processes at once.
What is the best solution for avoiding this 15 minute limit - Step Functions?

It is recommended to use AWS Step Functions for long-running jobs, such as ETL jobs. You could split the four update operations into separate Lambdas and orchestrate them in a Step Function to run in parallel.

I'd rethink your approach. If you're exceeding than 15 minutes limit, then one lambda function isn't going to work for you.
Can you break it down into smaller functionality? One lambda to do the inserts, one to do the updates and orchestrate them using Step Functions perhaps?
Have you looked at AWS Batch for batch processing instead of using Lambda Functions?

Related

Dividing tasks into aws step functions and then join them back when all completed

We have a AWS step function that processes csv files. These CSV files records can be anything from 1 to 4000.
Now, I want to create another inner AWS step function that will process these csv records. The problem is for each record I need to hit another API and for that I want all of the record to be executed asynchronously.
For example - CSV recieved having records of 2500
The step function called another step function 2500 times (The other step function will take a CSV record as input) process it and then store the result in Dynamo or in any other place.
I have learnt about the callback pattern in aws step function but in my case I will be passing 2500 tokens and I want the outer step function to process them when all the 2500 records are done processing.
So my question is this possible using the AWS step function.
If you know any article or guide for me to reference then that would be great.
Thanks in advance
It sounds like dynamic parallelism could work:
To configure a Map state, you define an Iterator, which is a complete sub-workflow. When a Step Functions execution enters a Map state, it will iterate over a JSON array in the state input. For each item, the Map state will execute one sub-workflow, potentially in parallel. When all sub-workflow executions complete, the Map state will return an array containing the output for each item processed by the Iterator.
This keeps the flow all within a single Step Function and allows for easier traceability.
The limiting factor would be the amount of concurrency available (docs):
Concurrent iterations may be limited. When this occurs, some iterations will not begin until previous iterations have completed. The likelihood of this occurring increases when your input array has more than 40 items.
One additional thing to be aware of here is cost. You'll easily blow right through the free tier and start incurring actual cost (link).

How can I schedule cloudwatch rule at second level?

I am trying to setup a cloudwatch rule to trigger a lambda based on this doc: https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/ScheduledEvents.html.
What I'd like to do is to trigger a lambda at every 3rd second per 5 minutes. For example, I want to trigger it as:
00:00:03
00:05:03
00:10:03
...
but I can't find a solution to configure second level in the cron expression. Is there any solution to that?
Cron only allows for a minimum of one minute. So configuration of second is not possible with cron expression. You can take hybrid approach by executing your lambda function at every 5 minutes and handle the 3rd second logic in your function by writing sleep function.
import time
def lambda_handler():
time.sleep(3)
# Now execute your logic
I think timing to second level is near impossible. possibly it can be adjusted to following
initiate every minute via Cron expression.
defer execution of processing logic using sleep for (1-3 seconds) if second portion of current time is under 3 second.
Skip entire processing logic if at the initiation time if second portion of the current time is above some high number of seconds like 5x if that suits the need. 59 will mean no-skip.

AWS dynamodb loop putItem vs batchWrite

Hi I was just wondering if someone could give me clarification on the benefits of batchwrite
If we have let's say 36 items we want to write to dynamodb, i'm using a AWS lambda function and the way I see it I have two options (Pseudo code)
Option One
for item in items:
putItem(item)
Option two
for item in items:
if cnt < 25
batch.push(item)
if cnt == 25
batchWrite(batch)
cnt = 0
cnt++
I feel like option one is quick and dirty but if my items would rarely go over 100 is it that bad (would I time out my lambda etc ..) ?
Anyway best practice clarification on this would be great.
For both of your options, you should implement some kind of parallel send for better controlling and performance, like Promise.allSettled in JS and asyncio parallel in python.
Like mentioned in the comments aboce, using batchWrite does in fact reduce the number of requests called, but one critical limitation is it only accepts up to 25 requests and 16 MB total size in one batch requests , if you expect the requests will exceed this limitation, you need to, and you should implement a way to divide the requests into multiple batches, so that each batch size is under 25 requests and 16 MB.
Using putItem is more simple than using batchWrite, as it doesn't have the limitation mentioned above. But again, you will initiate a lot of API requests.
Both the method does not affect the cost, as AWS does not charge you by the number of API called to them, according to their pricing description, but the actual data write to and read from the DynamoDB table, which are known are WCU and RCU respectively. Btw, data transfer in is not charged.
In conclusion, for both putItem and batchWrite, what you have to concern is (1) how to implement the requests and handle retry incase of error. (2) The quantity of the record to be inserted.

Callback for Multiple AWS Lambda execution completion

Here is my problem:
I have to compare a candidate object with some criteria with millions of other candidates in db. Since lambda allows only 5 minutes of execution so it causes timeout.
My Solution:
I planned to do this comparison with 10,000 chunks of candidates so I have to call 10 lambda functions (through SNS) to process 100,000 candidates and then save results of each lambda in some DynamoDB table. But how to get a callback when all lambda functions are done processing so that I can collect those results for individual lambdas and then calculate final results. How to achieve this or is there any better way to acheive my goal. Any help is most appreciated.
I'm not sure if AWS Lambda is truly a good fit for your use case. However just focusing on the main part of your question, you could use DynamoDB Atomic Counters to determine when all processing is complete. You would do the following:
Initially insert a record in DynamodB with a field like numberOfLambdaCalls attribute set to the number of concurrent executions you are kicking off, and a completedLambdaCalls attribute set to 0.
As each function completes, as part of updating the DynamoDB record they would increment the completedLambdaCalls attribute atomically.
Each function could check the returned result of the update to see if they were the one to complete the processing like if numberOfLambdaCalls == completedLambdaCalls and if they are, perform whatever action is necessary to trigger your response.

What is the most efficient way to perform a large and slow batch job on GAE

Say I have a retrieved a list of objects from NDB. I have a method that I can call to update the state of these objects, which I have to do every 15 minutes. These updates take ~30 seconds due to API calls that it has to make.
How would I go ahead and process a list of >1,000 objects?
Example of an approach that would be very slow:
my_objects = [...] # list of objects to process
for object in my_objects:
object.process_me() # takes around 30 seconds
object.put()
Two options:
you can run a task with a query cursor, that processes only N entities each time. When these are processed, and there are more entities to go, you fire another task with the next query cursor.Resources: query cursor, tasks
you can run a mapreduce job that will go over all entities in your query in a parallel manner (might require more resources).Simple tutorial: MapReduce on App Engine made easy
You might consider using mapreduce for your purposes. When I wanted to update all my > 15000 entities I used mapreduce.
def process(entity):
# update...
yield op.db.Put(entity)