How to clean up state post lambda time out - amazon-web-services

We use APIG and Lambda to process long running jobs. These jobs have an id which needs to unique. In order to capture duplicate job submissions /createJob Lambda checks an job exists (if not adds an entry into db) and requests to schedule that job.
We had an issue where an entry was made into db but before the request could be sent the lambda (Which executes /createJob) terminated. We believe it was due some network latency.
Though its an rare event, wanted to check that are available mechanisms for rollback (i.e. delete that entry from db in case the lambda fails to execute)

Even though you have not shared the lambda code, I think it is better to commit to the DB after the major steps in lambda are completed rather than before everything else. This way, if there is a failure for some reason, there will be no entry made to the DB and you don't need to rollback anything.

Related

Is there a way to pass along values between retries in Lambda?

For instance Try one failed, Can we pass few parameters to to event object of the next retry by something like below?
event.somevariable = somevalue
If we want do something like that what could be our options?
I'm not a fan of Lambda retries. They are run exactly the same as the initial call and if it failed the first time, it will fail on both of the subsequent retries. What changes?
I'm going to assume that you want to pass along a variable to track which retry is being executed and potentially make changes so that the subsequent retries do succeed - this does make sense. However, unfortunately, you need to look outside of lambda to make this happen.
DynamoDB is one method which is commonly used, to track the event ID and number of executions however I personally find that to be a faff.
I'd rather use Amazon SNS to ping a HTTP endpoint on failure, then re-execute my lambda function with different parameters. Just be mindful (in all cases) of idempotency. You should be able to re-execute a lambda multiple times without it causing issues or overwriting what was intended to happen.
There's no way to do that directly in AWS.
You could use the request ID as a primary key in a DynamoDB table where you store that value, and always look for those values in DynamoDB at the start of a request.

How can I track the progress/status of an asynchronous AWS Lambda invocation?

I have an API which I use to trigger AWS Lambda jobs. Upon request, the API invokes an AWS Lambda job with InvocationType='Event'. Hereafter, I want to periodically poll if the AWS Lambda job has finished.
The way that would fit best to my architecture, is to store an identifier of the Lambda job in a database and periodically check if the job is finished and what its output is. However, I was not able to find how I can do this.
How can I periodically poll for the result of an AWS Lambda job, and view the output once it has finished?
I have looked into using InvocationType='RequestResponse', but this requires me to store a future, which I cannot do in a database.
There's no built-in way to check for the status of an asynchronous Lambda invocation.
Asynchronous Lambda invocation, using the event invocation type, is meant to be a fire and forget job. As such, there's no 'progress' or 'status' to get or poll for.
As you don't want to wait for the Lambda to complete, synchronous Lambda invocation is out of the picture. In this case, you need to write your own logic to keep track of the status.
One way you could do this is to store a (job) item in a DynamoDB jobs table with 2 attributes:
jobId UUID (String attribute, set as the partition key)
completed boolean flag (Boolean attribute)
Workflow is then as follows:
Within your API, create & store a new job with completed defaulting to 'false'
Pass the newly-created jobId to the Lambda being invoked in the payload
When the Lambda finishes, lookup the job associated with the passed in jobId within the jobs table & set the completed attribute of the job to true
You can then periodically poll for the result of the job within the DynamoDB table.
Or take a look at using DynamoDB Streams as a way to know when a job finishes in near-real time without polling.
As to viewing the 'output', AWS Lambda just returns a success response without additional information. There is no 'output'. Store any output you might need in persistent storage - maybe an extra output attribute as a String with each job? - & later retrieve it.
#Ermiya Eskandary's answer is absolutely right.
I am a Dynamodb Subject matter expert, and did this status tracking (also error handling, retry, error logging) pattern for many of my customers
You could check the pynamodb_mate library, it has the status tracker pattern implemented and you can enable that with around 15 lines of code.
in general, when you say you want status tracking, you are talking about the following:
Each task should be handled by only one worker, you want a concurrency lock mechanism to avoid double consumption. (a lot of people didn't aware of this, it is called Idempotent)
For those succeeded tasks, store additional information such as the output of the task and log the success time.
For those failed task, log the error message for debug, so you can fix the bug and rerun the task.
For those failed task, you want to get all of failed tasks by one simple query and rerun with the updated business logic.
For those tasks failed too many times, you don't want to retry them anymore and wants to ignore them. (a lot of people run into endless loop when they deploy to production then realize that it is a necessary feature)
Run custom query based on task status for analytics purpose.
You can read this jupyter notebook example
Basically, with pynamodb_mate your lambda job application code become:
# this is your lambda application code
def lambda_handler(...):
...
# your new code should be:
with tracker.start_job():
lambda_handler()
If your application code is not Python, then you have two options:
create another lambda function that invoke the original one using sync mode. however, you pay more money to run the "caller" lambda function
suppose your lambda code in in Node.js, then add additional lambda runtime as a layer and wrap your node.js caller around a Python function. In short, you are using Python to call node.js.

Race condition for Microservice architecture [CosmosDB]

We have a micro service based architecture. Let's say we have front and backend completely isolated. The backend microserviceA exposes a rest endpoint which basically calls a thirdParty service and updates a record in cosmosDB. Now, this micro service is deployed over kubernetes cluster and hence can have multiple replication factor for load balancing. As mentioned before, the frontEnd is isolated and it consumes the exposed endpoint.
Problem :
FrontEnd has been written in such a manner that if the response is not obtained within a certain time frame or if a network failure occurs, it retries the endpoint. It has been observed that in some rare scenarios(doesn't matter what) UI makes multiple calls (mostly 2) one after another with time difference in milliseconds. Now here comes the race condition at the backend logic.
If the first call goes to ThirdParty first and obtained a success response, the second call will get a failure(bcz the first one was already a success). We can not change the behaviour of ThirdParty.
Taking above scenario as base, Now if the second call(failure one) updates the DB first and reaches the UI. UI takes this as a failure(whereas the first call was already a success) and take failure actions.
If the success calls makes it to the UI first, everything works fine.
Possible solution I can think of:
1)
Put a cache as source of truth.
apiCall : Status
If (entry not present in cache) {
Put Entry in cache With Status NULL or Something with specific TTL
(acquire lock on specific entry) {
If (status is success) return successResponse.
MAKE ThirdParty Call
Update DB
Update cache
Release LOCK
}
} else {
(acquire lock on specific entry) {
MAKE ThirdParty Call
Update DB
Update cache
Release LOCK
}
}
Else block will never be executed. seems like.
Only in case of failure, instead of updating the DB, put a thread.sleep(10000) for couple of times in hope that another thread will update the DB with success response.
If still not success, return a failure update and update DB.
Put a poller on UI side. If it is a failure. Try to poll couple of times more in hope that the status changes. If not, take the failure actions.
Optimistic locking for cosmos record.
https://cosmosdb.github.io/labs/dotnet/labs/10-concurrency-control.html
Not sure how this can help.
Let's say, both api calls read the record when the version was 0.
Now the second api call update the the DB record, as the version was not changed,
it will be a successful update.
Now the DB holds Failure as value.
The first api call tries to update it and it found a version mismatch,
the update will not go through and another attempt will be made to update the DB as it was a success.
In case of failure, no attempts to update DB will be made.
Now, the second API call will appear to UI first and UI will again take the failure action.
UI require a poller in such cases.
But if the UI requires a poller, why do we need the optimistic locking in first place. :)
I don't know cosmosDB functionality much. If there is some functionality cosmos provides to handle, Please be kind enough to share.
What will be the best way to handle such kind of scenarios.
It seems in your application design you have made it necessary to wait for each execution to finish before you fire the next one, I am not debating if this is good or bad that's a different discussion, but it seems the only option you have to fire all your DB Updates in a synchronous manner in this case.
Optimistic locking is very good to ensure that the document you are updating have not been updated while your code did other things but it will not help your UI issue here.
I think you need to abstract the UI in order to make this work properly otherwise you are stuck running things in synchronous mode

Make Lambda function execute now, and/or in an hour

I'm trying to implement an AWS Lambda function that should send an HTTP request. If that request fails (response is anything but status 200) I should wait another hour before retrying (longer that the Lambda stays hot). What the best way to implement this?
What comes to mind is to persist my HTTP request in some way and being able to trigger the Lambda function again in a specified amount of time in case of a persisted HTTP request. But I'm not completely sure which AWS service that would provide that functionality for me. Is SQS an option that can help here?
Or, can I dynamically schedule Lambda execution for this? Note that the request to be retried should be identical to the first one.
Any other suggestions? What's the best practice for this?
(Lambda function is my option. No EC2 or such things are possible)
You can't directly trigger Lambda functions from SQS (at the time of writing, anyhow).
You could potentially handle the non-200 errors by writing the request data (with appropriate timestamp) to a DynamoDB table that's configured for TTL. You can use DynamoDB Streams to detect when DynamoDB deletes a record and that can trigger a Lambda function from the stream.
This is obviously a roundabout way to achieve what you want but it should be simple to test.
As jarmod mentioned, you cannot trigger Lambda functions directly by SQS. But a workaround (one I've used personally) would be to do the following:
If the request fails, push an item to an SQS Delay Queue (docs)
This SQS message will only become visible on the queue after a certain delay (you mentioned an hour).
Then have a second scheduled lambda function which is triggered by a cron value of a smaller timeframe (I used a minute).
This second function would then scan the SQS queue and if an item is on the queue, call your first Lambda function (either by SNS or with the AWS SDK) to retry it.
PS: Note that you can put data in an SQS item, since you mentioned you needed the lambda functions to be identical you can store your first function's input in here to be reused after an hour.
I suggest that you take a closer look at the AWS Step Functions for this. Basically, Step Functions is a state machine that allows you to execute a Lambda function, i.e. a task in each step.
More information can be found if you log in to your AWS Console and choose the "Step Functions" from the "Services" menu. By pressing the Get Started button, several example implementations of different Step Functions are presented. First, I would take a closer look at the "Choice state" example (to determine wether or not the HTTP request was successful). If not, then proceed with the "Wait state" example.

How we can use JDBC connection pooling with AWS Lambda?

Can we use JDBC connection pooling with AWS Lambda ? AS AWS lambda function get called on a specific event, so its life time persist even after it finishing one of its call ?
No. Technically, you could create a connection pool outside of the handler function but since you can only make use of any one single connection per invocation so all you would be doing is tying up database connections and allocating a pool of which you could only ever use 1.
After uploading your Lambda function to AWS, the first time it is invoked AWS will create a container and run the setup code (the code outside of your handler function that creates the pool- let's say N connections) before invoking the handler code.
When the next request arrives, AWS may re-use the container again (or may not. It usually does, but that's down to AWS and not under your control).
Assuming it reuses the container, your handler function will be invoked (the setup code will not be run again) and your function would use one of N the connections to your database from the pool (held at the container level). This is most likely the first connection from the pool, number 1 as it is guaranteed to not be in use, since it's impossible for two functions to run at the same time within the same container. Read on for an explanation.
If AWS does not reuse the container, it will create a new container and your code will allocate another pool of N connections. Depending on the turnover of containers, you may exhaust the database pool entirely.
If two requests arrive concurrently, AWS cannot invoke the same handler at the same time. If this were possible, you'd have a shared state problem with the variables defined at the container scope level. Instead, AWS will use two separate containers and these will both allocate a pool of N connections each, i.e. 2N connections to your database.
It's never necessary for a single invocation function to require more than one connection (unless of course you need to communicate to two independent databases within the same context).
The only time a connection pool would be useful is if it were at one level above the container scope, that is, handed down by the AWS environment itself to the container. This is not possible.
The best case you can hope for is to have a single connection per container. Even then you would have to manage this single connection to ensure the database server hasn't disconnect or rebooted. If it does, your container's connection will die and your handler will never be able to connect again (until the container dies), unless you write some code in your function to check for dropped connections. On a busy server, the container might take a long time to die.
Also keep in mind that if your handler function fails, for example half way through a transaction or having locked a table, the next request invocation will get the dirty connection state from the container. The first invocation may have opened a transaction and died. The second invocation may commit and include all the previous queries up to the failure.
I recommend not managing state outside of the handler function at all, unless you have a specific need to optimise. If you do, then use a single connection, not a pool.
Yes, the lambda is mostly persistent, so JDBC connection pooling should work. The first time a lambda function is invoked, the environment will be created and it may or may not get reused. But in practice, subsequent invocations will often reuse the same lambda process along with all program state if your triggering events occur often.
This short lambda function demonstrates this:
package test;
import com.amazonaws.services.lambda.runtime.Context;
import com.amazonaws.services.lambda.runtime.RequestHandler;
public class TestLambda implements RequestHandler<String, String> {
private int invocations = 0;
public String handleRequest(String request, Context context) {
invocations++;
System.out.println("invocations = " + invocations);
return request;
}
}
Invoke this from the AWS console with any string as the test event. In the CloudWatch logs, you'll see the invocations number increment each time.
Kudos to the AWS RDS proxy, now you can used pooled MySql and postgrese connections without any extra configs in your Java or other any code specific to AWS Lambda. All you need is to create and Add a Database proxy your AWS Lambda function you want to reuse/pool connections. See how-to here.
Note: AWS RDS proxy is not included in the Free-Tier (more here).
It has caveat
There is no destroy method which ensures closing pool. One may say DB connection idle time would handle.
What if same DB being used for other use cases like pool maintain in regular machine Luke EC2.
As many say, if there is sudden spike in requests, create chaos to DB as there will be always some maximum connection setting at database side per user.