AWS - how do you share an access token between lambda processes? - amazon-web-services

First i have a question about the way Lambda works:
If it's only triggered by 1 SQS queue and that queue now contains 100 messages, would it sequentially create and tear down 100 lambdas processes? Or would it do it in parallel?
My second question is the main one:
The job of my lambda is to request an access token (for an external service) that expires every hour and using it, perform some action on that external service.
Now, i want to be able to cache that token and only ask for it every 1 hour, instead of every time i make the request using the lambda.
Given the nature of how Lambda works, is there a way of doing it through code?
How can i make sure all Lambdas processes use the same access token?
(I know i can create a new Redis instance and make them all point to it, but i'm looking for a "simpler" solution)

You can stuff the token in the SSM parameter store. You can encrypt the value. Lambdas can check the last modified date on the value to monitor when expiration is pending and renew. No Redis instance to maintain, and the value would be encrypted.
https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-paramstore.html
You could also use DynamoDB for this. Lower overhead than Redis since it’s serverless. If you have a lot of concurrent Lambda, this may be preferable to SSM because you may run into rate limiting on the API. A little more work because you have to setup a DynamoDB table.
Another option would be to have a “parent” Lambda function that gets the API token and calls the “worker” Lambdas and passes the token as a parameter.

Related

Retrieving AWS SSM Parameter taking a long time

I have a python lambda that forwards requests to an external API. The lambda is part of a target group that an ALB targets. The lambda goes through surges where it has to handle hundreds of invocations per second.
Everything works well for the most part except for when we hit some odd issue where it will take upwards of 20 seconds or so to retrieve a secure string param from parameter store. When that delay of 20 seconds occurs, the system making the call to our alb times out and throws an error.
I was thinking that I could do the ssm param retrieval in an init method of the lambda and then keep the lambda always warm but that seems like a waste of resources just to manage the ssm param reading issue.
Are there any suggestions on how this should be done or configured (or if perhaps I'm overlooking something that I should be doing)?
Every AWS API has a request limit - https://aws.amazon.com/premiumsupport/knowledge-center/ssm-parameter-store-rate-exceeded/
So, yes you should cache your parameters - How do I cache multiple AWS Parameter Store values in an AWS Lambda?

AWS lambda - best practice when reading from long list/s3

I have a scheduled error handling lambda, I would like to use Serverless technology here as opposed to a spring boot service or something.
The lambda will read from an s3 bucket and process accordingly. The problem is at times the s3 bucket may have high volume of data to be processed. long running operations aren't suited to lambdas.
One solution I can think of is have the lambda read and process one item from the bucket and on success trigger another instance of the same lambda unless the bucket is empty/fully-processed. The thing i don't like is that this is synchronous and quite slow. I also need to be conscious of running too many lambdas at the same time as we are hitting a REST endpoint as part of the error flow and don't want to overload it with too many requests.
I am thinking it would be nice to have maybe 3 instances of the lambdas running at the same time until the bucket is empty but not really sure, I am wondering if anyone has any nice patterns that could be used here or suggestions on best practices?
Thanks
Create a S3 bucket for processing your files.
Enable a trigger S3 -> Lambda, on every new file in the bucket lambda will be invoked to process the file, every file is processed separately. https://docs.aws.amazon.com/AmazonS3/latest/user-guide/enable-event-notifications.html
Once the file is processed you could either delete or move file to other place.
About concurrency please have a look at provisioned concurrency https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html
Update:
As you still plan to use a scheduler lambda and S3
Lambda reads/lists only the filenames and puts messages into SQS to process the file.
A new Lambda to consume SQS messages and process the file.
Note: I would recommend using SQS initially if the files/messages are not so big, it has built it recovery mechanics, DLQ , delays, visibility etc which you could benefit more than the simple S3 storage, second way is just create message with file reference and still use SQS.
I'd separate the lambda that is called by the scheduler from the lambda that is doing the actual processing. When the scheduler calls the first lambda, it can look at the contents of the bucket, then spawn the worker lambdas to process the objects. This way you have control over how many objects you want per worker.
Given your requirements, I would recommend:
Configure an Amazon S3 Event so that a message is pushed to an Amazon SQS queue when the objects are created in the S3 bucket
Schedule an AWS Lambda function at regular intervals that will:
Check that the external service is working
Invoke a Lambda function to process one message from the queue, and keep looping
The hard part would be throttling the second Lambda function so that it doesn't try to send all request at once (which might impact that external service).
You could probably do this by using a Step Function to trigger Lambda and then, if it was successful, trigger another Lambda function. This could even be done in parallel, such as allowing up to three parallel Lambda executions. The benefit of using Step Functions is that there is no cost for "waiting" for each Lambda to finish executing.
So, the Step Function flow would be something like:
Invoke a "check external service" Lambda function
If it fails, then quit the flow
Invoke the "processing" Lambda function
Get one message
Process the message
If successful, remove the message from the queue
Return success/fail
If it was successful, keep looping until the queue is empty

Does AWS Lambda Store Last Run Time?

I am trying to pass a unix timestamp into API get request as a parameter to another system to grab data. The parameter needs to be the last time the AWS Lambda ran. I need to somehow store the last time the AWS lambda function has ran into maybe an s3 bucket and also recover that timestamp. So I can pass that value along into the next run.
Anyone have any ideas on how to do something like this?
Lambda does not store any last run time between invocations (especially as its possible there could be concurrent invocations of your Lambda at the same time).
Depending on the use case if you want your Lambda to both read and write DynamoDB will probably be your best choice, although you should be aware of the following:
Reads and writes use credits, if you're read and write heavy you will need to consider pricing and enable autoscaling if your load varies.
By default reads are eventually consistent, if your writes must be accurate you will want to use strongly consistent reads.
As an alternative you could store the value as a parameter store value, it is limited 1000 operations per second so if you are not using frequent requests this will provide a very simple implementation.
If you do not need the information within the Lambda itself, you can get this information by filtering the CloudWatch logs that are produced by your Lambda. This would not be advisable in your Lambda itself as duration would span longer than either of the above options.
A quick database you could access from an AWS Lambda function is AWS Systems Manager Parameter Store.
You can store simple information such as configuration settings, URLs to databases and even... the last execution time!
IAM permissions can be used to limit access to specific parameters.
AWS Lambda stores many metrics on each function, including invocation time. Are you using boto3 or some other SDK?

AWS Real time data fetching

I have an application which needs to read data from AWS dynamodb table every 5 seconds.
Currently I fetch the data using lambda, and then getting the data from dynamodb back to the user.
The problem with querying the table every 5 seconds is that it can have performance affect and moreover there is a pricing issue. (Most of the time the data might not even be changed at all but when it is changed I want to be notified it immediately).
An important clarification is that my app sits outsite of AWS, and only access the AWS dynamodb to get data (using simple http request built with c#).
Is there any way I can get a notification to my app when a new data is inserted into dynamodb?
Just to add something on top of #john-rotenstein answer:
Once you have properly configured a Lambda function to be triggered by an event from a DynamoDB Stream, you could have your Lambda function notify your Web Application via an HTTP Request.
Another option is to use Lambda to put this notification in a Queue you may be using outside AWS and then have your C# code be a consumer of this Queue. There are several possibilities to notify your application, you just need to see which one is the best / most cost effective for your current scenario.
A data update in DynamoDB can trigger a DynamoDB Stream, which can trigger an AWS Lambda function.
The Lambda function could notify your application in some way.
See: DynamoDB Streams and AWS Lambda Triggers
Streams is the right answer in terms of engineering, but just to say your concern about the polling option being expensive is unfounded. Therefore if you have a working solution I would be tempted to leave it.
If you queried a table every 5 seconds, it would cost you $0.25 every 2 months.
This assumes your table has on-demand pricing, and the query returns less than 4KB of data.
https://aws.amazon.com/dynamodb/pricing/on-demand/

Delay Lambda execution over specific data

I am trying to come up with a way to have pieces of data processed at specific time intervals by invoking aws lambda every N hours.
For example, parse a page at specific url every 6 hours and store result in s3 bucket.
Have many (~100k) urls each processed that way.
Of course, you can have a VM that hosts some scheduler that would trigger lambdas, as described in this answer, but that breaks the "serverless" approach.
So, is there a way to do this using aws services only?
Things I tried that does not work:
SQS can delay messages, but only for maximum of 15 min (I need hours) and there is no built-in integration between SQS and Lambda so you need to have some polling agent (lambda?) that would poll the qeueu all the time and send new messages to worker lambda, which again breaks the point of only executing at scheduled time;
CloudWatch Alarms can send messages to SNS that triggers Lambda. You can have periodic lambda calls implemented like that by using future metric timestamp, however alarm message cannot have a custom data (think url from example above) connected to it, so that does not work too;
I could create Lambda CloudWatch scheduled triggers programmatically but they also cannot pass any data to Lambda.
The only way I could think of, is to have a dynamo DB table with "url" records, each with the timestamp of last "processing" and have periodic lambda that would query the table and send "old" records as jobs to another "worker" lambda (directly or via SNS).
That would work, however you still need to have a "polling" lambda, which could become a bottleneck as number of items to process grows.
Any other ideas?
100k jobs every 6 hours, doesn't sound like a great use case for Serverless IMO. Personally, I would set up a CloudWatch event with a relevant cron expression that triggered a Lambda to start an EC2 instance that processed all the URLs (stored in DynamoDB) and script the EC2 instance to shutdown after processing the last url.
But that's not what you asked.
You could set up a CloudWatch event with a relevant cron expression that spawns a lambda (orchestrator) reads the urls from DynamoDB or even an S3 file then invokes a second lambda (worker) for each url to actually parse the pages.
Using this pattern you will start hitting concurrency issues at 1000 lambdas (1 orchestrator & 999 workers), less if you have other lambdas running in the same region. You can ask AWS to increase this limit, but I don't know under what scenarios they will do this, or how high they will increase the limit.
From here you have three choices.
Split out the payload to each worker lambda so each instance receives multiple urls to process.
Add an another column to your list of urls and group urls with this column (e.g. first 500 are marked with a 1, second 500 are marked with a 2, etc). Then your orchestrator lambda could take urls off the list in batches. This would require you to run the CloudWatch event at a greater frequency and manage the state so the orchestrator lambda when invoked knows which is the next batch (I've done this at a smaller scale just storing a variable in a S2 file).
Would be to use some combination of options 1 and 2.
Looks like, it's fitting Batch processing scenario with AWS lambda function as a job. It's serverless but obviously adds dependency on another AWS service.
In the same time, it has dashboard, processing status, retries and all perks from job scheduling service.