I'm building a Lambda function in AWS that need to load reference data from a mysql database. There is no real issue right now as it very limited amount of data. But what is best practice here? Is there away to keep this data within Lambda (or some other similar functionality) so that I don't need to request it for every invocation of the function? I'm using Node js though I don't think that affects this question.
Many thanks,
Marcus
There is no build-in persistent storage for lambda. Any data that you would like to keep reliably (not counting temporary persistence due to lambda execution context) between invocations is to store data outside of lambda itself.
You already store it in MySQL, but other popular choices are:
SSM Parameter Store
S3
EFS
DynamoDB
ElastiCache if you really need fast access to the data.
Since you already get the data from MySQL the only advantage of using SSM or DynamoDB would be that you can use AWS API to access and update them, or inspect/modify in AWS Console. You don't need to bundle any MySQL client with your function nor establish any connections to the database.
Related
I am looking to trigger code every 1 Hour in AWS.
The code should: Parse through a list of zip codes, fetch data for each of the zip codes, store that data somewhere in AWS.
Is there a specific AWS service would I use for parsing through the list of zip codes and call the api for each zip code? Would this be Lambda?
How could I schedule this service to run every X hours? Do I have to use another AWS Service to call my Lambda function (assuming that's the right answer to #1)?
Which AWS service could I use to store this data?
I tried looking up different approaches and services in AWS. I found I could write serverless code in Lambda which made me think it would be the answer to my first question. Then I tried to look into how that could be ran every x time, but that's where I was struggling to know if I could still use Lambda for that. Then knowing where my options were to store the data. I saw that Glue may be an option, but wasn't sure.
Yes, you can use Lambda to run your code (as long as the total run time is less than 15 minutes).
You can use Amazon EventBridge Scheduler to trigger the Lambda every 1 hour.
Which AWS service could I use to store this data?
That depends on the format of the data and how you will subsequently use it. Some options are
Amazon DynamoDB for key-value, noSQL data
Amazon Aurora for relational data
Amazon S3 for object storage
If you choose S3, you can still do SQL-like queries on the data using Amazon Athena
I'm currently trying to create a system where a AWS Lambda function accesses a RDS database and query a column containing dates and search for a specific date and if it is found to pull data from the database to be used in the lambda function as a variable.
Apologies if this is a bit vague.O
I think you'll find tons of tutorials to do that online, but here are the main points:
Your Lambda should be deployed in a VPC, and you attach a VPC security group to it.
The SG attached to your RDS needs to allow your Lambda to access it.
Your Lambda needs to use a MySQL/PSQL/... to query your RDS database.
The credentials to access your DB should be passed to your Lambda function using environment variables and/or Systems Manager Parameter Store and/or Secrets Manager.
In any cases, your DB_PASSWORD should never be passed in plain text.
Be careful of some Lambda limitation, specifically the 15mins timeout in case your query takes a long time to run.
My AWS Lambda function needs to access data that is updated every hour and is going to be called very often via api. What is the most efficient and least expensive way?
The data that is already updated every hour is configured through Lambda batch, but I don't know where to store this data.
How about putting the latest data in the latest bucket of Amazon S3 every time? Or, even if there is a problem with the hot partition, how about storing it in Amazon DynamoDB because it is simple access? I considered the gateway cache, which is updated every hour, but at a cost. Please advise.
As you have mentioned "least expensive way" I will suggest to use Amazon DynamoDB because 25GB of space is free (always not free tier). Now if your data size is more than 25GB then also you can use DynamoDB over other services like RDS or S3 that comes at a cost.
The simplest option would be to use AWS Systems Manager Parameter Store. It is secured via IAM and is a great way to share parameters between AWS Lambda functions.
If your data is too big to store in Parameter Store, then consider storing it in Amazon S3. It is easily accessible and low-cost.
If there are problems using these services, then you could look at using databases but there is insufficient information in your question make an appropriate recommendation.
I have 8 different AWS Lambda functions that need to share some common data. (like common configuration for database, etc)
There is no in-built technique for sharing data between Lambda functions. Each function runs independently and there is no shared datastore.
You will need to use an external datastore -- that is, something outside of Lambda that can persist the data.
Some options include:
Amazon S3: You could store information in an S3 object, that is retrieved by your Lambda functions.
Amazon DynamoDB: A fully-managed NoSQL database that provides fast performance. Ideal if you are storing and retrieving a blog of data, such as a JSON object. Your Lambda function would access DynamoDB via standard API calls. For extreme performance, you could use DynamoDB Accelerator (DAX).
AWS Systems Manager Parameter Store: Provides secure, hierarchical storage for configuration data management and secrets management.
The above options are fully-managed services, so you don't need to run any extra infrastructure.
There are other options, such as Amazon ElastiCache, but they would require additional services to be running.
Depending on the nature of configuration you can decide on using different storage options.
If the configuration doesn't change often and mostly static you can use the following options,
Amazon Systems Manager Parameters Store
Embed configuration as a part of code
If they are changing often you can consider,
AWS DynamoDB
AWS S3
Note: Depending on your use case you can also consider these configurations as parameters to these Lambda functions.
I will add my 2 cents answer :
use lambda-layer (50mb) for persistent share data over function invocation
use EFS (ec2, within vpc) for persistente share data over any aws service
the lambda temporary memory (default 512mo extand recently to 10 gb) in ephemere only for current evocation.
I am new to AWS and please forgive me if this question is asked previously.
I have a REST API which returns 2 parameters (name, email). I want to load this data into Redshift.
I thought of making a Lambda function which starts every 2 minutes and call the REST API. The API might return max 3-4 records within this 2 minutes.
So, under this situation is it okay to just do a insert operation or I have to still use COPY (using S3)? I am worried only about performance and error-free (robust) data insert.
Also, the Lambda function will start asynchronously every 2 mins, so there might be a overlap of insert operation (but there won't be an overlap in data).
At this situation and if I go with S3 option, I am worried the S3 file generated by previous Lambda invoke will be overwritten and a conflict occurs.
Long story short, what is the best practise to insert fewer records into redshift?
PS: I am okay with using other AWS components as well. I even looked into Firehose which is perfect for me but it can't load data into Private Subnet Redshift.
Thanks all in advance
Yes, it would be fine to INSERT small amounts of data.
The recommendation to always load via a COPY command is for large amounts of data because COPY loads are parallelized across multiple nodes. However, for just a few lines, you can use INSERT without feeling guilty.
If your SORTKEY is a timestamp and you are loading data in time order, there is also less need to perform a VACUUM, since the data is already sorted. However, it is good practice to still VACUUM the table regularly if rows are being deleted.
As you don't have much data; you can use either copy or insert. Copy command is more optimized for bulk insert .. its like giving u capability of batch insert..
both will work equally fine
FYI, AWS now supports Data API feature.
As described in the official document, you can easily access Redshift data using HTTP request without JDBC connection anymore.
The Data API doesn't require a persistent connection to the cluster. Instead, it provides a secure HTTP endpoint and integration with AWS SDKs. You can use the endpoint to run SQL statements without managing connections. Calls to the Data API are asynchronous.
https://docs.aws.amazon.com/redshift/latest/mgmt/data-api.html
Here's the steps you need to use Redshift Data API
Determine if you, as the caller of the Data API, are authorized. For more information about authorization, see Authorizing access to the Amazon Redshift Data API.
Determine if you plan to call the Data API with authentication credentials from Secrets Manager or temporary credentials. For more information, see Choosing authentication credentials when calling the Amazon Redshift Data API.
Set up a secret if you use Secrets Manager for authentication credentials. For more information, see Storing database credentials in AWS Secrets Manager.
Review the considerations and limitations when calling the Data API. For more information, see Considerations when calling the Amazon Redshift Data API.
Call the Data API from the AWS Command Line Interface (AWS CLI), from your own code, or using the query editor in the Amazon Redshift console. For examples of calling from the AWS CLI, see Calling the Data API with the AWS CLI.