I am new to AWS and please forgive me if this question is asked previously.
I have a REST API which returns 2 parameters (name, email). I want to load this data into Redshift.
I thought of making a Lambda function which starts every 2 minutes and call the REST API. The API might return max 3-4 records within this 2 minutes.
So, under this situation is it okay to just do a insert operation or I have to still use COPY (using S3)? I am worried only about performance and error-free (robust) data insert.
Also, the Lambda function will start asynchronously every 2 mins, so there might be a overlap of insert operation (but there won't be an overlap in data).
At this situation and if I go with S3 option, I am worried the S3 file generated by previous Lambda invoke will be overwritten and a conflict occurs.
Long story short, what is the best practise to insert fewer records into redshift?
PS: I am okay with using other AWS components as well. I even looked into Firehose which is perfect for me but it can't load data into Private Subnet Redshift.
Thanks all in advance
Yes, it would be fine to INSERT small amounts of data.
The recommendation to always load via a COPY command is for large amounts of data because COPY loads are parallelized across multiple nodes. However, for just a few lines, you can use INSERT without feeling guilty.
If your SORTKEY is a timestamp and you are loading data in time order, there is also less need to perform a VACUUM, since the data is already sorted. However, it is good practice to still VACUUM the table regularly if rows are being deleted.
As you don't have much data; you can use either copy or insert. Copy command is more optimized for bulk insert .. its like giving u capability of batch insert..
both will work equally fine
FYI, AWS now supports Data API feature.
As described in the official document, you can easily access Redshift data using HTTP request without JDBC connection anymore.
The Data API doesn't require a persistent connection to the cluster. Instead, it provides a secure HTTP endpoint and integration with AWS SDKs. You can use the endpoint to run SQL statements without managing connections. Calls to the Data API are asynchronous.
https://docs.aws.amazon.com/redshift/latest/mgmt/data-api.html
Here's the steps you need to use Redshift Data API
Determine if you, as the caller of the Data API, are authorized. For more information about authorization, see Authorizing access to the Amazon Redshift Data API.
Determine if you plan to call the Data API with authentication credentials from Secrets Manager or temporary credentials. For more information, see Choosing authentication credentials when calling the Amazon Redshift Data API.
Set up a secret if you use Secrets Manager for authentication credentials. For more information, see Storing database credentials in AWS Secrets Manager.
Review the considerations and limitations when calling the Data API. For more information, see Considerations when calling the Amazon Redshift Data API.
Call the Data API from the AWS Command Line Interface (AWS CLI), from your own code, or using the query editor in the Amazon Redshift console. For examples of calling from the AWS CLI, see Calling the Data API with the AWS CLI.
Related
I am looking to trigger code every 1 Hour in AWS.
The code should: Parse through a list of zip codes, fetch data for each of the zip codes, store that data somewhere in AWS.
Is there a specific AWS service would I use for parsing through the list of zip codes and call the api for each zip code? Would this be Lambda?
How could I schedule this service to run every X hours? Do I have to use another AWS Service to call my Lambda function (assuming that's the right answer to #1)?
Which AWS service could I use to store this data?
I tried looking up different approaches and services in AWS. I found I could write serverless code in Lambda which made me think it would be the answer to my first question. Then I tried to look into how that could be ran every x time, but that's where I was struggling to know if I could still use Lambda for that. Then knowing where my options were to store the data. I saw that Glue may be an option, but wasn't sure.
Yes, you can use Lambda to run your code (as long as the total run time is less than 15 minutes).
You can use Amazon EventBridge Scheduler to trigger the Lambda every 1 hour.
Which AWS service could I use to store this data?
That depends on the format of the data and how you will subsequently use it. Some options are
Amazon DynamoDB for key-value, noSQL data
Amazon Aurora for relational data
Amazon S3 for object storage
If you choose S3, you can still do SQL-like queries on the data using Amazon Athena
We are building a customer facing App. For this app, data is being captured by IoT devices owned by a 3rd party, and is transferred to us from their server via API calls. We store this data in our AWS Documentdb cluster. We have the user App connected to this cluster with real time data feed requirements. Note: The data is time series data.
The thing is, for long term data storage and for creating analytic dashboards to be shared with stakeholders, our data governance folks are requesting us to replicate/copy the data daily from the AWS Documentdb cluster to their Google cloud platform -> Big Query. And then we can directly run queries on BigQuery to perform analysis and send data to maybe explorer or tableau to create dashboards.
I couldn't find any straightforward solutions for this. Any ideas, comments or suggestions are welcome. How do I achieve or plan the above replication? And how do I make sure the data is copied efficiently - memory and pricing? Also, don't want to disturb the performance of AWS Documentdb since it supports our user facing App.
This solution would need some custom implementation. You can utilize Change Streams and process the data changes in intervals to send to Big Query, so there is a data replication mechanism in place for you to run analytics. One of the use cases of using Change Streams is for analytics with Redshift, so Big Query should serve a similar purpose.
Using Change Streams with Amazon DocumentDB:
https://docs.aws.amazon.com/documentdb/latest/developerguide/change_streams.html
This document also contains a sample Python code for consuming change streams events.
I'm building a Lambda function in AWS that need to load reference data from a mysql database. There is no real issue right now as it very limited amount of data. But what is best practice here? Is there away to keep this data within Lambda (or some other similar functionality) so that I don't need to request it for every invocation of the function? I'm using Node js though I don't think that affects this question.
Many thanks,
Marcus
There is no build-in persistent storage for lambda. Any data that you would like to keep reliably (not counting temporary persistence due to lambda execution context) between invocations is to store data outside of lambda itself.
You already store it in MySQL, but other popular choices are:
SSM Parameter Store
S3
EFS
DynamoDB
ElastiCache if you really need fast access to the data.
Since you already get the data from MySQL the only advantage of using SSM or DynamoDB would be that you can use AWS API to access and update them, or inspect/modify in AWS Console. You don't need to bundle any MySQL client with your function nor establish any connections to the database.
I have a project where I am building a simple single page app, that needs to pull data from an api only once a day. I have a backend that I am thinking of building with golang, where I need to do 2 things:
1) Have a scheduled job that would once a day update the DB with the new data.
2) Serve that data to the frontend. Since the data would only be updated once a day, I would like to cache it after each update.
Since, the number of options that AWS is offering is a bit overwhelming, I am wondering what would be the ideal solution for this scenario. Should I use lambda that connects to DB and updates it with a scheduled job? Should I create then a separate REST API lambda where I would pull that data from the DB and call it from the frontend?
I would really appreciate suggestions for this problem.
Her is my suggestion;
Create a lambda function
it will fetch required information from database
You may use S3 or DynamoDB to save your content. Both of the solutions may be free please check for free tier offers depending on your usage
it will save the fetched content to S3 or DynamoDB (you may check Dax for DynamoDB caching)
Create an Api gateway and integrate it to your lambda (Elastic LoadBalancer is another choice)
Create a Schedule Expressions on CloudWatch to trigger lambda daily
Make a request from your front end to Api Gateway or ELB.
You may use Route 53 for domain naming.
Your lambda should have two separate functions, one is to respond schedule expression, the other one is to serve your content via communicating with S3/DynamoDB.
Edit:
Here is the architecture
Edit:
If the content is going to be static, you may configure a S3 bucket for static site serving and your daily lambda may write it in there when it is triggered. Then you no longer need api gateway and DynamoDB.
here is the documentation for s3 static content
I'm working on a project to manage documents (eg: create, read, maintain different versions etc...) and my plan is to use the following AWS architecture.
When a document is created/updated it will be saved on to a version enabled s3 bucket via API Gateway S3 proxy. S3 put event will trigger a lambda to get latest version and all version ids and save it to DynamoDB. Once it is saved on a DynamoDB table, it will be indexed in Elasticsearch via DynamoDB stream.
My Plan is to use Elasticsearch for all search queries. And I will load the latest documents from DynamoDB. Since each record has S3 version ids i can query old versions from S3 as well.
Since my architecture relies much on eventual consistency i.e. (S3 to DynamoDB and DynamoDB to Elastic Search) I'm worried that I would not get the latest document data either when I query the Elasticsearch or query DynamoDB after I create a document.
Any suggestions for improvements will be much appreciated.
Thanks!
As you said your application architecture has multiple points where eventual consistency is used.
If your application business case absolutely requires that when you query data, you get the absolute latest version, then your architecture choices are bad and you should, for example, consider using a RDS persistence instead.
If not, then you just design the rest of your system keeping in mind that getting a completed PUT does not guarantee that queries immediately return the data. Giving instructions on how to do this vastly depends on your application and cannot feasibly be generalized.
Since you use a dynamodb stream, your dynamodb insert will reach your elastic search server but with a delay. In case of write failure it's up to the client to issue a retry.
Also you have to keep in mind the time it takes to trigger a dynamodb stream and the time it takes for the elastic search indexing (Plus the s3 event).
So your problem has to do more with the time it takes to reach the elastic search server.
If you want something more consistent that depicts the current status (since that is the problem you will end up with) without any delays you need to change the tools.