We would like to stream data directly from EC2 web server to RedShift. Do I need to use Kinesis? What is the best practice? I do not plan to do any special analysis before the storage on this data. I would like a cost effective solution (it might be costly to use DynamoDB as a temporary storage before loading).
If cost is your primary concern than the exact number of records/second combined with the record sizes can be important.
If you are talking very low volume of messages a custom app running on a t2.micro instance to aggregate the data is about as cheap as you can go, but it won't scale. The bigger downside is that you are responsible for monitoring, maintaining, and managing that EC2 instance.
The modern approach would be to use a combination of Kinesis + Lambda + S3 + Redshift to have the data stream in requiring no EC2 instances to mange!
The approach is described in this blog post: A Zero-Administration Amazon Redshift Database Loader
What that blog post doesn't mention is now with API Gateway if you do need to do any type of custom authentication or data transformation you can do that without needing an EC2 instance by using Lambda to broker the data into Kinesis.
This would look like:
API Gateway -> Lambda -> Kinesis -> Lambda -> S3 -> Redshift
Redshift is best suited for batch loading using the COPY command. A typical pattern is to load data to either DynamoDB, S3, or Kinesis, then aggregate the events before using COPY to Redshift.
See also this useful SO Q&A.
I implemented a such system last year inside my company using Kinesis and Kinesis connector. Kinesis connector is just a standalone app released by AWS we are running in a bunch of ElasticBeanStalk servers as Kinesis consumers, then the connector will aggregate messages to S3 every a while or every amount of messages, then it will trigger the COPY command from Redshift to load data into Redshift periodically. Since it's running on EBS, you can tune the auto-scaling conditions to make sure the cluster grows and shrinks with the volume of data from Kinesis stream.
BTW, AWS just announced Kinesis Firehose yesterday. I haven't played it but it definitely looks like a managed version of the Kinesis connector.
Related
I have a use case where my redshift cluster is private and supports only VPN connection to the VPC. I need to send data from kinesis firehose which is in another VPC. I found out that we need to make redshift public or attach an internet gateway to make this happen but I can't use internet gateway. I need to connect to redshift from kinesis firehose with VPN only. I am not able to figure out any way to do this.
As you are already aware, you cannot use a private Redshift cluster in a VPC as a target for Firehose without Internet access. There is no direct solution for this as detailed here and here.
That said, I can think of at least two work arounds that might suffice.
You can have Firehose target S3. Then setup a private link access to S3 from the private VPC and setup an event to copy the data into the Redshift cluster on an acceptable cadence. I think this is probably the best option.
You MIGHT be able to setup Firehose with a lambda processor that feeds the records into Redshift. The reason I say "might" is because the lambda will also need to be within the VPC and will need to be able to keep up with the Firehose flow. This could be fraught with performance issues, and potentially expensive. And Redshift isn't really meant to have high write transactions as a data warehouse. This is the worst option.
Firehose aggregates data in S3 and then triggers a COPY command in Redshift. As you don't have a network path from Firehose to Redshift this fails. However, Firehose can just stop at placing the data in S3.
Now you just need a way to trigger Redshift to COPY the data. There are a number of ways to do this but the easiest way is to use Lambda (in your Redshift VPC) to issue the COPY commands. You will need to decide on when the Lambda should run - Firehose uses two parameters to determine when a COPY should be issued; time since last COPY and data size not yet copied. You can emulate this behavior if you like but the simplest way is to just issue COPYs on some regular time interval, like every 5 min.
To do this you set up CloudWatch to trigger your Lambda every 5 min. The
Lambda looks in the Firehose location in S3 and lists all the files
renames (moves) all these files to put them in a new uniquely named
S3 "subfolder"
issues the COPY command to Redshift to ingest from this "subfolder"
Upon successful ingestion these files can be moved again, left in
the above "subfolder" or deleted
The reason to rename/move the files in S3 is to ensure that each run of the Lambda is operating on a unique set of files and that files aren't ingested more than once.
I am new to AWS and trying to find a way to load the data from S3 to RDS . In my current approach I am using EC2 instance to do that (where my app is running). I was thinking of doing through lambda but my data will have around (22 million records) and my current approach is taking 4hr. And lambda timeout is 15mins (So lambda approach does not work in this case).
The problem with my current approach is This data files comes may be like ones in a month and I don't want to have a EC2 running just of this task. Any alternatives in server-less world would be helpful.Thank You
Note: The data is loaded from S3 to RDS based on SQS, i,e my application is pulling the messages from SQS which will then load the data into RDS
Please try DMS for this. You need to create DMS agent with S3 bucket info as source and target details of your RDS.
We wish to build data pipeline system which tracks website interactions/events.
The goal is to track user behavior in a website so we would like to choose the right architecture to implement it having the following two constraints :
1) the system is Amazon
2) this is budgetary project so we cannot use redshift for this purpose
Based on the above two constraints my plan is to implement the following architecture:
website-javascript --> AWS-S3 -->(AWS-Lambda)--> AWS-RDS
website javascript client -
aws-firehose data delivery system to S3 - tracking user interaction and load them to aws-firehose which eventually write them in aws-S3.
AWS Lambda (Python) - Periodically task which pulls daily events from AWS-S3 and load them to AWS-RDS.
The reason I have chosen AWS-RDS is due to its cost-effectiveness for this objective
Appreciate any comment to the above mentioned implementation or any other architecture proposal that you may recommend to use instead of the above
If I understand your question correctly, you are proposing below solution to perform web analytics for your application:
WebServer --> Firehose --> AWS-S3 --> AWS-Lambda --> AWS-RDS
I see below pros and cons with above design
Pros:
low cost
easy to implement
Cons:
RDS may not be salable enough to handle analytics on massive amounts of web-streaming data, which tend to grow rapidly
Need to handle load balancing, failure scenarios and other complexities for lambda
You need to handle data transformation for RDS as it expects structured data to be ingested into relational tables
Proposal to store the data in S3 through Firehose sounds a good solution. But please keep in mind that minimum interval for Firehose is one minute, so your application needs to tolerate this minor latency. You may use Kinesis Streams to have millisecond latency, but then you need to manage your own application code and instances to handle Streams.
After ingesting data in Kinesis Firehose or Streams, you may also explore below alternatives:
Use Kinesis Analytics to track web users activity in real-time if it's available in your AWS region. It's only available in selected AWS regions currently
Within Firehose, transform your data using lambda and store it in S3 in optimized format for further analysis with AWS Athena
Use Elastic Search as a destination and perform web analytics with ELK stack instead of RDS
Though you mentioned that you can not use RedShift, it still may be the best solution for time series analysis. Exploring RedShift, RedShift Spectrum and formatted data stored in S3 may still be a cost effective solution with better cababilities
Adding few references from AWS, which you may go through before deciding on the solution:
Real-Time Web Analytics with Kinesis Data Analytics Solution
Near Real-time Analytics on Streaming Data with Amazon Kinesis and Amazon Elasticsearch
Schema-On-Read Analytics Pipeline Using Amazon Athena
Amazon Redshift Spectrum Extends Data Warehousing Out to Exabytes—No Loading Required
Hey folky this is a getting more and more common.
Generally the pattern is click events to Kinesis streams then you can monitor user interaction with the website in real time using Kinesis analytics. You can connect the stream to firehose to offload data in to an S3 bucket as well as incorporate Lambdas to transform the data.
There is some major complexity around handling Lambdas and Kinesis streams in parallel so this solution might not be as scalable as using AWS Kafka. Or perhaps run a job to move your s3 data into rds for whatever reporting you might need that is adhoc.
Here is a pattern AWS already has real-time-web-analytics-with-kinesis
Any suggested architecture ?
For the first full load, using Kinesis, how do I automate it so that it creates different streams for different tables. (Is this the way to do it?)
Incase if there is a new additional table, how do I create a new stream automatically.
3.How do I load to Kinesis incrementally (whenever the data is populated )
Any resources/ architectures will be definitely helpful. Using Kinesis because multiple other down stream consumers might access this data in future.
Recommend looking into AWS Schema Conversion Tool (AWS SCT) and AWS Database Migration Service (AWS DMS). DMS does not necessarily use Kinesis but it is specifically design for this use case.
Start with the walk through in this blog post: "How to Migrate Your Oracle Data Warehouse to Amazon Redshift Using AWS SCT and AWS DMS"
When i read about AWS data pipeline the idea immediately struck - produce statistics to kinesis and create a job in pipeline that will consume data from kinesis and COPY it to redshift every hour. All in one go.
But it seems there is no node in pipeline that can consume kinesis. So now i have two possible plans of action:
Create instance where Kinesis's data will be consumed and sent to S3 split by hours. Pipeline will copy from there to Redshift.
Consume from Kinesis and produce COPY directly to Redshift on the spot.
What should I do? Is there no way to connect Kinesis to redshift using AWS services only, without custom code?
It is now possible to do so without user-code via a new managed service called Kinesis Firehose. It manages the desired buffering intervals, temp uploads to s3, upload to Redshift, error handling and auto throughput management.
That is already done for you!
If you use the Kinesis Connector Library, there is a built-in connector to Redshift
https://github.com/awslabs/amazon-kinesis-connectors
Depending on the logic you have to process connector can be really easy to implement.
You can create and orchestrate complete pipeline with InstantStack to read data from Kinesis, transform it and push it into any Redshift or S3.