I am trying to consume some data in redshift using sagemaker to train some model. After some research, I found the best way to do so is first unloading the data from redshift to an S3 bucket. I assume sagemaker has API to directly interact with redshift, but why do we need to first unload it to an S3 bucket?
UNLOADing is a best practice and generally the method that the docs will promote. This is due to efficiency and performance. Redshift is a cluster with a single leader and multiple compute nodes. S3 is a cluster - a distributed object store. Having multiple compute nodes connect to S3 when moving data is far faster and less of a burden to the database.
Also, tools that you may be using with sagemaker (like EMR) are also clusters and will also benefit from multiple parallel connections to S3.
The larger the amount of data being moved the greater this benefit will be.
I am completly new to AWS, and need your support to point me to right direction on my requirement.
Requirement:
I need to read multiple csv files from s3 bucket, union the data perform some transformation and load it to another s3 bucket.
Issue:
I understand that Lambda is one of the option to do the same, but the data is huge so i belive at somepoint 15min limitation will be a issue to me.
Also for Glue ETL, fom what i read i understand it does suport the output to be s3.
Ask
Could you suggest any other ELT services and the link to help me get started.
There is a requirement to copy from Azure Blob to S3 for 10TB data and also from Synpase to Redshift for 10TB of data.
What is the best way to achieve these 2 migrations?
For the Redshift - you could export Azure Synapse Analytics to a a blob storage in a compatible format ideally compressed and then copy the data to S3. It is pretty straightforward to import data from S3 to Redshift.
You may need a VM instance to load read from Azure Storage and put into AWS S3 (doesn't matter where). The simplest option seems to be using the default CLI (Azure and AWS) to read the content to the migration instance and write to to the target bucket. However me personally - I'd maybe create an application writing down checkpoints, if the migration process interrupts for any reason, the migration process wouldn't need to start from the scratch.
There are a few options you may "tweak" based on the files to move, if there are many small files or less large files, from which region to move where, ....
https://aws.amazon.com/premiumsupport/knowledge-center/s3-upload-large-files/
As well you may consider using the AWS S3 Transfer Acceleration, may or may not help too.
Please note every larger cloud provider has some outbound data egress cost, for 10TB it may be considerable cost
I am working on an application that will read and analyze the logs of payment transactions. I know I will use Kinesis Analytics as per my requirements, which takes the input from the Data Streams and Firehose. But I am having trouble deciding which input method should I use for my system. My requirements are:
It can tolerate latency, but Data shouldn't lose data.
Must record all the errors in DynamoDB or S3 buckets.
Which input stream is suitable for my use case?
Data Streams vs Firehose
Streams:
Kinesis data streams is highly customizable and best suited for developers building custom applications or streaming data for specialized needs.
Going to write custom code
Real time (200ms latency for classic, 70ms latency for enhanced fan-out)
You must manage scaling (shard splitting/merging)
Data storage for 1 to 7 days, replay capability, multi consumers
Use with Lambda to insert data in real-time to ElasticSearch
Firehose:
Firehose handles loading data streams directly into AWS products for processing.
Fully managed, send to S3, Splunk, Redshift, ElasticSearch
Serverless data transformations with Lambda
Near real time (lowest buffer time is 1 minute)
Automated Scaling
No data storage
Kinesis Data Streams allows consumers to READ streaming data. And it gives you a plenty of options to do so. It is best suitable for use cases that require custom processing, choice of stream processing frameworks, and sub-second processing latency.
Data is reliably stored in streams up to 7 days and distributed across 3 Availability Zones.
Kinesis Firehose is used to LOAD streaming data to a target destination (S3, Elasticsearch, Splunk, etc). You can also transform streaming data (by using Lambda) before loading it to destination.
Data from failed attempts will be saved to S3.
So, if your goal is to only load data to Kinesis Data Analytics service with minimal or no pre-processing then try Kinesis Firehose first.
Please note, that you also would need to consider such aspects as cost, development efforts, scaling options, volume of the data when choosing a proper service.
Please take a look at the following AWS Solutions Implementation for reference:
https://aws.amazon.com/solutions/implementations/real-time-web-analytics-with-kinesis/
https://aws.amazon.com/solutions/implementations/real-time-iot-device-monitoring-with-kinesis/
There are some key differences between Kinesis Stream (KS) and Firehose (FH):
KS is real time, while FH is near-real time.
KS requires manual scaling and setup of its provisioning (shards) , while FH is basically serverless.
KS records are immutable (they persist in stream for its retention period - default 24h), while records in FH are gone from FH the moment they are delivered to destination.
From what you wrote, I think FH should be considered first, as you are not concerned about non-real-time nature of FH, it is much easier to manage and setup, and you can specify S3 as a backup for failed or all messages:
Kinesis Data Firehose uses Amazon S3 to backup all or failed only data that it attempts to deliver to your chosen destination.
The S3 backup ensures you are not loosing records, if delivery or lambda processing fail. Subsequently, in my view, Firehose addresses your two points well.
You can use firehose to feed into analytics, but question is how firehose gets data? You can write your own code to feed data or use kinesis data steams. Firehose mainly is delivery system for stream data that can be written in to various destinations such as S3, Redshift or others with optional capability to perform data transformation.
Check this link https://www.slideshare.net/AmazonWebServices/abd217from-batch-to-streaming?from_action=save and see how your use case can benefit from the information.
More info: https://docs.aws.amazon.com/kinesisanalytics/latest/dev/how-it-works.html
https://docs.aws.amazon.com/firehose/latest/dev/what-is-this-service.html
If you are creating s3 files from the kinesis stream but you dont require cleaning of those s3 files then go with the firehose option. Also if you dont have any partitioning key requirement that makes many small s3 files then firehose is a good solution. If you are doing more cleaning up the FH files than you would have created those s3 files yourself then FH isnt a good option.
Also depends on what do you with those s3 files. You need to find out if you are saving any work/money because of using Firehose against the manual creation of S3 files. Remember you cant reorder the content of the s3 files.
I am trying to write some IOT data to the S3 bucket and so I know 2 options so far.
1) Use AWS CLI and put data directly to the S3.
The downside of this approach is that I would have to parse out the data and figure out how to write it to S3. So there would be some dev required here. The upside is that there isn't additional cost associated to this.
2) Use Kinesis firehose
The downside of this approach is that it costs more money. It might be wasteful because the data doesn't have to be transferred in the real time, and it's not a huge amount of data. The upside is that I don't have to write any code for this data to be written in the S3 bucket.
Is there another alternative that I can explore?
If you're looking at keeping costs low, can you use some sort of cron functionality on your IoT device to POST data to a Lambda function that writes to S3, possibly?
Option 2 with Kinesis Data Firehose has the least administrative overhead.
You may also want to look into the native IoT services. It may be possible to use IoT Core and put the data directly in S3.