I am working on a Data processing application hosted as a web service on an EC2, each second a small data file (less than 10KB) in .csv format is generated.
Problem Statement:
Archive all the data files generated to Amazon Glacier.
My Approach :
As data files are very small. I store the files in AWS Kinesis and after few hours i flush data to S3 (because i cannot find a direct way to put data from Kinesis to Glacier) and using S3 lifecycle management at the end of the day i archive all the objects to Glacier.
My Questions :
Is there a way to transfer data to Glacier directly from Kinesis ?
Is it possible to configure Kinesis to flush data to S3/Glacier at the end of the day ? Is there any time or memory limitation upto which Kinesis can hold data ?
If Kinesis cannot transfer data to Glacier directly. Is there a workaround for this like - can i write a lambda function which can fetch data from Kinesis and archive it to Glacier ?
Is it possible to merge all the .csv file at Kinesis or S3 or Glacier level ?
Is Kinesis suitable for my usecase ? Is there anything else i can use ?
I would be grateful if someone can take time and answer my questions and point me to some references. Please let me know if there is a flaw in my approach or if there is a better way to do so.
Thanks.
You can't directly put data from Kinesis into Glacier (unless you want to put the 10kb filea directly into Glacier)
You could consider Kinesis Data Firehose as a way of flushing 15min. Increments of data to S3
You can definitely do that. Glacier allows direct uploads so there's no need to upload to S3 first
You could use Firehose to flush to S3 then transform and aggregate using Athena then transition that file to Glacier. Or you use Lambda directly and upload straight to Glacier.
Perhaps streaming data into Firehose would make more sense. Depending on your exact needs IoT Analytics might also be interesting.
Reading your question again, seeing you use csv files, I would highly recommend using the Kinesis > S3 > Athena > Transition to glacier approach
Related
I have multiple data source from which I need to build and implement a DWH in AWS. I have one challenge with respect to one of my unstructured data source (Data coming from different APIs). How can I ingest data from this source into the Amazon Redshift??? Can we first pull it into Amazon S3 bucket and then integrate S3 with Amazon redshift? What is a better approach?
Yes, S3 first. You APIs can write to S3 or/and if you like you can use a service like Kinesis (with or without firehose) to populate S3. From there it is just work in Redshift.
Without knowing more about the sources, yes S3 is likely the right approach - whether you require latency in seconds, minutes or hours will be an important consideration.
If latency is not a driving concern, simply:
Set up an S3 bucket to use a destination from your initial source(s).
Create tables in your Redshift database (loading data from S3 to Redshift requires pre-existing destination table).
Use the COPY command load from S3 to Redshift.
As noted, there may be value in Kinesis, especially if you're working with real-time data streams (the service recently introduced support for skipping S3 and streaming directly to Redshift).
S3 is probably the easier approach, if you're not trying to analyze real-time streams.
I am new to Data Engeenering, and am picking up a project for personal growth.
I understand Kafka is used for data ingestion, but does it make sense to use it to ingest data from an API to AWS S3?
Is there another way to do the same, or what situation can be brought to use Kafka in such a situation?
The best way to ingest streaming data to S3 is Kinesis Data Firehose. Kinesis is a technology similar to Kafka, and Data Firehose is its specialised version for delivering data to S3 (or other destinations). It's fairly cheap and configurable, and much, much less hassle than Kafka if all you want to do is get data onto S3. Kafka is great, but it's not going to deliver data to S3 out of the box.
We are using Kinesis Data Firehose to write RDS CDC data to S3 buckets as raw json files. Our Kinesis Firehose configuration is 128 MB and 60 seconds to create the S3 files. We have a glue job that monitors the s3 buckets and picks up these json file. We have a question on whether we will run into a race condition between a json file that is being currently written by Kinesis Firehose and Glue. I looked at the FAQ, but I could not get any pointers. Please let me know if the race condition is possible and any strategies that can mitigate this condition
https://aws.amazon.com/kinesis/data-firehose/faqs/
If you worry that your glue job will start working with a partially written file by Kinesis, then you should know that S3 operations are atomic as exampled in Amazon S3 data consistency model:
Updates to a single key are atomic. For example, if you PUT to an existing key, a subsequent read might return the old data or the updated data, but it never returns corrupted or partial data.
I am trying to write some IOT data to the S3 bucket and so I know 2 options so far.
1) Use AWS CLI and put data directly to the S3.
The downside of this approach is that I would have to parse out the data and figure out how to write it to S3. So there would be some dev required here. The upside is that there isn't additional cost associated to this.
2) Use Kinesis firehose
The downside of this approach is that it costs more money. It might be wasteful because the data doesn't have to be transferred in the real time, and it's not a huge amount of data. The upside is that I don't have to write any code for this data to be written in the S3 bucket.
Is there another alternative that I can explore?
If you're looking at keeping costs low, can you use some sort of cron functionality on your IoT device to POST data to a Lambda function that writes to S3, possibly?
Option 2 with Kinesis Data Firehose has the least administrative overhead.
You may also want to look into the native IoT services. It may be possible to use IoT Core and put the data directly in S3.
I am want to write streaming data from S3 bucket into Redshift through Firehose as the data is streaming in real time (600 files every minute) and I dont want any form of data loss.
How to put data from S3 into Kinesis Firehose?
It appears that your situation is:
Files randomly appear in S3 from an SFTP server
You would like to load the data into Redshift
There's two basic ways you could do this:
Load the data directly from Amazon S3 into Amazon Redshift, or
Send the data through Amazon Kinesis Firehose
Frankly, there's little benefit in sending it via Kinesis Firehose because Kinesis will simply batch it up, store it into temporary S3 files and then load it into Redshift. Therefore, this would not be a beneficial approach.
Instead, I would recommend:
Configure an event on the Amazon S3 bucket to send a message to an Amazon SQS queue whenever a file is created
Configure Amazon CloudWatch Events to trigger an AWS Lambda function periodically (eg every hour, or 15 minutes, or whatever meets your business need)
The AWS Lambda function reads the messages from SQS and constructs a manifest file, then triggers Redshift to import the files listed in the manifest file
This is a simple, loosely-coupled solution that will be much simpler than the Firehose approach (which would require somehow reading each file and sending the contents to Firehose).
Its actually designed to do the opposite, Firehose sends incoming streaming data to Amazon S3 not from Amazon S3, and other than S3 it can send data to other services like Redshift and Elasticsearch Service.
I don't know whether this will solve your problem but you can use COPY from S3 to redshift.
Hope it will help!