S3 to RDS file management system - amazon-web-services

I'm new to AWS and have a feasibility question for a file management system I'm trying to build. I would like to set up a system where people will use the Amazon S3 browser and drop either a csv or excel file into their specific bucket. Then I would like to automate the process of taking that csv/excel file and inserting that into a table within RDS. Now this is assuming that the table has already been built and those excel/csv file will always be formatted the same and will be in the same exact place every single time. Is it possible to automate this process or at least get it to point where very minimal human interference is needed. I'm new to AWS so I'm not exactly sure of the limits of S3 to RDS. Thank you in advance.

It's definitely possible. AWS supports notifications from S3 to SNS, which can be forwarded automatically to SQS: http://aws.amazon.com/blogs/aws/s3-event-notification/
S3 can also send notifications to AWS Lambda to run your own code directly.

Related

AWS Breaking a large file in S3 into small chucks of files based on number of rows present in the file

We will be receiving daily file from consumer which is 45 MB size .
We have a requirement to break this 45 file into small chunk of file based on a configurable number of rows.
Is there any AWS service available on top of S3 which can do this work
This process should be automated one and no manual intervention should be there.
Need to achieve it using Java (Preference).Any other language is also fine.
There is no "split my file into small chunks" service in AWS. You would need compute to perform this operation, such as an Amazon EC2 instance, an AWS Lambda function, or an AWS Fargate container.
If "automated one and no manual intervention" means "do this when the file is uploaded" then the most appropriate would be an AWS Lambda function since S3 can trigger the function upon upload.
The Lambda function will be provided with the Bucket and Key of the S3 object that triggered the function. You will need to write to code that downloads the file, splits it into multiple files and uploads them to S3. You might be able to do it in-memory depending on file size. There are plenty of examples online of how to use S3 from an AWS Lambda function.

Data migration from S3 to RDS

I am working on a requirement, where i am doing multipart upload of the csv file from on prem server to S3 Bucket.
To achieve this using AWS Lambda I create a presigned url and use this url i am uploading the csv file. Now, once i have the file in AWS S3, i want it to be moved to AWS RDS Oracle DB. Initially i was planning to use AWS Lambda for this.
So once i have the file in S3, it triggers lambda(s3 event) and lambda will push this file to RDS. But with this the issue is with the file Size(600 MB).
I am looking for some other way, where whenever there is a file uploaded to S3, it should trigger any AWS service and that service will push this csv file to RDS. I have gone through AWS DMS/Data Pipeline, but not able to find any way to automate this migration
I need to automate this migration on every s3 upload, that is also cost effective.
Setup S3 Integration and build SPROCS to help automate load. Details found here.
UPDATE:
Looks like you don't even need to create a SPROC. You can just use the RDS procedure as outlined here. You would then just create an event-driven lambda function that is triggered on a given S3 event--e.g. on object PUT(), POST(), COPY, etc..--which passes the S3 metadata requisite to access the event object. Here is a simple Python example of what that Lambda and config might look like. You would then use the metadata passed on the trigger event--as outlined in the Python example--to dynamically create your procedure call then execute that procedure. You can also add the ensuing workflow logic that meets your requirements--i.e. TASK_ID fetch & operational handling, monitoring, etc...--to the same lambda function or separate those concerns by adding additional lambdas. Hope this helps!

What is the most optimal way to automate data (csv file) transfer from s3 to Redshift without AWS Pipeline?

I am trying to take sql data stored in a csv file in an s3 bucket and transfer the data to AWS Redshift and automate that process. Would writing etl scripts with lambda/glue be the best way to approach this problem, and if so, how do I get the script/transfer to run periodically? If not, what would be the most optimal way to pipeline data from s3 to Redshift.
Tried using AWS Pipeline but that is not available in my region. I also tried to use the AWS documentation for Lambda and Glue but don't know where to find the exact solution to the problem
All systems (including AWS Data Pipeline) use the Amazon Redshift COPY command to load data from Amazon S3.
Therefore, you could write an AWS Lambda function that connects to Redshift and issues the COPY command. You'll need to include a compatible library (eg psycopg2) to be able to call Redshift.
You can use Amazon CloudWatch Events to call the Lambda function on a regular schedule. Or, you could get fancy and configure Amazon S3 Events so that, when a file is dropped in an S3 bucket, it automatically triggers the Lambda function.
If you don't want to write it yourself, you could search for existing code on the web, including:
The very simply Python-based christianhxc/aws-lambda-redshift-copy: AWS Lambda function that runs the copy command into Redshift
A more fully-featured node-based A Zero-Administration Amazon Redshift Database Loader | AWS Big Data Blog

Is there any service on AWS that can help me convert mp4 files to mp3?

I'm new to Amazon web services and I'm wondering if the platform offers any solution to convert media files to different formats ( mp4 to mp3) or do I have to use a lambda function with a third party library to achieve this.
Thank you !
You can get up and running quickly with Elastic Transcoder. You will need to:
create two s3 buckets, your 'inbox' and 'outbox'
add a transcoder pipeline specifying which bucket is your in/out buckets, and you what file types you want to transcode from and two.
you can set up a trigger so that every time something hits the in bucket, the process runs, or you can place something in the in bucket and use the sdk or cli to trigger a job.
Two things to note:
When you fire a job, you have to pass in the name of the file that will be created. If the file already exists in the out bucket, an error will be thrown.
As with all of aws' complete services, you get a little free up front, then it gets expensive. Once you get the hang of it, you can save some money rolling your own in lambda like this

process s3 access logs using AWS datapipeline

My usecase is to process S3 access logs(having those 18 fields) periodically and push to table in RDS. I'm using AWS data pipeline for this task to run everyday to process previous day's logs.
I decided to split the task into two activities
1. Shell Command Activity : To process s3 access logs and create a csv file
2. Hive Activity : To read data from csv file and insert to RDS table.
My input s3 bucket has lots of log files hence first activity fails due to out of memory error while staging. However i don't want to stage all the logs, staging the previous day's log is enough for me. I searched around internet but didn't get any solution. How do i achieve this ? Is my solution the optimal one ? Does any solution better than this exist ? Any suggestions will be helpful
Thanks in Advance
You can define your S3 data node use timestamps. For e.g. you can say the directory path is
s3://yourbucket/ #{format(#scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}
Since your log files should have a timestamp in the name (or they could be organized by timestamped directories).
This will only stage the files matching that pattern.
You may be recreating a solution that is already done by Logstash (or more precisely the ELK stack).
http://logstash.net/docs/1.4.2/inputs/s3
Logstash can consume S3 files.
Here is a thread on reading access logs from S3
https://groups.google.com/forum/#!topic/logstash-users/HqHWklNfB9A
We use Splunk (not-free) that has the same capabilities through its AWS plugin.
May I ask why are you pushing the access logs to RDS?
ELK might be a great solution for you. You can build it on your own or use ELK-as-a-service from Logz.io (I work for Logz.io).
It enables you to easily define an S3 bucket, get all your logs read regularly from the bucket and ingested by ELK and view them in preconfigured dashboards.