How to automate cleaning of data in AWS using Jupyter Notebook - amazon-web-services

I have a Jupyter Notebook file that cleans the data file (.csv) in S3. The cleaning process is taken care of...
However, I want to be able to automatically apply this cleaning process to every file that is uploaded to the S3 bucket. Each file will have the exact same data format. I am thinking maybe of using AWS Glue, but not sure where to start. If we can skip the upload to S3 and go straight into Glue that would be interesting to explore...
The end goal is to load the clean data in Quick Sight and also AWS Sage Maker for ML applications.
Any advice on how to approach this?
Thanks

A simple AWS Lambda function with an AWS EventBridge Rule can do this. Check out this link: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-create-rule-schedule.html
Use the above link to set up the cron or event(whatever is your use case), write you s3 clean up code in lambda function. Lambda support multiple programming languages. And then let the lambda take care of it.

Related

Aws extract tar.gz on S3

As i'm New on aws and a little confused by all the similar services, I would like to have some leads and know if I am in the right direction.
I have tar.gz archives stored on aws s3 glacier deep archives. I would like that when requesting a restore, the archive is automatically extracted and the folders and files it contains put in s3 (with an expiration date).
these archives are too big to be extracted via lambda (300GB or more).
My idea would be to trigger a lambda function when the restore is complete and use that lambda function to start another aws service that does the extraction. I was thinking either aws batch or fargate. Which service do you think is the most suitable? For this kind of simple task it is preferable to use an arm architecture?
If someone has already done this before and has codes to share I'm interested (if not I'll try to put my final solution here for others).

AWS how to Trigger mediaconvert after video upload automatically

I am new to AWS. Most of example I have seen need an input file name from S3 bucket for media convert. I want to automate this process. What is the best way to do it. I want to achieve following.
API to upload a video(mp4) to a S3 bucket.
Trigger MediaConvert Job to process newly updated video and convert it to HLS.
I know how to create an API as well as MediaConvert job. What I need help with it is automating this workflow. How can I pass recently uploaded video to MediaConvert job dynamically?
I think this should actually cover what you're looking for, and is straight from the source:
https://aws.amazon.com/blogs/media/vod-automation-part-1-create-a-serverless-watchfolder-workflow-using-aws-elemental-mediaconvert/
Essentially, you'll be making use of AWS Lambda, a serverless code execution product. Lambda functions by allowing you to hook directly into "triggers" or events from within the AWS ecosystem (like uploading a file to S3).
The lambda can then execute code in a number of supported languages like Javascript or Python, which can be used to execute a MediaConvert job on the triggering object (the file uploaded to S3).

Data migration from S3 to RDS

I am working on a requirement, where i am doing multipart upload of the csv file from on prem server to S3 Bucket.
To achieve this using AWS Lambda I create a presigned url and use this url i am uploading the csv file. Now, once i have the file in AWS S3, i want it to be moved to AWS RDS Oracle DB. Initially i was planning to use AWS Lambda for this.
So once i have the file in S3, it triggers lambda(s3 event) and lambda will push this file to RDS. But with this the issue is with the file Size(600 MB).
I am looking for some other way, where whenever there is a file uploaded to S3, it should trigger any AWS service and that service will push this csv file to RDS. I have gone through AWS DMS/Data Pipeline, but not able to find any way to automate this migration
I need to automate this migration on every s3 upload, that is also cost effective.
Setup S3 Integration and build SPROCS to help automate load. Details found here.
UPDATE:
Looks like you don't even need to create a SPROC. You can just use the RDS procedure as outlined here. You would then just create an event-driven lambda function that is triggered on a given S3 event--e.g. on object PUT(), POST(), COPY, etc..--which passes the S3 metadata requisite to access the event object. Here is a simple Python example of what that Lambda and config might look like. You would then use the metadata passed on the trigger event--as outlined in the Python example--to dynamically create your procedure call then execute that procedure. You can also add the ensuing workflow logic that meets your requirements--i.e. TASK_ID fetch & operational handling, monitoring, etc...--to the same lambda function or separate those concerns by adding additional lambdas. Hope this helps!

What is the most optimal way to automate data (csv file) transfer from s3 to Redshift without AWS Pipeline?

I am trying to take sql data stored in a csv file in an s3 bucket and transfer the data to AWS Redshift and automate that process. Would writing etl scripts with lambda/glue be the best way to approach this problem, and if so, how do I get the script/transfer to run periodically? If not, what would be the most optimal way to pipeline data from s3 to Redshift.
Tried using AWS Pipeline but that is not available in my region. I also tried to use the AWS documentation for Lambda and Glue but don't know where to find the exact solution to the problem
All systems (including AWS Data Pipeline) use the Amazon Redshift COPY command to load data from Amazon S3.
Therefore, you could write an AWS Lambda function that connects to Redshift and issues the COPY command. You'll need to include a compatible library (eg psycopg2) to be able to call Redshift.
You can use Amazon CloudWatch Events to call the Lambda function on a regular schedule. Or, you could get fancy and configure Amazon S3 Events so that, when a file is dropped in an S3 bucket, it automatically triggers the Lambda function.
If you don't want to write it yourself, you could search for existing code on the web, including:
The very simply Python-based christianhxc/aws-lambda-redshift-copy: AWS Lambda function that runs the copy command into Redshift
A more fully-featured node-based A Zero-Administration Amazon Redshift Database Loader | AWS Big Data Blog

S3 to RDS file management system

I'm new to AWS and have a feasibility question for a file management system I'm trying to build. I would like to set up a system where people will use the Amazon S3 browser and drop either a csv or excel file into their specific bucket. Then I would like to automate the process of taking that csv/excel file and inserting that into a table within RDS. Now this is assuming that the table has already been built and those excel/csv file will always be formatted the same and will be in the same exact place every single time. Is it possible to automate this process or at least get it to point where very minimal human interference is needed. I'm new to AWS so I'm not exactly sure of the limits of S3 to RDS. Thank you in advance.
It's definitely possible. AWS supports notifications from S3 to SNS, which can be forwarded automatically to SQS: http://aws.amazon.com/blogs/aws/s3-event-notification/
S3 can also send notifications to AWS Lambda to run your own code directly.