Run ETL python script in AWS triggered by S3

Run ETL python script in AWS triggered by S3 - amazon-web-services

I am new with AWS and don't know how to do the following. When I put an object in S3 I want to launch a python script that does some transformations and returns it to another path in S3. I've tried a lambda function but the process takes more than 300 seconds. I've also tried it with a Glue job but I don't know how to trigger it when I put the file in S3.
Does anyone know how to do it? Maybe I'm using the wrong AWS tools.

The simple solution for your problem is here:
Since you've already mentioned that you have AWS Glue job working to do this operation. And all you don't know is how to trigger glue job when file placed in s3, I am answering to that question.
You can write an AWS lambda using boto3 module which can be triggered based up on the s3 event and have setup glue.start_job_run command in your lambda function.
response = client.start_job_run(
JobName='string')
https://boto3.readthedocs.io/en/latest/reference/services/glue.html#Glue.Client.start_job_run
Note:: I strongly believe Glue is the right tool rather than lambda for your requirement that you mentioned in question, because AWS lambda have time out limitation. It will get timeout after 300 seconds.

One option would be to use SQS:
Create the SQS queue.
Setup S3 to send notifications to the SQS queue when new objects are added to the source bucket. See Configuring Amazon S3 Event Notifications.
Setup your Python script on an EC2 instance and listen to the SQS queue in your code.
Upload the output of your Python script into the target S3 bucket after script finished.
Can you break up the Python processing into smaller steps? I'd definitely recommend that you use Lambda instead of managing EC2 if you can get your code to run within the Lambda restrictions.

Related

How to set up a time scheduled serverless python job on AWS?

I'd like to peform the following tasks on a regular basis (e.g. every day at 6AM) using AWS:
get new set of data using API. This dataset is updated on a daily basis.
run a python script that would process the obtained dataset by the means of several python libraries like matplotlib, pandas, plotly
automatically send the output of the script, which would be a single pdf file or a html dashboard, via email to a group of specified recipients
I know how to perform all of the above items locally - my goal is to automate this routine. I'm new to AWS and would appreciate some advice on how to perform these tasks in a straightforward way. Based on the reading I did so far, it looks like the serverless approach may be able to do the job and also reduce the complexity, but I'm not sure which functionalities exactly I should use.

For scheduling you can use aws event bridge.
You can schedule AWS lambda or AWS Step Functions both of these are serverless :).
You can have 3 lambdas
To get the data and save it in S3/dynamo (if you want to persist the data)
Processor lambda and save the report to S3.
Another lambda to send email using AWS SES which will read the report from S3 and send it.
If you don't want to use step function you can start your lambda from S3 put event or you can trigger one lambda from another lambda using aws-sdk.

So there are different approaches you can take.
First off, I would create a Lambda. You can schedule the function to run on a cron job.
If the Message you want to send is small:
I would create a SNS Topic with a email fan out.
Inside your lambda you can then transform the data and send out via SNS.
Otherwise:
I would use SES and send a mail via the SES SDK.

Data migration from S3 to RDS

I am working on a requirement, where i am doing multipart upload of the csv file from on prem server to S3 Bucket.
To achieve this using AWS Lambda I create a presigned url and use this url i am uploading the csv file. Now, once i have the file in AWS S3, i want it to be moved to AWS RDS Oracle DB. Initially i was planning to use AWS Lambda for this.
So once i have the file in S3, it triggers lambda(s3 event) and lambda will push this file to RDS. But with this the issue is with the file Size(600 MB).
I am looking for some other way, where whenever there is a file uploaded to S3, it should trigger any AWS service and that service will push this csv file to RDS. I have gone through AWS DMS/Data Pipeline, but not able to find any way to automate this migration
I need to automate this migration on every s3 upload, that is also cost effective.

Setup S3 Integration and build SPROCS to help automate load. Details found here.
UPDATE:
Looks like you don't even need to create a SPROC. You can just use the RDS procedure as outlined here. You would then just create an event-driven lambda function that is triggered on a given S3 event--e.g. on object PUT(), POST(), COPY, etc..--which passes the S3 metadata requisite to access the event object. Here is a simple Python example of what that Lambda and config might look like. You would then use the metadata passed on the trigger event--as outlined in the Python example--to dynamically create your procedure call then execute that procedure. You can also add the ensuing workflow logic that meets your requirements--i.e. TASK_ID fetch & operational handling, monitoring, etc...--to the same lambda function or separate those concerns by adding additional lambdas. Hope this helps!

Edit image file in S3 bucket using AWS Lambda

Some images which is already uploaded on AWS S3 bucket and of course there is a lot of image. I want to edit and replace those images and I want to do it on AWS server, Here I want to use aws lambda.
I already can do my job from my local pc. But it takes a very long time. So I want to do it on server.
Is it possible?

Unfortunately directly editing file in S3 is not supported Check out the thread. To overcome the situation, you need to download the file locally in server/local machine, then edit it and re-upload it again to s3 bucket. Also you can enable versions
For node js you can use Jimp
For java: ImageIO
For python: Pillow
or you can use any technology to edit it and later upload it using aws-sdk.
For lambda function you can use serverless framework - https://serverless.com/
I have made youtube videos long back. This is related to how get started with aws-lambda and serverless
https://www.youtube.com/watch?v=uXZCNnzSMkI

You can trigger a Lambda using the AWS SDK.
Write a Lambda to process a single image and deploy it.
Then locally use the AWS SDK to list the images in the bucket and invoke the Lambda (asynchronously) for each file using invoke. I would also save somewhere which files have been processed so you can continue if something fails.
Note that the default limit for Lambda is 1000 concurrent executions, so to avoid reaching the limit you can send messages to an SQS queue (which then triggers the Lambda) or just retry when invoke throws an error.

Amazon S3 daily data processing

There is a data being stored on a s3 bucket in a daily basis, we are trying to automate parsing and processing that daily data being sent to s3 bucket, we already have the script that will parse the data, we just need to have the approach on the AWS how to automate this,the approach/use-case we thought was AWS batch that is scheduled to do the script on a daily basis or will get the latest data on that day before EOD, but seems like batch is incapable of doing it.
any ideas and approach? we've seen some approach like using Lambda and SQS/SNS
just to summarize:
data (Daily) > stored in S3 > data will be process by our team > stored to elastic search.
Thanks your ideas.

AWS Lambda is exactly what you want in this case. You can trigger lambda executing on S3 file showing up, that will process the file, and can then send it to ElasticSearch or wherever you want it to end up.
Here's an official explanation from AWS: https://docs.aws.amazon.com/lambda/latest/dg/with-s3.html

You can use Lambda + cloud watch events to execute your code on a regular schedule. You can specify a fixed rate ( or you can specify a Cron expression ), for example, in your case, you can execute your lambda every 24 hours, this way, your logic for data processing will run once daily.
Take a look at this article from AWS : Schedule AWS Lambda Functions Using CloudWatch Events

how to copy a file from public network into aws s3 using AWS lambda in nodejs?

I want to copy a file from a public network, e.g:
http://myvideo.com/porn/tca-333.mp4
into my s3 bucket:
let's say:
s3://mypublicstorage.example.com/porn/tca-333.mp4
using AWS Lambda.
problem is, the video could be pretty big in size like 2 ~ 10 GB.
if I were to fetch it using request library like superagent, all of it, will probably be stored in RAM, hence it'll not be enough.
AWS Lambda itself has a limitation of only having 500ish MB disk iirc.
is it impossible to do this task by AWS Lambda after all?
my current code is more or less something like:
request.get(srcUrl).then(resp => S3.putObject({Body: resp.body, ...}))
any suggestion?

For this kind of a job, it is not possible simply using Lambda. You can setup an Architecture as follows.
AWS Lambda Function to push a Job to a Queue in Amazon SQS.
Setup an EC2 instance(s) to execute to pull the job from the Queue and download the file and upload it to S3. (Optionally to stop the EC2 once the job is done)
For low frequent workloads you can optionally Start the EC2 instance from Lambda and program to self shutdown after processing a job and finds the Queue is empty .

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Run ETL python script in AWS triggered by S3 - amazon-web-services

Related

How to set up a time scheduled serverless python job on AWS?

Data migration from S3 to RDS

Edit image file in S3 bucket using AWS Lambda

Amazon S3 daily data processing

how to copy a file from public network into aws s3 using AWS lambda in nodejs?

Categories

Resources