Python Script as a Cron on AWS S3 buckets - amazon-web-services

I have a python script which copy files from one S3 bucket to another S3 bucket. This script needs to be run every Sunday at some specific time. I was reading some of articles and answers, So I tried to use AWS lambda + Cloudwatch events. This files run for minimum 30 minutes. would it be still good with Lambda as Lambda can run max 15 minutes only. Or is there any other way? I can create an EC2 box and run it as a Cron but that would be expensive. Or any other standard way?

The more appropriate way would be to use aws glue python shell job as it is under the serverless umbrella and you'll be charged as you go.
So this way you will only be charged for the time your code runs.
Also you don't need to manage the EC2 for this. This is like an extended lambda.

If the two buckets are supposed to stay in sync, i.e. all files from bucket #1 should eventually be synced to bucket #2, then there are various replication options in S3.
Otherwise look at S3 Batch Operations. You can derive the list of files that you need to copy from S3 Inventory which will give you additional context on the files, such as date/time uploaded, size, storage class etc.

Unfortunately, the lambda 15min execution time is a hard stop so it's not suitable for this use case as a big bang.
You could use multiple lambda calls to go through the objects one at a time and move them. However, you would need a DynamoDB table (or something similar) to keep track of what has been moved and what has not.
Another couple of options would be:
S3 Replication which will keep one bucket in sync with the other.
An S3 Batch operation
Or if its data files? you can always use AWS glue.

You can certainly use Amazon EC2 for a long-running batch job.
A t3.micro Linux instance costs $0.0104 per hour, and a t3.nano is half that price, charged per-second.
Just add a command at the end of the User Data script that will shut down the instance:
sudo shutdown now -h
If you launch the instance with Shutdown Behavior = Terminate, then the instance will self-terminate.

Related

Loading files from AWS S3 bucket to a Snowflake tables

I want to copy files from S3 bucket to the Snowflake. To do this I'm using Lambda function. In the S3 bucket I have a folders and in every folders there are many CSV files. These CSV files can small and huge. I have created a Lambda function that is loading these files to the Snowflake. The problem is that Lambda function can work only 15 minutes. It's not enough to load all the files to the Snowflake. Can you help me with this problem? I have one solution for this - execute lambda only with one file not with all files
As you said, the maximum execution time for a Lambda function is 15 minutes, and is not a good idea load all the file in the memory, because you will have high costs with execution time and high usage of memory.
But, if you really want to use Lambdas and you are dealing with files over 1GB, perhaps you should consider AWS Athena or optimizing your AWS Lambda function to read the file using a stream instead of loading the whole file into memory.
Other option may be to create a SQS message when the file lands on s3 and have an EC2 instance poll the queue and process as necessary. For more information check here: Running Cost-effective queue workers with Amazon SQS and Amazon EC2 Spot Instances.
The best option will be automate the Snowpipe with AWS Lambda, for this check the Snowpipe docs Automating Snowpipe with AWS Lambda.

Best AWS architecture solution for migrating data to cloud

Say I have 4 or 5 data sources that I access through API calls. The data aggregation and mining is all scripted in a python file. Lets say the output is all structured data. I know there are plenty of considerations, but from a high level, what would some possible solutions look like if I ultimately wanted to run analysis in BI software?
Can I host the python script in Lambda and set a daily trigger to run the python file. And then have the output stored in RDS/Aurora? Or since the applications I'm running API calls to aren't in AWS, would I need the data to be in an AWS instance before running a Lambda function?
Or host the python script in an EC2 instance, use lambda to trigger a daily refresh that just stores the data in EC2-ESB or Redshift?
Just starting to learn AWS cloud architecture so my knowledge is fairly limited. Just seems like there can be multiple solutions to any problem so not sure if the 2 ideas above are viable.
You've mentioned two approaches which are working. Ultimately it very depends on your use case, budget etc.. and you are right, usually in AWS you will have different solutions that can solve the same problem. For example, another possible solution could be to Dockerize your Python script and run it on containers services (ECS/EKS). But considering you just started with AWS I will focus on the approaches you mentioned as it's probably the most 2 common ones.
In short, based on your description, I would not suggest to go with EC2 because it adds complexity to your use case and moreover extra costs. If you can imagine the final setup, you will need to configure and manage the instance itself, its class type, AMI, your script deployment, access to internet, subnets, etc. Also a minor thing to clarify: you would probably set a cron expression on it to trigger your script (not a lambda reaching the EC2 !). As you can see, quite a big setup for poor benefits (except maybe gaining some experience with AWS ;)) and the instance would be idle most of the time which is far from optimum.
If you just have to run a daily Python script and need to store the output somewhere I would suggest to use lambda for the processing, you can simply have a scheduled event (prefered way is now Amazon EventBridge instead) that triggers your lambda function once a day. Then depending on your output and how you need to process it, you can use RDS obviously from lambda using the Python SDK but you can also use S3 as blob storage if you don't need to run specific queries - for example if you can store your output in json format.
Note that one limitation to lambda is that it can only run for 15 minutes straight per execution. The good thing is that by default lambda has internet access so you don't need to care about any gateway setup and can reach your external endpoints.
Also from a cost perspective running one lambda/day combined with S3 should be free or almost free. The pricing in lambda is very cheap. Running 24/7 an EC2 instance or RDS (which is also an instance) will cost you some money.
Lambda with storage in S3 is the way to go. EC2 / EBS costs add up over time and EC2 will limit the parallelism you can achieve.
Look into Step Functions as a way to organize and orchestrate your Lambdas. I have python code that copies 500K+ files to S3 and takes a week to run. If I copy the files in parallel (500-ish at a time) this process takes about 10 hours. The parallelism is limited by the sourcing system as I can overload it by going wider. The main Lambda launches the file copy Lambdas at a controlled rate but also terminates after a few minutes of run time but returns the last file updated to the controlling Step Function. The Step Function restarts the main Lambda where the last one left off.
Since you have multiple sources you can have multiple top level Lambdas running in parallel all from the same Step Function and each launching a controlled number of worker Lambdas. You won't overwhelm S3 but you will want to make sure you don't overload your sources.
The best part of this is that it costs pennies (at the scale I'm using it).
Once the data is in S3 I'm copying it up to Redshift and transforming it. These processes are also part of the Step Function through additional Lambda Functions.

AWS lambda function for copying data into Redshift

I am new to AWS world and I am trying to implement a process where data written into S3 by AWS EMR can be loaded into AWS Redshift. I am using terraform to create S3 and Redshift and other supported functionality. For loading data I am using lambda function which gets triggered when the redshift cluster is up . The lambda function has the code to copy the data from S3 to redshift. Currently the process seams to work fine .The amount of data is currently low
My question is
This approach seems to work right now but I don't know how it will work once the volume of data increases and what if lambda functions times out
can someone please suggest me any alternate way of handling this scenario even if it can be handled without lambda .One alternate I came across searching for this topic is AWS data pipeline.
Thank you
A server-less approach I've recommended clients move to in this case is Redshift Data API (and Step Functions if needed). With the Redshift Data API you can launch a SQL command (COPY) and close your Lambda function. The COPY command will run to completion and if this is all you need to do then your done.
If you need to take additional actions after the COPY then you need a polling Lambda that checks to see when the COPY completes. This is enabled by Redshift Data API. Once COPY completes you can start another Lambda to run the additional actions. All these Lambdas and their interactions are orchestrated by a Step Function that:
launches the first Lambda (initiates the COPY)
has a wait loop that calls the "status checker" Lambda every 30 sec (or whatever interval you want) and keeps looping until the checker says that the COPY completed successfully
Once the status checker lambda says the COPY is complete the step function launches the additional actions Lambda
The Step function is an action sequencer and the Lambdas are the actions. There are a number of frameworks that can set up the Lambdas and Step Function as one unit.
With bigger datasets, as you already know, Lambda may time out. But 15 minutes is still a lot of time, so you can implement alternative solution meanwhile.
I wouldn't recommend data pipeline as it might be an overhead (It will start an EC2 instance to run your commands). Your problem is simply time out, so you may use either ECS Fargate, or Glue Python Shell Job. Either of them can be triggered by Cloudwatch Event triggered on an S3 event.
a. Using ECS Fargate, you'll have to take care of docker image and setup ECS infrastructure i.e. Task Definition, Cluster (simple for Fargate).
b. Using Glue Python Shell job you'll simply have to deploy your python script in S3 (along with the required packages as wheel files), and link those files in the job configuration.
Both of these options are serverless and you may chose one based on ease of deployment and your comfort level with docker.
ECS doesn't have any timeout limits, while timeout limit for Glue is 2 days.
Note: To trigger AWS Glue job from Cloudwatch Event, you'll have to use a Lambda function, as Cloudwatch Event doesn't support Glue start job yet.
Reference: https://docs.aws.amazon.com/eventbridge/latest/APIReference/API_PutTargets.html

Moving Across AWS Regions: us-east-1 to us-east-2

I have the following currently created in AWS us-east-1 region and per the request of our AWS architect I need to move it all to the us-east-2, completely, and continue developing in us-east-2 only. What are the easiest and least work and coding options (as this is a one-time deal) to move?
S3 bucket with a ton of folders and files.
Lambda function.
AWS Glue database with a ton of crawlers.
AWS Athena with a ton of tables.
Thank you so much for taking a look at my little challenge :)
There is no easy answer for your situation. There are no simple ways to migrate resources between regions.
Amazon S3 bucket
You can certainly create another bucket and then copy the content across, either using the AWS Command-Line Interface (CLI) aws s3 sync command or, for huge number of files, use S3DistCp running under Amazon EMR.
If there are previous Versions of objects in the bucket, it's not easy to replicate them. Hopefully you have Versioning turned off.
Also, it isn't easy to get the same bucket name in the other region. Hopefully you will be allowed to use a different bucket name. Otherwise, you'd need to move the data elsewhere, delete the bucket, wait a day, create the same-named bucket in another region, then copy the data across.
AWS Lambda function
If it's just a small number of functions, you could simply recreate them in the other region. If the code is stored in an Amazon S3 bucket, you'll need to move the code to a bucket in the new region.
AWS Glue
Not sure about this one. If you're moving the data files, you'll need to recreate the database anyway. You'll probably need to create new jobs in the new region (but I'm not that familiar with Glue).
Amazon Athena
If your data is moving, you'll need to recreate the tables anyway. You can use the Athena interface to show the DDL commands required to recreate a table. Then, run those commands in the new region, pointing to the new S3 bucket.
AWS Support
If this is an important system for your company, it would be prudent to subscribe to AWS Support. They can provide advice and guidance for these types of situations, and might even have some tools that can assist with a migration. The cost of support would be minor compared to the savings in your time and effort.
Is it possible for you to create CloudFormation stacks (from existing resources) using the console, then copying the contents of those stacks and run them in the other region (replacing values where they need to be).
See this link: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/resource-import-new-stack.html

How to schedule 'aws s3 sync s3://bucket1 s3://bucket2' using AWS resources?

I want to schedule aws s3 sync s3://bucket1 s3://bucket2 command to run everyday at defined time, say 3 AM.
What options do we have to schedule this using aws resources like lambda etc?
I saw many people using Windows scheduler, but as this is s3 to s3 sync, its not a better option to use Windows scheduler of servers to run this command through cli.
This sounds like a case of The X-Y Problem. That is, it's likely "scheduling an AWS CLI command to run" is not your underlying problem. I'd urge you to consider whether your problem is actually "getting one S3 bucket to exactly replicate the contents of another".
On this point, you have multiple options. These fall broadly into two categories:
Actively sync objects from bucket A to bucket B. This can be done using any number of methods already mentioned, including your idea of scheduling the AWS CLI command.
Lean on S3's built-in replication and this is probably what you want.
The reason S3 replication was implemented by AWS is to solve exactly this problem. Unless you have additional considerations (if you do, please update your question, so we can better answer it :) ) replication is likely your best, and easiest, and most reliable, option.
There are so many ways to do this, I'll elaborate on the ones I use.
Cloudwatch events to trigger whatever is going to perform your task. You can use it just like a crontab.
Lambda functions:
1 - give the lambda function an IAM role that allows read from bucket1 and write to bucket2 and then call the api.
2 - since aws cli is a python tool, you could emmbed aws cli as a python dependency and use it within your.
Here's a link to a tutorial:
https://bezdelev.com/hacking/aws-cli-inside-lambda-layer-aws-s3-sync/
Docker+ECS Fargate:
0 - pick any docker image with aws-cli preinstalled like this one
1 - create an ECS Fargate cluster (will cost you nothing)
2 - create an ECS task definition and inside it use the image you chose the on step 0 and on command put "aws s3 sync bucket1 bucket2"
3 - create a schedule that will use your task definition created on step 2
Additional considerations:
Those are the ones I would use. You could also have cloudwatch trigger a cloudformation that would create an ec2 instance and use the userdata field to run the sync, you could create an ami of an ec2 that on /etc/rc.local has the sync command and then a halt command, and several other options that work. But I'd advise you get the lambda option, unless your sync job takes more then 15 minutes (which is lambda's timeout) then I'd go with the docker option.