Problem:
We need to perform a task under which we have to transfer all files ( CSV format) stored in AWS S3 bucket to a on-premise LAN folder using the Lambda functions. This will be a scheduled tasks which will be carried out after every 1 hour, and the file will again be transferred from S3 to on-premise LAN folder while replacing the existing ones. Size of these files is not large (preferably under few MBs).
I am not able to find out any AWS managed service to accomplish this task.
I am a newbie to AWS, any solution to this problem is most welcome.
Thanks,
Actually, I am looking for a solution by which I can push S3 files to on-premise folder automatically
For that you need to make the on-premise network visible to the logic (lambda, whatever..) "pushing" the content. The default solution is using the AWS site-to-site VPN.
There are multiple options for setting up the VPN, you could choose based on the needs.
Then the on-premise network will look just like another subnet.
However - VPN has its complexity and cost. In most of the cases it is much easier to "pull" data from the on-premise environment.
To sync data there are multiple options. For a managed service, I could point out the S3 Gateway which based on your description sounds like an insane overkill.
Maybe you could start with a simple cron job (or a task timer if working with windows) and run a CLI command to sync the S3 content or just copy specified files.
Check out S3 Sync, I think it will help you accomplish this task: https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/sync.html#examples
To run any AWS CLI in your computer, you will need to setup credentials, and the setup account/roles should have permissions to do the task (e.g. access S3)
Check out AWS CLI setup here: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html
Related
Use case:
I have one directory on-premise, I want to make a backup for it let's say at every midnight. And want to restore it if something goes wrong.
Doesn't seem a complicated task,but reading through the AWS documentation even this can be cumbersome and costly.Setting up Storage gateway locally seems unnecessarily complex for a simple task like this,setting up at EC2 costly also.
What I have done:
Reading through this + some other blog posts:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html
https://docs.aws.amazon.com/storagegateway/latest/userguide/WhatIsStorageGateway.html
What I have found:
1.Setting up file gateway (locally or as an EC2 instance):
It just mount the files to an S3. And that's it.So my on-premise App will constantly write to this S3.The documentation doesn't mention anything about scheduled backup and recovery.
2.Setting up volume gateway:
Here I can make a scheduled synchronization/backup to the a S3 ,but using a whole volume for it would be a big overhead.
3.Standalone S3:
Just using a bare S3 and copy my backup there by AWS API/SDK with a manually made scheduled job.
Solutions:
Using point 1 from above, enable versioning and the versions of the files will serve as a recovery point.
Using point 3
I think I am looking for a mix of file-volume gateway: Working on file level and make an asynchronus scheduled snapshot for them.
How this should be handled? Isn't there a really easy way which will just send a backup of a directory to the AWS?
The easiest way to backup a directory to Amazon S3 would be:
Install the AWS Command-Line Interface (CLI)
Provide credentials via the aws configure command
When required run the aws s3 sync command
For example
aws s3 sync folder1 s3://bucketname/folder1/
This will copy any files from the source to the destination. It will only copy files that have been added or changed since a previous sync.
Documentation: sync — AWS CLI Command Reference
If you want to be more fancy and keep multiple backups, you could copy to a different target directory, or create a zip file first and upload the zip file, or even use a backup program like Cloudberry Backup that knows how to use S3 and can do traditional-style backups.
Can someone explain me whats the best way to transfer data from a harddrive on an EC2 Instance (running Windows Server 2012) to an S3 Bucket for the same AWS Account on a daily basis?
Backround idea to this:
I'm generating a .csv file for one of our Business partners daily at 11:00 am and I want to deliver it to S3 (he has access to our S3 Bucket).
After that he can pull it out of S3 manually or automatically whenever he wants.
Hope you can help me, I only found manually solutions with the CLI, but no automated way for daily transfers.
Best Regards
You can directly mount S3 buckets as mounted drives on your EC2 instances. This way you don't even need some sort of triggers/daily task scheduler along with third party service as objects would be directly available in the S3 bucket.
For Linux typically you would use Filesystem in Userspace (FUSE). Take a look at this repo if you need it for Linux: https://github.com/s3fs-fuse/s3fs-fuse.
Regarding Windows, there is this tool:
https://tntdrive.com/mount-amazon-s3-bucket.aspx
If these tools don't suit you or if you don't want to mount directly the s3 bucket, here is another option: Whatever you can do with the CLI you should be able to do with the SDK. Therefore if you are able to code in one of the various language AWS Lambda proposes - C#/Java/Go/Powershell/Python/Node.js/Ruby - you could automate that using a Lambda function along with a daily task scheduler triggering at 11a.m.
Hope this helps!
Create a small application that uploads your file to an S3 bucket (there are a some example here). Then use Task Scheduler to execute your application on a regular basis.
Working on a problem at the moment where I want to export a file on an EC2 instance running a Windows AMI at four hour intervals to an S3 bucket. Currently, the architecture I'm thinking is as follows.
1. CloudWatch Events rule using scheduled trigger
2. Rule triggers Lambda function to run
3. Lambda function would use some form of the AWS CLI on the windows EC2 instance to extract (sync, cp, etc.) the file
4. File is placed is S3 bucket
Does anyone see a path that's more efficient than this one? I want to ensure that I'm handling this in the most straightforward manner. Thanks in advance for any input!
It is quite difficult to have external code (eg an AWS Lambda function) cause something to execute on a Windows computer. You could use Systems Manager Run Command, but that's a rather complex solution.
It would be much simpler to have the Windows computer push the files to Amazon S3:
Create a scheduled task in Windows
Use aws s3 cp or aws s3 sync to copy the files to Amazon S3
Done!
Your solution seems solid. Alternatively you may want to write daemon-like service (background process) that runs on each EC2 and does the data transfer from that instance to S3. What I like about your solution is how you can centrally control the scheduling easily. For my distributed solution you can have the processes read from central config, but that seems more complicated than the CW/Lambda solution.
For the EC2 process solution, this may be useful:
How to mount Amazon S3 Bucket as a Windows Drive, but it should be easy (and more scalable) to just use the AWS SDK instead to talk to S3
I am new in this forum and technology and looking for your advice. I am working on POC and below are my requirement. Could you please guide me the way to achieve the result.
Copy data from NAS to S3.
Use S3 as a source in EMR Job with target to S3/Redshift.
Any link, pdf will also helpful.
Thanks,
Pardeep
There's a lot here that you're asking and there's not a lot of info on your use case to go by so I'm going to be very general in my answer and hopefully it at least points you in the right direction.
You can use Lambda to copy data from your NAS to S3. Assuming your NAS is on-premise and assuming you have a VPN into your VPC or even Direct Connect configured, then you can use a VPC enabled Lambda function to read from the NAS on-premise and write to S3.
If your NAS is running on EC2 the above will remain the same except there's no need for VPN or Direct Connect.
Are you looking to kick off the EMR job from Lambda? You can use S3 as a source for EMR to then output to S3 either from within Lambda or via other means as well.
If you can provide more info on your use case we could probably give you a better quality answer.
Copy data from NAS to S3.
Really depends on the amount of data and the frequency on which you run the copy job. If the data in GBs, then you can install AWS CLI on a machine where NFS is attached. AWS CLI command like CP can be multithreaded and can easily copy your datasets to S3. You might also enable S3 transfer acceleration to speed things up. Having AWS Direct connect to your company network can also speed up any transfers from on-premis to AWS.
http://docs.aws.amazon.com/cli/latest/topic/s3-config.html
http://docs.aws.amazon.com/AmazonS3/latest/dev/transfer-acceleration.html
https://aws.amazon.com/directconnect/
If the data is in TBs (which is probably distributed across multiple volumes), then you might have to consider using physical transfer utilities like AWS Snowball,AWSImportExport or AWS Snowmobile based on the use-case.
https://aws.amazon.com/cloud-data-migration/
Use S3 as a source in EMR Job with target to S3/Redshift.
Again, as there are lot of applications on EMR, there are lot of choices. Redshift supports COPY/UNLOAD commands to S3 which any application can make use of. If you want to use SPARK on EMR , then installing databricks spark-redshift driver is a viable option for you.
https://github.com/databricks/spark-redshift
https://databricks.com/blog/2015/10/19/introducing-redshift-data-source-for-spark.html
https://aws.amazon.com/blogs/big-data/powering-amazon-redshift-analytics-with-apache-spark-and-amazon-machine-learning/
I want to add an export functionality to transfer some data from S3 to another selected cloud storage by the user.
I am already doing it by using a simple node.js server on an amazon t2.micro instance that gets the file from S3 and pipes it to a POST request to the desired cloud storage.
The problem with this solution is scalability and network saturation of my amazon infrastructure.
I recently discovered Amazon Lambda and thought it would be the perfect solution for my feature, but then I saw that the function can't run more than 300s and some of my files may take more than 300s.
I know there are some services like mover.io which handle that, but they don't support some of the cloud storage I need to export to.
What do you suggest me ?
Thank you.