I have million of files in different folders on S3 bucket.
The files are very small. I wish to download all the files that are
under folder named VER1. The folder VER1 contains many subfolders,
I wish to download all the million files under all the subfolders of VER1.
(e.g VER1-> sub1-> file1.txt ,VER1-> sub1 -> subsub1 -> file2.text, etc.)
What is the fastest way to download all the files?
Using s3 cp? s3 sync?
Is there a way to download all the files located under the folder in parallel?
Use the AWS Command-Line Interface (CLI):
aws s3 sync s3://bucket/VER1 [name-of-local-directory]
From my experience, it will download in parallel but it won't necessarily use the full bandwidth because there is a lot of overhead for each object. (It is more efficient for large objects, since there is less overhead.)
It is possible that aws s3 sync might have problems with a large number of files. You'd have to try it to see whether it works.
If you really wanted full performance, you could write your own code that downloads in massive parallel, but the time saving would probably be lost in the time it takes you to write and test such a program.
Another option is to use aws s3 sync to download to an Amazon EC2 instance, then zip the files and simply download the zip file. That would reduce bandwidth requirements.
Related
There are many compressed files inside google cloud storage.
I want to unzip the zipped file, rename it and save it back to another bucket.
I've seen a lot of posts, but I couldn't find a way other than how to download it with gsutil and handle it.
Do you have any other way?
To modify a file, such as unzipping, you must read, modify and then write. This means download, unzip and upload the extracted files.
Use gsutil or another tool, one of the SDKs, or the REST APIs. To unzip a file, use a zip tool or one of the libraries that support zip operations.
You can start using cloudFunctions for it. CloudFunction will get triggered when the object is created/finalized and it will do the job automatically.
You can also use the same function to iterate over all the zip files, do the tasks and move same files to another bucket. One thing to make sure it may not be able to move all files in a single go, but for the files which already exists on the bucket, you can use local machine to run the program.
Here is the list of libraries to connect to CloudStorage
I would like to run an aws s3 sync command daily to update my hard drive backup on S3. Most of the time there will be no changes. The problem is that the s3 sync command takes days to check for changes (for a 4tb HDD). What is the quickest way to update a hard drive backup on S3?
If you are wanting to backup your own computer to Amazon S3, I would recommend using a Backup Utility that knows how to use S3. These utilities can do smart things like compress data, track files that have changed and set an appropriate Storage Class.
For example, I use Cloudberry Backup on a Windows computer. It does regular checking for new/changed files and uploads them to S3. If I delete a file locally, it waits 90 days before deleting it from S3. It can also handle multiple versions of files, rather than always overwriting files.
I would recommend only backing-up data folders (eg My Documents). There is no benefit to backing-up your Operating System or temporary files because you would not restore the OS from a remote backup.
While some backup utilities can compress files individually or in groups, experience has taught me to never do so since it can make restoration difficult if you do not have the original backup software (and remember -- backups last years!). The great things about S3 is that it is easy to access from many devices -- I have often grabbed documents from my S3 backup via my phone when I'm away from home.
Bottom line: Use a backup utility that knows how to do backups well. Make sure it knows how to use S3.
I would recommend using a backup tool that can synchronize with Amazon S3. For example, for Windows you can use Uranium Backup. It syncs with several clouds, including Amazon S3.
It can be scheduled to perform daily backups and also incremental backups (in case there are changes.)
I think this is the best way, considering the tediousness of daily manual syncing. Plus, it runs in the background and notifies you of any error or success logs.
This is the solution I use, I hope it can help you.
We have a vendor that provides us data files ( 4/5 files ~10 GB each) on a monthly basis. They provide these files on their FTP site that we connect to using the username and password provided by them.
We download the zip files, unzip them, extract some relevant files, Gzip them and upload them to our s3 bucket and from there we push the data to Redshift.
Currently I have a python script that runs on an EC2 instance that does all this, but I am sure there's a better "serverless" solution out there ( Ideally in AWS environment) that can do this for me since this doesnt seem to be a very unique use case.
I am looking for recommendations / alternate solutions for processing these files.
Thank you.
I have an Amazon S3 bucket with tons of images. A subset of these images need to be synced to a local machine for image analysis (AI) purposes. This has to be done regularly and ideally with a list of file names as input. Not all images need to be synced.
There are ways to synchronise S3 with either Dropbox/Amazon Drive or other storage services, but none of them appear to have the option to provide a list of files that need to be synced.
How can this be implemented?
The first thing that springs to mind when talking about syncing and s3 is using the aws s3 sync cli command. This will allow you to sync specific origin destination folders as well as afford you the ability to use --include, --exclude if you want to list specific files. The commands also allow for the use of wildcards [*] if you have specific naming conventions you can use to identify the files.
You can also repeatedly call the --exclude command for multiple files, so depending on your OS you could either list all files or create a find script that identifies the files and singles them out.
Additionally you are able to do --delete which will remove any files in the destination path that are not in the origin.
As much as I would like to answer but I felt it would be good to
comment with one's thoughts initially if they are in line with the OP!
But I see the comments are being used to provide an answer to gain
points :)
I would like to submit my official answer!
Ans:
If I get this correctly I would use aws cli wth filters of include and exclude.
https://docs.aws.amazon.com/cli/latest/reference/s3/index.html#use-of-exclude-and-include-filters
So I have a use case where I need to put files from on-prem FTP to S3.
The size of each file (XML) is 5KB max.
The no of files is 100 files per minutes.
No, the use case is such that as soon as files come at FTP location I need to put into S3 bucket immediately.
What could be the best way to achieve that.
Here are my option
Using AWS CLI at my FTP location.(push mechanism )
Using lambda (pull mechanism.
Writing java application to put the file into S3 from FTP.
Or is there anything built in that I can leverage in.
Basically, i need to put the file in S3 as soon as possible because UI is built on top of S3 and if the file does not arrive immediately I might be in trouble.
The easiest would be to use the AWS Command-Line Interface (CLI), or an API call if you wish to do it from application code.
It doesn't really make sense doing it via Lambda, because Lambda would need to somehow retrieve the file from FTP and then copy it to S3 (so it is doing double work).
You can certainly write a Java application to do it, or simply call the AWS CLI (written in Python) since it will work out-of-the-box.
You could either use aws s3 sync to copy all new/updated files, or copy specific files with aws s3 cp. If you have so many files, it's probably best to specify the files otherwise it will waste time scanning many historical files that don't need to be copied.
The ultimate best case would be for the files to be sent to S3 directly, without involving FTP at all!