what will be faster - copy files separetly to s3 or move them inside the bucket - amazon-web-services

I have a a few directories with about 1000 files each. The size of a directory is about 30G each.
I need to upload these directories to s3 bucket, but each file needs to be in a separate directory.
I'm using AWS SDK for uploading.
What would be the most efficient way to do it?
Copy directories as they are and move files inside s3 bucket to final destinations?
Run separate command for each file?
For 1st I hope aws libraries will better handle parallelism while updating than I can do with 2nd solution.
But 2nd approach saves me time for moving (in fact copying and deleting) objects inside s3.
Regards
Pawel

Related

Is it feasible to maintain directory structure when backing up to AWS S3 Glacier classes?

I am trying to backup 2TB from a shared drive of Windows Server to S3 Glacier
There are maybe 100 folders (some may be nested ) and perhaps 5000 files (some small like spread sheets, photos and other are larger like server images. My first question is what counts as an object here?
Let’s say I have Folder 1 which has 10 folders inside it. Each of 10 folders have 100 files.
Would number of objects be 1 folder + (10 folders * 100 files) = 1001 objects?
I am trying to understand how folder nesting is treated in S3. Do I have to manually create each folder as a prefix and then upload each file inside that using AWS CLI? I am trying to recreate the shared drive experience on the cloud where I can browse the folders and download the files I need.
Amazon S3 does not actually support folders. It might look like it does, but it actually doesn't.
For example, you could upload an object to invoices/january.txt and the invoices directory will just magically 'appear'. Then, if you deleted that object, the invoices folder would magically 'disappear' (because it never actually existed).
So, feel free to upload objects to any location without creating the directories first.
However, if you click the Create folder button in the Amazon S3 management console, it will create a zero-length object with the name of the directory. This will make the directory 'appear' and it would be counted as an object.
The easiest way to copy the files from your Windows computer to an Amazon S3 bucket would be:
aws s3 sync directoryname s3://bucket-name/ --storage-class DEEP_ARCHIVE
It will upload all files, including files in subdirectories. It will not create the folders, since they aren't necessary. However, the folder will still 'appear' in S3.

AWS: Speed up copy of large number of very small files

I have a single bucket with a large number of very small text files (betwen 500 bytes to 1.2k). This bucket currently contains over 1.7 Million files and will be ever increasing.
The way that I add data to this bucket is by generating batches of files (in the order 50.000 files) and transfering those files into the bucket.
Now the problem is this. If I transfer the files one by one in a loop it takes an unbareably long amount of time. So if all the files a in a directory origin_directory I would do
aws s3 cp origin_directory/filename_i s3://my_bucket/filename_i
I would do this command 50000 times.
Right now I'm testing this on a set of about 280K files. Doing this would take approximately 68 hours according to my calculations. However I found out that I can sync:
aws s3 sync origin_directory s3://my_bucket/
Now this, works much much faster. (Will take about 5 hours, according to my calculations). However, the sync needs to figure out what to copy (files present in the directory and not present in the bucket). Since the files in the bucket will be ever increasing, I'm thinking that this will take longer and longer as times moves on.
However, since I delete the information after every sync, I know that the sync operation needs to transfer all files in that directory.
So my question is, is there a way to start a "batch copy" similar to the sync, without actually doing the sync?
You can use:
aws s3 cp --recursive origin_directory/ s3://my_bucket/
This is the same as a sync, but it will not check whether the files already exist.
Also, see Use of Exclude and Include Filters to learn how to specify wildcards (eg all *.txt files).
When copying a large number of files using aws s3 sync or aws s3 cp --recursive, the AWS CLI will parallelize the copying, making it much faster. You can also play with the AWS CLI S3 Configuration to potentially optimize it for your typical types of files (eg copy more files simultaneously).
try using https://github.com/mondain/jets3t
it does this same function but works in parallel, so it will complete the job much faster.

Copy limited number of files from S3?

We are using an S3 bucket to store a growing number of small JSON files (~1KB each) that contain some build-related data. Part of our pipeline involves copying these files from S3 and putting them into memory to do some operations.
That copy operation is done via S3 cli tool command that looks something like this:
aws s3 cp s3://bucket-path ~/some/local/path/ --recursive --profile dev-profile
The problem is that the number of json files on S3 is getting pretty large since more are being made every day. It's nothing even close to the capacity of the S3 bucket since the files are so small. However, in practical terms, there's no need to copy ALL these JSON files. Realistically the system would be safe just copying the most recent 100 or so. But we do want to keep older ones around for other purposes.
So my question boils down to: is there a clean way to copy a specific number of files from S3 (maybe sorted by most recent)? Is there some kind of pruning policy we can set on an S3 bucket to delete files older than X days or something?
The aws s3 sync command in the AWS CLI sounds perfect for your needs.
It will copy only files that are New or Modified since the last sync. However, it means that the destination will need to retain a copy of the 'old' files so that they are not copied again.
Alternatively, you could write a script (eg in Python) that lists the objects in S3 and then only copies objects added since the last time the copy was run.
You can set the Lifecycle policies to the S3 buckets which will remove them after certain period of time.
To copy only some days old objects you will need to write a script

Merge multiple zip files on s3 into fewer zip files

We have a problem wherein some of the files in a s3 directory are in ~500MiB range, but many other files are in KiB and Bytes. I want to merge all the small files into fewer bigger files of the order of ~500MiB.
What is the most efficient way to rewriting data in an s3 folder instead of having to download, merge on local and push back to s3. Is there some utility/aws command i can use to achieve it?
S3 is a storage service and has no compute capability. For what you are asking, you need compute (to merge). So you cannot do what you want without downloading, merging and uploading.

AWS S3 Sync very slow when copying to large directories

When syncing data to an empty directory in S3 using AWS-CLI, it's almost instant. However, when syncing to a large directory (several million folders), it takes a very long time before even starting to upload / sync the files.
Is there an alternative method? It looks like it's trying to take account of all files in an S3 directory before syncing - I don't need that, and uploading the data without checking beforehand would be fine.
The sync command will need to enumerate all of the files in the bucket to determine whether a local file already exists in the bucket and if it is the same as the local file. The more documents you have in the bucket, the longer it's going to take.
If you don't need this sync behavior just use a recursive copy command like:
aws s3 cp --recursive . s3://mybucket/
and this should copy all of the local files in the current directory to the bucket in S3.
If you use the unofficial s3cmd from S3 Tools, you can use the --no-check-md5 option while using sync to disable the MD5 sums comparison to significantly speed up the process.
--no-check-md5 Do not check MD5 sums when comparing files for [sync].
Only size will be compared. May significantly speed up
transfer but may also miss some changed files.
Source: https://s3tools.org/usage
Example: s3cmd --no-check-md5 sync /directory/to/sync s3://mys3bucket/