Merge multiple zip files on s3 into fewer zip files - amazon-web-services

We have a problem wherein some of the files in a s3 directory are in ~500MiB range, but many other files are in KiB and Bytes. I want to merge all the small files into fewer bigger files of the order of ~500MiB.
What is the most efficient way to rewriting data in an s3 folder instead of having to download, merge on local and push back to s3. Is there some utility/aws command i can use to achieve it?

S3 is a storage service and has no compute capability. For what you are asking, you need compute (to merge). So you cannot do what you want without downloading, merging and uploading.

Related

what will be faster - copy files separetly to s3 or move them inside the bucket

I have a a few directories with about 1000 files each. The size of a directory is about 30G each.
I need to upload these directories to s3 bucket, but each file needs to be in a separate directory.
I'm using AWS SDK for uploading.
What would be the most efficient way to do it?
Copy directories as they are and move files inside s3 bucket to final destinations?
Run separate command for each file?
For 1st I hope aws libraries will better handle parallelism while updating than I can do with 2nd solution.
But 2nd approach saves me time for moving (in fact copying and deleting) objects inside s3.
Regards
Pawel

AWS: Speed up copy of large number of very small files

I have a single bucket with a large number of very small text files (betwen 500 bytes to 1.2k). This bucket currently contains over 1.7 Million files and will be ever increasing.
The way that I add data to this bucket is by generating batches of files (in the order 50.000 files) and transfering those files into the bucket.
Now the problem is this. If I transfer the files one by one in a loop it takes an unbareably long amount of time. So if all the files a in a directory origin_directory I would do
aws s3 cp origin_directory/filename_i s3://my_bucket/filename_i
I would do this command 50000 times.
Right now I'm testing this on a set of about 280K files. Doing this would take approximately 68 hours according to my calculations. However I found out that I can sync:
aws s3 sync origin_directory s3://my_bucket/
Now this, works much much faster. (Will take about 5 hours, according to my calculations). However, the sync needs to figure out what to copy (files present in the directory and not present in the bucket). Since the files in the bucket will be ever increasing, I'm thinking that this will take longer and longer as times moves on.
However, since I delete the information after every sync, I know that the sync operation needs to transfer all files in that directory.
So my question is, is there a way to start a "batch copy" similar to the sync, without actually doing the sync?
You can use:
aws s3 cp --recursive origin_directory/ s3://my_bucket/
This is the same as a sync, but it will not check whether the files already exist.
Also, see Use of Exclude and Include Filters to learn how to specify wildcards (eg all *.txt files).
When copying a large number of files using aws s3 sync or aws s3 cp --recursive, the AWS CLI will parallelize the copying, making it much faster. You can also play with the AWS CLI S3 Configuration to potentially optimize it for your typical types of files (eg copy more files simultaneously).
try using https://github.com/mondain/jets3t
it does this same function but works in parallel, so it will complete the job much faster.

Upload list of files to GCS without for loop

I have to filter the GCS using some search criteria and capture the filtered files in a python list(number of files will vary every time). I want to move them to separate GCS folder in same bucket. Currently I am looping through the list and calling gsutil cp for every file. As my list is dynamic, is there any way to implement this without loop because sometimes my list will have more than million files.

Does AWS S3 sync support calculating checksums of subsets of files?

I want to download a large file incrementally from a S3 bucket. Each new version of the file I download may change slightly. Therefore, if the s3 sync command computes a checksum of the entire new file, it will (likely) differ from the checksum of the old file, requiring the entire new file to be downloaded.
If, instead, s3 sync computes checksums on many small subsets of the file, it may find that only 1% of them do not match, meaning that only 1% of the file would need to be downloaded again.
Does s3 sync support comparing checksums of subsets of files? I read the manual page for s3 sync and could not find a clear answer.

AWS S3 Sync very slow when copying to large directories

When syncing data to an empty directory in S3 using AWS-CLI, it's almost instant. However, when syncing to a large directory (several million folders), it takes a very long time before even starting to upload / sync the files.
Is there an alternative method? It looks like it's trying to take account of all files in an S3 directory before syncing - I don't need that, and uploading the data without checking beforehand would be fine.
The sync command will need to enumerate all of the files in the bucket to determine whether a local file already exists in the bucket and if it is the same as the local file. The more documents you have in the bucket, the longer it's going to take.
If you don't need this sync behavior just use a recursive copy command like:
aws s3 cp --recursive . s3://mybucket/
and this should copy all of the local files in the current directory to the bucket in S3.
If you use the unofficial s3cmd from S3 Tools, you can use the --no-check-md5 option while using sync to disable the MD5 sums comparison to significantly speed up the process.
--no-check-md5 Do not check MD5 sums when comparing files for [sync].
Only size will be compared. May significantly speed up
transfer but may also miss some changed files.
Source: https://s3tools.org/usage
Example: s3cmd --no-check-md5 sync /directory/to/sync s3://mys3bucket/