Does AWS S3 sync support calculating checksums of subsets of files? - amazon-web-services

I want to download a large file incrementally from a S3 bucket. Each new version of the file I download may change slightly. Therefore, if the s3 sync command computes a checksum of the entire new file, it will (likely) differ from the checksum of the old file, requiring the entire new file to be downloaded.
If, instead, s3 sync computes checksums on many small subsets of the file, it may find that only 1% of them do not match, meaning that only 1% of the file would need to be downloaded again.
Does s3 sync support comparing checksums of subsets of files? I read the manual page for s3 sync and could not find a clear answer.

Related

AWS: Speed up copy of large number of very small files

I have a single bucket with a large number of very small text files (betwen 500 bytes to 1.2k). This bucket currently contains over 1.7 Million files and will be ever increasing.
The way that I add data to this bucket is by generating batches of files (in the order 50.000 files) and transfering those files into the bucket.
Now the problem is this. If I transfer the files one by one in a loop it takes an unbareably long amount of time. So if all the files a in a directory origin_directory I would do
aws s3 cp origin_directory/filename_i s3://my_bucket/filename_i
I would do this command 50000 times.
Right now I'm testing this on a set of about 280K files. Doing this would take approximately 68 hours according to my calculations. However I found out that I can sync:
aws s3 sync origin_directory s3://my_bucket/
Now this, works much much faster. (Will take about 5 hours, according to my calculations). However, the sync needs to figure out what to copy (files present in the directory and not present in the bucket). Since the files in the bucket will be ever increasing, I'm thinking that this will take longer and longer as times moves on.
However, since I delete the information after every sync, I know that the sync operation needs to transfer all files in that directory.
So my question is, is there a way to start a "batch copy" similar to the sync, without actually doing the sync?
You can use:
aws s3 cp --recursive origin_directory/ s3://my_bucket/
This is the same as a sync, but it will not check whether the files already exist.
Also, see Use of Exclude and Include Filters to learn how to specify wildcards (eg all *.txt files).
When copying a large number of files using aws s3 sync or aws s3 cp --recursive, the AWS CLI will parallelize the copying, making it much faster. You can also play with the AWS CLI S3 Configuration to potentially optimize it for your typical types of files (eg copy more files simultaneously).
try using https://github.com/mondain/jets3t
it does this same function but works in parallel, so it will complete the job much faster.

AWS s3 sync to upload if file does not exist in target

I have uploaded about 1,000,000 files from my local directory to s3 buckets/subfolders and some of them have failed.
I would like to use the 'sync' option to capture those that did not make it the first time. The s3 modified date is the date/time my file was uploaded (which differs from my source file date/times).
As I understand, sync will upload a file to the target if it does not exist, if the file date has changed, or if the size is different.
Can I modify the command line to NOT use the file date as a consideration for syncing? I ONLY want to copy a file if it does not exist.
aws s3 sync \localserver\localshare\folder s3://mybucket/Folder1
aws s3 sync will compare the "last modified time".
For the objects in S3, there is only one timestamp LastModified, which should be when you uploaded the files.
For your local file (assume a posix linux file system). It should have 3 timestamps: last-access, last-modified, last-status-change. Only last-modified time will be used for comparison.
Now support you uploaded 1M files and some of them failed. For all the files had uploaded successfully, they should have identical last-modified time, and then another sync will not upload them again (sync will validate whether those files are identical and it will be considerable long for the validations for 1M objects.)
On the meantime, you can use aws s3 sync --size-only arguments. It fits what you described. But be sure to check whether it is really something you need. I mean, in many cases, many files could be keep the same size even after being modified (intentionally or accidentally), --size-only will ignore such same-size files.

Merge multiple zip files on s3 into fewer zip files

We have a problem wherein some of the files in a s3 directory are in ~500MiB range, but many other files are in KiB and Bytes. I want to merge all the small files into fewer bigger files of the order of ~500MiB.
What is the most efficient way to rewriting data in an s3 folder instead of having to download, merge on local and push back to s3. Is there some utility/aws command i can use to achieve it?
S3 is a storage service and has no compute capability. For what you are asking, you need compute (to merge). So you cannot do what you want without downloading, merging and uploading.

AWS S3 Sync very slow when copying to large directories

When syncing data to an empty directory in S3 using AWS-CLI, it's almost instant. However, when syncing to a large directory (several million folders), it takes a very long time before even starting to upload / sync the files.
Is there an alternative method? It looks like it's trying to take account of all files in an S3 directory before syncing - I don't need that, and uploading the data without checking beforehand would be fine.
The sync command will need to enumerate all of the files in the bucket to determine whether a local file already exists in the bucket and if it is the same as the local file. The more documents you have in the bucket, the longer it's going to take.
If you don't need this sync behavior just use a recursive copy command like:
aws s3 cp --recursive . s3://mybucket/
and this should copy all of the local files in the current directory to the bucket in S3.
If you use the unofficial s3cmd from S3 Tools, you can use the --no-check-md5 option while using sync to disable the MD5 sums comparison to significantly speed up the process.
--no-check-md5 Do not check MD5 sums when comparing files for [sync].
Only size will be compared. May significantly speed up
transfer but may also miss some changed files.
Source: https://s3tools.org/usage
Example: s3cmd --no-check-md5 sync /directory/to/sync s3://mys3bucket/

Merging files on AWS S3 (Using Apache Camel)

I have some files that are being uploaded to S3 and processed for some Redshift task. After that task is complete these files need to be merged. Currently I am deleting these files and uploading merged files again.
These eats up a lot of bandwidth. Is there any way the files can be merged directly on S3?
I am using Apache Camel for routing.
S3 allows you to use an S3 file URI as the source for a copy operation. Combined with S3's Multi-Part Upload API, you can supply several S3 object URI's as the sources keys for a multi-part upload.
However, the devil is in the details. S3's multi-part upload API has a minimum file part size of 5MB. Thus, if any file in the series of files under concatenation is < 5MB, it will fail.
However, you can work around this by exploiting the loop hole which allows the final upload piece to be < 5MB (allowed because this happens in the real world when uploading remainder pieces).
My production code does this by:
Interrogating the manifest of files to be uploaded
If first part is
under 5MB, download pieces* and buffer to disk until 5MB is buffered.
Append parts sequentially until file concatenation complete
If a non-terminus file is < 5MB, append it, then finish the upload and create a new upload and continue.
Finally, there is a bug in the S3 API. The ETag (which is really any MD5 file checksum on S3, is not properly recalculated at the completion of a multi-part upload. To fix this, copy the fine on completion. If you use a temp location during concatenation, this will be resolved on the final copy operation.
* Note that you can download a byte range of a file. This way, if part 1 is 10K, and part 2 is 5GB, you only need to read in 5110K to get meet the 5MB size needed to continue.
** You could also have a 5MB block of zeros on S3 and use it as your default starting piece. Then, when the upload is complete, do a file copy using byte range of 5MB+1 to EOF-1
P.S. When I have time to make a Gist of this code I'll post the link here.
You can use Multipart Upload with Copy to merge objects on S3 without downloading and uploading them again.
You can find some examples in Java, .NET or with the REST API here.