aws s3 sync command based on file size only? - amazon-web-services

Is it possible to run the s3 sync command but only upload the files based on file size and not just include the modified date time of the file?
I am currently running:
aws s3 sync ./../app/dist s3://mywebsite.me/dist --acl public-read
The issue I have is I run gulp commands prior to this and files are generated even though the contents of the files are not changed.
Then doing the sync causes files to be uploaded that have not been modified in terms of content.

You can use the --size-only sync switch for that.
--size-only (boolean) Makes the size of each key the only criteria used to decide whether to sync from source to destination.

Related

AWS s3 sync to upload if file does not exist in target

I have uploaded about 1,000,000 files from my local directory to s3 buckets/subfolders and some of them have failed.
I would like to use the 'sync' option to capture those that did not make it the first time. The s3 modified date is the date/time my file was uploaded (which differs from my source file date/times).
As I understand, sync will upload a file to the target if it does not exist, if the file date has changed, or if the size is different.
Can I modify the command line to NOT use the file date as a consideration for syncing? I ONLY want to copy a file if it does not exist.
aws s3 sync \localserver\localshare\folder s3://mybucket/Folder1
aws s3 sync will compare the "last modified time".
For the objects in S3, there is only one timestamp LastModified, which should be when you uploaded the files.
For your local file (assume a posix linux file system). It should have 3 timestamps: last-access, last-modified, last-status-change. Only last-modified time will be used for comparison.
Now support you uploaded 1M files and some of them failed. For all the files had uploaded successfully, they should have identical last-modified time, and then another sync will not upload them again (sync will validate whether those files are identical and it will be considerable long for the validations for 1M objects.)
On the meantime, you can use aws s3 sync --size-only arguments. It fits what you described. But be sure to check whether it is really something you need. I mean, in many cases, many files could be keep the same size even after being modified (intentionally or accidentally), --size-only will ignore such same-size files.

Deflating 7z within Google Cloud Storage Bucket

I am trying to deflate a 7z multipart container within a Google Cloud Storage Bucket. Can I do this without copying the data locally and re-uploading?
I want to make sure that I perform the extraction of the files without generating unnecessary overhead. I am not sure if there is any way this can be done directly within the Bucket.
In an ideal scenario I could decompress the archives directly into the Bucket.
I believe you might be making confusion between the term storage that one would be used to nowadays, as in a persistent disk accessed by a File System abstraction, and what you can do with a Google Cloud Storage Bucket.
You can make several operations on Objects, which are the pieces of data that reside in Buckets, including upload and download.
So, you have a compressed file in a Bucket and you want to decompress it and have the decompressed content in a Bucket too. Then you have to download the compressed file to some machine that is able to decompress it and after that you’d upload the decompressed content.
I'll leave you here a demonstration:
Make sure you have an archive file and nothing else on the current directory.
ARCHIVE=ar0000.7z
Create a Bucket, if you don't got one created already:
gsutil mb gs://sevenzipblobber
Upload the archive file to a Bucket:
gsutil cp -v $ARCHIVE gs://sevenzipblobber/archives/
Download the archive file from a Bucket (this could from any other Bucket at any other time):
gsutil cp -v gs://sevenzipblobber/archives/$ARCHIVE .
Extract and remove the archive:
7z x $ARCHIVE && rm -v $ARCHIVE
Upload to a Bucket the contents of the current directory, which should be the contents of the archive file decompressed (keep in mind that with the -m flag, that speeds up the upload, the output will be jumbled up).
gsutil -m cp -vr . gs://sevenzipblobber/dearchives/$ARCHIVE
List the contents of the Bucket:
gsutil ls -r gs://sevenzipblobber/
You could also use a Client Server pattern, where the Server would be responsable for decompressing the archive and upload the contents to Cloud Storage again.
The Client could be Google Cloud Functions triggered by an event on a Bucket, in this case the Server could be an HTTP Server waiting for the upload.
Or the Client could be Cloud Pub/Sub Notifications for Cloud Storage and therefore Server would have to be subscribed to the respective topic.

How to keep both directory and s3 bucket updated via aws-cli?

I want to synchronize a directory on my machine with a bucket in s3. The problem is that I find that the aws cli option of sync does not seem to do what I expected.
The behavior that I'm looking for is such that when I run the command it evaluates the content of the local directory and the content of the s3 bucket and updates the one with the old content with the changes in the other.
Can't guess the best approach. Thanks in advance.
I think it's a one way sync. If anything is the source is new or updated then it will sync it to the destination. Even if you may have a new file in the destination, it won't sync that down to your source. You would need to do a sync again from destination back to source. It's a one way sync, from source to destination.
You can't do that with awscli. aws sync s3://bucket . will synchronize the data in both sides
If you want a tool that acts according to your needs you have to develop a python script using https://github.com/boto/boto3
Npm package (and cli command) s3-cli can sync a local folder and an S3 bucket, both ways.
s3-cli sync [--delete-removed] /path/to/folder/ s3://bucket/key/on/s3/
s3-cli sync [--delete-removed] s3://bucket/key/on/s3/ /path/to/folder/

AWS S3 Sync very slow when copying to large directories

When syncing data to an empty directory in S3 using AWS-CLI, it's almost instant. However, when syncing to a large directory (several million folders), it takes a very long time before even starting to upload / sync the files.
Is there an alternative method? It looks like it's trying to take account of all files in an S3 directory before syncing - I don't need that, and uploading the data without checking beforehand would be fine.
The sync command will need to enumerate all of the files in the bucket to determine whether a local file already exists in the bucket and if it is the same as the local file. The more documents you have in the bucket, the longer it's going to take.
If you don't need this sync behavior just use a recursive copy command like:
aws s3 cp --recursive . s3://mybucket/
and this should copy all of the local files in the current directory to the bucket in S3.
If you use the unofficial s3cmd from S3 Tools, you can use the --no-check-md5 option while using sync to disable the MD5 sums comparison to significantly speed up the process.
--no-check-md5 Do not check MD5 sums when comparing files for [sync].
Only size will be compared. May significantly speed up
transfer but may also miss some changed files.
Source: https://s3tools.org/usage
Example: s3cmd --no-check-md5 sync /directory/to/sync s3://mys3bucket/

aws s3 mv/sync command

I have about 2 million files nested in subfoldrs in a bucket and want to move all of them to another bucket. Spending much of time on searching ... i found a solution to use AWS CLI mv/sync command. use move command or use sync command and then delete all the files after successfully synced.
aws s3 mv s3://mybucket/ s3://mybucket2/ --recursive
or it can be as
aws s3 sync s3://mybucket/ s3://mybucket2/
But the problem is how would i know that how many files/folders have moved or synced and how much time would it take...
And what if some exception occurs(machine/server stops/ internet disconnection due to any reason )...i have to again execute the command or it will for surely complete and move/sync all files. How can i be sure about the number of files moved/synced and files not moved/synced.
or can i have something like that
I move limited number of files e.g 100 thousand.. and repeat until all files are moved...
or move files on the basis of uploaded time.. e.g files uploaded from starting date to ending date
if yes .. how?
To sync them use:
aws s3 sync s3://mybucket/ s3://mybucket2/
You can repeat the command, after it finish (or fail) without issue. This will check if anything is missing/different to the target s3 bucket and will process it again.
The time depends on what size are the files, how much objects you have. Amazon counts directories as an object, so they matter too.