How to perform files integrity check between Amazon-S3 and Google Cloud Storage - amazon-web-services

I am migrating my data from Amazon-S3 to Google-Cloud Storage.
I have copied my data using gsutil:
$ gsutil cp -R s3://my_bucket/* gs://my_bucket
What I want to do next is to check if all the files in S3 is properly exist in Google Storage.
At the moment all I did is to do print file list in file and then do simple Unix diff but that doesn't really check the file integrity.
What's the good way to check that?

gsutil verifies MD5 checksums on objects copied between cloud providers, so if the recursive copy command completes successfully (shell return code 0), you should have copied everything successfully. Note that gsutil isn't able to compare checksums for S3 objects larger than 5 GiB (which have a non-MD5 checksum that gsutil doesn't support), and will print a warning for cases it encounters.

Related

Faster method to download 6.5m objects from GCP bucket

I'm looking for a faster method to download a ton of objects (6.5 million in my case) from a bucket. The average object size is 2kb (it's a JSON file). The method I used was gsutil -m cp -r gs://<bucket>/<folder> . which took 14 hours for 1M objects.
It's not feasible to run this on my laptop for 7 days straight. Any ideas?
PS: I don't need them to be in individual JSON files. I'm thinking to create a script that pulls a file from the bucket, and adds a row to a CSV, then deletes the file.
Try downloading the files to a VM, compressing the files into a single tgz (or bz2 or xz), upload back to the bucket, and download the tgz.
Cloud shell should work too.

Deflating 7z within Google Cloud Storage Bucket

I am trying to deflate a 7z multipart container within a Google Cloud Storage Bucket. Can I do this without copying the data locally and re-uploading?
I want to make sure that I perform the extraction of the files without generating unnecessary overhead. I am not sure if there is any way this can be done directly within the Bucket.
In an ideal scenario I could decompress the archives directly into the Bucket.
I believe you might be making confusion between the term storage that one would be used to nowadays, as in a persistent disk accessed by a File System abstraction, and what you can do with a Google Cloud Storage Bucket.
You can make several operations on Objects, which are the pieces of data that reside in Buckets, including upload and download.
So, you have a compressed file in a Bucket and you want to decompress it and have the decompressed content in a Bucket too. Then you have to download the compressed file to some machine that is able to decompress it and after that you’d upload the decompressed content.
I'll leave you here a demonstration:
Make sure you have an archive file and nothing else on the current directory.
ARCHIVE=ar0000.7z
Create a Bucket, if you don't got one created already:
gsutil mb gs://sevenzipblobber
Upload the archive file to a Bucket:
gsutil cp -v $ARCHIVE gs://sevenzipblobber/archives/
Download the archive file from a Bucket (this could from any other Bucket at any other time):
gsutil cp -v gs://sevenzipblobber/archives/$ARCHIVE .
Extract and remove the archive:
7z x $ARCHIVE && rm -v $ARCHIVE
Upload to a Bucket the contents of the current directory, which should be the contents of the archive file decompressed (keep in mind that with the -m flag, that speeds up the upload, the output will be jumbled up).
gsutil -m cp -vr . gs://sevenzipblobber/dearchives/$ARCHIVE
List the contents of the Bucket:
gsutil ls -r gs://sevenzipblobber/
You could also use a Client Server pattern, where the Server would be responsable for decompressing the archive and upload the contents to Cloud Storage again.
The Client could be Google Cloud Functions triggered by an event on a Bucket, in this case the Server could be an HTTP Server waiting for the upload.
Or the Client could be Cloud Pub/Sub Notifications for Cloud Storage and therefore Server would have to be subscribed to the respective topic.

automating file archival from ec2 to s3 based on last modified date

I want to write an automated job in which the job will go through my files stored on the ec2 storage and check for the last modified date.If the date is more than (x) days the file should automatically get archived to my s3.
Also I don't want to convert the file to a zip file for now.
What I don't understand is how to give the path of the ec2 instance storage and the how do i put the condition for the last modified date.
aws s3 sync your-new-dir-name s3://your-s3-bucket-name/folder-name
Please correct me if I understand this wrong
Your requirement is to archive the older files
So you need a script that checks the modified time and if its not being modified since X days you simply need to make space by archiving it to S3 storage . You don't wish to store the file locally
is it correct ?
Here is some advice
1. Please provide OS information ..this would help us to suggest shell script or power shell script
Here is power shell script
$fileList = Get-Content "c:\pathtofolder"
foreach($file in $fileList) {
Get-Item $file | select -Property fullName, LastWriteTime | Export-Csv 'C:\fileAndDate.csv' -NoTypeInformation
}
then AWS s3 cp to s3 bucket.
You will do the same with Shell script.
Using aws s3 sync is a great way to backup files to S3. You could use a command like:
aws s3 sync /home/ec2-user/ s3://my-bucket/ec2-backup/
The first parameter (/home/ec2-user/) is where you can specify the source of the files. I recommend only backing-up user-created files, not the whole operating system.
There is no capability for specifying a number of days. I suggest you just copy all files.
You might choose to activate Versioning to keep copies of all versions of files in S3. This way, if a file gets overwritten you can still go back to a prior version. (Storage charges will apply for all versions kept in S3.)

aws s3 sync command based on file size only?

Is it possible to run the s3 sync command but only upload the files based on file size and not just include the modified date time of the file?
I am currently running:
aws s3 sync ./../app/dist s3://mywebsite.me/dist --acl public-read
The issue I have is I run gulp commands prior to this and files are generated even though the contents of the files are not changed.
Then doing the sync causes files to be uploaded that have not been modified in terms of content.
You can use the --size-only sync switch for that.
--size-only (boolean) Makes the size of each key the only criteria used to decide whether to sync from source to destination.

AWS S3 Sync very slow when copying to large directories

When syncing data to an empty directory in S3 using AWS-CLI, it's almost instant. However, when syncing to a large directory (several million folders), it takes a very long time before even starting to upload / sync the files.
Is there an alternative method? It looks like it's trying to take account of all files in an S3 directory before syncing - I don't need that, and uploading the data without checking beforehand would be fine.
The sync command will need to enumerate all of the files in the bucket to determine whether a local file already exists in the bucket and if it is the same as the local file. The more documents you have in the bucket, the longer it's going to take.
If you don't need this sync behavior just use a recursive copy command like:
aws s3 cp --recursive . s3://mybucket/
and this should copy all of the local files in the current directory to the bucket in S3.
If you use the unofficial s3cmd from S3 Tools, you can use the --no-check-md5 option while using sync to disable the MD5 sums comparison to significantly speed up the process.
--no-check-md5 Do not check MD5 sums when comparing files for [sync].
Only size will be compared. May significantly speed up
transfer but may also miss some changed files.
Source: https://s3tools.org/usage
Example: s3cmd --no-check-md5 sync /directory/to/sync s3://mys3bucket/