I am trying to deflate a 7z multipart container within a Google Cloud Storage Bucket. Can I do this without copying the data locally and re-uploading?
I want to make sure that I perform the extraction of the files without generating unnecessary overhead. I am not sure if there is any way this can be done directly within the Bucket.
In an ideal scenario I could decompress the archives directly into the Bucket.
I believe you might be making confusion between the term storage that one would be used to nowadays, as in a persistent disk accessed by a File System abstraction, and what you can do with a Google Cloud Storage Bucket.
You can make several operations on Objects, which are the pieces of data that reside in Buckets, including upload and download.
So, you have a compressed file in a Bucket and you want to decompress it and have the decompressed content in a Bucket too. Then you have to download the compressed file to some machine that is able to decompress it and after that you’d upload the decompressed content.
I'll leave you here a demonstration:
Make sure you have an archive file and nothing else on the current directory.
ARCHIVE=ar0000.7z
Create a Bucket, if you don't got one created already:
gsutil mb gs://sevenzipblobber
Upload the archive file to a Bucket:
gsutil cp -v $ARCHIVE gs://sevenzipblobber/archives/
Download the archive file from a Bucket (this could from any other Bucket at any other time):
gsutil cp -v gs://sevenzipblobber/archives/$ARCHIVE .
Extract and remove the archive:
7z x $ARCHIVE && rm -v $ARCHIVE
Upload to a Bucket the contents of the current directory, which should be the contents of the archive file decompressed (keep in mind that with the -m flag, that speeds up the upload, the output will be jumbled up).
gsutil -m cp -vr . gs://sevenzipblobber/dearchives/$ARCHIVE
List the contents of the Bucket:
gsutil ls -r gs://sevenzipblobber/
You could also use a Client Server pattern, where the Server would be responsable for decompressing the archive and upload the contents to Cloud Storage again.
The Client could be Google Cloud Functions triggered by an event on a Bucket, in this case the Server could be an HTTP Server waiting for the upload.
Or the Client could be Cloud Pub/Sub Notifications for Cloud Storage and therefore Server would have to be subscribed to the respective topic.
Related
I have a setup like this in Google Cloud Platform for my website, a Gatsbyjs project:
Push to repository
Trigger CloudBuild that builds my Gatsby website into public folder
Copy the files within the "public" folder to a Storage Bucket, using rsync
However, when I visit my site (LoadBalancer connected to the Bucket), the .js,.css, .html files are not being served as gzip.
I know there seems to be a flag on cp command for gzip, but how does this work for rsync?
Thanks
If you want to "compress on the fly" your files with rsync this is not possible.
However, if you just want to apply to gzip transport encoding to certain files you can use the -j <...> option of rsync. This will saves network bandwidth but is going to leave the data uncompressed in the Storage bucket.
However, if you want to take uncompressed files from the public folder and send them to a bucket and keep them compressed you will need to do a gsutil cp -z command. This will compress your file (if they are not compressed in your public folder) and store them compressed in the bucket
I'm trying to download GhTorrent dump from http://ghtorrent-downloads.ewi.tudelft.nl/mysql/mysql-2020-07-17.tar.gz which is about 127gb
I tried in the cloud but after 6gb it stops, I believe that there is a size limit for using curl
curl http://ghtorrent... | gsutil cp - gs://MY_BUCKET_NAME/mysql-2020-07-17.tar.gz
I cannot use Data Transfer as I need to specify the url, size in bytes (which I have) and hash MD5 which I don't have and I only can generate by having the file in my disk. I think(?)
Is there any other option to download and upload the file directly to the cloud?
My total disk size is 117gb sad beep
Worked for me with Storage Transfer Service: https://console.cloud.google.com/transfer/
Have a look on the pricing before moving TBs especially if your target is nearline/coldline: https://cloud.google.com/storage-transfer/pricing
Simple example that copies a file from a public url, to my bucket using a Transfer Job:
Create a file theTsv.tsv and specify the complete list of files that must be copied. This example contains just one file:
TsvHttpData-1.0
http://public-url-pointint-to-the-file
Upload the theTsv.tsv file to your bucket or any publicly accessible url. In this example I am storing my .tsv file on my bucket https://storage.googleapis.com/<my-bucket-name>/theTsv.tsv
Create a transfer job - List of object URLs
Add the url that points to the theTsv.tsv file in the URL of TSV file field;
Select the target bucket
Run immediately
My file, named MD5SUB was copied from the source url into my bucket, under an identical directory structure.
I am migrating my data from Amazon-S3 to Google-Cloud Storage.
I have copied my data using gsutil:
$ gsutil cp -R s3://my_bucket/* gs://my_bucket
What I want to do next is to check if all the files in S3 is properly exist in Google Storage.
At the moment all I did is to do print file list in file and then do simple Unix diff but that doesn't really check the file integrity.
What's the good way to check that?
gsutil verifies MD5 checksums on objects copied between cloud providers, so if the recursive copy command completes successfully (shell return code 0), you should have copied everything successfully. Note that gsutil isn't able to compare checksums for S3 objects larger than 5 GiB (which have a non-MD5 checksum that gsutil doesn't support), and will print a warning for cases it encounters.
Is it possible to run the s3 sync command but only upload the files based on file size and not just include the modified date time of the file?
I am currently running:
aws s3 sync ./../app/dist s3://mywebsite.me/dist --acl public-read
The issue I have is I run gulp commands prior to this and files are generated even though the contents of the files are not changed.
Then doing the sync causes files to be uploaded that have not been modified in terms of content.
You can use the --size-only sync switch for that.
--size-only (boolean) Makes the size of each key the only criteria used to decide whether to sync from source to destination.
When syncing data to an empty directory in S3 using AWS-CLI, it's almost instant. However, when syncing to a large directory (several million folders), it takes a very long time before even starting to upload / sync the files.
Is there an alternative method? It looks like it's trying to take account of all files in an S3 directory before syncing - I don't need that, and uploading the data without checking beforehand would be fine.
The sync command will need to enumerate all of the files in the bucket to determine whether a local file already exists in the bucket and if it is the same as the local file. The more documents you have in the bucket, the longer it's going to take.
If you don't need this sync behavior just use a recursive copy command like:
aws s3 cp --recursive . s3://mybucket/
and this should copy all of the local files in the current directory to the bucket in S3.
If you use the unofficial s3cmd from S3 Tools, you can use the --no-check-md5 option while using sync to disable the MD5 sums comparison to significantly speed up the process.
--no-check-md5 Do not check MD5 sums when comparing files for [sync].
Only size will be compared. May significantly speed up
transfer but may also miss some changed files.
Source: https://s3tools.org/usage
Example: s3cmd --no-check-md5 sync /directory/to/sync s3://mys3bucket/