Google Cloud - Download large file from web - google-cloud-platform

I'm trying to download GhTorrent dump from http://ghtorrent-downloads.ewi.tudelft.nl/mysql/mysql-2020-07-17.tar.gz which is about 127gb
I tried in the cloud but after 6gb it stops, I believe that there is a size limit for using curl
curl http://ghtorrent... | gsutil cp - gs://MY_BUCKET_NAME/mysql-2020-07-17.tar.gz
I cannot use Data Transfer as I need to specify the url, size in bytes (which I have) and hash MD5 which I don't have and I only can generate by having the file in my disk. I think(?)
Is there any other option to download and upload the file directly to the cloud?
My total disk size is 117gb sad beep

Worked for me with Storage Transfer Service: https://console.cloud.google.com/transfer/
Have a look on the pricing before moving TBs especially if your target is nearline/coldline: https://cloud.google.com/storage-transfer/pricing
Simple example that copies a file from a public url, to my bucket using a Transfer Job:
Create a file theTsv.tsv and specify the complete list of files that must be copied. This example contains just one file:
TsvHttpData-1.0
http://public-url-pointint-to-the-file
Upload the theTsv.tsv file to your bucket or any publicly accessible url. In this example I am storing my .tsv file on my bucket https://storage.googleapis.com/<my-bucket-name>/theTsv.tsv
Create a transfer job - List of object URLs
Add the url that points to the theTsv.tsv file in the URL of TSV file field;
Select the target bucket
Run immediately
My file, named MD5SUB was copied from the source url into my bucket, under an identical directory structure.

Related

Google CLoud Transfer Job is creating one extra folder

I have created a Transfer Job to import some of my website's static resources to Google storage.
The job was supposed to import the data in a bucket named www.pretty-story.com.
It is importing from a tsv file located here.
For instance the first url is :
https://www.pretty-story.com/wp-includes/js/jquery/jquery.min.js
so I would have expected the job to create the folder structure starting with wp-includes.
But instead the job created this folder structure www.pretty-story.com\wp-includes\js\jquery.
Therefore the complete path (including my bucket name) is :
www.pretty-story.com\www.pretty-story.com\wp-includes\js\jquery.
How can I tell the data transfer job to use the bucket as first folder, instead of creating a subfolder with the same name ?
According to https://cloud.google.com/storage-transfer/docs/create-url-list:
When an object located at http(s)://[HOSTNAME]:[PORT]/[URL_PATH] is transferred to Cloud Storage, the name of the object in Cloud Storage is [HOSTNAME]/[URL_PATH].
You don't have an option to skip the [HOSTNAME]/ part of this, so what you are asking is not possible.
If the amount of data involved is reasonable, I recommend downloading it to a workstation and using gsutil to copy it into a bucket without the hostname prefix.

dynamically create / append to zip from multiple instances

I have a situation where thousands o files are created for a user by multiple backend instances, and then they're uploaded to AWS S3 / Azure Storage. After all the files are created, the user wants to download them as a zip. I can create the zip and then get a pre-signed URL, but I tried few archiving solutions and all of them are just taking too much time (hours).
Is there any way of creating the zip dynamically from the multiple backend instances? I want append to zip after each file creation, from any backend instance.
Zip itself supports the use case you want. For example, zip command in Linux:
When given the name of an existing zip archive, zip will replace identically named entries in the zip archive (matching the relative names as stored in the archive) or add entries for new names.
You need to persist the working zip file somewhere in a file system though. The most obvious choice I can think of is EFS, so that multiple instances can mount the file system and access the zip file.
If you don't want to modify the existing instances/workloads, you can even mount EFS on Lambda. Then set S3 trigger for the Lambda to update zip file every time a new file is uploaded.
I think you can not use only S3 for this, because you cannot update S3 objects. Then you need to download/upload for every new file, which is really not ideal.

Deflating 7z within Google Cloud Storage Bucket

I am trying to deflate a 7z multipart container within a Google Cloud Storage Bucket. Can I do this without copying the data locally and re-uploading?
I want to make sure that I perform the extraction of the files without generating unnecessary overhead. I am not sure if there is any way this can be done directly within the Bucket.
In an ideal scenario I could decompress the archives directly into the Bucket.
I believe you might be making confusion between the term storage that one would be used to nowadays, as in a persistent disk accessed by a File System abstraction, and what you can do with a Google Cloud Storage Bucket.
You can make several operations on Objects, which are the pieces of data that reside in Buckets, including upload and download.
So, you have a compressed file in a Bucket and you want to decompress it and have the decompressed content in a Bucket too. Then you have to download the compressed file to some machine that is able to decompress it and after that you’d upload the decompressed content.
I'll leave you here a demonstration:
Make sure you have an archive file and nothing else on the current directory.
ARCHIVE=ar0000.7z
Create a Bucket, if you don't got one created already:
gsutil mb gs://sevenzipblobber
Upload the archive file to a Bucket:
gsutil cp -v $ARCHIVE gs://sevenzipblobber/archives/
Download the archive file from a Bucket (this could from any other Bucket at any other time):
gsutil cp -v gs://sevenzipblobber/archives/$ARCHIVE .
Extract and remove the archive:
7z x $ARCHIVE && rm -v $ARCHIVE
Upload to a Bucket the contents of the current directory, which should be the contents of the archive file decompressed (keep in mind that with the -m flag, that speeds up the upload, the output will be jumbled up).
gsutil -m cp -vr . gs://sevenzipblobber/dearchives/$ARCHIVE
List the contents of the Bucket:
gsutil ls -r gs://sevenzipblobber/
You could also use a Client Server pattern, where the Server would be responsable for decompressing the archive and upload the contents to Cloud Storage again.
The Client could be Google Cloud Functions triggered by an event on a Bucket, in this case the Server could be an HTTP Server waiting for the upload.
Or the Client could be Cloud Pub/Sub Notifications for Cloud Storage and therefore Server would have to be subscribed to the respective topic.

Amazon S3 synch command uploads the entire modified file again or just the delta in the file?

My system generate large log files continuously and I want to upload all the log files to Amazon S3. I am planning to use the s3 synch command for this. My system appens the logs in the same file until they are of about 50MB and then it create new log file. I understand that synch command will synch the modified local log file in s3 bucket, but I dont want to upload the entire log file when the file changes as the files are large and sending same data again and again will consume my data bandwidth.
So I am wondering if s3 synch command sends the entire modified file or just the delta in the file?
The documentation implies that it copies the whole updated files
Recursively copies new and updated files
Plus there would be no way to to do this without downloading the file from S3 which would effectively double the cost of an upload since you'd pay the download and upload costs.

does the pricing for s3 data transfer out of the internet includes for reading file contents

I have a web app with a download buttons to download objects from s3 buckets. I also have plot buttons to read the contents of csv files in s3 bucket using pandas read_csv to read the columns and make visualizations. I wanted to understand if the price for s3 data transfer out of the internet is only for actually download of files or it also includes just reading the contents too because the bytes are transferred over the internet in that case as well.
S3 does not operate like a file system. There is no notion of reading and writing portions of files as you would to a local or remote drive. To read a file you must always download the entire file and then read portions as needed. That is why AWS only shows pricing for data transfer.