Google CLoud Transfer Job is creating one extra folder - google-cloud-platform

I have created a Transfer Job to import some of my website's static resources to Google storage.
The job was supposed to import the data in a bucket named www.pretty-story.com.
It is importing from a tsv file located here.
For instance the first url is :
https://www.pretty-story.com/wp-includes/js/jquery/jquery.min.js
so I would have expected the job to create the folder structure starting with wp-includes.
But instead the job created this folder structure www.pretty-story.com\wp-includes\js\jquery.
Therefore the complete path (including my bucket name) is :
www.pretty-story.com\www.pretty-story.com\wp-includes\js\jquery.
How can I tell the data transfer job to use the bucket as first folder, instead of creating a subfolder with the same name ?

According to https://cloud.google.com/storage-transfer/docs/create-url-list:
When an object located at http(s)://[HOSTNAME]:[PORT]/[URL_PATH] is transferred to Cloud Storage, the name of the object in Cloud Storage is [HOSTNAME]/[URL_PATH].
You don't have an option to skip the [HOSTNAME]/ part of this, so what you are asking is not possible.
If the amount of data involved is reasonable, I recommend downloading it to a workstation and using gsutil to copy it into a bucket without the hostname prefix.

Related

GCP AI Notebook can't access storage bucket

New to GCP. Trying to load a saved model file into an AI Platform notebook. Tried several approaches without success.
Most obvious approach seemed to be to set the value of a variable to the path copied from storage:
model_path = "gs://<my-bucket>/models/3B/export/1600635833/saved_model.pb"
Results: OSError: SavedModel file does not exist at: (the above path)
I know I can connect to the bucket and retrieve contents because I downloaded a csv file from the bucket and printed out the contents.
OSError to me sounds like you are trying to access GCS bucket with a regular file system which do not support looking at GCS. (Example: Python open() function)
To access files in GCS I recommend you use the Client Libraries. https://cloud.google.com/storage/docs/reference/libraries
Another option for testing is to try to connect to SSH and use gsutil command.
Note: I assume <my-bucket> was edited to replace your real GCS bucket name.
According to the GCP documentation enter here , you are able to access Cloud Storage. This page will guide to using Cloud Storage with AI Platform Training.

Google Cloud - Download large file from web

I'm trying to download GhTorrent dump from http://ghtorrent-downloads.ewi.tudelft.nl/mysql/mysql-2020-07-17.tar.gz which is about 127gb
I tried in the cloud but after 6gb it stops, I believe that there is a size limit for using curl
curl http://ghtorrent... | gsutil cp - gs://MY_BUCKET_NAME/mysql-2020-07-17.tar.gz
I cannot use Data Transfer as I need to specify the url, size in bytes (which I have) and hash MD5 which I don't have and I only can generate by having the file in my disk. I think(?)
Is there any other option to download and upload the file directly to the cloud?
My total disk size is 117gb sad beep
Worked for me with Storage Transfer Service: https://console.cloud.google.com/transfer/
Have a look on the pricing before moving TBs especially if your target is nearline/coldline: https://cloud.google.com/storage-transfer/pricing
Simple example that copies a file from a public url, to my bucket using a Transfer Job:
Create a file theTsv.tsv and specify the complete list of files that must be copied. This example contains just one file:
TsvHttpData-1.0
http://public-url-pointint-to-the-file
Upload the theTsv.tsv file to your bucket or any publicly accessible url. In this example I am storing my .tsv file on my bucket https://storage.googleapis.com/<my-bucket-name>/theTsv.tsv
Create a transfer job - List of object URLs
Add the url that points to the theTsv.tsv file in the URL of TSV file field;
Select the target bucket
Run immediately
My file, named MD5SUB was copied from the source url into my bucket, under an identical directory structure.

Copying objects from one bucket directory folder to another bucket folder using transfer

I'm wanting to use google transfer to copy all folders/files in a specific directory in Bucket-1 to the root directory of Bucket-2.
Have tried to use transfer with the filter option but doesn't copy anything across.
Any pointers on getting this to work within transfer or step by step for functions would be really appreciated.
I reproduced your issue and worked for me using gsutil.
For example:
gsutil cp -r gs://SourceBucketName/example.txt gs://DestinationBucketName
Furthermore, I tried to copy using Transfer option and it also worked. The steps I have done with Transfer option are these:
1 - Create new Transfer Job
Panel: “Select Source”:
2 - Select your source for example Google Cloud Storage bucket
3 - Select your bucket with the data which you want to copy.
4 - On the field “Transfer files with these prefixes” add your data (I used “example.txt”)
Panel “Select destination”:
5 - Select your destination Bucket
Panel “Configure transfer”:
6 - Run now if you want to complete the transfer now.
7 - Press “Create”.
For more information about copy from a bucket to another you can check the official documentation.
So, a few things to consider here:
You have to keep in mind that Google Cloud Storage buckets don’t treat subdirectories the way you would expect. To the bucket it is basically all part of the file name. You can find more information about that in the How Subdirectories Work documentation.
The previous is also the reason why you cannot transfer a file that is inside a “directory” and expect to see only the file’s name appear in the root of your targeted bucket. To give you an example:
If you have a file at gs://my-bucket/my-bucket-subdirectory/myfile.txt, once you transfer it to your second bucket it will still have the subdirectory in its name, so the result will be: gs://my-second-bucket/my-bucket-subdirectory/myfile.txt
This is why, If you are interested in automating this process, you should definitely give the Google Cloud Storage Client Libraries a try.
Additionally, you could also use the GCS Client with Google Cloud Functions. However, I would just suggest this if you really need the Event Triggers offered by GCF. If you just want the transfer to run regularly, for example on a cron job, you could still use the GCS Client somewhere other than a Cloud Function.
The Cloud Storage Tutorial might give you a good example of how to handle Storage events.
Also, on your future posts, try to provide as much relevant information as possible. For this post, as an example, it would’ve been nice to know what file structure you have on your buckets and what you have been getting as an output. And If you can provide straight away what’s your use case, it will also prevent other users from suggesting solutions that don’t apply to your needs.
try this in Cloud Shell in the project
gsutil cp -r gs://bucket1/foldername gs://bucket2

Google Cloud Storage bucket has stopped overwriting files by default when uploading with the Python library

I have an App Engine cron job that runs every week, uploading a file called logs.json to a Google Cloud Storage bucket.
For the past few months, this file has been overwritten each time the new version was uploaded.
In the last few weeks, rather than overwriting the file, the existing copy has been retained and the new one uploaded under a different name, e.g. logs_XHjYmP3.json.
This is a simplified snippet from the Django storage class where the upload is performed. I have verified that the filename is correct at the point of upload:
# Prints 'logs.json'
print(file.name)
blob.upload_from_file(file, content_type=content_type)
blob.make_public()
Reading the documentation, it says:
The effect of uploading to an existing blob depends on the
“versioning” and “lifecycle” policies defined on the blob’s bucket. In
the absence of those policies, upload will overwrite any existing
contents.
The versioning for the bucket is set to suspended, and I'm not aware of any other settings or any changes I have made that would affect this.
How can I make the file upload overwrite any existing file with the same name?
After further testing, although print(file.name) looked correct, the incorrect filename was actually coming from Django's get_available_name() storage class method. That method was trying to generate a unique filename if the file already existed. I have added the method to my custom storage class, and, if the file meets the criteria, I just return the existing name to allow it to overwrite. I'm still not sure why it started doing this, however.

Is there any terraform module to create folders within a Bucket (GCP)

Is there any terraform module to create folders within a Bucket (GCP) ?
That is to say, I already know that with the module google_storage_bucket we can create GCS buckets in GCP.
But, is there any way to create folders within a bucket (GCP) by using terraform ?
thanks,
So a quick answer to your question is: yes. You can use Terraform to effectively make an empty "directory" in a bucket. Here's how:
resource "google_storage_bucket" "storage_bucket" {
name = "my-really-awesome-test-bucket"
location = "us-east4"
project = "my-really-awesome-project"
}
resource "google_storage_bucket_object" "content_folder" {
name = "empty_directory/"
content = "Not really a directory, but it's empty."
bucket = "${google_storage_bucket.storage_bucket.name}"
}
Notice that you're creating an object with a trailing / at the end of the name. The content goes nowhere and is just there because the module requires it. And now when you log into the GCP console, you'll see the empty "directory" in the bucket and you can upload new objects into it.
But there's some other stuff going on here that you should know. Google Cloud Storage uses a flat file-system. This means that when you upload an object to the service, you aren't really creating a directory structure and storing your file inside. Instead you are creating a single file with the entire path (ex: '/bucket_name/directory1/directory2/filename') as the entire file name. It's actually more technical than this, but that's a rough explanation.
Object storage is completely flat. There are no folders. When you see a folder in the UI or output from a command, it's just metadata describing how to show it to you.