Rename file name in GCP Cloud Storage and remove square brackets [] - google-cloud-platform

We have many videos uploaded to GCP Cloud Storage.
We need to change file name and remove [].
Asking if there is a good solution.
file example:
gs://xxxxxx/xxxxxx/[BlueLobster] Saint Seiya The Lost Canvas - 06 [1080p].mkv

You can't rename file in Cloud Storage. Renaming file equals to copying the file with a new name and to delete the older name.
It will take time if you have a lot of (large) files, but it's not impossible.

Based on the given scenario, you want to bulk rename all the filenames with ¨[]¨. Based on this documentation, gsutil interprets these characters as wildcards. gsutil does not support this currently.
There´s a way to handle this kind of request by using a custom script to rename all the files with ´[´.
You may use any programming languages that have Cloud Storage client libraries. For this instructions, we´ll be using Python for the custom script.
On your Google Cloud Console, Click the Activate Cloud Shell on the top right of the Google Cloud Console beside the question mark sign. For more information, you may refer here.
On your Cloud Shell, Install the Python client library by using this command:
pip install --upgrade google-cloud-storage
For more information, please refer on this documentation.
After the installation of client library, launch the Cloud Shell Editor by clicking the Open Editor on the top right side of the Cloud Shell. You may refer here for more information.
On your Cloud Shell Editor, click the File menu and choose New File. Name it script.py. Click Ok.
This code assumes that all the objects on your bucket have the same name from the sample you provided.:
import re
from google.cloud import storage
storage_client = storage.Client()
bucket_name = "my_bucket"
bucket = storage_client.bucket(bucket_name)
storage_client = storage.Client()
blobs = storage_client.list_blobs(bucket_name)
pattern = r"[\([{})\]]"
for blob in blobs:
out_var = blob.name
fixed_var = re.sub(pattern, '', blob.name)
print(out_var + " " + fixed_var)
new_blob = bucket.rename_blob(blob, fixed_var)
Change the content of ¨my_bucket¨ to the name of your bucket.
Click File and then Save or you can just press Ctrl + S.
Go back to the terminal by clicking the Open Terminal on the top right section of the Cloud Shell Editor.
Copy and paste this code to the editor:
python script.py
To run the script, press the Enter key.
Files that have brackets are now renamed.
The files aren´t renamed in the backend. Under the hood, it's more of being rewritten with a new name and it's due to object immutability. This will only copy the old files with a new name and removes the old file afterwards.

Related

Django open excel.xlsx with openpyxl from Google Cloud Storage

I need to open a .xlsx file from my bucket on Google Cloud Storage, the problem is I get :FileNotFoundError at /api/ficha-excel
[Errno 2] No such file or directory: 'ficha.xlsx'
These are the settings from my bucket.
UPLOAD_ROOT = 'reportes/'
MEDIA_ROOT = 'reportes'
These are the route bucket/reportes/ficha.xlsx
This is the code of my get function:
directorio = FileSystemStorage("/reportes").base_location
os.makedirs(directorio, exist_ok=True)
# read
print("Directorios: ", directorio)
plantilla_excel = openpyxl.load_workbook(f"{directorio}/ficha.xlsx")
print(plantilla_excel.sheetnames)
currentSheet = plantilla_excel['Hoja1']
print(currentSheet['A5'].value)
What is the problem with the path? I can't figure out.
The below solution doesn’t use Django FileStorage/Storage classes. It opens a .xlsx file from the Cloud Storage bucket on Google Storage using openpyxl.
Summary :
I uploaded the Excel file on GCS, read the Blob data with openpyxl via BytesIO and saved the data in the workbook using the .save() method.
Steps to Follow :
Create a Google Cloud Storage bucket. Choose a globally unique name for it. Keep with the defaults and finally enter Create.
Choose an Excel file from your local system and upload it in the bucket using the “Upload files” option.
Once you have the excel file in your bucket, follow the steps below :
Go to Google Cloud Platform and create a service account (API). Click
Navigation Menu> APIs & Services> Credentials to go to the screen.
Then click Manage Service Accounts.
On the next screen, click Create Service Account.
Enter the details of the service account for each item.
In the next section, you will create a role for Cloud Storage. Choose
Storage Admin (full permission).
Click the service account you created, click Add Key in the Keys
field, and select Create New Key.
Select JSON as the key type and "create" it. Since the JSON file is
downloaded in the local storage, use the JSON file in the next item
and operate Cloud Storage from Python.
We will install the libraries required for this project in Cloud
Shell First, install the Google Cloud Storage library with pip
install to access Cloud Storage:
pip install google-cloud-storage
Install openpyxl using :
pip install openpyxl
Create a folder (excel) with the name of your choice in your Cloud editor.
Create files within it :
main.py
JSON key file (the one that got downloaded in local storage, copy that
file into this folder)
excel
main.py
●●●●●●●●●●.json
Write the below lines of code in main.py file :
from google.cloud import storage
import openpyxl
import io
#Create a client instance for google cloud storage
client = storage.Client.from_service_account_json('●●●●●●●●●●.json') //The path to your JSON key file which is now
#Get an instance of a bucket
bucket = client.bucket(‘bucket_name’) //only the bucketname will do, full path not necessary.
##Get a blob instance of a file
blob = bucket.blob(‘test.xlsx') // test.xlsx is the excel file I uploaded in the bucket already.
buffer = io.BytesIO()
blob.download_to_file(buffer)
wb = openpyxl.load_workbook(buffer)
wb.save('./retest.xlsx')
You will see a file ‘retest.xlsx’ getting created at the same folder in Cloud Editor.

How can I use GCS Delete in Data Fusion Studio?

Apologies if this is very simple but I am a complete beginner at GCP.
I've created a pipline that picks up multiple CSVs from a bucket, wrangles them then writes them into BigQuery. I want it to then delete the contents of the bucket folder the files came from. So let's say I pulled the CSVs using gs://bucket/Data/Country/*.CSV can I use GCS Delete to get rid of all the CSVs in there?
As a desperate attempt :D, in the Objects to delete, I specified gs://bucket/Data/Country/*.* but this didn't do a thing.
According to the Google Cloud Storage Delete plugin documentation its necessary to put each object separating it by comma.
There are feature request asking for the possibility to allow suffixes and prefixes when using this plugin, you can use the +1 button and provide your feedback about how this feature could be useful.
On the other hand, I thought in a workaround that could be work for you. Using the GCS documentation I have created an script to list all csv objects in a bucket, you only have to copy & paste the output in the Objects to Delete property of the plugin. Its important to mentioned that I used this workaround with 100 files more-less, I'm not sure if it's feasible to use with a larger amount of files.
from google.cloud import storage
bucket_name="MY_BUCKET"
file_format="csv"
def list_csv(bucket_name):
storage_client = storage.Client()
blobs = storage_client.list_blobs(bucket_name)
for blob in blobs:
if file_format in blob.name:
print("gs://"+ bucket_name + "/" + blob.name+",")
return None
list_csv(bucket_name)

GCP AI Notebook can't access storage bucket

New to GCP. Trying to load a saved model file into an AI Platform notebook. Tried several approaches without success.
Most obvious approach seemed to be to set the value of a variable to the path copied from storage:
model_path = "gs://<my-bucket>/models/3B/export/1600635833/saved_model.pb"
Results: OSError: SavedModel file does not exist at: (the above path)
I know I can connect to the bucket and retrieve contents because I downloaded a csv file from the bucket and printed out the contents.
OSError to me sounds like you are trying to access GCS bucket with a regular file system which do not support looking at GCS. (Example: Python open() function)
To access files in GCS I recommend you use the Client Libraries. https://cloud.google.com/storage/docs/reference/libraries
Another option for testing is to try to connect to SSH and use gsutil command.
Note: I assume <my-bucket> was edited to replace your real GCS bucket name.
According to the GCP documentation enter here , you are able to access Cloud Storage. This page will guide to using Cloud Storage with AI Platform Training.

Google Cloud Webapp to use a user uploaded file for processing

I recently moved my project from heroku to google cloud. It's written in flask and basically does some text summary (nothing fancy) of an uploaded .docx file. I was able to locally use files on heroku due to their ephemeral file system.
With google cloud, finding myself lost trying to use a file uploaded and running python functions on it.
The error I'm getting is:
with open(self.file, 'rb') as file: FileNotFoundError: [Errno 2] No such file or directory: 'http://storage.googleapis.com/...'
Edited the specifics out for now but when I open the link in a browser it brings up the download window. I know the file gets there since I go to google cloud and everything is in the proper bucket.
Also is there a way to delete from the bucket immediately after python goes through the document? Currently have the lifecycle set to a day but just need the data temporarily runover.
I'm sorry if these are silly questions. Very new to this and trying to learn.
Thanks
Oh and here's the current code
gcs = storage.Client()
user_file = request.files['file']
local = secure_filename(user_file.filename)
blob = bucket.blob(local)
blob.upload_from_string(user_file.read(),content_type=user_file.content_type)
this_file = f"http://storage.googleapis.com/{CLOUD_STORAGE_BUCKET}/{local}"
then a function is supposed to open this_file
returned a public_url to a file name to be processed and used
def open_file(self):
url = self.file
file = BytesIO(requests.get(url).content)
return docx.Document(file)

Is there any way to get s3 uri from aws web console?

I want to download a directory from my s3.
When I need a file, the s3 management console (aws web console) allows me to download it, but when a directory, I have to use aws-cli, like:
$ aws s3 cp s3://mybucket/mydirectory/ . --recursive
My question is: Is there any way to get the s3 uri (s3://mybucket/mydirectory/) from s3 management console?
It's URL is available, but it is slightly different from s3 URI required by aws-cli. I could not find any menu to get the uri.
Thank you in advance!
No, it is not displayed in the console. However, it is simply:
s3://<bucket-name>/<key>
Directories are actually part of the key. For example, foo.jpg stored in an images directory will actually have a key (filename) of images/foo.jpg.
(self-answer)
Because it seems there was no such way, I have created one:
pip install aws-s3-url2uri
And command aws_s3_url2uri will be available after installation.
This command internally converts the web console URLs to S3 URIs, so works with URLs and URIs and local paths:
aws_s3_url2uri ls "https://console.aws.amazon.com/s3/home?region=<regionname>#&bucket=mybucket&prefix=mydir/mydir2/"
calls
aws s3 ls s3://mybucket/mydir/mydir2/
internally.
To convert an S3 URL displayed in the console such as https://s3.us-east-2.amazonaws.com/my-bucket-name/filename to an S3 URI, remove the https://s3.us-east-2.amazonaws.com/ portion and replace it with s3://, like so:
s3://my-bucket-name/filename
It looks like this feature is now available in the AWS Web Console.
It is accessible in two ways:
Selecting the checkbox next to the file and clicking on "Copy S3 URI" button.
Clicking on the file, and clicking on the "Copy S3 URI" button on the top right.
You can get the value from the console by selecting the file in the console. Choose Copy path on the Overview tab to copy the S3:// link to the object.
It is possible to get the S3-URI for a proper key/file in the console, by selecting the key and clicking on the Copy path button, this will place the s3-URI for the file on the clipboard.
However, directories are not keys as such but just key prefixes, so this will not work for them.
You may fail to get s3uri if you are created a new bucket.
You can get s3uri after creating a new folder in your bucket >> select check box to the newly created folder >> then copy the s3uri that appears at the top.