List buckets that match a bucket label with gsutil - google-cloud-platform

I have my google cloud storage buckets labeled
I can't find anything in the docs on how to do a gsutil ls but only filter buckets with a specific label- is this possible?

Just had a use case where I wanted to list all buckets with a specific label. The accepted answer using subprocess was noticeably slow for me. Here is my solution using the Python client library for Cloud Storage:
from google.cloud import storage
def list_buckets_by_label(label_key, label_value):
# List out buckets in your default project
client = storage.Client()
buckets = client.list_buckets() # Iterator
# Only return buckets where the label key/value match inputs
output = list()
for bucket in buckets:
if bucket.labels.get(label_key) == label_value:
output.append(bucket.name)
return output

Nowadays is not possible to do what you want in one single step. You can do it in 3 steps:
getting all the buckets of your GCP project.
Get the labels of every bucket.
Do the gsutil ls of every bucket that accomplish your criteria.
This is my python 3 code that I did for you.
import subprocess
out = subprocess.getoutput("gsutil ls")
for line in out.split('\n'):
label = subprocess.getoutput("gsutil label get "+line)
if "YOUR_LABEL" in str(label):
gsout = subprocess.getoutput("gsutil ls "+line)
print("Files in "+line+":\n")
print(gsout)

A bash only solution:
function get_labeled_bucket {
# list all of the buckets for the current project
for b in $(gsutil ls); do
# find the one with your label
if gsutil label get "${b}" | grep -q '"key": "value"'; then
# and return its name
echo "${b}"
fi
done
}
The section '"key": "value"' is just a string, replace with your key and your value. Call the function with LABELED_BUCKET=$(get_labeled_bucket)
In my opinion, making a bash function return more than one value is more trouble than it is worth. If you need to work with multiple buckets then I would replace the echo with the code that needs to run.

from google.cloud import storage
client = storage.Client()
for blob in client.list_blobs('bucketname', prefix='xjc/folder'):
print(str(blob))

Related

How to know path "gs://bucket1/folder_x" existing or not in GCP bucket

Is there a '''gsutil''' command can tell me if the path '''gs://bucket1/folder1_x/folder2_y/''' existing or not? Is there a '''ping''' command in gsutil?
I use Jenkins parameters folder_x and folder_y which value input by user, and joined by pipeline. Currently, if the dir does exist, the pipeline will show success. But if the path is wrong, the pipeline will be interrupted and shows failure.
Tried use gsutil stat and gsutil -q stat, it can test '''gs://bucket1/folder1_x/folder2_y/file1''', but not for dir.
'''groovy
pipeline {
stages {
stage('Check existing dirs') {
steps {
script{
if (params['Action'] =="List_etl-output") {
def Output_Data="${params['Datasource']}".toString().split(",").collect{"\"" + it + "\""}
def Output_Stage="${params['Etl_Output_Stage']}".toString().split(",").collect{"\"" + it + "\""}
for (folder1 in Output_Data) {
for (folder2 in Output_Stage) {
sh(script: """
gsutil ls -r gs://bucket1/*/$Data/$Stage
""" )
}
}
}
}
}
}
}
}
'''
I was use gsutil to check if the path gs://bucket1/*/$Data/$Stage available or not. The $Data and $Stage are given by user input, the Jenkins pipeline interrupted when the path not available. I want gsutil can skip the wrong path when it's not available.
The directory doesn't exist in Cloud Storage. It's a graphical representation. All the blob are stored at to bucket root and their name is composed of the full path (with / that you interpret as directory, but not). It's also for this that you can only search on prefix.
To answer your question, you can use this latest feature: search on the prefix. If there is 1 element, the folder exist BECAUSE there is at least 1 blob with this prefix. Here an example in Python (I don't know your language, I can adapt it in several language if you need)
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('bucket1')
if len(list(bucket.list_blobs(prefix='folder_x/'))):
print('there is a file in the "directory"')
else:
print('No file with this path, so no "directory"')
Here the example in Groovy
import com.google.cloud.storage.Bucket
import com.google.cloud.storage.Storage
import com.google.cloud.storage.StorageOptions
Storage storage = StorageOptions.getDefaultInstance().service
Bucket bucket = storage.get("bucket1")
System.out.println(bucket.list(Storage.BlobListOption.prefix("folder_x/")).iterateAll().size())

Generate CSV import file for AutoML Vision from an existing bucket

I already have a GCloud bucket divided by label as follows:
gs://my_bucket/dataset/label1/
gs://my_bucket/dataset/label2/
...
Each label folder has photos inside. I would like to generate the required CSV – as explained here – but I don't know how to do it programmatically, considering that I have hundreds of photos in each folder. The CSV file should look like this:
gs://my_bucket/dataset/label1/photo1.jpg,label1
gs://my_bucket/dataset/label1/photo12.jpg,label1
gs://my_bucket/dataset/label2/photo7.jpg,label2
...
You need to list all files inside the dataset folder with their complete path and then parse it to obtain the name of the folder containing the file, as in your case this is the label you want to use. This can be done in several different ways. I will include two examples from which you can base your code on:
Gsutil has a method that lists bucket contents, then you can parse the string with a bash script:
# Create csv file and define bucket path
bucket_path="gs://buckbuckbuckbuck/dataset/"
filename="labels_csv_bash.csv"
touch $filename
IFS=$'\n' # Internal field separator variable has to be set to separate on new lines
# List of every .jpg file inside the buckets folder. ** searches for them recursively.
for i in `gsutil ls $bucket_path**.jpg`
do
# Cuts the address using the / limiter and gets the second item starting from the end.
label=$(echo $i | rev | cut -d'/' -f2 | rev)
echo "$i, $label" >> $filename
done
IFS=' ' # Reset to originnal value
gsutil cp $filename $bucket_path
It also can be done using the Google Cloud Client libraries provided for different languages. Here you have an example using python:
# Imports the Google Cloud client library
import os
from google.cloud import storage
# Instantiates a client
storage_client = storage.Client()
# The name for the new bucket
bucket_name = 'my_bucket'
path_in_bucket = 'dataset'
blobs = storage_client.list_blobs(bucket_name, prefix=path_in_bucket)
# Reading blobs, parsing information and creating the csv file
filename = 'labels_csv_python.csv'
with open(filename, 'w+') as f:
for blob in blobs:
if '.jpg' in blob.name:
bucket_path = 'gs://' + os.path.join(bucket_name, blob.name)
label = blob.name.split('/')[-2]
f.write(', '.join([bucket_path, label]))
f.write("\n")
# Uploading csv file to the bucket
bucket = storage_client.get_bucket(bucket_name)
destination_blob_name = os.path.join(path_in_bucket, filename)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(filename)
For those, like me, who were looking for a way to create the .csv file for batch processing in googleAutoML, but don't need the label column :
# Create csv file and define bucket path
bucket_path="gs:YOUR_BUCKET/FOLDER"
filename="THE_FILENAME_YOU_WANT.csv"
touch $filename
IFS=$'\n' # Internal field separator variable has to be set to separate on new lines
# List of every [YOUREXTENSION] file inside the buckets folder - change in next line - ie **.png beceomes **.your_extension. ** searches for them recursively.
for i in `gsutil ls $bucket_path**.png`
do
echo "$i" >> $filename
done
IFS=' ' # Reset to originnal value
gsutil cp $filename $bucket_path

How to delete images with suffix from folder in S3 bucket

I have stored multiple sizes of the image on s3.
e.g. image100_100,image200_200,image300_150;
I want to delete the specific size of images like images with suffix 200_200 from the folder. there are a lot of images in this folder so how to delete these images?
Use AWS command-line interface (AWS CLI):
aws s3 rm s3://Path/To/Dir/ --recursive --exclude "*" --include "*200_200"
We first exclude everything, then include what we need to delete. This is a workaround to mimic the behavior of rm -r "*200_200" command in Linux.
The easiest method would be to write a Python script, similar to:
import boto3
BUCKET = 'my-bucket'
PREFIX = '' # eg 'images/'
s3_client = boto3.client('s3', region_name='ap-southeast-2')
# Get a list of objects
list_response = s3_client.list_objects_v2(Bucket = BUCKET, Prefix = PREFIX)
while True:
# Find desired objects to delete
objects = [{'Key':object['Key']} for object in list_response['Contents'] if object['Key'].endswith('200_200')]
print ('Deleting:', objects)
# Delete objects
if len(objects) > 0:
delete_response = s3_client.delete_objects(
Bucket=BUCKET,
Delete={'Objects': objects}
)
# Next page
if list_response['IsTruncated']:
list_response = s3_client.list_objects_v2(
Bucket = BUCKET,
Prefix = PREFIX,
ContinuationToken=list_reponse['NextContinuationToken'])
else:
break

Copy Without Prefix s3

I have directory structures in s3 like
bucket/folder1/*/*.csv
Where the folder wildcard refers to a number of different folders containing csv files.
I want to copy them at without the prefix to
bucket/folder2/*.csv
Ex:
bucket/folder1/
s3distcp --src=s3://bucket/folder1/ --dests3://bucket/folder2/ --srcPattern=.*/csv
Results in the undesired structure of:
bucket/folder2/*/*.csv
I need a solution to copy in bulk that is scalable. Can I do this with s3distcp? Can I do this with aws s3 cp (without having to execute the aws s3 cp per file)?
You should try the following CLI command
aws s3 sync s3://SOURCE_BUCKET_NAME s3://DESTINATION_BUCKET_NAME --recursive
There is no shortcut to do what you wish, because you are manipulating the path to the objects.
You could instead write a little program to do it, such as:
import boto3
BUCKET = 'my-bucket'
s3_client = boto3.client('s3', region_name = 'ap-southeast-2')
# Get a list of objects in folder1
response = s3_client.list_objects_v2(Bucket=BUCKET, Prefix='folder1')
# Copy files to folder2, keeping a flat hierarchy
for object in response['Contents']:
key = object['Key']
print(key)
s3_client.copy_object(
CopySource={'Bucket': BUCKET, 'Key': key},
Bucket=BUCKET,
Key = 'folder2' + key[key.rfind('/'):]
)
Ended up using Apache Nifi to do this, changing the filename attribute of the flowfile (use regex to remove all of the path before the last '/') and writing with a prefix to the desired directory. It scales really well.

Downloading folders from Google Cloud Storage Bucket

I'm new to Google Cloud Platform.I have trained my model on datalab and saved the model folder on cloud storage in my bucket. I'm able to download the existing files in the bucket to my local machine by doing right-click on the file --> save as link. But when I try to download the folder by the same procedure as above, I'm not getting the folder but its image. Is there anyway I can download the whole folder and its contents as it is? Is there any gsutil command to copy folders from cloud storage to local directory?
You can find docs on the gsutil tool here and for your question more specifically here.
The command you want to use is:
gsutil cp -r gs://bucket/folder .
This is how you can download a folder from Google Cloud Storage Bucket
Run the following commands to download it from the bucket storage to your Google Cloud Console local path
gsutil -m cp -r gs://{bucketname}/{folderPath} {localpath}
once you run that command, confirm that your folder is on the localpath by running ls command to list files and directories on the localpath
Now zip your folder by running the command below
zip -r foldername.zp yourfolder/*
Once the zip process is done, click on the more dropdown menu at the right side of the Google Cloud Console,
then select "Download file" Option. You will be prompted to enter the name of the file that you want to download, enter the name of the zip file - "foldername.zp"
Prerequisites:
Google Cloud SDK is installed and initialized ($ glcoud init)
Command:
gsutil -m cp -r gs://bucket-name .
This will copy all of the files using multithread which is faster. I found that the "dir" command instructed for use in the official Gsutil Docs did not work.
If you are downloading using data from google cloud storage using python and want to maintain same folder structure , follow this code i wrote in python.
OPTION 1
from google.cloud import storage
def findOccurrences(s, ch): # to find position of '/' in blob path ,used to create folders in local storage
return [i for i, letter in enumerate(s) if letter == ch]
def download_from_bucket(bucket_name, blob_path, local_path):
# Create this folder locally
if not os.path.exists(local_path):
os.makedirs(local_path)
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blobs=list(bucket.list_blobs(prefix=blob_path))
startloc = 0
for blob in blobs:
startloc = 0
folderloc = findOccurrences(blob.name.replace(blob_path, ''), '/')
if(not blob.name.endswith("/")):
if(blob.name.replace(blob_path, '').find("/") == -1):
downloadpath=local_path + '/' + blob.name.replace(blob_path, '')
logging.info(downloadpath)
blob.download_to_filename(downloadpath)
else:
for folder in folderloc:
if not os.path.exists(local_path + '/' + blob.name.replace(blob_path, '')[startloc:folder]):
create_folder=local_path + '/' +blob.name.replace(blob_path, '')[0:startloc]+ '/' +blob.name.replace(blob_path, '')[startloc:folder]
startloc = folder + 1
os.makedirs(create_folder)
downloadpath=local_path + '/' + blob.name.replace(blob_path, '')
blob.download_to_filename(downloadpath)
logging.info(blob.name.replace(blob_path, '')[0:blob.name.replace(blob_path, '').find("/")])
logging.info('Blob {} downloaded to {}.'.format(blob_path, local_path))
bucket_name = 'google-cloud-storage-bucket-name' # do not use gs://
blob_path = 'training/data' # blob path in bucket where data is stored
local_dir = 'local-folder name' #trainingData folder in local
download_from_bucket(bucket_name, blob_path, local_dir)
OPTION 2: using gsutil sdk
One more option of doing it via python program is below.
def download_bucket_objects(bucket_name, blob_path, local_path):
# blob path is bucket folder name
command = "gsutil cp -r gs://{bucketname}/{blobpath} {localpath}".format(bucketname = bucket_name, blobpath = blob_path, localpath = local_path)
os.system(command)
return command
OPTION 3 - No python ,directly using terminal and google SDK
Prerequisites: Google Cloud SDK is installed and initialized ($ glcoud init)
Refer to below link for commands:
https://cloud.google.com/storage/docs/gsutil/commands/cp
gsutil -m cp -r gs://bucket-name "{path to local existing folder}"
Works for sure.
As of Mar. 2022, the gs path needs to be double quoted. You can actually find the proper downloading command by navigating to the bucket root, check one of the dir and click Download on the top.
Here's the code I wrote.
This Will download the complete directory structure to your VM/local storage .
from google.cloud import storage
import os
bucket_name = "ar-data"
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
dirName='Data_03_09/' #***folder in bucket whose content you want to download
blobs = bucket.list_blobs(prefix = dirName)#, delimiter = '/')
destpath=r'/home/jupyter/DATA_test/' #***path on your vm/local where you want to download the bucket directory
for blob in blobs:
#print(blob.name.lstrip(dirName).split('/'))
currpath=destpath
if not os.path.exists(os.path.join(destpath,'/'.join(blob.name.lstrip(dirName)).split('/')[:-1])):
for n in blob.name.lstrip(dirName).split('/')[:-1]:
currpath=os.path.join(currpath,n)
if not os.path.exists(currpath):
print('creating directory- ', n , 'On path-', currpath)
os.mkdir(currpath)
print("downloading ... ",blob.name.lstrip(dirName))
blob.download_to_filename(os.path.join(destpath,blob.name.lstrip(dirName)))
or simply use in terminal :
gsutil -m cp -r gs://{bucketname}/{folderPath} {localpath}