There is a bucket on GCP in which the folders are named as below.
Tests_111/tests.pdf
Tests_112/tests.pdf
AllTests_111/alltests.pdf
...
These folders need to be copied into another bucket in the following structure.
Tests/111/tests.pdf
Tests/112/tests.pdf
AllTests/111/alltests.pdf
I basically, just trying to check the folder name to see if it can be spliced by '_', which will give me "Tests" & "111". After that, create a folder with the name "Tests" and a subfolder with the name "111" within that one.
I'm new to GCP and gsutils. Is there any way to achieve this?
here an example of code snippet to copy objects from one bucket to another
from google.cloud import storage
def copy_blob(
bucket_name, blob_name, destination_bucket_name, destination_blob_name
):
"""Copies a blob from one bucket to another with a new name."""
# bucket_name = "your-bucket-name"
# blob_name = "your-object-name"
# destination_bucket_name = "destination-bucket-name"
# destination_blob_name = "destination-object-name"
storage_client = storage.Client()
source_bucket = storage_client.bucket(bucket_name)
source_blob = source_bucket.blob(blob_name)
destination_bucket = storage_client.bucket(destination_bucket_name)
blob_copy = source_bucket.copy_blob(
source_blob, destination_bucket, destination_blob_name
)
print(
"Blob {} in bucket {} copied to blob {} in bucket {}.".format(
source_blob.name,
source_bucket.name,
blob_copy.name,
destination_bucket.name,
)
)
Ref: https://cloud.google.com/storage/docs/copying-renaming-moving-objects#storage-copy-object-python
So you have only to code the logic to set properly the destination_blob_name from the blob_name. By the way a blob name is fully qualified, means it include the folder name and the file name like Tests/111/tests.pdf
Actually i use that command , and it works well :
gsutil cp gs:/bucket1/file.xml gs://bucket2/destination_folder
(bucket1 is in project1 in GCP and bucket2 is in another project in GCP)
But i would like to do that command every day at 9am, how can i do that on my GCP project in a easy way ?
Edit : It will copy the file over and over each day from the source bucket to the destination bucket( the two buckets are in a different project each). (actually when the file arrive in the destination bucket, it is consume and ingest in bigquery automatically , i just want to trigg my command gsutil and stop to do it manually each morning )
(except the method with Data transfert because i have not the right of the source project so i cannot activate the service account for data transfert , i have only the rights on destination project.)
Bests regards,
Actually i can copy a file from a bucket into another bucket into a specfic folder (RQ : the 2 buckets are on the same gcp project)
I don't arrive to use the second method with a gs://
EDIT 2:
import base64
import sys
import urllib.parse
# Imports the Google Cloud client library , dont forget the requirement or else it's ko
from google.cloud import storage
def copy_blob(
bucket_name ="prod-data", blob_name="test.csv", destination_bucket_name = "prod-data-f", destination_blob_name ="channel_p"
):
"""Copies a blob from one bucket to another with a new name."""
bucket_name = "prod-data"
blob_name = "test.csv"
destination_bucket_name = "prod-data-f"
destination_blob_name = "channel_p/test.csv"
storage_client = storage.Client()
source_bucket = storage_client.bucket(bucket_name)
source_blob = source_bucket.blob("huhu/"+blob_name)
destination_bucket = storage_client.bucket(destination_bucket_name)
blob_copy = source_bucket.copy_blob(
source_blob, destination_bucket, destination_blob_name
)
# Second Method (KO)
#
# client = storage.Client()
# with open('gs://prod-data-f/channelp.xml','wb') as file_obj:
# client.download_blob_to_file(
# 'gs://pathsource/somefolder/channelp.xml', file_obj)
#
# End of second Method
print(
"Blob {} in bucket {} copied to blob {} in bucket {}.".format(
source_blob.name,
source_bucket.name,
blob_copy.name,
destination_bucket.name,
)
)
Data transfer is obviously the right tool for doing this, but since you cannot use it, there are alternative solutions.
One of them is to copy files using a Cloud Function (you can use this snippet), and trigger each day at 9am that Cloud Function using Cloud Scheduler. Cloud Function can also be triggered by a Pub/Sub message.
The solution that i was seeking (it works for me when i test):
Main.py
import base64
import os
import sys
import json
import uuid
import logging
from time import sleep
from flask import request
from random import uniform
from google.cloud import firestore
from google.cloud.exceptions import Forbidden, NotFound
from google.cloud import storage
# set retry deadline to 60s
DEFAULT_RETRY = storage.retry.DEFAULT_RETRY.with_deadline(60)
def Move2FinalBucket(data, context):
# if 'data' in event:
# name = base64.b64decode(event['data']).decode('utf-8')
# else:
# name = 'NO_DATA'
# print('Message {}!'.format(name))
# Get cache source bucket
cache_bucket = storage.Client().get_bucket('nameofmysourcebucket', timeout=540, retry=DEFAULT_RETRY)
# Get source file to copy
blob2transfer = cache_bucket.blob('uu/oo/pp/filename.csv')
# Get cache destination bucket
destination_bucket = storage.Client().get_bucket('nameofmydestinationbucket', timeout=540, retry=DEFAULT_RETRY)
# Get destination file
new_file = destination_bucket.blob('kk/filename.csv')
#rewrite into new_file
new_file.rewrite(blob2transfer, timeout=540, retry=DEFAULT_RETRY)
requirement.txt
# Function dependencies, for example:
# package>=version
#google-cloud-storage==1.22.0
google-cloud-storage
google-cloud-firestore
google-api-core
flask==1.1.4
Dont forget to add a service account with the right Storage admin on this CF and it will works.
Best regards,
I have s3 bucket with folder, and inside the folder there are large files.
I want to rename the folder with python3-boto3 script.
I read this ("How to Rename Amazon S3 Folder Objects with Python"), and what he is doing is to copy the files with new prefix, then deleting the original folder.
It is very not efficient way to do it, and because I have large files, it will take long time to do it.
Is there a simpler/more efficient way to do it?
There is no way to rename s3 objects/folders - you will need to copy them to the new name and delete the old name unfortunately.
There is a mv command in the aws cli, but behind the scenes it is doing a copy then delete for you - so you can make the operation easier, but it is not a true 'rename'.
https://docs.aws.amazon.com/cli/latest/reference/s3/mv.html
Simple, no. Unfortunately.
There are a lot of 'issues' with folder structures in s3 it seems as the storage is flat.
I have a Django project where I needed the ability to rename a folder but still keep the directory structure in-tact, meaning empty folders would need to be copied and stored in the renamed directory as well.
aws cli is great but neither cp or sync or mv copied empty folders (i.e. files ending in '/') over to the new folder location, so I used a mixture of boto3 and the aws cli to accomplish the task.
More or less I find all folders in the renamed directory and then use boto3 to put them in the new location, then I cp the data with aws cli and finally remove it.
import threading
import os
from django.conf import settings
from django.contrib import messages
from django.core.files.storage import default_storage
from django.shortcuts import redirect
from django.urls import reverse
def rename_folder(request, client_url):
"""
:param request:
:param client_url:
:return:
"""
current_property = request.session.get('property')
if request.POST:
# name the change
new_name = request.POST['name']
# old full path with www.[].com?
old_path = request.POST['old_path']
# remove the query string
old_path = ''.join(old_path.split('?')[0])
# remove the .com prefix item so we have the path in the storage
old_path = ''.join(old_path.split('.com/')[-1])
# remove empty values, this will happen at end due to these being folders
old_path_list = [x for x in old_path.split('/') if x != '']
# remove the last folder element with split()
base_path = '/'.join(old_path_list[:-1])
# # now build the new path
new_path = base_path + f'/{new_name}/'
# remove empty variables
# print(old_path_list[:-1], old_path.split('/'), old_path, base_path, new_path)
endpoint = settings.AWS_S3_ENDPOINT_URL
# # recursively add the files
copy_command = f"aws s3 --endpoint={endpoint} cp s3://{old_path} s3://{new_path} --recursive"
remove_command = f"aws s3 --endpoint={endpoint} rm s3://{old_path} --recursive"
# get_creds() is nothing special it simply returns the elements needed via boto3
client, resource, bucket, resource_bucket = get_creds()
path_viewing = f'{"/".join(old_path.split("/")[1:])}'
directory_content = default_storage.listdir(path_viewing)
# loop over folders and add them by default, aws cli does not copy empty ones
# so this is used to accommodate
folders, files = directory_content
for folder in folders:
new_key = new_path+folder+'/'
# we must remove bucket name for this to work
new_key = new_key.split(f"{bucket}/")[-1]
# push this to new thread
threading.Thread(target=put_object, args=(client, bucket, new_key,)).start()
print(f'{new_key} added')
# # run command, which will copy all data
os.system(copy_command)
print('Copy Done...')
os.system(remove_command)
print('Remove Done...')
# print(bucket)
print(f'Folder renamed.')
messages.success(request, f'Folder Renamed to: {new_name}')
return redirect(request.META.get('HTTP_REFERER', f"{reverse('home', args=[client_url])}"))
The link below shows how to download an entire S3 content. However, how does one get subfolder content. Suppose my S3 folder has the following emulated structure.
S3Folder/S1/file1.c
S3Folder/S1/file2.h
S3Folder/S1/file1.h
S3Folder/S2/file.exe
S3Folder/S2/resource.data
Suppose I am interested only in S2 folder. How do I isolate the keys in bucket list ?
local backup of an S3 content
conn = boto.connect_s3(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
bucket = conn.get_bucket(bucket_name)
# go through the list of files
bucket_list = bucket.list()
for l in bucket_list:
keyString = str(l.key)
d = LOCAL_PATH + keyString
try:
l.get_contents_to_filename(d)
except OSError:
# check if dir exists
if not os.path.exists(d):
os.mkdir(d)
You could do the following:
import os
import boto3
s3_resource = boto3.resource("s3", region_name="us-east-1")
def download_objects():
root_dir = 'D:/' # local machine location
s3_bucket_name = 'S3_Bucket_Name' #s3 bucket name
s3_root_folder_prefix = 'sample' # bucket inside root folder
s3_folder_list = ['s3_folder_1','s3_folder_2','s3_folder_3'] # root folder inside sub folders list
my_bucket = self.s3_resource.Bucket(s3_bucket_name)
for file in my_bucket.objects.filter(Prefix=s3_root_folder_prefix):
if any(s in file.key for s in s3_folder_list):
try:
path, filename = os.path.split(file.key)
try:
os.makedirs(root_dir + path)
except Exception as err:
pass
my_bucket.download_file(file.key, root_dir + path + '/' + filename)
except Exception as err:
print(err)
if __name__ == '__main__':
download_objects()
You can download s3 objects by adding prefix of it in the key value.
So, according to your Question , you just need to add prefix '/S2' while downloading objects
FYI: s3 download object using boto3
For more check this
I am already connected to the instance and I want to upload the files that are generated from my python script directly to S3. I have tried this:
import boto
s3 = boto.connect_s3()
bucket = s3.get_bucket('alexandrabucket')
from boto.s3.key import Key
key = bucket.new_key('s0').set_contents_from_string('some content')
but this is rather creating a new file s0 with the context "same content" while I want to upload the directory s0 to mybucket.
I had a look also to s3put but I didn't manage to get what I want.
The following function can be used to upload directory to s3 via boto.
def uploadDirectory(path,bucketname):
for root,dirs,files in os.walk(path):
for file in files:
s3C.upload_file(os.path.join(root,file),bucketname,file)
Provide a path to the directory and bucket name as the inputs. The files are placed directly into the bucket. Alter the last variable of the upload_file() function to place them in "directories".
There is nothing in the boto library itself that would allow you to upload an entire directory. You could write your own code to traverse the directory using os.walk or similar and to upload each individual file using boto.
There is a command line utility in boto called s3put that could handle this or you could use the AWS CLI tool which has a lot of features that allow you to upload entire directories or even sync the S3 bucket with a local directory or vice-versa.
The s3fs package provides nice functionalities to handle such cases
s3_file = s3fs.S3FileSystem()
local_path = "some_dir_path/some_dir_path/"
s3_path = "bucket_name/dir_path"
s3_file.put(local_path, s3_path, recursive=True)
I built the function based on the feedback from #JDPTET, however,
I needed to remove the common entire local path from getting uploaded to the bucket!
Not sure how many path separators I encounter - so I had to use os.path.normpath
def upload_folder_to_s3(s3bucket, inputDir, s3Path):
print("Uploading results to s3 initiated...")
print("Local Source:",inputDir)
os.system("ls -ltR " + inputDir)
print("Dest S3path:",s3Path)
try:
for path, subdirs, files in os.walk(inputDir):
for file in files:
dest_path = path.replace(inputDir,"")
__s3file = os.path.normpath(s3Path + '/' + dest_path + '/' + file)
__local_file = os.path.join(path, file)
print("upload : ", __local_file, " to Target: ", __s3file, end="")
s3bucket.upload_file(__local_file, __s3file)
print(" ...Success")
except Exception as e:
print(" ... Failed!! Quitting Upload!!")
print(e)
raise e
s3 = boto3.resource('s3', region_name='us-east-1')
s3bucket = s3.Bucket("<<s3bucket_name>>")
upload_folder_to_s3(s3bucket, "<<Local Folder>>", "<<s3 Path>>")
You could do the following:
import os
import boto3
s3_resource = boto3.resource("s3", region_name="us-east-1")
def upload_objects():
try:
bucket_name = "S3_Bucket_Name" #s3 bucket name
root_path = 'D:/sample/' # local folder for upload
my_bucket = s3_resource.Bucket(bucket_name)
for path, subdirs, files in os.walk(root_path):
path = path.replace("\\","/")
directory_name = path.replace(root_path,"")
for file in files:
my_bucket.upload_file(os.path.join(path, file), directory_name+'/'+file)
except Exception as err:
print(err)
if __name__ == '__main__':
upload_objects()
This is the code I used which recursively upload files from the specified folder to the specified s3 path. Just add S3 credential and bucket details in the script:
https://gist.github.com/hari116/4ab5ebd885b63e699c4662cd8382c314/
#!/usr/bin/python
"""Usage: Add bucket name and credentials
script.py <source folder> <s3 destination folder >"""
import os
from sys import argv
import boto3
from botocore.exceptions import NoCredentialsError
ACCESS_KEY = ''
SECRET_KEY = ''
host = ''
bucket_name = ''
local_folder, s3_folder = argv[1:3]
walks = os.walk(local_folder)
# Function to upload to s3
def upload_to_aws(bucket, local_file, s3_file):
"""local_file, s3_file can be paths"""
s3 = boto3.client('s3', aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY)
print(' Uploading ' +local_file + ' as ' + bucket + '/' +s3_file)
try:
s3.upload_file(local_file, bucket, s3_file)
print(' '+s3_file + ": Upload Successful")
print(' ---------')
return True
except NoCredentialsError:
print("Credentials not available")
return False
"""For file names"""
for source, dirs, files in walks:
print('Directory: ' + source)
for filename in files:
# construct the full local path
local_file = os.path.join(source, filename)
# construct the full Dropbox path
relative_path = os.path.relpath(local_file, local_folder)
s3_file = os.path.join(s3_folder, relative_path)
# Invoke upload function
upload_to_aws(bucket_name, local_file, s3_file)
For reading file form folder we can use
import boto
from boto.s3.key import Key
keyId = 'YOUR_AWS_ACCESS_KEY_ID'
sKeyId='YOUR_AWS_ACCESS_KEY_ID'
bucketName='your_bucket_name'
conn = boto.connect_s3(keyId,sKeyId)
bucket = conn.get_bucket(bucketName)
for key in bucket.list():
print ">>>>>"+key.name
pathV = key.name.split('/')
if(pathV[0] == "data"):
if(pathV[1] != ""):
srcFileName = key.name
filename = key.name
filename = filename.split('/')[1]
destFileName = "model/data/"+filename
k = Key(bucket,srcFileName)
k.get_contents_to_filename(destFileName)
elif(pathV[0] == "nlu_data"):
if(pathV[1] != ""):
srcFileName = key.name
filename = key.name
filename = filename.split('/')[1]
destFileName = "model/nlu_data/"+filename
k = Key(bucket,srcFileName)
k.get_contents_to_filename(destFileName)
Updated #user 923227's answer to (1) include newer boto3 interface (2) work with nuances of windows double backslash (3) cleaner tqdm progress bar:
import os
from tqdm import tqdm
def upload_folder_to_s3(s3_client, s3bucket, input_dir, s3_path):
pbar = tqdm(os.walk(input_dir))
for path, subdirs, files in pbar:
for file in files:
dest_path = path.replace(input_dir, "").replace(os.sep, '/')
s3_file = f'{s3_path}/{dest_path}/{file}'.replace('//', '/')
local_file = os.path.join(path, file)
s3_client.upload_file(local_file, s3bucket, s3_file)
pbar.set_description(f'Uploaded {local_file} to {s3_file}')
print(f"Successfully uploaded {input_dir} to S3 {s3_path}")
Usage example:
s3_client = boto3.client('s3', aws_access_key_id=AWS_ACCESS_KEY_ID, aws_secret_access_key=AWS_SECRET_ACCESS_KEY)
upload_folder_to_s3(s3_client, 'BUCKET-NAME', <local-directory>, <s3-directory>)
Somehow the other snippets did not really work for me, this is a modification of the snippet from user 923227 that does.
This code copies all files in a directory and maintains the directory in S3, e.g.2023/01/file.jpg will be in the bucket as 2023/01/file.jpg.
import os
import sys
import boto3
client = boto3.client('s3')
local_path = "your-path/data"
bucketname = "bucket-name"
for path, dirs, files in os.walk(local_path):
for file in files:
file_s3 = os.path.normpath(path + '/' + file)
file_local = os.path.join(path, file)
print("Upload:", file_local, "to target:", file_s3, end="")
client.upload_file(file_local, bucketname, file_s3)
print(" ...Success")
Another method that did not exist when this question was first asked is to use python-rclone (https://github.com/ddragosd/python-rclone/blob/master/README.md).
This requires a download of rclone and a working rclone config. Commonly used for AWS (https://rclone.org/s3/) but can be used for other providers as well.
install('python-rclone')
import rclone
cfg_path = r'(path to rclone config file here)'
with open(cfg_path) as f:
cfg = f.read()
# Implementation
# Local file to cloud server
result = rclone.with_config(cfg).run_cmd(command="sync", extra_args=["/home/demodir/", "AWS test:dummydir/etc/"])
# Cloud server to cloud server
result = rclone.with_config(cfg).run_cmd(command="sync", extra_args=["Gdrive:test/testing/", "AWS test:dummydir/etc/"
This allows you to run a "sync" command similar to the AWS CLI within your python code by reading in the config file and mapping your output via kwargs (extra_args)
This solution does not use boto, but I think it could do what the OP wants.
It uses awscli and Python.
import os
class AwsCredentials:
def __init__(self, access_key: str, secret_key: str):
self.access_key = access_key
self.secret_key = secret_key
def to_command(self):
credentials = f'AWS_ACCESS_KEY_ID={self.access_key} AWS_SECRET_ACCESS_KEY={self.secret_key}'
return credentials
def sync_s3_bucket(credentials: AwsCredentials, source_path: str, bucket: str) -> None:
command = f'{credentials.to_command()} aws s3 sync {source_path} s3://{bucket}'
result = os.system(command)
assert result == 0, f'The s3 sync was not successful, error code: {result}'
Please consider getting the AWS credentials from a file or from the environment.
The documentation for the s3 sync command is here.
Simply running terminal commands using os module with F string works
import os
ActualFolderName = "FolderToBeUploadedOnS3"
os.system(f'aws s3 cp D:\<PathToYourFolder>\{ActualFolderName} s3://<BucketName>/{ActualFolderName}/ --recursive')