Copy nested Amazon S3 folders into flattened folder - amazon-web-services

Long story short, we have documents stored something like this /accounts/account-abc/docs/uuid.pdf which is pretty redundant. What we want is basically docs/uuid.pdf. There are lots of other posts about copying, but they are all single dirs. I need something like this (which is obviously wrong):
aws s3 cp s3://accounts/*/docs s3://docs/ --recursive ---include "*"
Would I need to write a custom script in order to acomplish the above?

Here's a Python script that will copy files from a given SOURCE_PATH to a TARGET_PATH, removing all sub-folders:
import boto3
SOURCE_BUCKET = 'source-bucket'
SOURCE_PATH = 'accounts/'
TARGET_BUCKET = 'target-bucket'
TARGET_PATH = 'docs/'
s3_resource = boto3.resource('s3')
bucket = s3_resource.Bucket(SOURCE_BUCKET)
for object in bucket.objects.filter(Prefix=SOURCE_PATH):
target_key = object.key[object.key.rfind('/')+1:]
print('Copying', target_key)
s3_resource.Object(TARGET_BUCKET, TARGET_PATH + target_key).copy({'Bucket':SOURCE_BUCKET, 'Key': object.key})
# Optional, to delete source object:
# object.delete()
You might need to modify it if you only wish to copy from a SOURCE_PATH that also contains a sub-directory of docs (based on your example).

Related

Is there a simple way to rename s3 folder via boto3?

I have s3 bucket with folder, and inside the folder there are large files.
I want to rename the folder with python3-boto3 script.
I read this ("How to Rename Amazon S3 Folder Objects with Python"), and what he is doing is to copy the files with new prefix, then deleting the original folder.
It is very not efficient way to do it, and because I have large files, it will take long time to do it.
Is there a simpler/more efficient way to do it?
There is no way to rename s3 objects/folders - you will need to copy them to the new name and delete the old name unfortunately.
There is a mv command in the aws cli, but behind the scenes it is doing a copy then delete for you - so you can make the operation easier, but it is not a true 'rename'.
https://docs.aws.amazon.com/cli/latest/reference/s3/mv.html
Simple, no. Unfortunately.
There are a lot of 'issues' with folder structures in s3 it seems as the storage is flat.
I have a Django project where I needed the ability to rename a folder but still keep the directory structure in-tact, meaning empty folders would need to be copied and stored in the renamed directory as well.
aws cli is great but neither cp or sync or mv copied empty folders (i.e. files ending in '/') over to the new folder location, so I used a mixture of boto3 and the aws cli to accomplish the task.
More or less I find all folders in the renamed directory and then use boto3 to put them in the new location, then I cp the data with aws cli and finally remove it.
import threading
import os
from django.conf import settings
from django.contrib import messages
from django.core.files.storage import default_storage
from django.shortcuts import redirect
from django.urls import reverse
def rename_folder(request, client_url):
"""
:param request:
:param client_url:
:return:
"""
current_property = request.session.get('property')
if request.POST:
# name the change
new_name = request.POST['name']
# old full path with www.[].com?
old_path = request.POST['old_path']
# remove the query string
old_path = ''.join(old_path.split('?')[0])
# remove the .com prefix item so we have the path in the storage
old_path = ''.join(old_path.split('.com/')[-1])
# remove empty values, this will happen at end due to these being folders
old_path_list = [x for x in old_path.split('/') if x != '']
# remove the last folder element with split()
base_path = '/'.join(old_path_list[:-1])
# # now build the new path
new_path = base_path + f'/{new_name}/'
# remove empty variables
# print(old_path_list[:-1], old_path.split('/'), old_path, base_path, new_path)
endpoint = settings.AWS_S3_ENDPOINT_URL
# # recursively add the files
copy_command = f"aws s3 --endpoint={endpoint} cp s3://{old_path} s3://{new_path} --recursive"
remove_command = f"aws s3 --endpoint={endpoint} rm s3://{old_path} --recursive"
# get_creds() is nothing special it simply returns the elements needed via boto3
client, resource, bucket, resource_bucket = get_creds()
path_viewing = f'{"/".join(old_path.split("/")[1:])}'
directory_content = default_storage.listdir(path_viewing)
# loop over folders and add them by default, aws cli does not copy empty ones
# so this is used to accommodate
folders, files = directory_content
for folder in folders:
new_key = new_path+folder+'/'
# we must remove bucket name for this to work
new_key = new_key.split(f"{bucket}/")[-1]
# push this to new thread
threading.Thread(target=put_object, args=(client, bucket, new_key,)).start()
print(f'{new_key} added')
# # run command, which will copy all data
os.system(copy_command)
print('Copy Done...')
os.system(remove_command)
print('Remove Done...')
# print(bucket)
print(f'Folder renamed.')
messages.success(request, f'Folder Renamed to: {new_name}')
return redirect(request.META.get('HTTP_REFERER', f"{reverse('home', args=[client_url])}"))

How to delete images with suffix from folder in S3 bucket

I have stored multiple sizes of the image on s3.
e.g. image100_100,image200_200,image300_150;
I want to delete the specific size of images like images with suffix 200_200 from the folder. there are a lot of images in this folder so how to delete these images?
Use AWS command-line interface (AWS CLI):
aws s3 rm s3://Path/To/Dir/ --recursive --exclude "*" --include "*200_200"
We first exclude everything, then include what we need to delete. This is a workaround to mimic the behavior of rm -r "*200_200" command in Linux.
The easiest method would be to write a Python script, similar to:
import boto3
BUCKET = 'my-bucket'
PREFIX = '' # eg 'images/'
s3_client = boto3.client('s3', region_name='ap-southeast-2')
# Get a list of objects
list_response = s3_client.list_objects_v2(Bucket = BUCKET, Prefix = PREFIX)
while True:
# Find desired objects to delete
objects = [{'Key':object['Key']} for object in list_response['Contents'] if object['Key'].endswith('200_200')]
print ('Deleting:', objects)
# Delete objects
if len(objects) > 0:
delete_response = s3_client.delete_objects(
Bucket=BUCKET,
Delete={'Objects': objects}
)
# Next page
if list_response['IsTruncated']:
list_response = s3_client.list_objects_v2(
Bucket = BUCKET,
Prefix = PREFIX,
ContinuationToken=list_reponse['NextContinuationToken'])
else:
break

Copy Without Prefix s3

I have directory structures in s3 like
bucket/folder1/*/*.csv
Where the folder wildcard refers to a number of different folders containing csv files.
I want to copy them at without the prefix to
bucket/folder2/*.csv
Ex:
bucket/folder1/
s3distcp --src=s3://bucket/folder1/ --dests3://bucket/folder2/ --srcPattern=.*/csv
Results in the undesired structure of:
bucket/folder2/*/*.csv
I need a solution to copy in bulk that is scalable. Can I do this with s3distcp? Can I do this with aws s3 cp (without having to execute the aws s3 cp per file)?
You should try the following CLI command
aws s3 sync s3://SOURCE_BUCKET_NAME s3://DESTINATION_BUCKET_NAME --recursive
There is no shortcut to do what you wish, because you are manipulating the path to the objects.
You could instead write a little program to do it, such as:
import boto3
BUCKET = 'my-bucket'
s3_client = boto3.client('s3', region_name = 'ap-southeast-2')
# Get a list of objects in folder1
response = s3_client.list_objects_v2(Bucket=BUCKET, Prefix='folder1')
# Copy files to folder2, keeping a flat hierarchy
for object in response['Contents']:
key = object['Key']
print(key)
s3_client.copy_object(
CopySource={'Bucket': BUCKET, 'Key': key},
Bucket=BUCKET,
Key = 'folder2' + key[key.rfind('/'):]
)
Ended up using Apache Nifi to do this, changing the filename attribute of the flowfile (use regex to remove all of the path before the last '/') and writing with a prefix to the desired directory. It scales really well.

How to extract files in S3 on the fly with boto3?

I'm trying to find a way to extract .gz files in S3 on the fly, that is no need to download it to locally, extract and then push it back to S3.
With boto3 + lambda, how can i achieve my goal?
I didn't see any extract part in boto3 document.
You can use BytesIO to stream the file from S3, run it through gzip, then pipe it back up to S3 using upload_fileobj to write the BytesIO.
# python imports
import boto3
from io import BytesIO
import gzip
# setup constants
bucket = '<bucket_name>'
gzipped_key = '<key_name.gz>'
uncompressed_key = '<key_name>'
# initialize s3 client, this is dependent upon your aws config being done
s3 = boto3.client('s3', use_ssl=False) # optional
s3.upload_fileobj( # upload a new obj to s3
Fileobj=gzip.GzipFile( # read in the output of gzip -d
None, # just return output as BytesIO
'rb', # read binary
fileobj=BytesIO(s3.get_object(Bucket=bucket, Key=gzipped_key)['Body'].read())),
Bucket=bucket, # target bucket, writing to
Key=uncompressed_key) # target key, writing to
Ensure that your key is reading in correctly:
# read the body of the s3 key object into a string to ensure download
s = s3.get_object(Bucket=bucket, Key=gzip_key)['Body'].read()
print(len(s)) # check to ensure some data was returned
The above answers are for gzip files, for zip files, you may try
import boto3
import zipfile
from io import BytesIO
bucket = 'bucket1'
s3 = boto3.client('s3', use_ssl=False)
Key_unzip = 'result_files/'
prefix = "folder_name/"
zipped_keys = s3.list_objects_v2(Bucket=bucket, Prefix=prefix, Delimiter = "/")
file_list = []
for key in zipped_keys['Contents']:
file_list.append(key['Key'])
#This will give you list of files in the folder you mentioned as prefix
s3_resource = boto3.resource('s3')
#Now create zip object one by one, this below is for 1st file in file_list
zip_obj = s3_resource.Object(bucket_name=bucket, key=file_list[0])
print (zip_obj)
buffer = BytesIO(zip_obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
for filename in z.namelist():
file_info = z.getinfo(filename)
s3_resource.meta.client.upload_fileobj(
z.open(filename),
Bucket=bucket,
Key='result_files/' + f'{filename}')
This will work for your zip file and your result unzipped data will be in result_files folder. Make sure to increase memory and time on AWS Lambda to maximum since some files are pretty large and needs time to write.
Amazon S3 is a storage service. There is no in-built capability to manipulate the content of files.
However, you could use an AWS Lambda function to retrieve an object from S3, decompress it, then upload content back up again. However, please note that there is default limit of 500MB in temporary disk space for Lambda, so avoid decompressing too much data at the same time.
You could configure the S3 bucket to trigger the Lambda function when a new file is created in the bucket. The Lambda function would then:
Use boto3 to download the new file
Use the gzip Python library to extract files
Use boto3 to upload the resulting file(s)
Sample code:
import gzip
import io
import boto3
bucket = '<bucket_name>'
key = '<key_name>'
s3 = boto3.client('s3', use_ssl=False)
compressed_file = io.BytesIO(
s3.get_object(Bucket=bucket, Key=key)['Body'].read())
uncompressed_file = gzip.GzipFile(None, 'rb', fileobj=compressed_file)
s3.upload_fileobj(Fileobj=uncompressed_file, Bucket=bucket, Key=key[:-3])

AWS static site file upload via boto 3 set the right content-types

What are the right content types for the different types of files of a static site hosted at AWS and how to set these in a smart way via boto3?
I use the upload_file method:
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('allecijfers.nl')
bucket.upload_file('C:/Hugo/Sites/allecijfers/public/test/index.html', 'test/index.html', ExtraArgs={'ACL': 'public-read', 'ContentType': 'text/html'})
This works well for the html files. I initially left out the ExtraArgs which results in a file download (probably because the content type is binary?). I found this page that states several content types but I am not sure how to apply it.
E.g. probably the CSS files should be uploaded with 'ContentType': 'text/css'.
But what about the js files, index.xml, etc? And how to do this in a smart way? FYI this is my current script to upload from Windows to AWS, this requires string.replace("\","/") which probably is not the smartest either?
for root, dirs, files in os.walk(local_root + local_dir):
for filename in files:
# construct the full local path
local_path = os.path.join(root, filename).replace("\\","/")
# construct the full S3 path
relative_path = os.path.relpath(local_path, local_root)
s3_path = os.path.join(relative_path).replace("\\","/")
bucket.upload_file(local_path, s3_path, ExtraArgs={'ACL': 'public-read', 'ContentType': 'text/html'})
I uploaded my complete Hugo site from the same source using the AWS CLI to the same S3 bucket and this works perfect without specifying content types, is this also possible via boto 3?
Many thanks in advance for your help!
There is a python built-in library to guess mimetypes.
So you could just lookup each filename first. It works like this:
import mimetypes
print(mimetypes.guess_type('filename.html'))
Result:
('text/html', None)
In your code. I also slightly improved the portability of your code with respect to the windows path. Now it will do the same thing, but be portable to a Unix platform by looking up the platform specific separator (os.path.sep) that will be being used in any paths.
import boto3
import mimetypes
s3 = boto3.resource('s3')
bucket = s3.Bucket('allecijfers.nl')
for root, dirs, files in os.walk(local_root + local_dir):
for filename in files:
# construct the full local path (Not sure why you were converting to a
# unix path when you'd want this correctly as a windows path
local_path = os.path.join(root, filename)
# construct the full S3 path
relative_path = os.path.relpath(local_path, local_root)
s3_path = relative_path.replace(os.path.sep,"/")
# Get content type guess
content_type = mimetypes.guess_type(filename)[0]
bucket.upload_file(
File=local_path,
Bucket=bucket,
Key=s3_path,
ExtraArgs={'ACL': 'public-read', 'ContentType': content_type}
)