Issue with uploading files from local directory to aws S3 using python 2.7 and boto 2

Issue with uploading files from local directory to aws S3 using python 2.7 and boto 2 - python-2.7

I’m doing simple operation to of downloading the gzip files from S3 bucket to the local directory. I’m extracting those into another local directory and then uploading them back to S3 bucket again into archive folder path. While doing this operation I want to make sure I am processing same set of files that I initially download from S3 bucket which is (f_name) in below code. Now, below code is not uploading those back to S3 , that’s where I’m stuck. But able to download from S3 and extract it into local directory. Can you please help me understand what is wrong with the _uploadFile function?
from boto.s3.connection import S3Connection
from boto.s3.key import *
import os
import os.path
aws_bucket= "event-logs-dev” ## S3 Bucket name
local_download_directory= "/Users/TargetData/Download/test_queue1/“ ## local directory to download the gzip files from S3.
Target_directory_to_extract = "/Users/TargetData/unzip” ##local directory to gunzip the downloaded files.
Target_s3_path_to_upload= "event-logs-dev/data/clean/xact/logs/archive/“ ## S3 bucket path to upload the files.
def decompressAllFilesFromNetfiler(self,aws_bucket,local_download_directory,Target_d irectory_to_extract,Target_s3_path_to_upload):
zipFiles = [f for f in os.listdir(local_download_directory) if re.match(r'.*\.tar\.gz', f)]
for f_name in zipFiles:
if os.path.exists(Target_directory_to_extract+"/"+f_name[:-len('.tar.gz')]) and os.access(Target_directory_to_extract+"/"+f_name[:-len('.tar.gz')], os.R_OK):
print ('File {} already exists!'.format(f_name))
else:
f_name_with_path = os.path.join(local_download_directory, f_name)
os.system('mkdir -p {} && tar vxzf {} -C {}'.format(Target_directory_to_extract, f_name_with_path, Target_directory_to_extract))
print ('Extracted file {}'.format(f_name))
self._uploadFile(aws_bucket,f_name,Target_s3_path_to_upload,Target_directory_to_extract)
def _uploadFile(self, aws_bucket, f_name,Target_s3_path_to_upload,Target_directory_to_extract):
full_key_name = os.path.expanduser(os.path.join(Target_s3_path_to_upload, f_name))
path = os.path.expanduser(os.path.join(Target_directory_to_extract, f_name))
try:
print "Uploaded extracted file to: %s" % (full_key_name)
key = aws_bucket.new_key(full_key_name)
key.set_contents_from_filename(path)
except:
if full_key_name is None:
print "Error uploading”
Currently, the output prints that Uploaded extracted file to: event-logs-dev/data/clean/xact/logs/archive/1442235602129200000.tar.gz, but nothing is uploaded to S3 bucket. Your help is greatly appreciated!! Thank you in advance!

It appears that you have cut and pasted parts of your code - and maybe formatting was lost as your code above will not work as pasted. I've taken the liberty to make it PEP8 (mostly) however there is still some missing code to create the S3 objects. Since your import the modules, I presume that you have that section of code and just didn't paste it.
here is a cleaned up version of your code formatted correctly. I also added a Exception code to your try: block to print out the error you get. You should update the Exception to be more specific to the Exceptions thrown for make_key or set_contents_... but the general Exception will get you started. If nothing more this is more readable, but you should include your S3 connection code too - and remove anything that is specific to your domain (e.g. keys, trade secrets, etc).
#!/usr/bin/env python
"""
do some download
some extract
and some upload
"""
from boto.s3.connection import S3Connection
from boto.s3.key import *
import os
import os.path
aws_bucket = 'event-logs-dev'
local_download_directory = '/Users/TargetData/Download/test_queue1/'
Target_directory_to_extract = '/Users/TargetData/unzip'
Target_s3_path_to_upload = 'event-logs-dev/data/clean/xact/logs/archive/'
'''
MUST BE SOME MAGIC HERE TO GET AN S3 CONNECTION ???
aws_bucket IS NOT A BUCKET OBJECT ...
'''
def decompressAllFilesFromNetfiler(self,
aws_bucket,
local_download_directory,
Target_directory_to_extract,
Target_s3_path_to_upload):
'''
decompress stuff
'''
zipFiles = [f for f in os.listdir(
local_download_directory) if re.match(r'.*\.tar\.gz', f)]
for f_name in zipFiles:
if os.path.exists(
"{}/{}".format(Target_directory_to_extract,
f_name[:len('.tar.gz')])) and os.access(
"{}/{}".format(Target_directory_to_extract,
f_name[:len('.tar.gz')])) and os.R_OK:
print ('File {} already exists!'.format(f_name))
else:
f_name_with_path = os.path.join(local_download_directory, f_name)
os.system('mkdir -p {} && tar vxzf {} -C {}'.format(
Target_directory_to_extract,
f_name_with_path,
Target_directory_to_extract))
print ('Extracted file {}'.format(f_name))
self._uploadFile(aws_bucket,
f_name,
Target_s3_path_to_upload,
Target_directory_to_extract)
def _uploadFile(self,
aws_bucket,
f_name,
Target_s3_path_to_upload,
Target_directory_to_extract):
full_key_name = os.path.expanduser(os.path.join(Target_s3_path_to_upload,
f_name))
path = os.path.expanduser(os.path.join(Target_directory_to_extract, f_name))
try:
S3CONN = S3Connection()
BUCKET = S3CONN.get_bucket(aws_bucket)
key = BUCKET.new_key(full_key_name)
key.set_contents_from_filename(path)
print "Uploaded extracted file to: {}".format(full_key_name)
except Exception as UploadERR:
if full_key_name is None:
print 'Error uploading'
else:
print "Error : {}".format(UploadERR)

Related

Creating a dmarc parser using parsedmarc in python3 for use in AWS s3

I am very new to programming. I am working on a pipeline to analyze DMARC report files that are sent to my email account, that I am manually placing in an s3 bucket. The goal of this task is to download, extract, and analyze files using parsedmarc: https://github.com/domainaware/parsedmarc The part I'm having difficulty with is setting a conditional statement to extract .gz files if the target file is not a .zip file. I'm assuming the gzip library will be sufficient for this purpose. Here is the code I have so far. I'm using python3 and the boto3 library for AWS. Any help is appreciated!
import parsedmarc
import pprint
import json
import boto3
import zipfile
import gzip
pp = pprint.PrettyPrinter(indent=2)
def main():
#Set default session profile and region for sandbox account. Access keys are pulled from /.aws/config and /.aws/credentials.
#The 'profile_name' value comes from the header for the account in question in /.aws/config and /.aws/credentials
boto3.setup_default_session(region_name="aws-region-goes-here")
boto3.setup_default_session(profile_name="aws-account-profile-name-goes-here")
#Define the s3 resource, the bucket name, and the file to download. It's hardcoded for now...
s3_resource = boto3.resource(s3)
s3_resource.Bucket('dmarc-parsing').download_file('source-dmarc-report-filename.zip' '/home/user/dmarc/parseme.zip')
#Use the zipfile python library to extract the file into its raw state.
with zipfile.ZipFile('/home/user/dmarc/parseme.zip', 'r') as zip_ref:
zip_ref.extractall('/home/user/dmarc')
#Ingest all locations for xml file source
dmarc_report_directory = '/home/user/dmarc/'
dmarc_report_file = 'parseme.xml'
"""I need an if statement here for extracting .gz files if the file type is not .zip. The contents of every archive are .xml files"""
#Set report output variables using functions in parsedmarc. Variable set to equal the output
pd_report_output=parsedmarc.parse_aggregate_report_file(_input=f"{dmarc_report_directory}{dmarc_report_file}")
#use jsonify to make the output in json format
pd_report_jsonified = json.loads(json.dumps(pd_report_output))
dkim_status = pd_report_jsonified['records'][0]['policy_evaluated']['dkim']
spf_status = pd_report_jsonified['records'][0]['policy_evaluated']['spf']
if dkim_status == 'fail' or spf_status == 'fail':
print(f"{dmarc_report_file} reports failure. oh crap. report:")
else:
print(f"{dmarc_report_file} passes. great. report:")
pp.pprint(pd_report_jsonified['records'][0]['auth_results'])
if __name__ == "__main__":
main()

Here is the code using the parsedmarc.parse_aggregate_report_xml method I found. Hope this helps others in parsing these reports:
import parsedmarc
import pprint
import json
import boto3
import zipfile
import gzip
pp = pprint.PrettyPrinter(indent=2)
def main():
#Set default session profile and region for account. Access keys are pulled from ~/.aws/config and ~/.aws/credentials.
#The 'profile_name' value comes from the header for the account in question in ~/.aws/config and ~/.aws/credentials
boto3.setup_default_session(profile_name="aws_profile_name_goes_here", region_name="region_goes_here")
source_file = 'filename_in_s3_bucket.zip'
destination_directory = '/tmp/'
destination_file = 'compressed_report_file'
#Define the s3 resource, the bucket name, and the file to download. It's hardcoded for now...
s3_resource = boto3.resource('s3')
s3_resource.Bucket('bucket-name-for-dmarc-report-files').download_file(source_file, f"{destination_directory}{destination_file}")
#Extract xml
outputxml = parsedmarc.extract_xml(f"{destination_directory}{destination_file}")
#run parse dmarc analysis & convert output to json
pd_report_output = parsedmarc.parse_aggregate_report_xml(outputxml)
pd_report_jsonified = json.loads(json.dumps(pd_report_output))
#loop through results and find relevant status info and pass fail status
dmarc_report_status = ''
for record in pd_report_jsonified['records']:
if False in record['alignment'].values():
dmarc_report_status = 'Failed'
#************ add logic for interpreting results
#if fail, publish to sns
if dmarc_report_status == 'Failed':
message = "Your dmarc report failed a least one check. Review the log for details"
sns_resource = boto3.resource('sns')
sns_topic = sns_resource.Topic('arn:aws:sns:us-west-2:112896196555:TestDMARC')
sns_publish_response = sns_topic.publish(Message=message)
if __name__ == "__main__":
main()

How to extract files in S3 on the fly with boto3?

I'm trying to find a way to extract .gz files in S3 on the fly, that is no need to download it to locally, extract and then push it back to S3.
With boto3 + lambda, how can i achieve my goal?
I didn't see any extract part in boto3 document.

You can use BytesIO to stream the file from S3, run it through gzip, then pipe it back up to S3 using upload_fileobj to write the BytesIO.
# python imports
import boto3
from io import BytesIO
import gzip
# setup constants
bucket = '<bucket_name>'
gzipped_key = '<key_name.gz>'
uncompressed_key = '<key_name>'
# initialize s3 client, this is dependent upon your aws config being done
s3 = boto3.client('s3', use_ssl=False) # optional
s3.upload_fileobj( # upload a new obj to s3
Fileobj=gzip.GzipFile( # read in the output of gzip -d
None, # just return output as BytesIO
'rb', # read binary
fileobj=BytesIO(s3.get_object(Bucket=bucket, Key=gzipped_key)['Body'].read())),
Bucket=bucket, # target bucket, writing to
Key=uncompressed_key) # target key, writing to
Ensure that your key is reading in correctly:
# read the body of the s3 key object into a string to ensure download
s = s3.get_object(Bucket=bucket, Key=gzip_key)['Body'].read()
print(len(s)) # check to ensure some data was returned

The above answers are for gzip files, for zip files, you may try
import boto3
import zipfile
from io import BytesIO
bucket = 'bucket1'
s3 = boto3.client('s3', use_ssl=False)
Key_unzip = 'result_files/'
prefix = "folder_name/"
zipped_keys = s3.list_objects_v2(Bucket=bucket, Prefix=prefix, Delimiter = "/")
file_list = []
for key in zipped_keys['Contents']:
file_list.append(key['Key'])
#This will give you list of files in the folder you mentioned as prefix
s3_resource = boto3.resource('s3')
#Now create zip object one by one, this below is for 1st file in file_list
zip_obj = s3_resource.Object(bucket_name=bucket, key=file_list[0])
print (zip_obj)
buffer = BytesIO(zip_obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
for filename in z.namelist():
file_info = z.getinfo(filename)
s3_resource.meta.client.upload_fileobj(
z.open(filename),
Bucket=bucket,
Key='result_files/' + f'{filename}')
This will work for your zip file and your result unzipped data will be in result_files folder. Make sure to increase memory and time on AWS Lambda to maximum since some files are pretty large and needs time to write.

Amazon S3 is a storage service. There is no in-built capability to manipulate the content of files.
However, you could use an AWS Lambda function to retrieve an object from S3, decompress it, then upload content back up again. However, please note that there is default limit of 500MB in temporary disk space for Lambda, so avoid decompressing too much data at the same time.
You could configure the S3 bucket to trigger the Lambda function when a new file is created in the bucket. The Lambda function would then:
Use boto3 to download the new file
Use the gzip Python library to extract files
Use boto3 to upload the resulting file(s)
Sample code:
import gzip
import io
import boto3
bucket = '<bucket_name>'
key = '<key_name>'
s3 = boto3.client('s3', use_ssl=False)
compressed_file = io.BytesIO(
s3.get_object(Bucket=bucket, Key=key)['Body'].read())
uncompressed_file = gzip.GzipFile(None, 'rb', fileobj=compressed_file)
s3.upload_fileobj(Fileobj=uncompressed_file, Bucket=bucket, Key=key[:-3])

How to download a snappy.parquet file from s3 using Boto in Python

I'm new to this, and trying to download a snappy.parquet file from Amazon s3 I can later convert to CSV file.
I tried working with the following example I've found online, and I get an empty folder. can anyone please help me?
import boto
import sys, os
from boto.s3.key import Key
from boto.exception import S3ResponseError
DOWNLOAD_LOCATION_PATH =""
BUCKET_NAME = ""
AWS_ACCESS_KEY_ID= ""
AWS_ACCESS_SECRET_KEY = ""
conn = boto.connect_s3(AWS_ACCESS_KEY_ID, AWS_ACCESS_SECRET_KEY)
bucket = conn.get_bucket(BUCKET_NAME)
#goto through the list of files
bucket_list = bucket.list()
for l in bucket_list:
key_string = str(l.key)
s3_path = DOWNLOAD_LOCATION_PATH + key_string
try:
print ("Current File is ", s3_path)
l.get_contents_to_filename(s3_path)
except (OSError, S3ResponseError) as e:
pass
# check if the file has been downloaded locally
if not os.path.exists(s3_path):
try:
os.makedirs(s3_path)
except OSError as exc:
# let guard againts race conditions
import errno
if exc.errno != errno.EEXIST:
raise

The script you are using appears to recursively download the contents of the specified S3 bucket (BUCKET_NAME) to the specified local directory (DOWNLOAD_LOCATION_PATH). FWIW, I notice this script looks like it comes from here.
The "Current File is ..." output line should show you the progress of these files being written. One problem you might be having is due to this line:
s3_path = DOWNLOAD_LOCATION_PATH + key_string
If you had specified DOWNLOAD_LOCATION_PATH at the top as a directory without a trailing '/' character, e.g. like this:
DOWNLOAD_LOCATION_PATH = '/tmp/my_dir'
then the files being downloaded would be written not underneath the /tmp/my_dir directory, but directly in /tmp/ with a my_dir prefix on each filename! You can fix this by changing this line to:
s3_path = os.path.join(DOWNLOAD_LOCATION_PATH, key_string)
Other than that, the script appears to work alright. You may want to add this line at the very top:
from __future__ import print_function
if you are still using Python 2.x, otherwise the print output will look a bit odd (print will think you are printing a 2-Tuple).
Your question also makes it sound like you really only want/need to download a single file from the bucket -- if so, this isn't really a great script to be using, since it's downloading everything.

Django Tweepy can't access Amazon S3 file

I'm using Tweepy, a tweeting python library, django-storages and boto. I have a custom manage.py command that works correctly locally, it gets an image from the filesystem and tweets that image. If I change the storage to Amazon S3, however, I can't access the file. It gives me this error:
raise TweepError('Unable to access file: %s' % e.strerror)
I tried making the images in the bucket "public". Didn't work. This is the code (it works without S3):
filename = model_object.image.file.url
media_ids = api.media_upload(filename=filename) # ERROR
params = {'status': tweet_text, 'media_ids': [media_ids.media_id_string]}
api.update_status(**params)
This line:
model_object.image.file.url
Gives me the complete url of the image I want to tweet, something like this:
https://criptolibertad.s3.amazonaws.com/OrillaLibertaria/195.jpg?Signature=xxxExpires=1467645897&AWSAccessKeyId=yyy
I also tried constructing the url manually, since it is a public image stored in my bucket, like this:
filename = "https://criptolibertad.s3.amazonaws.com/OrillaLibertaria/195.jpg"
But it doesn't work.
¿Why do I get the Unable to access file error?
The source code from tweepy looks like this:
def media_upload(self, filename, *args, **kwargs):
""" :reference: https://dev.twitter.com/rest/reference/post/media/upload
:allowed_param:
"""
f = kwargs.pop('file', None)
headers, post_data = API._pack_image(filename, 3072, form_field='media', f=f) # ERROR
kwargs.update({'headers': headers, 'post_data': post_data})
def _pack_image(filename, max_size, form_field="image", f=None):
"""Pack image from file into multipart-formdata post body"""
# image must be less than 700kb in size
if f is None:
try:
if os.path.getsize(filename) > (max_size * 1024):
raise TweepError('File is too big, must be less than %skb.' % max_size)
except os.error as e:
raise TweepError('Unable to access file: %s' % e.strerror)
Looks like Tweepy can't get the image from the Amazon S3 bucket, but how can I make it work? Any advice will help.

The issue occurs when tweepy attempts to get file size in _pack_image:
if os.path.getsize(filename) > (max_size * 1024):
The function os.path.getsize assumes it is given a file path on disk; however, in your case it is given a URL. Naturally, the file is not found on disk and os.error is raised. For example:
# The following raises OSError on my machine
os.path.getsize('https://criptolibertad.s3.amazonaws.com/OrillaLibertaria/195.jpg')
What you could do is to fetch the file content, temporarily save it locally and then tweet it:
import tempfile
with tempfile.NamedTemporaryFile(delete=True) as f:
name = model_object.image.file.name
f.write(model_object.image.read())
media_ids = api.media_upload(filename=name, f=f)
params = dict(status='test media', media_ids=[media_ids.media_id_string])
api.update_status(**params)
For your convenience, I published a fully working example here: https://github.com/izzysoftware/so38134984

upload a directory to s3 with boto

I am already connected to the instance and I want to upload the files that are generated from my python script directly to S3. I have tried this:
import boto
s3 = boto.connect_s3()
bucket = s3.get_bucket('alexandrabucket')
from boto.s3.key import Key
key = bucket.new_key('s0').set_contents_from_string('some content')
but this is rather creating a new file s0 with the context "same content" while I want to upload the directory s0 to mybucket.
I had a look also to s3put but I didn't manage to get what I want.

The following function can be used to upload directory to s3 via boto.
def uploadDirectory(path,bucketname):
for root,dirs,files in os.walk(path):
for file in files:
s3C.upload_file(os.path.join(root,file),bucketname,file)
Provide a path to the directory and bucket name as the inputs. The files are placed directly into the bucket. Alter the last variable of the upload_file() function to place them in "directories".

There is nothing in the boto library itself that would allow you to upload an entire directory. You could write your own code to traverse the directory using os.walk or similar and to upload each individual file using boto.
There is a command line utility in boto called s3put that could handle this or you could use the AWS CLI tool which has a lot of features that allow you to upload entire directories or even sync the S3 bucket with a local directory or vice-versa.

The s3fs package provides nice functionalities to handle such cases
s3_file = s3fs.S3FileSystem()
local_path = "some_dir_path/some_dir_path/"
s3_path = "bucket_name/dir_path"
s3_file.put(local_path, s3_path, recursive=True)

I built the function based on the feedback from #JDPTET, however,
I needed to remove the common entire local path from getting uploaded to the bucket!
Not sure how many path separators I encounter - so I had to use os.path.normpath
def upload_folder_to_s3(s3bucket, inputDir, s3Path):
print("Uploading results to s3 initiated...")
print("Local Source:",inputDir)
os.system("ls -ltR " + inputDir)
print("Dest S3path:",s3Path)
try:
for path, subdirs, files in os.walk(inputDir):
for file in files:
dest_path = path.replace(inputDir,"")
__s3file = os.path.normpath(s3Path + '/' + dest_path + '/' + file)
__local_file = os.path.join(path, file)
print("upload : ", __local_file, " to Target: ", __s3file, end="")
s3bucket.upload_file(__local_file, __s3file)
print(" ...Success")
except Exception as e:
print(" ... Failed!! Quitting Upload!!")
print(e)
raise e
s3 = boto3.resource('s3', region_name='us-east-1')
s3bucket = s3.Bucket("<<s3bucket_name>>")
upload_folder_to_s3(s3bucket, "<<Local Folder>>", "<<s3 Path>>")

You could do the following:
import os
import boto3
s3_resource = boto3.resource("s3", region_name="us-east-1")
def upload_objects():
try:
bucket_name = "S3_Bucket_Name" #s3 bucket name
root_path = 'D:/sample/' # local folder for upload
my_bucket = s3_resource.Bucket(bucket_name)
for path, subdirs, files in os.walk(root_path):
path = path.replace("\\","/")
directory_name = path.replace(root_path,"")
for file in files:
my_bucket.upload_file(os.path.join(path, file), directory_name+'/'+file)
except Exception as err:
print(err)
if __name__ == '__main__':
upload_objects()

This is the code I used which recursively upload files from the specified folder to the specified s3 path. Just add S3 credential and bucket details in the script:
https://gist.github.com/hari116/4ab5ebd885b63e699c4662cd8382c314/
#!/usr/bin/python
"""Usage: Add bucket name and credentials
script.py <source folder> <s3 destination folder >"""
import os
from sys import argv
import boto3
from botocore.exceptions import NoCredentialsError
ACCESS_KEY = ''
SECRET_KEY = ''
host = ''
bucket_name = ''
local_folder, s3_folder = argv[1:3]
walks = os.walk(local_folder)
# Function to upload to s3
def upload_to_aws(bucket, local_file, s3_file):
"""local_file, s3_file can be paths"""
s3 = boto3.client('s3', aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY)
print(' Uploading ' +local_file + ' as ' + bucket + '/' +s3_file)
try:
s3.upload_file(local_file, bucket, s3_file)
print(' '+s3_file + ": Upload Successful")
print(' ---------')
return True
except NoCredentialsError:
print("Credentials not available")
return False
"""For file names"""
for source, dirs, files in walks:
print('Directory: ' + source)
for filename in files:
# construct the full local path
local_file = os.path.join(source, filename)
# construct the full Dropbox path
relative_path = os.path.relpath(local_file, local_folder)
s3_file = os.path.join(s3_folder, relative_path)
# Invoke upload function
upload_to_aws(bucket_name, local_file, s3_file)

For reading file form folder we can use
import boto
from boto.s3.key import Key
keyId = 'YOUR_AWS_ACCESS_KEY_ID'
sKeyId='YOUR_AWS_ACCESS_KEY_ID'
bucketName='your_bucket_name'
conn = boto.connect_s3(keyId,sKeyId)
bucket = conn.get_bucket(bucketName)
for key in bucket.list():
print ">>>>>"+key.name
pathV = key.name.split('/')
if(pathV[0] == "data"):
if(pathV[1] != ""):
srcFileName = key.name
filename = key.name
filename = filename.split('/')[1]
destFileName = "model/data/"+filename
k = Key(bucket,srcFileName)
k.get_contents_to_filename(destFileName)
elif(pathV[0] == "nlu_data"):
if(pathV[1] != ""):
srcFileName = key.name
filename = key.name
filename = filename.split('/')[1]
destFileName = "model/nlu_data/"+filename
k = Key(bucket,srcFileName)
k.get_contents_to_filename(destFileName)

Updated #user 923227's answer to (1) include newer boto3 interface (2) work with nuances of windows double backslash (3) cleaner tqdm progress bar:
import os
from tqdm import tqdm
def upload_folder_to_s3(s3_client, s3bucket, input_dir, s3_path):
pbar = tqdm(os.walk(input_dir))
for path, subdirs, files in pbar:
for file in files:
dest_path = path.replace(input_dir, "").replace(os.sep, '/')
s3_file = f'{s3_path}/{dest_path}/{file}'.replace('//', '/')
local_file = os.path.join(path, file)
s3_client.upload_file(local_file, s3bucket, s3_file)
pbar.set_description(f'Uploaded {local_file} to {s3_file}')
print(f"Successfully uploaded {input_dir} to S3 {s3_path}")
Usage example:
s3_client = boto3.client('s3', aws_access_key_id=AWS_ACCESS_KEY_ID, aws_secret_access_key=AWS_SECRET_ACCESS_KEY)
upload_folder_to_s3(s3_client, 'BUCKET-NAME', <local-directory>, <s3-directory>)

Somehow the other snippets did not really work for me, this is a modification of the snippet from user 923227 that does.
This code copies all files in a directory and maintains the directory in S3, e.g.2023/01/file.jpg will be in the bucket as 2023/01/file.jpg.
import os
import sys
import boto3
client = boto3.client('s3')
local_path = "your-path/data"
bucketname = "bucket-name"
for path, dirs, files in os.walk(local_path):
for file in files:
file_s3 = os.path.normpath(path + '/' + file)
file_local = os.path.join(path, file)
print("Upload:", file_local, "to target:", file_s3, end="")
client.upload_file(file_local, bucketname, file_s3)
print(" ...Success")

Another method that did not exist when this question was first asked is to use python-rclone (https://github.com/ddragosd/python-rclone/blob/master/README.md).
This requires a download of rclone and a working rclone config. Commonly used for AWS (https://rclone.org/s3/) but can be used for other providers as well.
install('python-rclone')
import rclone
cfg_path = r'(path to rclone config file here)'
with open(cfg_path) as f:
cfg = f.read()
# Implementation
# Local file to cloud server
result = rclone.with_config(cfg).run_cmd(command="sync", extra_args=["/home/demodir/", "AWS test:dummydir/etc/"])
# Cloud server to cloud server
result = rclone.with_config(cfg).run_cmd(command="sync", extra_args=["Gdrive:test/testing/", "AWS test:dummydir/etc/"
This allows you to run a "sync" command similar to the AWS CLI within your python code by reading in the config file and mapping your output via kwargs (extra_args)

This solution does not use boto, but I think it could do what the OP wants.
It uses awscli and Python.
import os
class AwsCredentials:
def __init__(self, access_key: str, secret_key: str):
self.access_key = access_key
self.secret_key = secret_key
def to_command(self):
credentials = f'AWS_ACCESS_KEY_ID={self.access_key} AWS_SECRET_ACCESS_KEY={self.secret_key}'
return credentials
def sync_s3_bucket(credentials: AwsCredentials, source_path: str, bucket: str) -> None:
command = f'{credentials.to_command()} aws s3 sync {source_path} s3://{bucket}'
result = os.system(command)
assert result == 0, f'The s3 sync was not successful, error code: {result}'
Please consider getting the AWS credentials from a file or from the environment.
The documentation for the s3 sync command is here.

Simply running terminal commands using os module with F string works
import os
ActualFolderName = "FolderToBeUploadedOnS3"
os.system(f'aws s3 cp D:\<PathToYourFolder>\{ActualFolderName} s3://<BucketName>/{ActualFolderName}/ --recursive')

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Issue with uploading files from local directory to aws S3 using python 2.7 and boto 2 - python-2.7

Related

Creating a dmarc parser using parsedmarc in python3 for use in AWS s3

How to extract files in S3 on the fly with boto3?

How to download a snappy.parquet file from s3 using Boto in Python

Django Tweepy can't access Amazon S3 file

upload a directory to s3 with boto

Categories

Resources