get file from subfolder in S3 bucket using boto3 - amazon-web-services

To read a file stored directly in a S3 bucket, you use the following code:
import boto3
s3 = boto3.resource('s3')
s3_object = s3.Bucket('mybucket').Object('example.txt').get()
file = s3_object['Body'].read()
However, the 'example.txt' file is stored in a sub-folder. How do I read in the file that is stored in 'mybucket/mysubfolder/example.txt'?

you need to provide the full key like this:
s3_object = s3.Bucket('mybucket').Object('mysubfolder/example.txt').get()
S3 is a key value pair storage the concept of folding is just for human readability.

Related

Data loss in AWS Lambda function when downloading file from S3 to /tmp

I have written a Lambda function in AWS to download a file from an S3 location to /tmp directory (local Lambda space).
I am able to download the file however, the file size is changing here, not sure why?
def data_processor(event, context):
print("EVENT:: ", event)
bucket_name = 'asr-collection'
fileKey = 'cc_continuous/testing/1645136763813.wav'
path = '/tmp'
output_path = os.path.join(path, 'mydir')
if not os.path.exists(output_path):
os.makedirs(output_path)
s3 = boto3.client("s3")
new_file_name = output_path + '/' + os.path.basename(fileKey)
s3.download_file(
Bucket=bucket_name, Key=fileKey, Filename=output_path + '/' + os.path.basename(fileKey)
)
print('File size is: ' + str(os.path.getsize(new_file_name)))
return None
Output:
File size is: 337964
Actual size: 230MB
downloaded file size is 330KB
I tried download_fileobj() as well
Any idea how can i download the file as it is, without any data loss?
The issue can be that the bucket you are downloading from was from a different region than the Lambda was hosted in. Apparently, this does not make a difference when running it locally.
Check your bucket locations relative to your Lambda region.
Make a note that setting the region on your client will allow you to use a lambda in a different region from your bucket. However if you intend to pull down larger files you will get network latency benefits from keeping your lambda in the same region as your bucket.
Working with S3 resource instance instead of client fixed it.
s3 = boto3.resource('s3')
keys = ['TestFolder1/testing/1651219413148.wav']
for KEY in keys:
local_file_name = '/tmp/'+KEY
s3.Bucket(bucket_name).download_file(KEY, local_file_name)

Combine all txt files in an S3 bucket into 1 large file

Problem: I am trying to combine large amounts of small-sized text files into 1 large-sized file in S3 bucket. Using python:
The code I tested to try this locally is below. It works perfectly. (obtained from another post):
with open(outfilename, 'wb') as outfile:
for filename in glob.glob('UBXEvents*'):
if filename == outfilename: # don't want to copy the output into the output
continue
with open(filename, 'rb') as readfile:
shutil.copyfileobj(readfile, outfile)
Now, since my files are located in an S3 bucket, I have trouble referencing the S3 bucket. I wanted to run this code for all files (using wild card *) in an S3 but I am having a hard time connecting the two.
Below is the s3 object I created:
object = client.get_object(
Bucket= 'my_bucket_name',
Key='bucket_path/prefix_of_file_name*'
)
Question: How would I reference the S3 bucket/path in my combining code above?
Obtaining a list of files
You can obtain a list of files in the bucket like this:
import boto3
s3_client = boto3.client('s3')
response = s3_client.list_objects_v2(Bucket='my-bucket', Prefix = 'folder1/')
for object in response['Contents']:
# Do stuff here
print(object['Key'])
Reading & Writing to Amazon S3
Normally, you would need to download each file from Amazon S3 to the local disk (using download_file() and then read the contents). However, you might instead want to use smart-open ยท PyPI, which is a library that allows files to be opened on S3 using similar syntax to the normal Python open() command.
Here's a program that uses smart-open to read files from S3 and combine them into an output file in S3:
import boto3
from smart_open import open
BUCKET = 'my-bucket'
PREFIX = 'folder1/' # Optional
s3_client = boto3.client('s3')
# Open output file with smart-open
with open(f's3://{BUCKET}/out.txt', 'w') as out_file:
response = s3_client.list_objects_v2(Bucket=BUCKET, Prefix = PREFIX)
for object in response['Contents']:
print(f"Copying {object['Key']}")
# Open input file with smart-open
with open(f"s3://{BUCKET}/{object['Key']}", 'r') as in_file:
# Read content from input file
for line in in_file:
# Write content to output file
out_file.write(line)

Copy Without Prefix s3

I have directory structures in s3 like
bucket/folder1/*/*.csv
Where the folder wildcard refers to a number of different folders containing csv files.
I want to copy them at without the prefix to
bucket/folder2/*.csv
Ex:
bucket/folder1/
s3distcp --src=s3://bucket/folder1/ --dests3://bucket/folder2/ --srcPattern=.*/csv
Results in the undesired structure of:
bucket/folder2/*/*.csv
I need a solution to copy in bulk that is scalable. Can I do this with s3distcp? Can I do this with aws s3 cp (without having to execute the aws s3 cp per file)?
You should try the following CLI command
aws s3 sync s3://SOURCE_BUCKET_NAME s3://DESTINATION_BUCKET_NAME --recursive
There is no shortcut to do what you wish, because you are manipulating the path to the objects.
You could instead write a little program to do it, such as:
import boto3
BUCKET = 'my-bucket'
s3_client = boto3.client('s3', region_name = 'ap-southeast-2')
# Get a list of objects in folder1
response = s3_client.list_objects_v2(Bucket=BUCKET, Prefix='folder1')
# Copy files to folder2, keeping a flat hierarchy
for object in response['Contents']:
key = object['Key']
print(key)
s3_client.copy_object(
CopySource={'Bucket': BUCKET, 'Key': key},
Bucket=BUCKET,
Key = 'folder2' + key[key.rfind('/'):]
)
Ended up using Apache Nifi to do this, changing the filename attribute of the flowfile (use regex to remove all of the path before the last '/') and writing with a prefix to the desired directory. It scales really well.

How to extract files in S3 on the fly with boto3?

I'm trying to find a way to extract .gz files in S3 on the fly, that is no need to download it to locally, extract and then push it back to S3.
With boto3 + lambda, how can i achieve my goal?
I didn't see any extract part in boto3 document.
You can use BytesIO to stream the file from S3, run it through gzip, then pipe it back up to S3 using upload_fileobj to write the BytesIO.
# python imports
import boto3
from io import BytesIO
import gzip
# setup constants
bucket = '<bucket_name>'
gzipped_key = '<key_name.gz>'
uncompressed_key = '<key_name>'
# initialize s3 client, this is dependent upon your aws config being done
s3 = boto3.client('s3', use_ssl=False) # optional
s3.upload_fileobj( # upload a new obj to s3
Fileobj=gzip.GzipFile( # read in the output of gzip -d
None, # just return output as BytesIO
'rb', # read binary
fileobj=BytesIO(s3.get_object(Bucket=bucket, Key=gzipped_key)['Body'].read())),
Bucket=bucket, # target bucket, writing to
Key=uncompressed_key) # target key, writing to
Ensure that your key is reading in correctly:
# read the body of the s3 key object into a string to ensure download
s = s3.get_object(Bucket=bucket, Key=gzip_key)['Body'].read()
print(len(s)) # check to ensure some data was returned
The above answers are for gzip files, for zip files, you may try
import boto3
import zipfile
from io import BytesIO
bucket = 'bucket1'
s3 = boto3.client('s3', use_ssl=False)
Key_unzip = 'result_files/'
prefix = "folder_name/"
zipped_keys = s3.list_objects_v2(Bucket=bucket, Prefix=prefix, Delimiter = "/")
file_list = []
for key in zipped_keys['Contents']:
file_list.append(key['Key'])
#This will give you list of files in the folder you mentioned as prefix
s3_resource = boto3.resource('s3')
#Now create zip object one by one, this below is for 1st file in file_list
zip_obj = s3_resource.Object(bucket_name=bucket, key=file_list[0])
print (zip_obj)
buffer = BytesIO(zip_obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
for filename in z.namelist():
file_info = z.getinfo(filename)
s3_resource.meta.client.upload_fileobj(
z.open(filename),
Bucket=bucket,
Key='result_files/' + f'{filename}')
This will work for your zip file and your result unzipped data will be in result_files folder. Make sure to increase memory and time on AWS Lambda to maximum since some files are pretty large and needs time to write.
Amazon S3 is a storage service. There is no in-built capability to manipulate the content of files.
However, you could use an AWS Lambda function to retrieve an object from S3, decompress it, then upload content back up again. However, please note that there is default limit of 500MB in temporary disk space for Lambda, so avoid decompressing too much data at the same time.
You could configure the S3 bucket to trigger the Lambda function when a new file is created in the bucket. The Lambda function would then:
Use boto3 to download the new file
Use the gzip Python library to extract files
Use boto3 to upload the resulting file(s)
Sample code:
import gzip
import io
import boto3
bucket = '<bucket_name>'
key = '<key_name>'
s3 = boto3.client('s3', use_ssl=False)
compressed_file = io.BytesIO(
s3.get_object(Bucket=bucket, Key=key)['Body'].read())
uncompressed_file = gzip.GzipFile(None, 'rb', fileobj=compressed_file)
s3.upload_fileobj(Fileobj=uncompressed_file, Bucket=bucket, Key=key[:-3])

Uploading multiple files to Google Cloud Storage via Python Client Library

The GCP python docs have a script with the following function:
def upload_pyspark_file(project_id, bucket_name, filename, file):
"""Uploads the PySpark file in this directory to the configured
input bucket."""
print('Uploading pyspark file to GCS')
client = storage.Client(project=project_id)
bucket = client.get_bucket(bucket_name)
blob = bucket.blob(filename)
blob.upload_from_file(file)
I've created an argument parsing function in my script that takes in multiple arguments (file names) to upload to a GCS bucket. I'm trying to adapt the above function to parse those multiple args and upload those files, but am unsure how to proceed. My confusion is with the 'filename' and 'file' variables above. How can I adapt the function for my specific purpose?
I don't suppose you're still looking for something like this?
from google.cloud import storage
import os
files = os.listdir('data-files')
client = storage.Client.from_service_account_json('cred.json')
bucket = client.get_bucket('xxxxxx')
def upload_pyspark_file(filename, file):
# """Uploads the PySpark file in this directory to the configured
# input bucket."""
# print('Uploading pyspark file to GCS')
# client = storage.Client(project=project_id)
# bucket = client.get_bucket(bucket_name)
print('Uploading from ', file, 'to', filename)
blob = bucket.blob(filename)
blob.upload_from_file(file)
for f in files:
upload_pyspark_file(f, "data-files\\{0}".format(f))
The difference between file and filename is as you may have guessed, file is the source file and filename is the destination file.