How to zip files in memory and send to Amazon S3 - amazon-web-services

I am having this problem that if I write something like (where bunch is just a list of file paths):
zip_buffer = io.BytesIO()
with zipfile.ZipFile(zip_buffer, "a", zipfile.ZIP_DEFLATED, False) as zipper:
for file in bunch:
zipper.write(file)
with open('0.zip', 'wb') as f:
f.write(zip_buffer.getvalue())
Then I get a zip file 0.zip, with all the files in the "bunch" list. Thats great.
However when I try to upload to Amazon S3 from memory
s3_client = boto3.client("s3")
zip_buffer = io.BytesIO()
with zipfile.ZipFile(zip_buffer, "a", zipfile.ZIP_DEFLATED, False) as zipper:
for file in bunch:
zipper.write(file)
s3_client.put_object(Bucket=S3_BUCKET, Key=ZIP_NAME_IN_S3_BUCKET, Body=zip_buffer.getvalue())
Then the zip file is created in the Amazon S3 bucket, however its not a valid zip file that can be unzipped. Why is it different when I save it locally, and send it to Amazon S3?

I found the solution. I had to do:
s3_client = boto3.client("s3")
zip_buffer = io.BytesIO()
with zipfile.ZipFile(zip_buffer, "a", zipfile.ZIP_DEFLATED, False) as zipper:
for file in bunch:
zipper.write(file)
zip_buffer.seek(0)
s3_client.upload_fileobj(zip_buffer, S3_BUCKET, ZIP_NAME_IN_S3_BUCKET)

Related

File truncated on upload to GCS

I am uploading a relatively small(<1 MiB) .jsonl file on Google CLoud storage using the python API. The function I used is from the gcp documentation:
def upload_blob(key_path,bucket_name, source_file_name, destination_blob_name):
"""Uploads a file to the bucket."""
# The ID of your GCS bucket
# bucket_name = "your-bucket-name"
# The path to your file to upload
# source_file_name = "local/path/to/file"
# The ID of your GCS object
# destination_blob_name = "storage-object-name"
storage_client = storage.Client.from_service_account_json(key_path)
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(source_file_name)
print(
"File {} uploaded to {}.".format(
source_file_name, destination_blob_name
)
)
The issue I am having is that the .jsonl file is getting truncated at 9500 lines after the upload. In fact, the 9500th line is not complete. I am not sure what the issue is and don't think there would be any limit for this small file. Any help is appreciated.
I had a similar problem some time ago. In my case the upload to bucket was called inside a with python clause right after the line where I recorded contents to source_file_name, so I just needed to move the upload line outside the with in order to properly recorded and close local file to be uploaded.

Combine all txt files in an S3 bucket into 1 large file

Problem: I am trying to combine large amounts of small-sized text files into 1 large-sized file in S3 bucket. Using python:
The code I tested to try this locally is below. It works perfectly. (obtained from another post):
with open(outfilename, 'wb') as outfile:
for filename in glob.glob('UBXEvents*'):
if filename == outfilename: # don't want to copy the output into the output
continue
with open(filename, 'rb') as readfile:
shutil.copyfileobj(readfile, outfile)
Now, since my files are located in an S3 bucket, I have trouble referencing the S3 bucket. I wanted to run this code for all files (using wild card *) in an S3 but I am having a hard time connecting the two.
Below is the s3 object I created:
object = client.get_object(
Bucket= 'my_bucket_name',
Key='bucket_path/prefix_of_file_name*'
)
Question: How would I reference the S3 bucket/path in my combining code above?
Obtaining a list of files
You can obtain a list of files in the bucket like this:
import boto3
s3_client = boto3.client('s3')
response = s3_client.list_objects_v2(Bucket='my-bucket', Prefix = 'folder1/')
for object in response['Contents']:
# Do stuff here
print(object['Key'])
Reading & Writing to Amazon S3
Normally, you would need to download each file from Amazon S3 to the local disk (using download_file() and then read the contents). However, you might instead want to use smart-open ยท PyPI, which is a library that allows files to be opened on S3 using similar syntax to the normal Python open() command.
Here's a program that uses smart-open to read files from S3 and combine them into an output file in S3:
import boto3
from smart_open import open
BUCKET = 'my-bucket'
PREFIX = 'folder1/' # Optional
s3_client = boto3.client('s3')
# Open output file with smart-open
with open(f's3://{BUCKET}/out.txt', 'w') as out_file:
response = s3_client.list_objects_v2(Bucket=BUCKET, Prefix = PREFIX)
for object in response['Contents']:
print(f"Copying {object['Key']}")
# Open input file with smart-open
with open(f"s3://{BUCKET}/{object['Key']}", 'r') as in_file:
# Read content from input file
for line in in_file:
# Write content to output file
out_file.write(line)

Extract Zip files in s3 bucket

I'm trying to extract zip files from s3 bucket and writing files back to another s3 bucket.Currently I have zip file size of 150 GB in s3 Bucket.
When I execute the below code, its keep on running from several hours and there is no output.Is there any alternative way to improve performance ? I appreciate your response.
s3_resource = boto3.resource('s3')
zip_obj = s3_resource.Object(bucket_name="bucket_name_here", key=zip_key)
buffer = BytesIO(zip_obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
for filename in z.namelist():
file_info = z.getinfo(filename)
s3_resource.meta.client.upload_fileobj(
z.open(filename),
Bucket=bucket,
Key=f'{filename}'
)

How to extract files in S3 on the fly with boto3?

I'm trying to find a way to extract .gz files in S3 on the fly, that is no need to download it to locally, extract and then push it back to S3.
With boto3 + lambda, how can i achieve my goal?
I didn't see any extract part in boto3 document.
You can use BytesIO to stream the file from S3, run it through gzip, then pipe it back up to S3 using upload_fileobj to write the BytesIO.
# python imports
import boto3
from io import BytesIO
import gzip
# setup constants
bucket = '<bucket_name>'
gzipped_key = '<key_name.gz>'
uncompressed_key = '<key_name>'
# initialize s3 client, this is dependent upon your aws config being done
s3 = boto3.client('s3', use_ssl=False) # optional
s3.upload_fileobj( # upload a new obj to s3
Fileobj=gzip.GzipFile( # read in the output of gzip -d
None, # just return output as BytesIO
'rb', # read binary
fileobj=BytesIO(s3.get_object(Bucket=bucket, Key=gzipped_key)['Body'].read())),
Bucket=bucket, # target bucket, writing to
Key=uncompressed_key) # target key, writing to
Ensure that your key is reading in correctly:
# read the body of the s3 key object into a string to ensure download
s = s3.get_object(Bucket=bucket, Key=gzip_key)['Body'].read()
print(len(s)) # check to ensure some data was returned
The above answers are for gzip files, for zip files, you may try
import boto3
import zipfile
from io import BytesIO
bucket = 'bucket1'
s3 = boto3.client('s3', use_ssl=False)
Key_unzip = 'result_files/'
prefix = "folder_name/"
zipped_keys = s3.list_objects_v2(Bucket=bucket, Prefix=prefix, Delimiter = "/")
file_list = []
for key in zipped_keys['Contents']:
file_list.append(key['Key'])
#This will give you list of files in the folder you mentioned as prefix
s3_resource = boto3.resource('s3')
#Now create zip object one by one, this below is for 1st file in file_list
zip_obj = s3_resource.Object(bucket_name=bucket, key=file_list[0])
print (zip_obj)
buffer = BytesIO(zip_obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
for filename in z.namelist():
file_info = z.getinfo(filename)
s3_resource.meta.client.upload_fileobj(
z.open(filename),
Bucket=bucket,
Key='result_files/' + f'{filename}')
This will work for your zip file and your result unzipped data will be in result_files folder. Make sure to increase memory and time on AWS Lambda to maximum since some files are pretty large and needs time to write.
Amazon S3 is a storage service. There is no in-built capability to manipulate the content of files.
However, you could use an AWS Lambda function to retrieve an object from S3, decompress it, then upload content back up again. However, please note that there is default limit of 500MB in temporary disk space for Lambda, so avoid decompressing too much data at the same time.
You could configure the S3 bucket to trigger the Lambda function when a new file is created in the bucket. The Lambda function would then:
Use boto3 to download the new file
Use the gzip Python library to extract files
Use boto3 to upload the resulting file(s)
Sample code:
import gzip
import io
import boto3
bucket = '<bucket_name>'
key = '<key_name>'
s3 = boto3.client('s3', use_ssl=False)
compressed_file = io.BytesIO(
s3.get_object(Bucket=bucket, Key=key)['Body'].read())
uncompressed_file = gzip.GzipFile(None, 'rb', fileobj=compressed_file)
s3.upload_fileobj(Fileobj=uncompressed_file, Bucket=bucket, Key=key[:-3])

IOError in Boto3 download_file

Background
I am using the following Boto3 code to download file from S3.
for record in event['Records']:
bucket = record['s3']['bucket']['name']
key = record['s3']['object']['key']
print (key)
if key.find('/') < 0 :
if len(key) > 4 and key[-5:].lower() == '.json': //File is uploaded outside any folder
download_path = '/tmp/{}{}'.format(uuid.uuid4(), key)
else:
download_path = '/tmp/{}/{}'.format(uuid.uuid4(), key)//File is uploaded inside a folder
If a new file is uploaded in S3 bucket, this code is triggered and that newly uploaded file is downloaded by this code.
This code works fine when uploaded outside any folder.
However, when I upload a file inside a directory, IO error happens.
Here is a dump of the IO error I am encountering.
[Errno 2] No such file or directory:
/tmp/316bbe85-fa21-463b-b965-9c12b0327f5d/test1/customer1.json.586ea9b8:
IOError
test1 is the directory inside my S3 bucket where customer1.json is uploaded.
Query
Any thoughts on how to resolve this error?
Error raised because you attempted to download and save file into directory which not exists. Use os.mkdir prior downloading file to create an directory.
# ...
else:
item_uuid = str(uuid.uuid4())
os.mkdir('/tmp/{}'.format(item_uuid))
download_path = '/tmp/{}/{}'.format(item_uuid, key) # File is uploaded inside a folder
Note: It's better to use os.path.join() while operating with systems paths. So code above could be rewritten to:
# ...
else:
item_uuid = str(uuid.uuid4())
os.mkdir(os.path.join(['tmp', item_uuid]))
download_path = os.path.join(['tmp', item_uuid, key]))
Also error may be raises because you including '/tmp/' in download path for s3 bucket file, do not include tmp folder as likely it's not exists on s3. Ensure you are on the right way by using that articles:
Amazon S3 upload and download using Python/Django
Python s3 examples
I faced the same issue, and the error message caused a lot of confusion, (the random string extension after the file name). In my case it was caused by the missing directory path, which didn't exist.
thanks for helping Andriy Ivaneyko,I found an solution using boto3.
Using this following code i am able to accomplish my task.
for record in event['Records']:
bucket = record['s3']['bucket']['name']
key = record['s3']['object']['key']
fn='/tmp/xyz'
fp=open(fn,'w')
response = s3_client.get_object(Bucket=bucket,Key=key)
contents = response['Body'].read()
fp.write(contents)
fp.close()
The problem with your code is that download_path is wrong. Whenever you are trying to download any file which is under a directory in your s3 bucket, the download path becomes something like:
download_path = /tmp/<uuid><object key name>
where <object key name> = "<directory name>/<object name>"
This makes the download path as:
download_path = /tmp/<uuid><directory name>/<object key name>
The code will fail because there is no directory exist with uuid-directory name. Your code only allows download of a file under /tmp directory only.
To fix the issue, considering splitting your key while making the download path and you can as well avoid check where the file was uploaded in the bucket. This will just take object file name only in the download path. For example:
for record in event['Records']:
bucket = record['s3']['bucket']['name']
key = record['s3']['object']['key']
print (key)
download_path = '/tmp/{}{}'.format(uuid.uuid4(), key.split('/')[-1])