Unzip Large file on AWS S3 using Lamda Functions

Unzip Large file on AWS S3 using Lamda Functions - amazon-web-services

I have a large file around 6GB and using AWS lambda trigger to unzip the file when it's uploaded to an S3 bucket using Python and Boto3 but I am getting Memory Error while unzipping the file in buffer using ByteIO.
# zip file is in output dir
if '-output' in file.key:
# get base path other the zip file name
save_base_path = file.key.split('//')[0]
# starting unzip process
zip_obj = s3_resource.Object(bucket_name=source_bucket, key=file.key)
buffer = BytesIO(zip_obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
print(f'Unziping....')
for filename in z.namelist():
file_info = z.getinfo(filename)
try:
response = s3_resource.meta.client.upload_fileobj(
z.open(filename),
Bucket=target_bucket,
Key=f'{save_base_path}/{filename}'
)
except Exception as e:
print(e)
print('unziping process completed')
# deleting zip file after unzip
s3_resource.Object(source_bucket, file.key).delete()
my_bucket.delete_object()
print("iteration completed")
else:
print('Zip file invalid position')
s3_resource.Object(source_bucket, file.key).delete
print(f'{file.key} deleted...')
Issue 1
when I am reading bytes its give me Memory Error
I have set Memory to 10240(10GB) in the General Configuration of AWS lambda function
Issue 2
I want to delete Object from s3 it runs code properly and doesn't give any error but also it is not deleting the file
is there any solution through which I can solve my Unzip issue

It is possible to wrap both the reading of the file in a small wrapper so that the entire zip file does not need to be downloaded from S3. From there it's straight forward enough to upload completed files back to S3 without keeping the entire contents in RAM:
# Download a zip file from S3 and upload it's unzipped contents back
# to S3
def s3_zip_to_s3(source_bucket, source_key, dest_bucket, dest_prefix):
s3 = boto3.client('s3')
# Use the S3Wrapper class to avoid having to transfer the entire
# file into RAM
with zipfile.ZipFile(S3Wrapper(s3, source_bucket, source_key)) as zip:
for name in zip.namelist():
print(f"Uploading {name}...")
# Use upload_fileobj to only stream to S3
s3.upload_fileobj(zip.open(name, 'r'), dest_bucket, dest_prefix + name)
# Create a file like object with a bare-bones implementation
# for reading only. Other than caching the file size, data is read
# from S3 for each call
class S3Wrapper:
def __init__(self, s3, bucket, key):
self.s3 = s3
self.bucket = bucket
self.key = key
self.pos = 0
self.length = s3.head_object(Bucket=bucket, Key=key)['ContentLength']
def seekable(self):
return True
def seek(self, offset, whence=0):
if whence == 0:
self.pos = offset
elif whence == 1:
self.pos += offset
else:
self.pos = self.length + offset
def tell(self):
return self.pos
def read(self, count=None):
if count is None:
resp = self.s3.get_object(Bucket=self.bucket, Key=self.key, Range=f'bytes={self.pos}-')
else:
resp = self.s3.get_object(Bucket=self.bucket, Key=self.key, Range=f'bytes={self.pos}-{self.pos+count-1}')
data = resp['Body'].read()
self.pos += len(data)
return data
While this works, this technique uses much less than the size of the zip file in my testing, it's not fast by any means.
I'd probably recommend some solution like a worker on EC2 or ECS to do the work for you.

Related

Why is the file uploaded to AWS S3 0B in size?

I am developing a webapplication with Flask as the backend and Nuxt JS as the frontend. I receive an image file from the frontend and can save it to my Flask directory structure locally. The file is ok and the images is being shown if I open it. Now i want to upload this image to AWS S3 instead of saving it to my disk. I use the boto3 SDK, here is my code:
Here is my save_picture method, that opens the image file and resizes it. I had the save method, but commented it out to avoid saving the file to disk as I want it only on S3.
def save_picture(object_id, form_picture, path):
if form_picture is None:
return None
random_hex = token_hex(8)
filename = form_picture.filename
if '.' not in filename:
return None
extension = filename.rsplit('.', 1)[1].lower()
if not allowed_file(extension, form_picture):
return None
picture_fn = f'{object_id}_{random_hex}.{extension}'
picture_path = current_app.config['UPLOAD_FOLDER'] / path / picture_fn
# resizing image and saving the small version
output_size = (1280, 720)
i = Image.open(form_picture)
i.thumbnail(output_size)
# i.save(picture_path)
return picture_fn
image_name = save_picture(object_id=new_object.id, form_picture=file, path=f'{object_type}_images')
s3 = boto3.client(
's3',
aws_access_key_id=current_app.config['AWS_ACCESS_KEY'],
aws_secret_access_key=current_app.config['AWS_SECRET_ACCESS_KEY']
)
print(file) # this prints <FileStorage: 'Capture.JPG' ('image/jpeg')>, so the file is ok
try:
s3.upload_fileobj(
file,
current_app.config['AWS_BUCKET_NAME'],
image_name,
ExtraArgs={
'ContentType': file.content_type
}
)
except Exception as e:
print(e)
return make_response({'msg': 'Something went wrong.'}, 500)
I can see the uploaded file in my S3, but it shows 0 B in size and if I download it, it says that it cannot be viewed.
I have tried different access policies in S3, as well as many tutorials online, nothing seems to help. Changing the version of S3 to v3 when creating the client breaks the whole system and the file is not being uploaded at all with an access error.
What could be the reason for this upload failure? I it the config of AWS or something else?
Thank you!

Thanks to #jarmod I tried to avoid the image processing and it worked. I am now resizing the image, saving it to disk, opening the saved image, not the initial file, and sending it to S3. I then delete the image on disk as I don't need it.

How to extract files in S3 on the fly with boto3?

I'm trying to find a way to extract .gz files in S3 on the fly, that is no need to download it to locally, extract and then push it back to S3.
With boto3 + lambda, how can i achieve my goal?
I didn't see any extract part in boto3 document.

You can use BytesIO to stream the file from S3, run it through gzip, then pipe it back up to S3 using upload_fileobj to write the BytesIO.
# python imports
import boto3
from io import BytesIO
import gzip
# setup constants
bucket = '<bucket_name>'
gzipped_key = '<key_name.gz>'
uncompressed_key = '<key_name>'
# initialize s3 client, this is dependent upon your aws config being done
s3 = boto3.client('s3', use_ssl=False) # optional
s3.upload_fileobj( # upload a new obj to s3
Fileobj=gzip.GzipFile( # read in the output of gzip -d
None, # just return output as BytesIO
'rb', # read binary
fileobj=BytesIO(s3.get_object(Bucket=bucket, Key=gzipped_key)['Body'].read())),
Bucket=bucket, # target bucket, writing to
Key=uncompressed_key) # target key, writing to
Ensure that your key is reading in correctly:
# read the body of the s3 key object into a string to ensure download
s = s3.get_object(Bucket=bucket, Key=gzip_key)['Body'].read()
print(len(s)) # check to ensure some data was returned

The above answers are for gzip files, for zip files, you may try
import boto3
import zipfile
from io import BytesIO
bucket = 'bucket1'
s3 = boto3.client('s3', use_ssl=False)
Key_unzip = 'result_files/'
prefix = "folder_name/"
zipped_keys = s3.list_objects_v2(Bucket=bucket, Prefix=prefix, Delimiter = "/")
file_list = []
for key in zipped_keys['Contents']:
file_list.append(key['Key'])
#This will give you list of files in the folder you mentioned as prefix
s3_resource = boto3.resource('s3')
#Now create zip object one by one, this below is for 1st file in file_list
zip_obj = s3_resource.Object(bucket_name=bucket, key=file_list[0])
print (zip_obj)
buffer = BytesIO(zip_obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
for filename in z.namelist():
file_info = z.getinfo(filename)
s3_resource.meta.client.upload_fileobj(
z.open(filename),
Bucket=bucket,
Key='result_files/' + f'{filename}')
This will work for your zip file and your result unzipped data will be in result_files folder. Make sure to increase memory and time on AWS Lambda to maximum since some files are pretty large and needs time to write.

Amazon S3 is a storage service. There is no in-built capability to manipulate the content of files.
However, you could use an AWS Lambda function to retrieve an object from S3, decompress it, then upload content back up again. However, please note that there is default limit of 500MB in temporary disk space for Lambda, so avoid decompressing too much data at the same time.
You could configure the S3 bucket to trigger the Lambda function when a new file is created in the bucket. The Lambda function would then:
Use boto3 to download the new file
Use the gzip Python library to extract files
Use boto3 to upload the resulting file(s)
Sample code:
import gzip
import io
import boto3
bucket = '<bucket_name>'
key = '<key_name>'
s3 = boto3.client('s3', use_ssl=False)
compressed_file = io.BytesIO(
s3.get_object(Bucket=bucket, Key=key)['Body'].read())
uncompressed_file = gzip.GzipFile(None, 'rb', fileobj=compressed_file)
s3.upload_fileobj(Fileobj=uncompressed_file, Bucket=bucket, Key=key[:-3])

Error when using continuation token on S3 download

I'm trying to download a large amount of small files from an S3 bucket - I'm doing this by using the following:
s3 = boto3.client('s3')
kwargs = {'Bucket': bucket}
with open('/Users/hr/Desktop/s3_backup/files.csv','w') as file:
while True:
# The S3 API response is a large blob of metadata.
# 'Contents' contains information about the listed objects.
resp = s3.list_objects_v2(**kwargs)
try:
contents = resp['Contents']
except KeyError:
return
for obj in contents:
key = obj['Key']
file.write(key)
file.write('\n')
# The S3 API is paginated, returning up to 1000 keys at a time.
# Pass the continuation token into the next response, until we
# reach the final page (when this field is missing).
try:
kwargs['ContinuationToken'] = resp['NextContinuationToken']
except KeyError:
break
However, after a certain amount of time I received this error message 'EndpointConnectionError: Could not connect to the endpoint URL'.
I know that there is still considerably more files on the s3 bucket. I have three questions:
Why is this error occurring when I haven't downloaded all files in the bucket?
Is there a way to start my code from the last file I downloaded from the S3 bucket (I don't want to have to re-download the file names I've already downloaded)
Is there a default ordering of the S3 bucket, is it alphabetical?

Django Tweepy can't access Amazon S3 file

I'm using Tweepy, a tweeting python library, django-storages and boto. I have a custom manage.py command that works correctly locally, it gets an image from the filesystem and tweets that image. If I change the storage to Amazon S3, however, I can't access the file. It gives me this error:
raise TweepError('Unable to access file: %s' % e.strerror)
I tried making the images in the bucket "public". Didn't work. This is the code (it works without S3):
filename = model_object.image.file.url
media_ids = api.media_upload(filename=filename) # ERROR
params = {'status': tweet_text, 'media_ids': [media_ids.media_id_string]}
api.update_status(**params)
This line:
model_object.image.file.url
Gives me the complete url of the image I want to tweet, something like this:
https://criptolibertad.s3.amazonaws.com/OrillaLibertaria/195.jpg?Signature=xxxExpires=1467645897&AWSAccessKeyId=yyy
I also tried constructing the url manually, since it is a public image stored in my bucket, like this:
filename = "https://criptolibertad.s3.amazonaws.com/OrillaLibertaria/195.jpg"
But it doesn't work.
¿Why do I get the Unable to access file error?
The source code from tweepy looks like this:
def media_upload(self, filename, *args, **kwargs):
""" :reference: https://dev.twitter.com/rest/reference/post/media/upload
:allowed_param:
"""
f = kwargs.pop('file', None)
headers, post_data = API._pack_image(filename, 3072, form_field='media', f=f) # ERROR
kwargs.update({'headers': headers, 'post_data': post_data})
def _pack_image(filename, max_size, form_field="image", f=None):
"""Pack image from file into multipart-formdata post body"""
# image must be less than 700kb in size
if f is None:
try:
if os.path.getsize(filename) > (max_size * 1024):
raise TweepError('File is too big, must be less than %skb.' % max_size)
except os.error as e:
raise TweepError('Unable to access file: %s' % e.strerror)
Looks like Tweepy can't get the image from the Amazon S3 bucket, but how can I make it work? Any advice will help.

The issue occurs when tweepy attempts to get file size in _pack_image:
if os.path.getsize(filename) > (max_size * 1024):
The function os.path.getsize assumes it is given a file path on disk; however, in your case it is given a URL. Naturally, the file is not found on disk and os.error is raised. For example:
# The following raises OSError on my machine
os.path.getsize('https://criptolibertad.s3.amazonaws.com/OrillaLibertaria/195.jpg')
What you could do is to fetch the file content, temporarily save it locally and then tweet it:
import tempfile
with tempfile.NamedTemporaryFile(delete=True) as f:
name = model_object.image.file.name
f.write(model_object.image.read())
media_ids = api.media_upload(filename=name, f=f)
params = dict(status='test media', media_ids=[media_ids.media_id_string])
api.update_status(**params)
For your convenience, I published a fully working example here: https://github.com/izzysoftware/so38134984

Amazon S3 concatenate small files

Is there a way to concatenate small files which are less than 5MBs on Amazon S3.
Multi-Part Upload is not ok because of small files.
It's not a efficient solution to pull down all these files and do the concatenation.
So, can anybody tell me some APIs to do these?

Amazon S3 does not provide a concatenate function. It is primarily an object storage service.
You will need some process that downloads the objects, combines them, then uploads them again. The most efficient way to do this would be to download the objects in parallel, to take full advantage of available bandwidth. However, that is more complex to code.
I would recommend doing the processing on "in the cloud" to avoid having to download the objects across the Internet. Doing it on Amazon EC2 or AWS Lambda would be more efficient and less costly.

Based on #wwadge's comment I wrote a Python script.
It bypasses the 5MB limit by uploading a dummy-object slightly bigger than 5MB, then append each small file as if it was the last. In the end it strips out the dummy-part from the merged file.
import boto3
import os
bucket_name = 'multipart-bucket'
merged_key = 'merged.json'
mini_file_0 = 'base_0.json'
mini_file_1 = 'base_1.json'
dummy_file = 'dummy_file'
s3_client = boto3.client('s3')
s3_resource = boto3.resource('s3')
# we need to have a garbage/dummy file with size > 5MB
# so we create and upload this
# this key will also be the key of final merged file
with open(dummy_file, 'wb') as f:
# slightly > 5MB
f.seek(1024 * 5200)
f.write(b'0')
with open(dummy_file, 'rb') as f:
s3_client.upload_fileobj(f, bucket_name, merged_key)
os.remove(dummy_file)
# get the number of bytes of the garbage/dummy-file
# needed to strip out these garbage/dummy bytes from the final merged file
bytes_garbage = s3_resource.Object(bucket_name, merged_key).content_length
# for each small file you want to concat
# when this loop have finished merged.json will contain
# (merged.json + base_0.json + base_2.json)
for key_mini_file in ['base_0.json','base_1.json']: # include more files if you want
# initiate multipart upload with merged.json object as target
mpu = s3_client.create_multipart_upload(Bucket=bucket_name, Key=merged_key)
part_responses = []
# perform multipart copy where merged.json is the first part
# and the small file is the second part
for n, copy_key in enumerate([merged_key, key_mini_file]):
part_number = n + 1
copy_response = s3_client.upload_part_copy(
Bucket=bucket_name,
CopySource={'Bucket': bucket_name, 'Key': copy_key},
Key=merged_key,
PartNumber=part_number,
UploadId=mpu['UploadId']
)
part_responses.append(
{'ETag':copy_response['CopyPartResult']['ETag'], 'PartNumber':part_number}
)
# complete the multipart upload
# content of merged will now be merged.json + mini file
response = s3_client.complete_multipart_upload(
Bucket=bucket_name,
Key=merged_key,
MultipartUpload={'Parts': part_responses},
UploadId=mpu['UploadId']
)
# get the number of bytes from the final merged file
bytes_merged = s3_resource.Object(bucket_name, merged_key).content_length
# initiate a new multipart upload
mpu = s3_client.create_multipart_upload(Bucket=bucket_name, Key=merged_key)
# do a single copy from the merged file specifying byte range where the
# dummy/garbage bytes are excluded
response = s3_client.upload_part_copy(
Bucket=bucket_name,
CopySource={'Bucket': bucket_name, 'Key': merged_key},
Key=merged_key,
PartNumber=1,
UploadId=mpu['UploadId'],
CopySourceRange='bytes={}-{}'.format(bytes_garbage, bytes_merged-1)
)
# complete the multipart upload
# after this step the merged.json will contain (base_0.json + base_2.json)
response = s3_client.complete_multipart_upload(
Bucket=bucket_name,
Key=merged_key,
MultipartUpload={'Parts': [
{'ETag':response['CopyPartResult']['ETag'], 'PartNumber':1}
]},
UploadId=mpu['UploadId']
)
If you already have a >5MB object that you want to add smaller parts too, then skip creating the dummy file and the last copy part with the byte-ranges. Also, I have no idea how this performs on a large number of very small files - in that case it might be better to download each file, merge them locally and then upload.

Edit: Didn't see the 5MB requirement. This method will not work because of this requirement.
From https://ruby.awsblog.com/post/Tx2JE2CXGQGQ6A4/Efficient-Amazon-S3-Object-Concatenation-Using-the-AWS-SDK-for-Ruby:
While it is possible to download and re-upload the data to S3 through
an EC2 instance, a more efficient approach would be to instruct S3 to
make an internal copy using the new copy_part API operation that was
introduced into the SDK for Ruby in version 1.10.0.
Code:
require 'rubygems'
require 'aws-sdk'
s3 = AWS::S3.new()
mybucket = s3.buckets['my-multipart']
# First, let's start the Multipart Upload
obj_aggregate = mybucket.objects['aggregate'].multipart_upload
# Then we will copy into the Multipart Upload all of the objects in a certain S3 directory.
mybucket.objects.with_prefix('parts/').each do |source_object|
# Skip the directory object
unless (source_object.key == 'parts/')
# Note that this section is thread-safe and could greatly benefit from parallel execution.
obj_aggregate.copy_part(source_object.bucket.name + '/' + source_object.key)
end
end
obj_completed = obj_aggregate.complete()
# Generate a signed URL to enable a trusted browser to access the new object without authenticating.
puts obj_completed.url_for(:read)
Limitations (among others)
With the exception of the last part, there is a 5 MB minimum part size.
The completed Multipart Upload object is limited to a 5 TB maximum size.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Unzip Large file on AWS S3 using Lamda Functions - amazon-web-services

Related

Why is the file uploaded to AWS S3 0B in size?

How to extract files in S3 on the fly with boto3?

Error when using continuation token on S3 download

Django Tweepy can't access Amazon S3 file

Amazon S3 concatenate small files

Categories

Resources