How to avoid having idle connection timeout while uploading large file? - django

Consider our current architecture:
+---------------+
| Clients |
| (API) |
+-------+-------+
∧
∨
+-------+-------+ +-----------------------+
| Load Balancer | | Nginx |
| (AWS - ELB) +<-->+ (Service Routing) |
+---------------+ +-----------------------+
∧
∨
+-----------------------+
| Nginx |
| (Backend layer) |
+-----------+-----------+
∧
∨
----------------- +-----------+-----------+
File Storage | Gunicorn |
(AWS - S3) <-->+ (Django) |
----------------- +-----------------------+
When a client, mobile or web, try to upload large files (more than a GB) on our servers then often face idle connection timeouts. Either from their client library, on iOS for example, or from our load balancer.
When the file is actually being uploaded by the client, no timeouts occurs because the connection isn't "idle", bytes are being transferred. But I think when the file has been transferred into the Nginx backend layer and Django starts uploading the file to S3, the connection between the client and our server becomes idle until the upload is completed.
Is there a way to prevent this from happening and on which layer should I tackle this issue ?

I have faced the same issue and fixed it by using django-queued-storage on top of django-storages. What django queued storage does is that when a file is received it creates a celery task to upload it to the remote storage such as S3 and in mean time if file is accessed by anyone and it is not yet available on S3 it serves it from local file system. In this way you don't have to wait for the file to be uploaded to S3 in order to send a response back to the client.
As your application behind Load Balancer you might want to use shared file system such as Amazon EFS in order to use the above approach.

You can create an upload handler to upload file directly to s3. In this way you shouldn't encounter connection timeout.
https://docs.djangoproject.com/en/1.10/ref/files/uploads/#writing-custom-upload-handlers
I did some tests and it works perfectly in my case.
You have to start a new multipart_upload with boto for example and send chunks progressively.
Don't forget to validate the chunk size. 5Mb is the minimum if your file contains more than 1 part. (S3 Limitation)
I think this is the best alternative to django-queued-storage if you really want to upload directly to s3 and avoid connection timeout.
You'll probably also need to create your own filefield to manage file correctly and not send it a second time.
The following example is with S3BotoStorage.
S3_MINIMUM_PART_SIZE = 5242880
class S3FileUploadHandler(FileUploadHandler):
chunk_size = setting('S3_FILE_UPLOAD_HANDLER_BUFFER_SIZE', S3_MINIMUM_PART_SIZE)
def __init__(self, request=None):
super(S3FileUploadHandler, self).__init__(request)
self.file = None
self.part_num = 1
self.last_chunk = None
self.multipart_upload = None
def new_file(self, field_name, file_name, content_type, content_length, charset=None, content_type_extra=None):
super(S3FileUploadHandler, self).new_file(field_name, file_name, content_type, content_length, charset, content_type_extra)
self.file_name = "{}_{}".format(uuid.uuid4(), file_name)
default_storage.bucket.new_key(self.file_name)
self.multipart_upload = default_storage.bucket.initiate_multipart_upload(self.file_name)
def receive_data_chunk(self, raw_data, start):
buffer_size = sys.getsizeof(raw_data)
if self.last_chunk:
file_part = self.last_chunk
if buffer_size < S3_MINIMUM_PART_SIZE:
file_part += raw_data
self.last_chunk = None
else:
self.last_chunk = raw_data
self.upload_part(part=file_part)
else:
self.last_chunk = raw_data
def upload_part(self, part):
self.multipart_upload.upload_part_from_file(
fp=StringIO(part),
part_num=self.part_num,
size=sys.getsizeof(part)
)
self.part_num += 1
def file_complete(self, file_size):
if self.last_chunk:
self.upload_part(part=self.last_chunk)
self.multipart_upload.complete_upload()
self.file = default_storage.open(self.file_name)
self.file.original_filename = self.original_filename
return self.file

You can try to skip uploading the file to your server and upload it to s3 directly, then only get back an url for your application.
There is an app for that: django-s3direct you can give it a try.

Related

Why is the file uploaded to AWS S3 0B in size?

I am developing a webapplication with Flask as the backend and Nuxt JS as the frontend. I receive an image file from the frontend and can save it to my Flask directory structure locally. The file is ok and the images is being shown if I open it. Now i want to upload this image to AWS S3 instead of saving it to my disk. I use the boto3 SDK, here is my code:
Here is my save_picture method, that opens the image file and resizes it. I had the save method, but commented it out to avoid saving the file to disk as I want it only on S3.
def save_picture(object_id, form_picture, path):
if form_picture is None:
return None
random_hex = token_hex(8)
filename = form_picture.filename
if '.' not in filename:
return None
extension = filename.rsplit('.', 1)[1].lower()
if not allowed_file(extension, form_picture):
return None
picture_fn = f'{object_id}_{random_hex}.{extension}'
picture_path = current_app.config['UPLOAD_FOLDER'] / path / picture_fn
# resizing image and saving the small version
output_size = (1280, 720)
i = Image.open(form_picture)
i.thumbnail(output_size)
# i.save(picture_path)
return picture_fn
image_name = save_picture(object_id=new_object.id, form_picture=file, path=f'{object_type}_images')
s3 = boto3.client(
's3',
aws_access_key_id=current_app.config['AWS_ACCESS_KEY'],
aws_secret_access_key=current_app.config['AWS_SECRET_ACCESS_KEY']
)
print(file) # this prints <FileStorage: 'Capture.JPG' ('image/jpeg')>, so the file is ok
try:
s3.upload_fileobj(
file,
current_app.config['AWS_BUCKET_NAME'],
image_name,
ExtraArgs={
'ContentType': file.content_type
}
)
except Exception as e:
print(e)
return make_response({'msg': 'Something went wrong.'}, 500)
I can see the uploaded file in my S3, but it shows 0 B in size and if I download it, it says that it cannot be viewed.
I have tried different access policies in S3, as well as many tutorials online, nothing seems to help. Changing the version of S3 to v3 when creating the client breaks the whole system and the file is not being uploaded at all with an access error.
What could be the reason for this upload failure? I it the config of AWS or something else?
Thank you!
Thanks to #jarmod I tried to avoid the image processing and it worked. I am now resizing the image, saving it to disk, opening the saved image, not the initial file, and sending it to S3. I then delete the image on disk as I don't need it.

Cloud Run: Google Cloud Storage to BigQuery duplicate uploads

I have a Cloud Run instance that receives list of files from cloud storage, checks if item exists in BigQuery and uploads it if it doesn't exist yet.
The pipeline structure is: Cloud Function (get list of files from GCS) > PubSub > Cloud Run > BigQuery.
I can't tell where the problem lies but I suspect its with PubSub at least once delivery, I've set the Acknowledgement deadline to 300 seconds and Retry Policy to "Retry after exponential backoff delay"
What I expect is to call the Function, get a list of files from GCS, the pipeline gets triggered and I see a 1:1 for files in GCS to BQ.
My question is what GCP services should I be using to get upload json files from GCS to upload to BQ? Dataflow? I ask because this seems to be something between the Cloud Run instances.
Relevant code in Cloud Run instance
Check if file exists
def existInTable(table_id, dataId):
try:
client = bigquery.Client(project=PROJECT)
tableref = PROJECT + "." + table_id
sql = """
SELECT version, dataId
FROM `{}`
WHERE version LIKE "{}" AND dataId LIKE "{}"
""".format(tableref,__version__.__version__,dataId)
query_job = client.query(sql)
results = query_job.result()
print("Found {} Entries for dataid {} and version {} in {}".format(results.total_rows,dataId,__version__.__version__,table_id))
if results.total_rows>0:
print("Skip upload")
return True
return False
except Exception as e:
print("Failed to find entries in table {}".format(table_id))
print(f"error: {e}")
return False
upload
def uploadCloudStorageFileToBigquery(table_id,gcsEntity):
client = bigquery.Client(project=PROJECT)
job_config = bigquery.LoadJobConfig(
schema= getSchema(),
source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
)
# Report Path
uri = "gs://{}/{}/{}file.json".format(gcsEntity.bucketName,gcsEntity.version,gcsEntity.dataId)
# Start upload process
try:
load_job = client.load_table_from_uri(
uri,
table_id,
location="US", # Must match the destination dataset location.
job_config=job_config,
) # Make an API request.
load_job.result() # Waits for the job to complete.
destination_table = client.get_table(table_id)
print("Successfully uploaded {},{} to {}, {} rows".format(gcsEntity.dataId,gcsEntity.version,table_id,destination_table.num_rows))
return True
except Exception as e:
print("Failed uploading {},{} to {} \n URI: {} \n".format(gcsEntity.dataId,gcsEntity.version,table_id,uri))
print(f"error: {e}")
return False
function that ties the two together
def runBigQueryUpload(gcsEntity):
# get Name for individual Table
individualTable = BIGQUERYDATASET + "." + gcsEntity.bucketName.replace("gcsBucket-","")
# Only continue if no entry exists so far in Individual BigQuery Table
noExceptionCaught = True
if existInTable(individualTable,gcsEntity.dataId):
print("BQ Instance {} from {} Already exists in {}".format(gcsEntity.dataId,gcsEntity.bucketName,individualTable))
else:
try:
uploadCloudStorageFileToBigquery(individualTable,gcsEntity)
except Exception as e:
print("Failed upload to BigQuery table {} for {}".format(individualTable,gcsEntity.dataId ))
print(f"error: {e}")
noExceptionCaught = False
# Upload to main table
if existInTable(BIGQUERYTABLEID,gcsEntity.dataId):
print("BQ Instance {} from {} Already exists in {}".format(gcsEntity.dataId,gcsEntity.bucketName,individualTable))
else:
try:
uploadCloudStorageFileToBigquery(BIGQUERYTABLEID,gcsEntity)
except Exception as e:
print("Failed upload to BigQuery table {} for {}".format(individualTable,gcsEntity.dataId ))
print(f"error: {e}")
noExceptionCaught = False
# If any try block failed, return False
if not noExceptionCaught:
print("Failed in upload process {} from {}".format(gcsEntity.dataId, gcsEntity.bucketName))
return 500
return 200
Hope this might help somebody else...
So I had to rethink the architecture completely.
I had three steps that I needed to do.
do some long processing on files (done with PubSub, on a per file basis, in order for Cloud Run to ack back on time).
Batch process all files in GCS to BQ in one go using write_disposition="WRITE_TRUNCATE", this overwrites a table and ensured no duplicate entries.
Pass a wildcard character in the uri for batch processing multiple files. I called this once entire batch was in GCS.
job_config = bigquery.LoadJobConfig(
autodetect=True,
schema= getSchema(),
source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
write_disposition="WRITE_TRUNCATE"
)
uri = "gs://{}/subfolder/*file.json".format(bucketName) #note wildcard
load_job = client.load_table_from_uri(
uri,
tableId,
location="US", # Must match the destination dataset location.
job_config=job_config,
) # Make an API request.
Notes: old code actually looped through each file in GCS and called load_table_from_uri each time (not batch processing and wastes quota).
Turns out two separate Cloud Run instances both checked that a file didn't exist in BQ and both uploaded at the same time (that's why I needed batch processing)

Django FileResponse - How to speed up file download

I have a setup that lets users download files that are stored in the DB as BYTEA data. Everything works OK, except the download speed is very slow...it seems to download in 33KB chunks, one chunk per second.
Is there a setting I can specify to speed this up?
views.py
from django.http import FileResponse
def getFileResponse(filedata, filename, filesize, contenttype):
response = FileResponse(filedata, content_type=contenttype)
response['Content-Disposition'] = 'attachment; filename=%s' % filename
response['Content-Length'] = filesize
return response
return getFileResponse(
filedata = myfile.filedata, # Binary data from DB
filename = myfile.filename + myfile.fileextension,
filesize = myfile.filesize,
contenttype = myfile.filetype
)
Previously, I had the binary data returned as an HttpResponse and it downloaded like a normal file, with normal speeds. This worked fine locally, but when I pushed to Heroku, it wouldn't download the file -- instead displaying <Memory at XXX> in the download file.
And another side issue...when I include a text file with non-ASCII data (i.e. á), I get an error as well:
UnicodeEncodeError: 'ascii' codec can't encode characters...: ordinal not in range(128)
How can I handle files with Unicode data?
Update
Anyone know why the download speed gets so slow when changing from HTTPResponse to FileResponse? Or alternatively, why the HTTPResponse to return a file doesn't work on Heroku?
Update - Google Drive
I re-worked my application and hooked it up with a Google Drive back-end for serving files. It employs BytesIO() suggested by Eric below:
def download_file(self, fileid, mimetype=None):
# Get binary file data
request = self.get_file(fileid=fileid, mediaflag=True)
stream = io.BytesIO()
downloader = MediaIoBaseDownload(stream, request)
done = False
# Retry if we received HTTPError
for retry in range(0, 5):
try:
while done is False:
status, done = downloader.next_chunk()
print("Download %d%%." % int(status.progress() * 100))
return stream.getvalue()
except (HTTPError) as error:
return ('API error: {}. Try # {} failed.'.format(error.response, retry))
I think the difference you observe between HttpResponse vs. FileResponse is caused by the spec: https://www.python.org/dev/peps/pep-3333/#buffering-and-streaming
In your previous code, an HttpResponse was created with one huge byte string containing your whole file, and the first iteration pass returned the complete response body. With a a FileResponse, the file is iterated in chunks (of 4kb, 8kb or other depending on your WSGI app server), which (I think) are streamed immediately upstream (to the reverse proxy then client), which may add overhead (more communication over process boundaries?).
It would help to know the app server used (uwsgi, gunicorn, waitress, other) and its relevant config. Also more details about the heroku error in case that can be solved!
why you store whole file in database.
best case is to store file on hard and store only path on database
then according to your web server you can let web server to serve file.
web services serve file better than Django.
if files have no access check store them on media
if your files have access control you according to your web server you can use some response headers
if you use Nginx must use X-Accel-Redirect and use any alternative on other web services tutorial on https://wellfire.co/learn/nginx-django-x-accel-redirects/

Create AWS sagemaker endpoint and delete the same using AWS lambda

Is there a way to create sagemaker endpoint using AWS lambda ?
The maximum timeout limit for lambda is 300 seconds while my existing model takes 5-6 mins to host ?
One way is to combine Lambda and Step functions with a wait state to create sagemaker endpoint
In the step function have tasks to
1 . Launch AWS Lambda to CreateEndpoint
import time
import boto3
client = boto3.client('sagemaker')
endpoint_name = 'DEMO-imageclassification-' + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
endpoint_config_name = 'DEMO-imageclassification-epc--2018-06-18-17-02-44'
print(endpoint_name)
def lambda_handler(event, context):
create_endpoint_response = client.create_endpoint(
EndpointName=endpoint_name,
EndpointConfigName=endpoint_config_name)
print(create_endpoint_response['EndpointArn'])
print('EndpointArn = {}'.format(create_endpoint_response['EndpointArn']))
# get the status of the endpoint
response = client.describe_endpoint(EndpointName=endpoint_name)
status = response['EndpointStatus']
print('EndpointStatus = {}'.format(status))
return status
2 . Wait task to wait for X minutes
3 . Another task with Lambda to check EndpointStatus and depending on EndpointStatus (OutOfService | Creating | Updating | RollingBack | InService | Deleting | Failed) either stop the job or continue polling
import time
import boto3
client = boto3.client('sagemaker')
endpoint_name = 'DEMO-imageclassification-2018-07-20-18-52-30'
endpoint_config_name = 'DEMO-imageclassification-epc--2018-06-18-17-02-44'
print(endpoint_name)
def lambda_handler(event, context):
# print the status of the endpoint
endpoint_response = client.describe_endpoint(EndpointName=endpoint_name)
status = endpoint_response['EndpointStatus']
print('Endpoint creation ended with EndpointStatus = {}'.format(status))
if status != 'InService':
raise Exception('Endpoint creation failed.')
# wait until the status has changed
client.get_waiter('endpoint_in_service').wait(EndpointName=endpoint_name)
# print the status of the endpoint
endpoint_response = client.describe_endpoint(EndpointName=endpoint_name)
status = endpoint_response['EndpointStatus']
print('Endpoint creation ended with EndpointStatus = {}'.format(status))
if status != 'InService':
raise Exception('Endpoint creation failed.')
status = endpoint_response['EndpointStatus']
return
Another approach is to combination of AWS Lambda functions and CloudWatch rules which I think would be clumsy.
While rajesh answer is closer to what the question ask for, I like to add that sagemaker now has a batch transform job.
Instead of continously hosting a machine, this job can handle predicting large size of batches at once without caring about latency. So if the intention behind the question is to deploy the model for a short time to predict on a fix amount of batches. This might be the better approach.

Download entire Google Drive folder from the shared link using Google drive API

I have the shared link of a Google Drive folder. I want to download the entire folder using the drive API. The way I am doing this currently is that I get a list of 'children' and then download all the 'children' instances into the folder. The problem is that for each 'children' instance download, a request is sent to googleapis. So, if there are a 1000 files, 1000 requests are sent. I have a folder with more than 10M small files and Google limits the number of requests to 10M per day.
I wanted to know if there is some way(any function in drive api) to download the entire folder at once?
Here is my code:
storage = Storage(CredentialsModel, 'id', task_request.user, 'credential')
credential = storage.get()
http = httplib2.Http()
http = credential.authorize(http)
service = build('drive', 'v2', http=http)
file_list = []
children = service.children().list(folderId=folder_id, q=q, **param).execute()
for child in children.get('items', []):
file_list.append(child['id'])
for k in range(len(file_list)):
curr_file = service.files().get(fileId=file_list[k]).execute()
download_url = curr_file.get('downloadUrl')
if download_url:
resp, content = service._http.request(download_url)
if resp.status == 200:
title = drive_file.get('title')
path = title
file = open(path, 'wb')
file.write(content)