Django: upload csv file and stays in memory - django

I'm trying to build a web app using Django where the user will upload some csv file, possibly a big one. Then the code will clean the file for bad data and then the user can use it to make queries with clean data.
Now I believe whenever a user will make a query, the whole code will run again which means it will start cleaning again and so on.
Question:
Is there any way that once the csv data in clean, it stays in memory and user can make queries to that clean data?
import pandas as pd
def converter(num):
try:
return float(num)
except ValueError:
try:
num = num.replace("-", '0.0').replace(',', '')
return float(num)
except ValueError:
return np.nan
def get_clean_data(request):
# Read the data from csv file:
df = pd.read_csv("data.csv")
# Clean the data and send JSON response
df['month'] = df['month'].str.split("-", expand=True)[1]
df[df.columns[8:]] = df[df.columns[8:]].astype(str).applymap(converter)
selected_year = df[df["Departure Datetime: Year (YYYY)"] == 2015]
data_for_user = (selected_year.groupby(
by="route").sum().sort_values(by="revenue").to_json()
return JsonResponse(data_for_user, safe=False)

One way to achieve this could be to cache the dataframe in memory after it has been cleaned. Subsequent requests could then use the cleaned version from the cache.
from django.core.cache import cache
def get_clean_data(request):
# Check the cache for cleaned data
df = cache.get('cleaned_data')
if df is None:
# Read the data from csv file:
df = pd.read_csv("data.csv")
# Clean the data
df['month'] = df['month'].str.split("-", expand=True)[1]
df[df.columns[8:]] = df[df.columns[8:]].astype(str).applymap(converter)
# Put it in the cache
cache.set('cleaned_data', df, timeout=600)
selected_year = df[df["Departure Datetime: Year (YYYY)"] == 2015]
data_for_user = (selected_year.groupby(
by="route").sum().sort_values(by="revenue").to_json()
return JsonResponse(data_for_user, safe=False)
You'd need to be a little bit careful, because if the csv file is very large it may consume a large amount of memory when cached.
Django supports a number of different cache backends, from simple local memory caching, to more complex memcached caching.

Related

Why the csv file in S3 is empty after loading from Lambda

import os
import csv
import boto3
client = boto3.client('s3')
fields = ['dt','dh','key','value']
row = [dt,dh,key,value]
print(row)
# name of csv file
filename = "/tmp/sns_file.csv"
# writing to csv file
with open(filename, 'a',newline='') as csvfile:
# creating a csv writer object
csvwriter = csv.writer(csvfile)
# writing the fields
csvwriter.writerow(fields)
# writing the data row
csvwriter.writerow(row)
final_file_name="final_report_"+dt+".csv"
client.upload_file('/tmp/sns_file.csv',BUCKET_NAME,final_file_name)
if os.path.exists('/tmp/sns_file.csv'):
os.remove('/tmp/sns_file.csv')
else:
print("The file does not exist")
Python's with block is a context manager, which means, in simple terms, it will "clean up" after all operations within it are done.
In context of files "clean up" means closing file. Any changes you write to the file will not be saved on disk until you close the file. So you need to move upload operation outside and after the with block.

Django send excel file to Celery Task. Error InMemoryUploadedFile

I have background process - read excel file and save data from this file. I need to do read file in the background process. But i have error InMemoryUploadedFile.
My code
def create(self, validated_data):
company = ''
file_type = ''
email = ''
file = validated_data['file']
import_data.delay(file=file,
company=company,
file_type=file_type,
email=email)
my method looks like
#app.task
def import_data(
file,
company,
file_type,
email):
// some code
But i have error InMemoryUploadedFile.
How i can to send a file to cellery without errors?
When you delay a task, Celery will try to serialize the parameters which in your case a file is included.
Files and especially files in memory can't be serialized.
So to fix the problem you have to save the file and pass the file path to your delayed function and then read the file there and do your calculations.
Celery does not know how to serialize complex objects such as file objects. However, this can be solved pretty easily. What I do is to encode/decode the file to its Base64 string representation. This allows me to send the file directly through Celery.
The following example shows how (I intendedly placed each conversion separatedly, though this could be arranged in a more pythonic way):
import base64
import tempfile
# (Django, HTTP server)
file = request.FILES['files'].file
file_bytes = file.read()
file_bytes_base64 = base64.b64encode(file_bytes)
file_bytes_base64_str = file_bytes_base64.decode('utf-8') # this is a str
# (...send string through Celery...)
# (Celery worker task)
file_bytes_base64 = file_bytes_base64_str.encode('utf-8')
file_bytes = base64.b64decode(file_bytes_base64)
# Write the file to a temporary location, deletion is guaranteed
with tempfile.TemporaryDirectory() as tmp_dir:
tmp_file = os.path.join(tmp_dir, 'something.zip')
with open(tmp_file, 'wb') as f:
f.write(file_bytes)
# Process the file
This can be inefficient for large files but it becomes pretty handy for small/medium sized temporary files.

how to load csv file data into pandas using request.FILES(django 1.11) without saving file on server

i just want to upload .csv file via form, directly in to pandas dataframe in django without saving physically file on to server.
def post(self, request, format=None):
try:
from io import StringIO, BytesIO
import io
print("data===",request.FILES['file'].read().decode("utf-8"))
# print("file upload FILES data=====",pd.read_csv(request.FILES['file'].read(), sep=','))
#print(request.FILES)
print("file upload data df=====11")
mm = pd.read_csv( BytesIO(request.FILES['file'].read().decode("utf-8")))
print("dataframe data=====",mm)
# import io, csv
# urlData = request.FILES['file']
# data = [row for row in (csv.reader(urlData))]
# print("file upload data df=====222",data)
# mm = pd.read_csv()
#excel_file = request.FILES['file']
# movies = pd.read_excel(request.FILES['file'])
except Exception as e:
print(e)
log.debug("Error in CheckThreadStatus api key required "+str(e))
return Response(responsejson('api key required', status=404))
the ans is straight forward: that is
pd.read_csv(request.FILES['file'])
works perfectly fine, the mistake i was doing is that.. my csv file was not in correct format.
Check With
pd.read_csv('data.csv') # doctest: +SKIP
If using post method you can try
getFile = request.FILE['file_name']
pd.read_csv(getFile) # doctest: +SKIP
You can use StringIO for reading and decoding your csv :
import csv
from io import StringIO
csv_file = request.FILES["csv_file"]
content = StringIO(csv_file.read().decode('utf-8'))
reader = csv.reader(content)
After reading you can populate your database like this :
csv_rows = [row for row in reader]
field_names = csv_rows[0] # Get the header row
del csv_rows[0] # Deleting header after storing it's values in field_names
for index, row in enumerate(csv_rows):
data_dict = dict(zip(field_names, row))
Model.objects.update_or_create(id=row[0],
defaults=data_dict
)
Make sure to validate data before inserting, if the data is critical.
HINT: use django forms to validate for you.
from django import forms

How to optimize an image (file upload) in Django before it is added to the storage location?

We are updating our backend storage for our Django project from a local disk store to an Amazon S3 bucket. Currently, we add the image, then optimize it and at a later time rsync it to our CDN. We control these steps so I just optimize after the upload and before the rsync.
We are moving to Amazon S3 and I would like to now optimize the images before they are uploaded to the S3 bucket, primarily so we don't upload to S3, then download in order to optimize and finally, re-upload. Why have three trips when we can probably do this in one.
My question is this: How can we intercept the upload to optimize the file before it's pushed to the storage backend, in this case, Amazon S3.
If it helps I am using amazon's boto library and django-storages-redux.
I had this question in draft form and realized I had never posted it. I did not find the solution on stack overflow so I thought I would add it as a Q&A post.
The solution is to override Django's TemporaryFileUploadHandler class. I also set the file size for uploads to zero so they all happen on disk and no in memory, though that might not be necessary.
# encoding: utf-8
from image_diet import squeeze
import shutil
import uuid
from django.core.files import File
from django.core.files.uploadhandler import TemporaryFileUploadHandler
class CompressImageUploadHandler(TemporaryFileUploadHandler):
"""
Run image squeeze on our temporary file before upload to S3
"""
def __init__(self, *args, **kwargs):
self.image_types = ('image/jpeg', 'image/png')
self.file_limit = 200000
self.overlay_fields = (
'attribute_name',
)
self.skip_compress_fields = (
'attribute_name',
)
super(CompressImageUploadHandler, self).__init__(*args, **kwargs)
def compress_image(self):
"""
For image files we need to compress them, but we need to do some
trickery along the way. We need to close the file, pass it to
image_diet.squeeze, then reopen the file with the same file name
"""
# if it's an image and small enough. Squeeze.
if (self.file.size < self.file_limit and
self.field_name not in self.skip_compress_fields):
# the beginning is a good place to start.
self.file.seek(0)
# let's squeeze this image.
# first, make a copy.
file_name = self.file.name
file_content_type = self.file.content_type
copy_path = u"{}{}".format(
self.file.temporary_file_path(),
str(uuid.uuid4())[:8]
)
shutil.copyfile(
self.file.temporary_file_path(),
copy_path
)
# closed please. image_squeeze updates on an open file
self.file.close()
squeeze(copy_path)
squeezed_file = open(copy_path)
self.file = File(squeezed_file)
# now reset some of the original values
self.file.name = file_name
self.file.content_type = file_content_type
def screenshot_overlay(self):
"""
Apply the guarantee_image_overlay method on screenshots
"""
if self.field_name in self.overlay_fields:
# this is a custom method that adds an overlay to the upload image if it's in the tuple of overlay_fields
guarantee_image_overlay(self.file.temporary_file_path())
# we have manipulated file, back to zero
self.file.seek(0)
def file_complete(self, file_size):
"""
Return the file object, just run image_squeeze against it.
This happens before the file object is uploaded to Amazon S3.
While the pre_save hook happens after the Amazon upload.
"""
self.file.seek(0)
self.file.size = file_size
if self.content_type in self.image_types:
# see if we apply the screenshot overlay.
self.screenshot_overlay()
self.compress_image()
return super(CompressImageUploadHandler, self).file_complete(file_size)

How to update table in database while file is being transferred using django celery?

I have a task like this in Django:
from celery import task
import subprocess, celery
#celery.task
def file(password, source12, destination):
return subprocess.Popen(['sshpass', '-p', password, 'rsync', '-avz', '--info=progress2', source12, destination],
stderr=subprocess.PIPE, stdout=subprocess.PIPE).communicate()[0]
This transfers file from one server to another using rsync.
Here's my views:
def sync(request):
"""Sync the files into the server with the progress bar"""
choices = request.POST.getlist('choice')
for i in choices:
new_source = source +"/"+ i
start_date1 = datetime.datetime.utcnow().replace(tzinfo=utc)
source12 = new_source.replace(' ', '') #Remove whitespaces
result = file.delay(password, source12, destination)
result.get()
a = result.ready()
start_date = start_date1.strftime("%B %d, %Y, %H:%M%p")
extension = os.path.splitext(i)[1][1:] #Get the file_extension
fullname = os.path.join(destination, i) #Get the file_full_size to calculate size
st = int(os.path.getsize(fullname))
f_size = size(st, system=alternative)
I want to update the table in database which I want to update and show it to the user. The table should be updated while the file is being transferred. How can I do that using django-celery?
There is nothing too special about Celery when it comes to Django. You can just update the database like you would normally do. The only thing you may need to think about are the transactions.
Just to be sure I would recommend using either manual commits or autocommits to update the database. Although I would suggest using redis/memcached instead of the database for these kind of status updates. They are more suited for this purpose.