Read a Django UploadedFile into a pandas DataFrame - django

I am attempting to read a .csv file uploaded to Django into a DataFrame.
I am following the instructions and the Django REST Framework page for uploading files. When I PUT a .csv file to a defined endpoint I end up with a Django UploadedFile object, in particular, a TemporaryUploadedFile.
I am trying to read this object into a pandas Dataframe using read_csv, however, there is additional formatting around the temporary uploaded file. I am wondering how to read the original .csv file that was uploaded.
According to the DRF docs, I have assigned:
file_obj = request.data['file']
Inside of a Python debugging console, I see:
ipdb> file_obj
<TemporaryUploadedFile: foobar.csv (multipart/form-data; boundary=--------------------------044608164241682586561733)>
Things I've tried so far.
With the original file path, I can read it into pandas like this.
dataframe = pd.read_csv(open("foobar.csv", "rb"))
However, the original file has additional metadata added by Django during the upload process.
ipdb> pd.read_csv(open(file_obj.temporary_file_path(), "rb"))
*** pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 5, saw 32
If I try to use the UploadedFile.read() method, I run into the following issue.
ipdb> dataframe = pd.read_csv(file_obj.read())
*** OSError: Expected file path name or file-like object, got <class 'bytes'> type
Thanks!
P.S. The first few lines of the original file look like this.
SPID,SA_ID,UOM,DIR,DATE,RS,NAICS,APCT,1:00,2:00,3:00,4:00,5:00,6:00,7:00,8:00,9:00,10:00,11:00,12:00,13:00,14:00,15:00,16:00,17:00,18:00,19:00,20:00,21:00,22:00,23:00,0:00:00
(Blanked),123456789,KWH,R,5/2/18,H2ETOUAN,,100,0,0,0,0,0,0,0,0.144,1.064,3.07,4.531,4.013,5.205,4.751,4.647,3.142,2.464,1.173,0.023,0,0,0,0,0
(Blanked),123456789,KWH,R,3/10/18,H2ETOUAN,,100,0,0,0,0,0,0,0,0,0.007,0.622,0.179,0.003,0.274,0.167,0.014,0.004,0.028,0.139,0,0,0,0,0,0
When I look at the contents of the temporary file, I see this.
----------------------------789873173211443224653494
Content-Disposition: form-data; name="file"; filename="foobar.csv"
Content-Type: File
SPID,SA_ID,UOM,DIR,DATE,RS,NAICS,APCT,1:00,2:00,3:00,4:00,5:00,6:00,7:00,8:00,9:00,10:00,11:00,12:00,13:00,14:00,15:00,16:00,17:00,18:00,19:00,20:00,21:00,22:00,23:00,0:00:00
(Blanked),123456789,KWH,R,5/2/18,H2ETOUAN,,100,0,0,0,0,0,0,0,0.144,1.064,3.07,4.531,4.013,5.205,4.751,4.647,3.142,2.464,1.173,0.023,0,0,0,0,0
(Blanked),123456789,KWH,R,3/10/18,H2ETOUAN,,100,0,0,0,0,0,0,0,0,0.007,0.622,0.179,0.003,0.274,0.167,0.014,0.004,0.028,0.139,0,0,0,0,0,0

UploadedFile.read() returns the file data in bytes, not a file path or file-like object. In order to use pandas read_csv() function, you'll need to turn those bytes into a stream. Since your file is a csv, the most straightforward way would be to use bytes.decode() with io.StringIO(), like:
dataframe = pd.read_csv(io.StringIO(file_obj.read().decode('utf-8')), delimiter=',')

Related

Read Excel file from Memory in Django

I am trying to read an excel file from memory in django but keep getting the following error:
NotImplementedError: formatting_info=True not yet implemented
Here's the code:
from pyexcel_xls import get_data
def processdocument(file):
print("file", file)
data = get_data(file)
return 1
when I am reading the same file from the local storage it works perfectly
data = get_data(r"C:\Users\Rahul Sharma\Downloads\Sample PFEP (2).xlsx")
I had a workaound solution in mind i.e. to save the uploaded file temporary in django os and then pass its URL to the function.
Can I do that?

PYPDF watermarking returns error

hi im trying to watermark a pdf fileusing pypdf2 though i get this error i cant figure out what goes wrong.
i get the following error:
Traceback (most recent call last): File "test.py", line 13, in <module>
page.mergePage(watermark.getPage(0)) File "C:\Python27\site-packages\PyPDF2\pdf.py", line 1594, in mergePage
self._mergePage(page2) File "C:\Python27\site-packages\PyPDF2\pdf.py", line 1651, in _mergePage
page2Content, rename, self.pdf) File "C:Python27\site-packages\PyPDF2\pdf.py", line 1547, in
_contentStreamRename
op = operands[i] KeyError: 0
using python 2.7.6 with pypdf2 1.19 on windows 32bit.
hopefully someone can tell me what i do wrong.
my python file:
from PyPDF2 import PdfFileWriter, PdfFileReader
output = PdfFileWriter()
input = PdfFileReader(open("test.pdf", "rb"))
watermark = PdfFileReader(open("watermark.pdf", "rb"))
# print how many pages input1 has:
print("test.pdf has %d pages." % input.getNumPages())
print("watermark.pdf has %d pages." % watermark.getNumPages())
# add page 0 from input, but first add a watermark from another PDF:
page = input.getPage(0)
page.mergePage(watermark.getPage(0))
output.addPage(page)
# finally, write "output" to document-output.pdf
outputStream = file("outputs.pdf", "wb")
output.write(outputStream)
outputStream.close()
Try writing to a StringIO object instead of a disk file. So, replace this:
outputStream = file("outputs.pdf", "wb")
output.write(outputStream)
outputStream.close()
with this:
outputStream = StringIO.StringIO()
output.write(outputStream) #write merged output to the StringIO object
outputStream.close()
If above code works, then you might be having file writing permission issues. For reference, look at the PyPDF working example in my article.
I encountered this error when attempting to use PyPDF2 to merge in a page which had been generated by reportlab, which used an inline image canvas.drawInlineImage(...), which stores the image in the object stream of the PDF. Other PDFs that use a similar technique for images might be affected in the same way -- effectively, the content stream of the PDF has a data object thrown into it where PyPDF2 doesn't expect it.
If you're able to, a solution can be to re-generate the source pdf, but to not use inline content-stream-stored images -- e.g. generate with canvas.drawImage(...) in reportlab.
Here's an issue about this on PyPDF2.

django file upload doesn't work: f.read() returns ''

I'm trying to upload and parse json files using django. Everything works great up until the moment I need to parse the json. Then I get this error:
No JSON object could be decoded: line 1 column 0 (char 0)
Here's my code. (I'm following the instructions here, and overwriting the handle_uploaded_file method.)
def handle_uploaded_file(f, collection):
# assert False, [f.name, f.size, f.read()[:50]]
t = f.read()
for j in serializers.deserialize("json", t):
add_item_to_database(j)
The weird thing is that when I uncomment the "assert" line, I get this:
[u'myfile.json', 59478, '']
So it looks like my file is getting uploaded with the right size (I've verified this on the server), but the read command seems to be failing entirely.
Any ideas?
I've seen this before. Your file has length, but reading it doesn't. I'm wondering if it's been read previously... try this:
f.seek(0)
f.read()

How can I get the temporary name of an UploadedFile in Django?

I'm doing some file validation and want to load an UploadedFile into an external library while it is in the '/tmp' directory before I save it somewhere that it can be executed. Django does the following:
Django will write the uploaded file to a temporary file stored in your system's temporary directory. On a Unix-like platform this means you can expect Django to generate a file called something like /tmp/tmpzfp6I6.upload.
It ihe "tmpzfp616.upload' that I want to be able to get my hands on. UploadedFile.name gives me "" while file.name gives me the proper name of the file "example.mp3".
With the library I am using, I need to pass the filepath of the temporary file to the library, rather than the file itself and so, need the string.
Any ideas?
Thanks in advance.
EDIT: Here's my code:
from django.core.files.uploadedfile import UploadedFile
class SongForm(forms.ModelForm):
def clean_audio_file(self):
file = self.cleaned_data.get('audio_file',False)
if file:
[...]
if file._size > 2.5*1024*1024:
try:
#The following two lines are where I'm having trouble, MP3 takes the path to file as input.
path = UploadedFile.temporary_file_path
audio = MP3('%s' %path)
except HeaderNotFoundError:
raise forms.ValidationError("Cannot read file")
else:
raise forms.ValidationError("Couldn't read uploaded file")
return file
Using "UploadedFile" I get an AttributeError "type object 'UploadedFile' has no attribute 'temporary_file_path'". If I instead use file.temporary_file_path (just throwing darts in the dark here) I get an IOError:
[Errno 2] No such file or directory: 'bound method TemporaryUploadedFile.temporary_file_path of >'
I realize temporary_file_path is the solution I'm looking for, I just can't figure out how to use it and neither the docs nor google seem to be much help in this particular instance.
UploadedFile.temporary_file_path
Only files uploaded onto disk will have this method; it returns the full path to the temporary uploaded file.

Getting "new-line character seen in unquoted field" when parsing csv document using django-storages

I am trying to parse csv files that have been uploaded to Amazon S3 using django-storages. I keep getting a "Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?". The normal work around for this is to open the file with "rU", but that does not seem to work with django storages. If I drop the file directly on the server and open from there it works, I just want to avoid storing the files directly on the server if possible. Here is the code I am using:
import csv
from django.core.files.storage import default_storage as s3_storage
n = 'csvdumps/130331548894.csv'
csvf = s3_storage.open(n, "rU")
csvReader = csv.reader(csvf)
for item in csvReader:
print item
I can see that this is a django-storage reported bug here http://jgrid.org/david/django-storages/issue/80/trying-to-parse-csv-file-from-django but perhaps you can try this:-
csvf = s3_storage.open(n.splitlines(), "rU")
Would also be great if you could share a link to access some of your S3 (sample) csv files though so I can open them to check the line endings.