Django howto progessively write data into a huge zipfile FileField - django

I have a django model like this:
class Todo(models.Model):
big_file = models.FileField(blank=True)
status = models.PositiveSmallIntegerField(default=0)
progress = models.IntegerField(default=0)
I'd like to do two operations:
first make an empty zipfile out of big_file (less important)
and then progressively add files into my zipfile (and save it iteratively)
The overall process would look like that:
from django.core.files.base import File
import io, zipfile
def generate_data(todo):
io_bytes = io.BytesIO(b'')
# 1. save an empty Zip archive:
with zipfile.ZipFile(io_bytes, 'w') as zip_fd:
todo.generated_data.save('heavy_file.zip', File(zip_fd))
# 2. Progressively fill the Zip archive:
with zipfile.ZipFile(io_bytes, 'w') zip_fd:
for filename, data_bytes in long_iteration(todo):
with zip_fd.open(filename, 'w') as in_zip:
in_zip.write(data_bytes)
if condition(something):
todo.generated_data.save() # that does not work
todo.status = 1
todo.progress = 123
todo.save()
todo.status = 2
todo.save()
But I can't figure out the right filedescriptor / file-like object / Filepath / django-File object combination ...
And it seems that in django I always have to save(filename, content). But my content could be Gigabytes, so it does not sound reasonable to store it all into a "content" variable?

Ok I found following solution for myself; first create an empty file and then use the <my_file_field>.path attribute:
def generate_data(todo):
# 1. save an empty Zip archive:
todo.big_file.save('filename.zip', ContentFile(''))
with zipfile.ZipFile(todo.big_file.path, 'w') as zip_fd:
pass
# 2. Progressively fill the Zip archive:
with zipfile.ZipFile(todo.big_file.path, 'w') zip_fd:
... # do the stuff

Related

Pandas closing django-like file, gives ValueError: I/O operation on closed file when uploading

In my process i need to upload a file to django as:
newFile = request.FILES['file']
then in another big function i open it with pandas:
data = pandas.read_csv(data_file, engine = 'python', header=headers_row, encoding = 'utf-8-sig')
and then i need to upload it
uploaded_file = Uploaded_file(file = newFile, retailer = ret, date = date)
but randomly (like 50/50) i get a ValueError: I/O operation on closed file.
Any solution to this? is it possible to open the file again or maybe make a copy of it and use pandas in one and upload the other?
I tried the later but i'm not sure of the implications of going this route:
from io import BytesIO
output = BytesIO(newFile.file.read())
for now it works but i'd appreciate any input on this

Reading multiple files in a directory with pyyaml

I'm trying to read all yaml files in a directory, but I am having trouble. First, because I am using Python 2.7 (and I cannot change to 3) and all of my files are utf-8 (and I also need them to keep this way).
import os
import yaml
import codecs
def yaml_reader(filepath):
with codecs.open(filepath, "r", encoding='utf-8') as file_descriptor:
data = yaml.load_all(file_descriptor)
return data
def yaml_dump(filepath, data):
with open(filepath, 'w') as file_descriptor:
yaml.dump(data, file_descriptor)
if __name__ == "__main__":
filepath = os.listdir(os.getcwd())
data = yaml_reader(filepath)
print data
When I run this code, python gives me the message:
TypeError: coercing to Unicode: need string or buffer, list found.
I want this program to show the content of the files. Can anyone help me?
I guess the issue is with filepath.
os.listdir(os.getcwd()) returns the list of all the files in the directory. so you are passing the list to codecs.open() instead of filename
There are multiple problems with your code, apart from that it is invalide Python, in the way you formatted this.
def yaml_reader(filepath):
with codecs.open(filepath, "r", encoding='utf-8') as file_descriptor:
data = yaml.load_all(file_descriptor)
return data
however it is not necessary to do the decoding, PyYAML is perfectly capable of processing UTF-8:
def yaml_reader(filepath):
with open(filepath, "rb") as file_descriptor:
data = yaml.load_all(file_descriptor)
return data
I hope you realise your trying to load multiple documents and always get a list as a result in data even if your file contains one document.
Then the line:
filepath = os.listdir(os.getcwd())
gives you a list of files, so you need to do:
filepath = os.listdir(os.getcwd())[0]
or decide in some other way, which of the files you want to open. If you want to combine all files (assuming they are YAML) in one big YAML file, you need to do:
if __name__ == "__main__":
data = []
for filepath in os.listdir(os.getcwd()):
data.extend(yaml_reader(filepath))
print data
And your dump routine would need to change to:
def yaml_dump(filepath, data):
with open(filepath, 'wb') as file_descriptor:
yaml.dump(data, file_descriptor, allow_unicode=True, encoding='utf-8')
However this all brings you to the biggest problem: that you are using PyYAML, that will mangle your YAML, dropping flow-style, comment, anchor names, special int/float, quotes around scalars etc. Apart from that PyYAML has not been updated to support YAML 1.2 documents (which has been the standard since 2009). I recommend you switch to using ruamel.yaml (disclaimer: I am the author of that package), which supports YAML 1.2 and leaves comments etc in place.
And even if you are bound to use Python 2, you should use the Python 3 like syntax e.g. for print that you can get with from __future__ imports.
So I recommend you do:
pip install pathlib2 ruamel.yaml
and then use:
from __future__ import absolute_import, unicode_literals, print_function
from pathlib import Path
from ruamel.yaml import YAML
if __name__ == "__main__":
data = []
yaml = YAML()
yaml.preserve_quotes = True
for filepath in Path('.').glob('*.yaml'):
data.extend(yaml.load_all(filepath))
print(data)
yaml.dump(data, Path('your_output.yaml'))

Python 2.7 and PrettyTables

I am trying to get PrettyTables to work with the following script. I can get it almost to look right but it keeps separating my tables so it is printing 16 separate tables. I need all information in one table that I can sort. I appreciate all the help i can get.
import sys
import os
import datetime
import hashlib
import logging
def getScanPath(): #12
# Prompt User for path to scan
path = raw_input('Please enter the directory to scan: ')
# Verify that the path is a directory
if os.path.isdir(path):
return path
else:
sys.exit('Invalid File Path ... Script Aborted')
def getFileList(filePath):
# Create an empty list to hold the resulting files
pathList =[]
# Get a list of files, note these will be just the names of the files
# NOT the full path
simpleFileNameList = os.listdir(filePath)
# Now process each filename in the list
for eachFile in simpleFileNameList:
# 1) Get the full path by join the directory with the filename
fullPath = os.path.join(filePath, eachFile)
# 2) Make sure the full path is an absolute path
absPath = os.path.abspath(fullPath)
# 3) Make sure the absolute path is a file i.e. not a folder or directory
if os.path.isfile(absPath):
# 4) if all is well, add the absolute path to the list
pathList.append(absPath)
else:
logging.error('A Non-File has been identified')
# 5) Once all files have been identified, return the list to the caller
return pathList
def getFileName(theFile):
return os.path.basename(theFile)
def getFileSize(theFile):
return os.path.getsize(theFile)
def getFileLastModified(theFile):
return os.path.getmtime(theFile)
def getFileHash(theFile):
hash_md5 = hashlib.md5()
with open(theFile, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()
# Main Script Starts Here
if __name__ == '__main__':
#Welcome Message
print "\nWelcome to the file scanner\n"
# prompt user for directory path
scanPath = getScanPath()
# Get a list of files with full path
scanFileList = getFileList(scanPath)
# Output Filenames
print "Files found in directory"
for eachFilePath in scanFileList:
fileName = getFileName(eachFilePath)
fileSize = getFileSize(eachFilePath)
lastModified = getFileLastModified(eachFilePath)
hashValue = getFileHash(eachFilePath)
fileModified = (datetime.datetime.fromtimestamp(lastModified))
from prettytable import PrettyTable
pTable = PrettyTable()
pTable.field_names = ["File Name", "File Size", "Last Modified", "Md5 Hash Value"]
pTable.add_row ([fileName, fileSize, fileModified, hashValue])
print (pTable)enter code here
This should show me one big table using all the values from a set directory that the user chooses. This will allow me to sort the table later using prettytables.
I have no experience with prettyTables, but I noticed you have lastModified and fileModified yet only fileModified is used for a column in your table. Are you sure pretty table doesn't have some kind of row limit?

How to get python to read all images in a directory one by one

My experience with python is very limited so I don't fully understand what the code does in this instance. This is part of the code for poets lab from the tensorflow framework.
import os, sys
import tensorflow as tf
import sys
import numpy as np
from PIL import Image
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
# change this as you see fit
image_path = sys.argv[1]
# Read in the image_data
image_data = tf.gfile.FastGFile(image_path, 'rb').read()
image = Image.open(image_path)
image_array = image.convert('RGB')
# Loads label file, strips off carriage return
label_lines = [line.rstrip() for line
in tf.gfile.GFile("retrained_labels.txt")]
# Unpersists graph from file
with tf.gfile.FastGFile("retrained_graph.pb", 'rb') as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
_ = tf.import_graph_def(graph_def, name='')
with tf.Session() as sess:
# Feed the image_data as input to the graph and get first prediction
softmax_tensor = sess.graph.get_tensor_by_name('final_result:0')
predictions = sess.run(softmax_tensor,{'DecodeJpeg:0': image_array})
# Sort to show labels of first prediction in order of confidence
top_k = predictions[0].argsort()[-len(predictions[0]):][::-1]
for node_id in top_k:
human_string = label_lines[node_id]
score = predictions[0][node_id]
print('%s (score = %.5f)' % (human_string, score))
filename = "results.txt"
with open(filename, 'a+') as f:
f.write('\n**%s**\n' % (image_path))
for node_id in top_k:
human_string = label_lines[node_id]
score = predictions[0][node_id]
f.write('%s (score = %.5f)\n' % (human_string, score))
I want the above code to read in a directory instead of a single image and then process them all and output the scores to the results.txt file.
Currently I can call this like so:
python this_file.py /root/images/1.jpg
How would I get this code to take the following input and processes it
python this_file.py /root/images/
Use os.listdir to list all files in the directory. Qualify it with a filter as well. Join the resulting files to their directory. Read them from the list with a for loop.
python this_file.py /root/images/
image_path = sys.argv[1]
image_paths = [os.path.join(image_path,img) for img in os.listdir(image_path) if '.jpg' in img]
I also recommend re-examining your training function and strategy. It is also good practice to abstract your entire network with tf variable placeholders as far as you can. In addition it would be much more efficient to implement batching, as well as possibly convert your dataset to tf records.

Looping CSV Concat in Python Pandas

I have multiple folders each containing csvs. I am trying to concat the csvs in each subdirectory and then export it. At the end I would have same number of outputs as the folders. At the end I would like to have Folder1.csv, Folder2.csv, ...Folder99.csv etc. This is what
import os
from glob import glob
import pandas as pd
import numpy as np
rootDir = 'D:/Data'
OutDirectory = 'D:/OutPut'
os.chdir(rootDir)
# The directory has folders as follows
# D:/Data/Folder1
# D:/Data/Folder2
# D:/Data/Folder3
# ....
# .....
# D:/Data/Folder99
# Each folders (Folder1, Folder2,..etc.) has many csvs.
frame = pd.DataFrame()
list_ = []
for (dirname, dirs, files) in os.walk(rootDir):
for filename in files:
if filename.endswith('.csv'):
df = pd.read_csv(filename,index_col=None, na_values=['-999'], delim_whitespace= True, header = 0, skiprows = 2)
OutFile = '%s.csv' % OutputFname
list_.append(df)
frame = pd.concat(list_)
df.to_csv(OutDirectory+OutFile, sep = ',', header= True)
I am getting the following error:
IOError: File file200150101.csv does not exist
You need to concatenate dirname and filename for a full path to your files. Change this line like so:
df = pd.read_csv(os.path.join(dirname, filename) ,index_col=None, na_values=['-999'], delim_whitespace= True, header = 0, skiprows = 2)
Edit:
I don't know how pandas works because I never used it. But i think your problem is, that you defined everything you wanted to be done to the CSVs in the inner loop that loops over files only (at least the indentation looks that way - but that could also be a format problem that occured when you pasted your code here on SO).
I rewrote your code and fixed some things that I think might be the problem:
First, I renamed your variables starting with big letters because,
for me, it always looks weird to have vars with big starting letters.
I moved your list variable to the outer loop because it should be
reset every time you enter a new directory as you want all CSVs to be
merged per folder.
And finally, I fixed the indentation. In python indentation tells
the compiler which commands are in the inner or outer loop.
My code now looks like this. You might have to change some things because I can't test it right now:
import os
from glob import glob
import pandas as pd
import numpy as np
rootDir = 'D:/Data'
outDir = 'D:/OutPut'
os.chdir(rootDir)
dirs = os.listdir(rootDir)
frame = pd.DataFrame()
for dirname in dirs:
# the outer loop loops over directories! the actual directory is stored in dirname
list = [] # collect csv data for every directory, not in general
files = glob('%s/*.csv' % (dirname))
for filename in files:
# the inner loop loops over the files in the 'dirname' folder
df = pd.read_csv(filename,index_col=None, na_values=['-999'], delim_whitespace= True, header = 0, skiprows = 2)
# all csv data should be in 'list' now
outFile = '%s.csv' % dirname # define the name for output csv
list.append(df) # do that for every file
# at this point, all files in the actual directory were processed
frame = pd.concat(list_) # and then merge CSVs
# ...actually not sure how pd.concat works, but i guess it does merge the data
frame.to_csv(os.path.join(outDir, outFile), sep = ',', header= True) # save the data