I have a Django/uwsgi/nginx stack running on CentOS. When uploading a large file to django (1GB+), I expect it to create a temp file in /tmp and I should be able to watch it grow as the upload progresses. However, I don't. ls -lah /tmp doesn't show any new files being created or changing in size. I even specified in my settings.py explicitly that FILE_UPLOAD_TEMP_DIR = '/tmp' but still nothing.
I'd appreciate any help in tracking down where the temp files are stored. I need this to determine whether there are any large uploads in progress.
They are stored in your system's temp directory. From https://docs.djangoproject.com/en/dev/topics/http/file-uploads/?from=olddocs:
Where uploaded data is stored
Before you save uploaded files, the data needs to be stored somewhere.
By default, if an uploaded file is smaller than 2.5 megabytes, Django
will hold the entire contents of the upload in memory. This means that
saving the file involves only a read from memory and a write to disk
and thus is very fast.
However, if an uploaded file is too large, Django will write the
uploaded file to a temporary file stored in your system's temporary
directory. On a Unix-like platform this means you can expect Django to
generate a file called something like /tmp/tmpzfp6I6.upload. If an
upload is large enough, you can watch this file grow in size as Django
streams the data onto disk.
These specifics -- 2.5 megabytes; /tmp; etc. -- are simply "reasonable
defaults". Read on for details on how you can customize or completely
replace upload behavior.
Additionally, this only happens after a given size, defaulted to 2.5MB
FILE_UPLOAD_MAX_MEMORY_SIZE The maximum size, in bytes, for files that
will be uploaded into memory. Files larger than
FILE_UPLOAD_MAX_MEMORY_SIZE will be streamed to disk.
Defaults to 2.5 megabytes.
I just tracked this down on my OS X system with Django 1.4.1.
In django/core/files/uploadedfile.py, a temporary file is created using django.core.files.temp, imported as tempfile
from django.core.files import temp as tempfile
This simply returns Python's standard tempfile.NamedTemporaryFile unless it's running on Windows.
To see the location of the tempdir, you can run this command at the shell:
python -c "import tempfile; print tempfile.gettempdir()"
On my system right now it outputs /var/folders/9v/npjlh_kn7s9fv5p4dwh1spdr0000gn/T, which is where I found my temporary uploaded file.
Related
I am creating a very big file that cannot fit in the memory directly. So I have created a bunch of small files in S3 and am writing a script that can read these files and merge them. I am using aws wrangler to do this
My code is as follows:
try:
dfs = wr.s3.read_parquet(path=input_folder, path_suffix=['.parquet'], chunked=True, use_threads=True)
for df in dfs:
path = wr.s3.to_parquet(df=df, dataset=True, path=target_path, mode="append")
logger.info(path)
except Exception as e:
logger.error(e, exc_info=True)
logger.info(e)
The problem is that w4.s3.to_parquet creates a lot of files, instead of writing in one file, also I can't remove chunked=True because otherwise my program fails with OOM
How do I make this write a single file in s3.
AWS Data Wrangler is writing multiple files because you have specified dataset=True. Removing this flag or switching to False should do the trick as long as you are specifying a full path
I don't believe this is possible. #Abdel Jaidi suggestion won't work as append=True requires dataset to be true or will throw an error. I believe that in this case, append has more to do with "appending" the data in Athena or Glue by adding new files to the same folder.
I also don't think this is even possible for parquet in general. As per this SO post it's not possible in a local folder, let alone S3. To add to this parquet is compressed and I don't think it would be easy to add a line to a compressed file without loading it all into memroy.
I think the only solution is to get a beefy ec2 instance that can handle this.
I'm facing a similar issue and I think I'm going to just loop over all the small files and create bigger ones. For example, you could append sever dataframes together and then rewrite those but you won't be able to get back to one parquet file unless you get a computer with enough ram.
I searched but haven't found a satisfying solution.
Minio/S3 does not have directories, only keys (with prefixes). So far so good.
Now I am in the need to change those prefixes. Not for a single file but for a whole bunch (a lot) files which can be really large (actually no limit).
Unfortunatly these storage servers seem not to have a concept of (and does not support):
rename file
move file
What has to be done is for each file
copy the file to the new target location
delete the file from the old source location
My given design looks like:
users upload files to bucketname/uploads/filename.ext
a background process takes the uploaded files, generates some more files and uploads them to bucketname/temp/filename.ext
when all processings are done the uploaded file and the processed files are moved to bucketname/processed/jobid/new-filenames...
The path prefix is used when handling the object created notification to differentiate if it is a upload (start processing), temp (check if all files are uploaded) and processed/jobid for holding them until the user deletes them.
Imagine a task where 1000 files have to get to a new location (within the same bucket) copy and delete them one by one has a lot of space for errors. Out of storage space during the copy operation and connection errors without any chance for rollback(s). It doesn't get easier if the locations would be different bucktes.
So, having this old design and not chance to rename/move a file:
Is there any change to copy the files without creating new physical files (without duplicating used storage space)?
Any experienced cloud developer could give me please a hint how to do this bulk copy with rollbacks in error cases?
Anyone implemented something like that with a functional rollback mechanism if e.g. file 517 of 1000 fails? Copy and delete them back seems not to be way to go.
Currently I am using Minio server and Minio dotnet library. But since they are compatible with Amazon S3 this scenario could also have happend on Amazon S3.
I am trying to download some files via mechanize. Files smaller than 1GB are downloaded without causing any trouble. However, if a file is bigger than 1GB the script runs out of memory:
The mechanize_response.py script throws out of memory at the following line
self.__cache.write(self.wrapped.read())
__cache is a cStringIO.StringIO, It seems that it can not handle more than 1GB.
How to download files larger than 1GB?
Thanks
It sounds like you are trying to download the file into memory but you don't have enough. Try using the retrieve method with a file name to stream the downloaded file to disc.
I finally figured out a work around.
Other than using browser.retrieve or browser.open I used mechanize.urlopen which returned the urllib2 Handler. This allowed me to download files larger than 1GB.
I am still interested in figuring out how to make retrieve work for files larger than 1GB.
I understand that Django File Upload Handler by default stores files less than 2.5MB to memory and those above it to a temp folder in the disk,
In my models,where I have a file field,I have specified the upload_tofolder where I expect the files to be written to.
Though when I try reading this files from this folder,I get an error implying that the files do not yet exist in that folder.
How will I force django to write the files to the folder specified in upload_to before another procedure starts reading from it?
I know I can read the files directly from memory by request.FILES['file'].name but I would rather force the files to be written from memory to folder before I read them.
Any Insights will be highly appreciated.
FILE_UPLOAD_MAX_MEMORY_SIZE setting tells django the maximum size of file to keep in memory. Set it to 0 and it will be always written to disk.
I have a rather large ZIP file, which gets downloaded (cannot change the file). The quest now is to unzip the file while it is downloading instead of having to wait till the central directory end is received.
Does such a library exist?
I wrote "pinch" a while back. It's in Objective-C but the method to decode files from a zip might be a way to get it in C++? Yeah, some coding will be necessary.
http://forrst.com/posts/Now_in_ObjC_Pinch_Retrieve_a_file_from_inside-I54
https://github.com/epatel/pinch-objc
I'm not sure such a library exists. Unless you are on a very fast line [or have a very slow processor], it's unlikely to save you a huge amount of time. Decompressing several gigabytes only takes a few seconds if all the data is in ram [it may then take a while to write the uncompressed data to the disk, and loading it from the disk may add to the total time].
However, assuming the sending end supports "range" downloading, you could possibly write something that downloads the directory first [by reading the fixed header first, then reading the directory and then downloading the rest of the file from start to finish]. Presumably that's how "pinch" linked in epatel's answer works.