Django FileWrapper object: How to hook in clean-up action - django

Suppose the django.core.servers.basehttp.FileWrapper class is used to serve back content from a temporary file.
When the client completes the file download, the temporary file needs to be deleted.
How can one hook into the FileWrapper object, to perform such a clean-up action?

If you run on unix system then unlink temp file right after opening. Disk space will be freed after closing file handle by FileWrapper at the end of downloading.

Related

How to get correct file size only on the completion of a detected file change, not at the beginning?

I'm using libuv's uv_fs_event_t to monitor file changes. And once a change is detected, I open the file in the callback uv_fs_event_cb.
However, my program requires to also get the full file size when opening the file, so I would know how much memory is to be allocated based on the file size. I found that no matter I use libuv's uv_fs_fstat or POSIX's stat/stat64, or fseek+ftell I never get the correct file size immediately. It's because when my program is opening the file, the file is still being updated.
My program runs in a tight single thread with callbacks so delay/sleep isn't the best option here (and no guaranteed correctness either).
Is there any way to handle this with or without leveraging libuv, so that I can, say hold off opening and reading the file, until the write to the file has completed? In other words, instead of immediately detects the start of a change of a file event, can I in some way detects a completion of a file change?
One approach is to have the writer create an intermediate file, and finish I/O by renaming it to the target file. e.g. this is what happens in most browsers, the file has an "downloading.tmp" name until download is complete to discourage you from opening it.
Another approach is to write/touch a "finished" file after writing the main target file, and wait to see that file before the reader starts his job.
Last option I can see, if the file format can be altered slightly, have the writer print the file size as first bytes of the file, then the reader can preallocate correctly even if the file is not fully written, and then it will insist on reading all the data.
Overall I'm suggesting instead of a completion event, make the writer produce any event that can be monitored after it has completed it's task, and have the reader wait/synchronize on that event.

When is a newly created HDFS file considered available for read?

Creating a HDFS file involves several things, metadata, allocating blocks, replicating blocks. My question is that, when is a file considered available for read? Does it need to wait until all blocks are fully replicated?
In my HDFS log, I noticed HDFS first allocated blocks for my mapreduce staging file:
org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1073743864_3041, replicas=10.206.36.220:9866, 10.206.37.92:9866, 10.206.36.186:9866, 10.206.36.246:9866, 10.206.38.104:9866, 10.206.37.119:9866, 10.206.37.255:9866, 10.206.37.129:9866, 10.206.38.97:9866, 10.206.38.5:9866 for /tmp/hadoop-yarn/staging/xxx/.staging/job_12345678_0567/job.split
but later the job failed to find the file:
INFO org.apache.hadoop.ipc.Server: IPC Server handler 80 on 8020, call Call#1 Retry#1 org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from 10.206.38.106:46254: java.io.FileNotFoundException: File does not exist: /tmp/hadoop-yarn/staging/xxx/.staging/job_12345678_0567/job.split
finally I see
INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1073744008_3185, replicas=10.206.37.253:9866, 10.206.36.167:9866 for /tmp/hadoop-yarn/staging/xxx/.staging/job_12345678_0567/job.split.1234567890._COPYING_
INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /tmp/hadoop-yarn/staging/xxx/.staging/job_12345678_0567/job.split.1234567890._COPYING_ is closed by DFSClient_NONMAPREDUCE_-1702431035_1
I'm guessing the file has never been fully created.
Data is available for read immediately after the flush operation.If a writer wants to ensure that data it has written does not get lost in the event of a system crash, it has to invoke flush. A successful return from the flush call guarantees that HDFS has persisted the data and relevant metadata. The writer can invoke flush as many times and as frequently as it needs. The writer can repeatedly write a few bytes to a file and then invoke flush. A reader that has the file already opened might not see that changes that this flush effected, but any re-opens of the file will allow the reader to access the newly written data. HDFS guarantees that once data is written to a file and flushed, it would either be available to new readers or an exception would be generated. New readers will encounter an IO error while reading those portions of the file that are currently unavailable. This can occur if all the replicas of a block are unavailable. HDFS guarantees to generate an exception (i.e. no silent data loss) if a new reader tries to read data that was earlier flushed but is currently unavailable.

How do I make a file load into the memory as soon as the server starts in django?

I want to load a file which is of type tar.gz into the memory as soon as the server starts so that I can perform operations with the requests that come in rather than loading the file every time a request a made. The way the file will load is through a function(load_archive)
def _get_predictor() -> Predictor:
check_for_gpu(-1)
archive = load_archive("{}/model.tar.gz".format(HOME),
cuda_device=-1)
return Predictor.from_archive(archive, "sentence-tagger")
The file mentioned in the code is very large(more than 2 GB) and I want an optimized way. I am creating an API which ultimately has to perform operations with this file. What I am thinking is loading this file once when the server starts and it remains in memory and serves all the requests fast enough without loading again and again and thus delaying the response. I have found some solution such as memcached or celery but I am not sure how can I use that and how to use that?
Please help me out.

How to delete dynamic cfm file only AFTER the code runs?

I'm using the in-memory file system to execute dynamic CFM files. How can I delete the temporary file only AFTER it is finished running? If I do it right after the cfinclude, it won't get deleted if the dynamic code has an abort or location tag etc.
Can I create a thread that will sleep till the main page thread completes and then delete it?

Rotating logs without restart, multiple process problem

Here is the deal:
I have a multiple process system (pre-fork model, similar to apache). all processes are writing to the same log file (in fact a binary log file recording requests and responses, but no matter).
I protect against concurrent access to the log via a shared memory lock, and when the file reach a certain size the process that notices it first roll the logs by:
closing the file.
renaming log.bin -> log.bin.1, log.bin.1 -> log.bin.2 and so on.
deleting logs that are beyond the max allowed number of logs. (say, log.bin.10)
opening a new log.bin file
The problem is that other processes are unaware, and are in fact continue to write to the old log file (which was renamed to log.bin.1).
I can think of several solutions:
some sort of rpc to notify other processes to reopen the log (maybe even a singal). I don't particularly like it.
have processes check the file length via the opened file stream, and somehow detect that the file was renamed under them and reopen log.bin file.
None of those is very elegant in my opinion.
thoughts? recommendations?
Your solution seems fine, but you should store an integer with inode of current logging file in shared memory (see stat(2) with stat.st_ino member).
This way, all process kept a local variable with the opened inode file.
The shared var must be updated when rotating by only one process, and all other process are aware by checking a difference between the local inode and the shared inode. It should induce a reopening.
What about opening the file by name each time before writing a log entry?
get shared memory lock
open file by name
write log entry
close file
release lock
Or you could create a logging process, which receives log messages from the other processes and handles all the rotating transparently from them.
You don't say what language you're using but your processes should all log to a log process and the log process abstracts the file writing.
Logging client1 -> |
Logging client2 -> |
Logging client3 -> | Logging queue (with process lock) -> logging writer -> file roller
Logging client4 -> |
You could copy log.bin to log.bin.1 and then truncate the log.bin file.
So the problems can still write to the old file pointer, which is empty now.
See also man logrotate:
copytruncate
Truncate the original log file to zero size in place after cre‐
ating a copy, instead of moving the old log file and optionally
creating a new one. It can be used when some program cannot be
told to close its logfile and thus might continue writing
(appending) to the previous log file forever. Note that there
is a very small time slice between copying the file and truncat‐
ing it, so some logging data might be lost. When this option is
used, the create option will have no effect, as the old log file
stays in place.
Since you're using shared memory, and if you know how many processes are using the log file.
You can create an array of flags in shared memory, telling each of the processes that the file has been rotated. Each process then resets the flag so that it doesn't re-open the file continuously.