I'm working on a Multi Threaded application programmed in C++. I uses some temporary files to pass data between my threads. One thread writes the data to be processed into files in a directory. Another thread scans the directory for work files and reads the files and process them further, then delete those files. I have to use these files , because if my app gets killed when , i have to retain the data which has not been processed yet.
But i hate to be using multiple files. I just want to use a single file. One thread continuously writing to a file and other thread reading the data and deleting the data which has been read.
Like a vessel is filled from top and at bottom i can get and delete the data from vessel. How to do this efficiently in C++ , first is there a way ..?
As was suggested in the comments to your questions using a database like SQLite may be a very good solution.
However if you insist on using a file then this is of course possible.
I did it myself once - created a persistent queue on disk using a file.
Here are the guidelines on how to achieve this:
The file should contain a header which point to the next unprocessed record (entry) and to the next available place to write to.
If the records have variable length then each record should contain a header which states the record length.
You may want to add to each record a flag that indicates whether the record was processed
file locking can be used to ensure no one reads from the portion of the file that is being written to
Use low level IO - don't use buffered streams of any kind, use direct write semantics
And here is the schemes for reading and writing (probably with some small logical bugs but you should be able to take it from there):
READER
Lock the file header and read it and unlock it back
Go to the last record position
Read the record header and the record
Write the record header back with the processed flag turned on
If you are not at the end of file Lock the header and write the new location of the next unprocessed record else write some marking to indicate there are no more records to process
Make sure that the next record to write points to the correct place
You may also want the reader to compact the file for you once in a while:
Lock the entire file
Copy all unprocessed records to the beginning of the file (You may want to keep some logic as not to overwrite your unprocessed records - maybe compact only if processed space is larger than unprocessed space)
Update the header
Unlock the file
WRITER
Lock the header of the file and see where the next record is to be written then unlock it
Lock the file from the place to be written to the length of the record
Write the record and unlock
Lock the header if the unprocessed record mark indicates there are no records to process let it point to the new record unlock the header
Hope this sets you on the write track
The win32Api function CreateFileMapping() enables processes to share data, multiple processes can use memory-mapped files that the system paging file stores.
A few good links:
http://msdn.microsoft.com/en-us/library/aa366551(VS.85).aspx
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366551(v=vs.85).aspx
http://www.codeproject.com/Articles/34073/Inter-Process-Communication-IPC-Introduction-and-S
http://www.codeproject.com/Articles/19531/A-Wrapped-Class-of-Share-Memory
http://www.bogotobogo.com/cplusplus/multithreaded2C.php
you can write data that was process line per line and delimeter for each line indicate if this record processing or not
Related
I'm using libuv's uv_fs_event_t to monitor file changes. And once a change is detected, I open the file in the callback uv_fs_event_cb.
However, my program requires to also get the full file size when opening the file, so I would know how much memory is to be allocated based on the file size. I found that no matter I use libuv's uv_fs_fstat or POSIX's stat/stat64, or fseek+ftell I never get the correct file size immediately. It's because when my program is opening the file, the file is still being updated.
My program runs in a tight single thread with callbacks so delay/sleep isn't the best option here (and no guaranteed correctness either).
Is there any way to handle this with or without leveraging libuv, so that I can, say hold off opening and reading the file, until the write to the file has completed? In other words, instead of immediately detects the start of a change of a file event, can I in some way detects a completion of a file change?
One approach is to have the writer create an intermediate file, and finish I/O by renaming it to the target file. e.g. this is what happens in most browsers, the file has an "downloading.tmp" name until download is complete to discourage you from opening it.
Another approach is to write/touch a "finished" file after writing the main target file, and wait to see that file before the reader starts his job.
Last option I can see, if the file format can be altered slightly, have the writer print the file size as first bytes of the file, then the reader can preallocate correctly even if the file is not fully written, and then it will insist on reading all the data.
Overall I'm suggesting instead of a completion event, make the writer produce any event that can be monitored after it has completed it's task, and have the reader wait/synchronize on that event.
I am going to read a file in C++. The reading itself is happening in a while-loop, and is reading from one file.
When the function reads information from the file, it is going to push this information up some place in the system. The problem is that this file may change while the loop is ongoing.
How may I catch that new information in the file? I tried out std::ifstream reading and changing my file manually on my computer as the endless-loop (with a sleep(2) between each loop) was ongoing, but as expected -- nothing happend.
EDIT: the file will overwrite itself at each new entry of data to the file.
Help?
Running on virtual box Ubuntu Linux 12.04, if that may be useful info. And this is not homework.
The usual solution is something along the lines of what MichaelH
proposes: the writing process opens the file in append mode, and
always writes to the end. The reading process does what
MichaelH suggests.
This works fine for small amounts of data in each run. If the
processes are supposed to run a long time, and generate a lot of
data, the file will eventually become too big, since it will
contain all of the processed data as well. In this case, the
solution is to use a directory, generating numbered files in it,
one file per data record. The writing process will write each
data set to a new file (incrementing the number), and the
reading process will try to open the new file, and delete it
when it has finished. This is considerably more complex than
the first suggestion, but will work even for processes
generating large amounts of data per second and running for
years.
EDIT:
Later comments by the OP say that the device is actually a FIFO.
In that case:
you can't seek, so MichaelH's suggestion can't be used
literally, but
you don't need to seek, since data is automatically removed
from the FIFO whenever it has been read, and
depending on the size of the data, and how it is written, the
writes may be atomic, so you don't have to worry about partial
records, even if you happen to read exactly in the middle of
a write.
With regards to the latter: make sure that both the read and
write buffers are large enough to contain a complete record, and
that the writer flushes after each record. And make sure that
the records are smaller than the size needed to guarantee
atomicity. (Historically, on the early Unix I know, this was
4096, but I would be surprised if it hasn't increased since
then. Although... Under Posix, this is defined by PIPE_BUF,
which is only guaranteed to be at least 512, and is only 4096
under modern Linux.)
Just read the file, rename the file, open the renamed file. do the processing of data to your system, and at the end of the loop close the file. After a sleep, re-open the file at the top of the white loop, rename it and repeat.
That's the simplest way to approach the problem and saves having to write code to process dynamic changes to the file during the processing stage.
To be absolutely sure you don't get any corruption it's best to rename the file. This guarantees that any changes from another process do not affect the processing. It may not be necessary to do this - it depends on the processing and how the file is updated. But it's a safer approach. A move or rename operation is guaranteed to be atomic - so there should be no concurreny issues if using this approach.
You can use inotify to watch file changes.
If you need simpler solution - read file attributes ( with stat(), and check last_write_time of a file ).
However you still may miss some file modification, while you'll be opening and rereading file. So if you have control over the application which writes to a file, i'd recommend you using something else to communicate between these processes, pipes for example.
To be more explicit, if you want tail-like behavior you'll want to:
Open the file, read in the data. Save the length. Close the file.
Wait for a bit.
Open the file again, attempt to seek to the last read position, read the remaining data, close.
rinse and repeat
Problem- Multiple processes want to update a file simultaneously.I do not want to use file locking functionality as highly loaded environment a process may block for a while which i don't want. I want something like all process send data to queue or some shared place or something else and one master process will keep on taking data from there and write to the file.So that no process will get block.
One possibility using socket programming.All the processes will send data to to single port and master keep on listening this single port and store data to file.But what if master got down for few seconds.if it happen than i may write to some file based on timestamp and than later sync.But i am putting this on hold and looking for some other solution.(No data lose)
Another possibility may be tacking lock for the particular segment of the file on which the process want to write.Basically each process will write a line.I am not sure how good it will be for high loaded system.
Please suggest some solution for this problem.
Have a 0mq instance handle the writes (as you initially proposed for the socket) and have the workers connect to it and add their writes to the queue (example in many languages).
Each process can write to own file (pid.temp) and periodically rename file (pid-0.data, pid-1.data, ...) for master process that can grab all this files.
You may not need to construct something like this. If you do not want to get processes blocked just use the LOCK_NB flag of perl flock. Periodically try to flock. If not succeeds continue the processing and the values can stored in an array. If file locked, write the data to it from the array.
Here is the deal:
I have a multiple process system (pre-fork model, similar to apache). all processes are writing to the same log file (in fact a binary log file recording requests and responses, but no matter).
I protect against concurrent access to the log via a shared memory lock, and when the file reach a certain size the process that notices it first roll the logs by:
closing the file.
renaming log.bin -> log.bin.1, log.bin.1 -> log.bin.2 and so on.
deleting logs that are beyond the max allowed number of logs. (say, log.bin.10)
opening a new log.bin file
The problem is that other processes are unaware, and are in fact continue to write to the old log file (which was renamed to log.bin.1).
I can think of several solutions:
some sort of rpc to notify other processes to reopen the log (maybe even a singal). I don't particularly like it.
have processes check the file length via the opened file stream, and somehow detect that the file was renamed under them and reopen log.bin file.
None of those is very elegant in my opinion.
thoughts? recommendations?
Your solution seems fine, but you should store an integer with inode of current logging file in shared memory (see stat(2) with stat.st_ino member).
This way, all process kept a local variable with the opened inode file.
The shared var must be updated when rotating by only one process, and all other process are aware by checking a difference between the local inode and the shared inode. It should induce a reopening.
What about opening the file by name each time before writing a log entry?
get shared memory lock
open file by name
write log entry
close file
release lock
Or you could create a logging process, which receives log messages from the other processes and handles all the rotating transparently from them.
You don't say what language you're using but your processes should all log to a log process and the log process abstracts the file writing.
Logging client1 -> |
Logging client2 -> |
Logging client3 -> | Logging queue (with process lock) -> logging writer -> file roller
Logging client4 -> |
You could copy log.bin to log.bin.1 and then truncate the log.bin file.
So the problems can still write to the old file pointer, which is empty now.
See also man logrotate:
copytruncate
Truncate the original log file to zero size in place after cre‐
ating a copy, instead of moving the old log file and optionally
creating a new one. It can be used when some program cannot be
told to close its logfile and thus might continue writing
(appending) to the previous log file forever. Note that there
is a very small time slice between copying the file and truncat‐
ing it, so some logging data might be lost. When this option is
used, the create option will have no effect, as the old log file
stays in place.
Since you're using shared memory, and if you know how many processes are using the log file.
You can create an array of flags in shared memory, telling each of the processes that the file has been rotated. Each process then resets the flag so that it doesn't re-open the file continuously.
I have this tool in which a single log-like file is written to by several processes.
What I want to achieve is to have the file truncated when it is first opened, and then have all writes done at the end by the several processes that have it open.
All writes are systematically flushed and mutex-protected so that I don't get jumbled output.
First, a process creates the file, then starts a sequence of other processes, one at a time, that then open the file and write to it (the master sometimes chimes in with additional content; the slave process may or may not be open and writing something).
I'd like, as much as possible, not to use more IPC that what already exists (all I'm doing now is writing to a popen-created pipe). I have no access to external libraries other that the CRT and Win32 API, and I would like not to start writing serialization code.
Here is some code that shows where I've gone:
// open the file. Truncate it if we're the 'master', append to it if we're a 'slave'
std::ofstream blah(filename, ios::out | (isClient ? ios:app : 0));
// do stuff...
// write stuff
myMutex.acquire();
blah << "stuff to write" << std::flush;
myMutex.release();
Well, this does not work: although the output of the slave process is ordered as expected, what the master writes is either bunched together or at the wrong place, when it exists at all.
I have two questions: is the flag combination given to the ofstream's constructor the right one ? Am I going the right way anyway ?
If you'll be writing a lot of data to the log from multiple threads, you'll need to rethink the design, since all threads will block on trying to acquire the mutex, and in general you don't want your threads blocked from doing work so they can log. In that case, you'd want to write your worker thread to log entries to queue (which just requires moving stuff around in memory), and have a dedicated thread to pull entries off the queue and write them to the output. That way your worker threads are blocked for as short a time as possible.
You can do even better than this by using async I/O, but that gets a bit more tricky.
As suggested by reinier, the problem was not in the way I use the files but in the way the programs behave.
The fstreams do just fine.
What I missed out is the synchronization between the master and the slave (the former was assuming a particular operation was synchronous where it was not).
edit: Oh well, there still was a problem with the open flags. The process that opened the file with ios::out did not move the file pointer as needed (erasing text other processes were writing), and using seekp() completely screwed the output when writing to cout as another part of the code uses cerr.
My final solution is to keep the mutex and the flush, and, for the master process, open the file in ios::out mode (to create or truncate the file), close it and reopen it using ios::app.
I made a 'lil log system that has it's own process and will handle the writing process, the idea is quite simeple. The proccesses that uses the logs just send them to a pending queue which the log process will try to write to a file. It's like batch procesing in any realtime rendering app. This way you'll grt rid of too much open/close file operations. If I can I'll add the sample code.
How do you create that mutex?
For this to work this needs to be a named mutex so that both processes actually lock on the same thing.
You can check that your mutex is actually working correctly with a small piece of code that lock it in one process and another process which tries to acquire it.
I suggest blocking such that the text is completely written to the file before releasing the mutex. I've had instances where the text from one task is interrupted by text from a higher priority thread; doesn't look very pretty.
Also, put the format into Comma Separated format, or some format that can be easily loaded into a spreadsheet. Include thread ID and timestamp. The interlacing of the text lines shows how the threads are interacting. The ID parameter allows you to sort by thread. Timestamps can be used to show sequential access as well as duration. Writing in a spreadsheet friendly format will allow you to analyze the log file with an external tool without writing any conversion utilities. This has helped me greatly.
One option is to use ACE::logging. It has an efficient implementation of concurrent logging.