I have a fortran code which needs to read a series of ascii data files (which all together are about 25 Gb). Basically the code opens a given ascii file, reads the information and use it to do some operations, and then close it. Then opens another file, reads the information, do some operations, and close it again. And so on with the rest of ascii files.
Overall each complete run takes about 10h. I usually need to run several independent calculations with different parameters, and the way I do is to run each independent calculation sequentially, so that at the end if I have 10 independent calculations, the total CPU time is 100h.
A more rapid way would be to run the 10 independent calculations at the same time using different processors on a cluster machine, but the problem is that if a given calculation needs to open and read data from a given ascii file which has been already opened and it's being used by another calculation, then the code gives obviously an error.
I wonder whether there is a way to verify if a given ascii file is already being used by another calculation, and if so to ask the code to wait until the ascii file is finally closed.
Any help would be of great help.
Many thanks in advance.
Obamakoak.
Two processes should be able to read the same file. Perhaps action="read" on the open statement might help. Must the files be human readable? The I/O would very likely be much faster with unformatted (sometimes call binary) files.
P.S. If your OS doesn't support multiple-read access, you might have to create your own lock system. Create a master file that a process opens to check which files are in use or not, and to update said list. Immediately closing after a check or update. To handle collisions on this read/write file, use iostat on the open statement and retry after a delay if there is an error.
I know this is an old thread but I've been struggling with the same issue for my own code.
My first attempt was creating a variable on a certain process (e.g. the master) and accessing this variable exclusively using one-sided passive MPI. This is fancy and works well, but only with newer versions of MPI.
Also, my code seemed happy to open (with READWRITE status) files that were also open in other processes.
Therefore, the easiest workaround, if your program has file access, is to make use of an external lock file, as described here. In your case, the code might look something like this:
A process checks whether the lock file exists using the NEW statement, which fails if a file already exists. It will look something like:
file_exists = .true.
do while (file_exists)
open(STATUS='NEW',unit=11,file=lock_file_name,iostat=open_stat)
if (open_stat.eq.0) then
file_exists = .false.
open(STATUS='OLD',ACTION=READWRITE',unit=12,file=data_file_name,iostat=ierr)
if (ierr.ne.0) stop
else
call sleep(1)
end if
end do
The file is now opened exclusively by the current process. Do the operations you need to do, such as reading, writing.
When you are done, close the data file and finally the lock file
close(12,iostat=ierr)
if (ierr.ne.0) stop
close(11,status='DELETE',iostat=ierr)
if (ierr.ne.0) stop
The data file is now again unlocked for the other processes.
I hope this may be useful for other people who have the same problem.
Related
I am learning to use TWINCAT 3 with C++ and as my first work, I decided open a .txt file and get a number inside, and put on an string or an integer.
I have read all the documentation and have many questions. I discovered that I can't use the C++ libraries, just the TWINCAT functions. Then I got lost.
First: What are the exacts steps to Open a file in TWINCAT 3 with C++?
Second: How can I read the data in the file and put in an string or integer?
I would like to do that in a CycleUpdate.
I am sorry if it is a noob question.
As a first step, you have to understand that TwinCAT is providing you a PLC with real-time capabilities. Which means each task you program will need to be executed at every cycle: your task must NOT exceed a certain duration.
Many accesses to the operating system require a lot of wait times, that you would not keep in a real-time system. For that, most of function blocks you will find are equipped with an "Execute" boolean input (or similar) and outputs like "Busy", "Done" and "Error" (even "ErrorID"). These are here in order to start a process and check periodically (i.e. in each cycle) whether the process is complete.
You cannot manage a file opening, reading, writing or closing (OS functions) within a single CycleUpdate. This is the cost for assuring real-time capabilities besides.
I am going to read a file in C++. The reading itself is happening in a while-loop, and is reading from one file.
When the function reads information from the file, it is going to push this information up some place in the system. The problem is that this file may change while the loop is ongoing.
How may I catch that new information in the file? I tried out std::ifstream reading and changing my file manually on my computer as the endless-loop (with a sleep(2) between each loop) was ongoing, but as expected -- nothing happend.
EDIT: the file will overwrite itself at each new entry of data to the file.
Help?
Running on virtual box Ubuntu Linux 12.04, if that may be useful info. And this is not homework.
The usual solution is something along the lines of what MichaelH
proposes: the writing process opens the file in append mode, and
always writes to the end. The reading process does what
MichaelH suggests.
This works fine for small amounts of data in each run. If the
processes are supposed to run a long time, and generate a lot of
data, the file will eventually become too big, since it will
contain all of the processed data as well. In this case, the
solution is to use a directory, generating numbered files in it,
one file per data record. The writing process will write each
data set to a new file (incrementing the number), and the
reading process will try to open the new file, and delete it
when it has finished. This is considerably more complex than
the first suggestion, but will work even for processes
generating large amounts of data per second and running for
years.
EDIT:
Later comments by the OP say that the device is actually a FIFO.
In that case:
you can't seek, so MichaelH's suggestion can't be used
literally, but
you don't need to seek, since data is automatically removed
from the FIFO whenever it has been read, and
depending on the size of the data, and how it is written, the
writes may be atomic, so you don't have to worry about partial
records, even if you happen to read exactly in the middle of
a write.
With regards to the latter: make sure that both the read and
write buffers are large enough to contain a complete record, and
that the writer flushes after each record. And make sure that
the records are smaller than the size needed to guarantee
atomicity. (Historically, on the early Unix I know, this was
4096, but I would be surprised if it hasn't increased since
then. Although... Under Posix, this is defined by PIPE_BUF,
which is only guaranteed to be at least 512, and is only 4096
under modern Linux.)
Just read the file, rename the file, open the renamed file. do the processing of data to your system, and at the end of the loop close the file. After a sleep, re-open the file at the top of the white loop, rename it and repeat.
That's the simplest way to approach the problem and saves having to write code to process dynamic changes to the file during the processing stage.
To be absolutely sure you don't get any corruption it's best to rename the file. This guarantees that any changes from another process do not affect the processing. It may not be necessary to do this - it depends on the processing and how the file is updated. But it's a safer approach. A move or rename operation is guaranteed to be atomic - so there should be no concurreny issues if using this approach.
You can use inotify to watch file changes.
If you need simpler solution - read file attributes ( with stat(), and check last_write_time of a file ).
However you still may miss some file modification, while you'll be opening and rereading file. So if you have control over the application which writes to a file, i'd recommend you using something else to communicate between these processes, pipes for example.
To be more explicit, if you want tail-like behavior you'll want to:
Open the file, read in the data. Save the length. Close the file.
Wait for a bit.
Open the file again, attempt to seek to the last read position, read the remaining data, close.
rinse and repeat
I have a program that creates pipes between two processes. One process constantly monitors the output of the other and when specific output is encountered it gives input through the other pipe with the write() function. The problem I am having, though is that the contents of the pipe don't go through to the other process's stdin stream until I close() the pipe. I want this program to infinitely loop and react every time it encounters the output it is looking for. Is there any way to send the input to the other process without closing the pipe?
I have searched a bit and found that named pipes can be reopened after closing them, but I wanted to find out if there was another option since I have already written the code to use unnamed pipes and I haven't yet learned to use named pipes.
Take a look at using fflush.
How are you reading the other end? Are you expecting complete strings? You aren't sending terminating NULs in the snippet you posted. Perhaps sending strlen(string)+1 bytes will fix it. Without seeing the code it's hard to tell.
Use fsync. http://pubs.opengroup.org/onlinepubs/007908799/xsh/fsync.html
From http://www.delorie.com/gnu/docs/glibc/libc_239.html:
Once write returns, the data is enqueued to be written and can be read back right away, but it is not necessarily written out to permanent storage immediately. You can use fsync when you need to be sure your data has been permanently stored before continuing. (It is more efficient for the system to batch up consecutive writes and do them all at once when convenient. Normally they will always be written to disk within a minute or less.) Modern systems provide another function fdatasync which guarantees integrity only for the file data and is therefore faster. You can use the O_FSYNC open mode to make write always store the data to disk before returning.
I have a directory (DIR_A) to dump from Server A to Server B which is
expected to take a few weeks. DIR_A has the normal tree
structure i.e. a directory could have subfolders or files, etc
Aim:
As DIR_A is being dumped to server B, I will have to
go through DIR_A and search for certain files within it (do not know the
exact name of each file because server A changes the names of all the files
being sent). I cannot wait for weeks to process some files within DIR_A. So, I want to
start manipulating some of the files once I receive them at server B.
Brief:
Server A sends DIR_A to Server B. Expected to take weeks.
I have to start processing the files at B before the upload is
complete.
Attempt Idea:
I decided to write a program that will list the contents of DIR_A.
I went on finding out whether files exist within folders and subfolders of DIR_A.
I thought that I might look for the EOF of a file within DIR_A. If it is not present
then the file has not yet been completely uploaded. I should wait till the EOF
is found. So, I keep looping, calculating the size of the file and verifying whether EOF is present. If this is the case, then I start processing that file.
To simulate the above, I decided to write and execute a program writing to
a text file and then stopped it in the middle without waiting for completion.
I tried to use the program below to determine whether the EOF could be found. I assumed that since I abrubtly ended the program writing to the text file the eof will not be present and hence the output "EOF FOUND" should not be reached. I am wrong since this was reached. I also tried with feof(), and fseek().
std::ifstream file(name_of_file.c_str, std::ios::binary);
//go to the end of the file to determine eof
char character;
file.seekg(0, ios::end);
while(!file.eof()){
file.read(character, sizeof(char));
}
file.close();
std::cout << "EOF FOUND" << std::endl
Could anyone provide with an idea of determining whether a file has been completely written or not?
EOF is simply C++'s way of telling you there is no more data. There's no EOF "character" that you can use to check if the file is completely written.
The way this is typically accomplished is to transfer the file over with one name, i.e. myfile.txt.transferring, and once the transfer is complete, move the file on the target host (back to something like myfile.txt). You could do the same by using separate directories.
Neither C nor C++ have a standard way to determine if the file is still open for writing by another process. We have a similar situation: a server that sends us files and we have to pick them up and handle as soon as possible. For that we use Linux's inotify subsystem, with a watch configured for IN_CLOSE_WRITE events (file was closed after having been opened for writing), which is wrapped in boost::asio::posix::stream_descriptor for convenient asynchronicity.
Depending on the OS, you may have a similar facility. Or just lsof as already suggested.
All finite files have an end. If a file is being written by one process, and (assuming the OS allows it) simultaneously read (faster than it is being written) by another process,then the reading process will see an EOF when it has read all the characters that have been written.
What would probably work better is, if you can determine a length of time during which you can guarantee that you'll receive a significant number of bytes and write them to a file (beware OS buffering), then you can walk the directory once per period, and any file that has changed its file size can be considered to be unfinished.
Another approach would require OS support: check what files are open by the receiving process, with a tool like lsof. Any file open by the receiver is unfinished.
In C, and I think it's the same in C++, EOF is not a character; it is a condition a file is (or is not) in. Just like media removed or network down is not a character.
I have this tool in which a single log-like file is written to by several processes.
What I want to achieve is to have the file truncated when it is first opened, and then have all writes done at the end by the several processes that have it open.
All writes are systematically flushed and mutex-protected so that I don't get jumbled output.
First, a process creates the file, then starts a sequence of other processes, one at a time, that then open the file and write to it (the master sometimes chimes in with additional content; the slave process may or may not be open and writing something).
I'd like, as much as possible, not to use more IPC that what already exists (all I'm doing now is writing to a popen-created pipe). I have no access to external libraries other that the CRT and Win32 API, and I would like not to start writing serialization code.
Here is some code that shows where I've gone:
// open the file. Truncate it if we're the 'master', append to it if we're a 'slave'
std::ofstream blah(filename, ios::out | (isClient ? ios:app : 0));
// do stuff...
// write stuff
myMutex.acquire();
blah << "stuff to write" << std::flush;
myMutex.release();
Well, this does not work: although the output of the slave process is ordered as expected, what the master writes is either bunched together or at the wrong place, when it exists at all.
I have two questions: is the flag combination given to the ofstream's constructor the right one ? Am I going the right way anyway ?
If you'll be writing a lot of data to the log from multiple threads, you'll need to rethink the design, since all threads will block on trying to acquire the mutex, and in general you don't want your threads blocked from doing work so they can log. In that case, you'd want to write your worker thread to log entries to queue (which just requires moving stuff around in memory), and have a dedicated thread to pull entries off the queue and write them to the output. That way your worker threads are blocked for as short a time as possible.
You can do even better than this by using async I/O, but that gets a bit more tricky.
As suggested by reinier, the problem was not in the way I use the files but in the way the programs behave.
The fstreams do just fine.
What I missed out is the synchronization between the master and the slave (the former was assuming a particular operation was synchronous where it was not).
edit: Oh well, there still was a problem with the open flags. The process that opened the file with ios::out did not move the file pointer as needed (erasing text other processes were writing), and using seekp() completely screwed the output when writing to cout as another part of the code uses cerr.
My final solution is to keep the mutex and the flush, and, for the master process, open the file in ios::out mode (to create or truncate the file), close it and reopen it using ios::app.
I made a 'lil log system that has it's own process and will handle the writing process, the idea is quite simeple. The proccesses that uses the logs just send them to a pending queue which the log process will try to write to a file. It's like batch procesing in any realtime rendering app. This way you'll grt rid of too much open/close file operations. If I can I'll add the sample code.
How do you create that mutex?
For this to work this needs to be a named mutex so that both processes actually lock on the same thing.
You can check that your mutex is actually working correctly with a small piece of code that lock it in one process and another process which tries to acquire it.
I suggest blocking such that the text is completely written to the file before releasing the mutex. I've had instances where the text from one task is interrupted by text from a higher priority thread; doesn't look very pretty.
Also, put the format into Comma Separated format, or some format that can be easily loaded into a spreadsheet. Include thread ID and timestamp. The interlacing of the text lines shows how the threads are interacting. The ID parameter allows you to sort by thread. Timestamps can be used to show sequential access as well as duration. Writing in a spreadsheet friendly format will allow you to analyze the log file with an external tool without writing any conversion utilities. This has helped me greatly.
One option is to use ACE::logging. It has an efficient implementation of concurrent logging.