UNIX File Descriptors Reuse - c++

Though I'm reasonably used to UNIX and have programmed on it for a long time, I'm not used to file manipulation.
I know that 0/1/2 file descriptors are standard in, out, and error. I'm aware that whenever a process opens a file, it is given a descriptor with the smallest value that isn't yet used - and I understand some things about using dup/dup2.
I get confused about file descriptors between processes though. Does each process have its own 0/1/2 descriptors for in/out/error or are those 3 descriptors shared between all processes? How come you can run 3 programs in 3 different shells and they all get only their programs output if they are shared?
If two programs open myfile.txt after start-up, will they both use file descriptor #3, or would the second program use #4 since 3 was taken?
I know I asked the same question in a couple ways there, but I just wanted to be clear. The more detail the better :) I've never run into problems with these things while programming, but I'm reading through a UNIX book to understand more and I suddenly realized this confused me a lot and I'd never though about it in detail before.

Each file descriptor is local to the process. However, some file descriptors can refer to the same file - for example, if you create a child process using fork() it would share the files opened by the parent. It would have its own set of file descriptors, initially identical to the parent's ones, but they can change with closing/dup-ing, etc.
If two programs open the same file, in general they get separate file descriptors, pointing to separate internal structures. However, using certain techniques (fork, FD passing, etc.) you can have file descriptors in different processes point to the same internal entity. Generally, though, it is not the case.
Answering your question, both programs would have FD #3 for newly open file.

File descriptors in Unix (normally) persist through fork() and exec() calls. So yes, several processes can share file descriptors.
For example, a shell might do a command like:
foo | bar
In this case, foo's stdout must be connected to bar's stdin. To do this, the shell will most likely use pipe() to create reader- and writer file descriptors. It fork()s twice. The descriptors persist. The fork() which will call up foo, will close(1); dup(writer_fd); to make writer_fd descriptor 1. It will then exec(), and process foo will output to the pipe we created. For bar, we close(0); dup(reader); then exec(). And voila, foo will output to bar.

Don't confuse the file descriptors with the resources they represent. You can have ten different processes, each with a file descriptor of '3' open, and each refer to a different open file. When a process does I/O using its file descriptor, the OS knows which process is doing the I/O and is able to disambiguate which file is being referred to.

Related

What "gotchas" should I be aware of when writing to the same file descriptor in a parent and child process?

Background: I'm working in C (and very C-ish C++) on Linux. The parent process has an open file descriptor (edit: not file descriptor, actually a FILE pointer) that it writes data to in a "sectioned" format. The child process uses it for this same purpose. As long as the child process is running, it is guaranteed that the parent will not attempt to write more data to its copy of the FILE pointer. The child exits, the parent waits for it, and then it writes more data to the file.
It appears to be working correctly, but I'm still suspicious of it. Do I need to re-seek to the end in the parent? Are there any synchronization issues I need to handle?
The question changed from 'file descriptors' to 'file pointers' or 'file streams'. That complicates any answer.
File Descriptors
File descriptors come from Unix in general, and Linux in particular, and a lot of the behaviour is standardized by POSIX. You say that the parent and child processes share the same file descriptor. That's not possible; file descriptors are specific to one process.
Suppose the parent opens a file (assume the file descriptor is 3) and therefore there is also a new open file description; then the parent process forks. After the fork, each process has a separate file descriptor (but they're both using file descriptor 3), but they share the same open file description. Yes: 'open file descriptors' and 'open file descriptions' are different! Every open file descriptor has an open file description, but a single open file description can be associated with many open file descriptors, and those descriptors need not all be associated with the same process.
One of the critical bits of data in the open file description is the current position (for reading or writing — writing is what matters here).
Consequently, when the child writes, the current position moves for both the parent and the child. Therefore, whenever the parent does write, it writes after the location where the child finished writing.
What you are seeing is guaranteed. At least under the circumstances where the parent opens the file and forks.
Note that in the scenario discussed, the O_APPEND flag was not needed or relevant. However, if you are worried about it, you could open the file with the O_APPEND flag, and then each normal write (not written via pwrite()) will write the data at the current end of the file. This will work even if the two processes do not share the same open file description.
POSIX specification:
open()
fork()
write()
dup2()
pwrite()
File Streams
File streams come with buffering which makes their behaviour more complex than file descriptors (which have no buffering).
Suppose the scenario is like this pseudo-code (error handling omitted):
FILE *fp = fopen(filename, "w");
…code block 1…
pid_t pid = fork();
if (pid == 0)
{
…child writes to file…
…child exits…
}
else
{
…parent waits for child to exit…
…parent writes to file…
}
The de facto implementation of a file stream uses a file descriptor (and you can find the file descriptor using fileno(fp)). When the fork() occurs, it is important that there is no pending data in the file stream — use fflush(fp) before the fork() if any of the code in '…code block 1…' has written to the file stream.
After the fork(), the two processes share the same open file description, but they have independent file descriptors. Of necessity, they have identical copies of the file stream structure.
When the child writes to its copy of the file stream, the data is stored in its buffer. When the buffer fills, or when the child closes the stream (possibly by exiting in a coordinated manner, not using _exit() or its relatives or as a result of a signal), the child's file data is written to the file. That process will move the current position in the shared open file description.
When the parent is notified that the child has exited, then it can write to its file buffer. That information will be written to disk when the buffer fills or is flushed, or when the parent closes the file stream. Since it will be using the same open file description as the child was using, the write position will be where the child left it.
So, as before, what you're seeing is guaranteed as long as you are careful enough.
In particular, calling fflush(fp) before the fork() is crucial if the file stream has been used by the parent before the fork(). If you don't ensure that the stream is flushed, you will get unflushed data written twice, once by the child and once by the parent.
It is also crucial that the child exits cleanly — closing the file stream and hence flushing any unwritten data to the file. If the child does not exit cleanly, there may be buffered data that never gets written to the file. Similarly, if the parent doesn't exit cleanly, there may be buffered data from the parent that never gets written to the file.
If you are talking about POSIX file descriptors, then each write call to a file descriptor is atomic and affects the underlying kernel resource object indepently of what other processes might do with file descriptors that refer to the same object. If two processes do write at approximately the same time, the operations will get ordered by the kernel with one happening completely (though it might write less data than requested) and then the other happening.
In your case, it sounds like you are synchronizing such that you know all parent writes happen either before the child has started (before fork) or after it has completed (after wait), which guarentees the ordering of the write calls.

Is seek_ptr unique per file?

Sorry but I didn't find clear answer to my question,
I know that each file has its own seek_ptr, let's suppose the main process opened connection to file_A then before doing anything called fork()
Then forked process reads 2 chars, which is correct?
will seek_ptr be equal to 2 for both files?
seek_ptr be equal to 2 for the child process and still 0 for main process?
Only if the answer is 1:
How can I open 2 files in notepad and each file has its indicator/cursor in different locations?
In Unix, (pid, fd) acts as a pointer into the kernel's table of open file descriptions. When a process is forked, the child process will have a different PID, call it pid2. So (pid2, fd) is a different key from (pid, fd). However, these two pointers actually point to the same open file description: fork does not fork the open file descriptions themselves. Therefore, they share a single offset. If one process seeks, it affects the other process as well. If one process reads, it affects the other process as well.
However, either process is free to call close to dissociate fd from the existing open file description, then call open to create a new open file description which may refer to the same file. After this is done, the two processes will have different open file descriptions, and seeking in one does not affect the other.
Each successful call to open always creates a new open file description.

File pointers after returning from a forked child process

Is it normal, for a given file descriptor shared between a forked parent and child process, that the file position in the parent process remains the same after a child process reads from the same file descriptor?
This is happening for me. Here's the setup:
I am writing a C++ CGI program, so it reads http requests from stdin. When processing a multipart_form, I process stdin with an intermediary object (Multipart_Pull) that has a getc() method that detects the boundary strings and returns EOF at the end of a each field, so I can pretend a field's contents are a file. When the field is a file upload, I fork twice in order to pipe the results of Multipart_Pull::getc to the stdin of a child process that runs ssconvert to make a CSV file from an Excel file for further processing. I wrote the child process to leave the file pointer at the position where the parent could pick it up. The parent process uses wait() to ensure the child processes are done before continuing.
For testing while developing Multipart_Pull, I am faking stdin by opening a disk file that was copied from a real multipart_form request.
When faking stdin, and after the child process returns, the first character read in the parent process is the same first character that the child process read when it started. That is, the file pointer didn't move in the parent's copy of the file.
I have confirmed that the child process actually reads the data by running gdb and following the appropriate child process by using set follow-fork-mode child, and also confirmed the file position of the parent on return by comparing the characters read against the file from which the data is read.
When I am really reading from stdin, I don't expect that this will be a problem because (correct me if I'm wrong here), when you read a character from stdin, it's gone forever.
I realize that there are workarounds to solve this particular problem, the easiest being to just ignore any fields that follow a file upload on a multipart_form, i.e. the parent doesn't try to continue reading after the fork. However, I hate to cripple the production code or make unnecessary restrictions, and mainly because I really just want to understand what's happening.
Thanks in advance.
Is it normal, for a given file descriptor shared between a forked parent and child process, that the file position in the parent process remains the same after a child process reads from the same file descriptor?
Since you bring up fork(), I presume you are working with a POSIX-compliant system. Otherwise, the answer is subject to the specific details of your C++ implementation.
In POSIX terminology, file descriptors and streams are both types of "handles" on an underlying "open file description". There may be multiple distinct handles on the same open file description, potentially held by different processes. The fork() function is one way in which such a situation may arise.
In the event that multiple handles on the same open file description are manipulated, POSIX explicitly declares the results unspecified except under specific conditions. Your child processes satisfy their part of those requirements by closing their streams, either explicitly or as a consequence of normal process termination. According to POSIX, however, for the parent's subsequent use of its stream to have specified behavior, it "shall perform an lseek() or fseek() (as appropriate to the type of handle) to an appropriate location."
In other words, the parent process cannot rely on the child processes' manipulation of the file offset to automatically be visible to it, and in fact cannot rely on any particular offset at all after the children manipulate their copies of the stream.

How to pipe stdout in c++ [duplicate]

I am programming a shell in c++. It needs to be able to pipe the output from one thing to another. For example, in linux, you can pipe a textfile to more by doing cat textfile | more.
My function to pipe one thing to another is declared like this:
void pipeinput(string input, string output);
I send "cat textfile" as the input, and "more" as the output.
In c++ examples that show how to make pipes, fopen() is used. What do I send as my input to fopen()? I have seen c++ examples of pipeing using dup2 and without suing dup2. What's dup2 used for? How do you know if you need to use it or not?
Take a look at popen(3), which is a way to avoid execvp.
For a simple, two-command pipeline, the function interface you propose may be sufficient. For the general case of an N-stage pipeline, I don't think it is flexible enough.
The pipe() system call is used to create a pipe. In context, you will be creating one pipe before forking. One of the two processes will arrange for the write end of the pipe to become its standard output (probably using dup2()), and will then close both of the file descriptors originally returned by pipe(). It will then execute the command that writes to the pipe (cat textfile in your example). The other process will arrange for the read enc of the pipe to become its standard input (probably using dup2() again), and will then close both of the file descriptor originally returned by pipe(). It will then execute the command that reads from the pipe (more in your example).
Of course, there will be still a third process around - the parent shell process - which forked off a child to run the entire pipeline. You might decide you want to refine the mechanisms a bit if you want to track the statuses of each process in the pipeline; the process organization is then a bit different.
fopen() is not used to create pipes. It can be used to open the file descriptor, but it is not necessary to do so.
Pipes are created with the pipe(2) call, before forking off the process. The subprocess has a little bit of file descriptor management to do before execing the command. See the example in pipe's documentation.

Writing concurrently to a file

I have this tool in which a single log-like file is written to by several processes.
What I want to achieve is to have the file truncated when it is first opened, and then have all writes done at the end by the several processes that have it open.
All writes are systematically flushed and mutex-protected so that I don't get jumbled output.
First, a process creates the file, then starts a sequence of other processes, one at a time, that then open the file and write to it (the master sometimes chimes in with additional content; the slave process may or may not be open and writing something).
I'd like, as much as possible, not to use more IPC that what already exists (all I'm doing now is writing to a popen-created pipe). I have no access to external libraries other that the CRT and Win32 API, and I would like not to start writing serialization code.
Here is some code that shows where I've gone:
// open the file. Truncate it if we're the 'master', append to it if we're a 'slave'
std::ofstream blah(filename, ios::out | (isClient ? ios:app : 0));
// do stuff...
// write stuff
myMutex.acquire();
blah << "stuff to write" << std::flush;
myMutex.release();
Well, this does not work: although the output of the slave process is ordered as expected, what the master writes is either bunched together or at the wrong place, when it exists at all.
I have two questions: is the flag combination given to the ofstream's constructor the right one ? Am I going the right way anyway ?
If you'll be writing a lot of data to the log from multiple threads, you'll need to rethink the design, since all threads will block on trying to acquire the mutex, and in general you don't want your threads blocked from doing work so they can log. In that case, you'd want to write your worker thread to log entries to queue (which just requires moving stuff around in memory), and have a dedicated thread to pull entries off the queue and write them to the output. That way your worker threads are blocked for as short a time as possible.
You can do even better than this by using async I/O, but that gets a bit more tricky.
As suggested by reinier, the problem was not in the way I use the files but in the way the programs behave.
The fstreams do just fine.
What I missed out is the synchronization between the master and the slave (the former was assuming a particular operation was synchronous where it was not).
edit: Oh well, there still was a problem with the open flags. The process that opened the file with ios::out did not move the file pointer as needed (erasing text other processes were writing), and using seekp() completely screwed the output when writing to cout as another part of the code uses cerr.
My final solution is to keep the mutex and the flush, and, for the master process, open the file in ios::out mode (to create or truncate the file), close it and reopen it using ios::app.
I made a 'lil log system that has it's own process and will handle the writing process, the idea is quite simeple. The proccesses that uses the logs just send them to a pending queue which the log process will try to write to a file. It's like batch procesing in any realtime rendering app. This way you'll grt rid of too much open/close file operations. If I can I'll add the sample code.
How do you create that mutex?
For this to work this needs to be a named mutex so that both processes actually lock on the same thing.
You can check that your mutex is actually working correctly with a small piece of code that lock it in one process and another process which tries to acquire it.
I suggest blocking such that the text is completely written to the file before releasing the mutex. I've had instances where the text from one task is interrupted by text from a higher priority thread; doesn't look very pretty.
Also, put the format into Comma Separated format, or some format that can be easily loaded into a spreadsheet. Include thread ID and timestamp. The interlacing of the text lines shows how the threads are interacting. The ID parameter allows you to sort by thread. Timestamps can be used to show sequential access as well as duration. Writing in a spreadsheet friendly format will allow you to analyze the log file with an external tool without writing any conversion utilities. This has helped me greatly.
One option is to use ACE::logging. It has an efficient implementation of concurrent logging.