Are standard input and standard output independent or not?
Consider a parent program had launched a child, and the parent's standard output was attached to the child's standard input, and the child's standard output was attached to the parent's standard input.
stdin <- stdout
parent child
stdout -> stdin
If the child (asynchronously) continually read from its standard input and wrote data to its standard output, but the parent just wrote to the child's standard input and didn't read from the child's standard output at all:
stdin| << stdout
parent child
stdout ==>==> stdin
would there eventually be a blockage? Do standard input and standard output share a buffer of any kind? Specifically via C++ std::cin (istream) and std::cout (ostream) if that's needed to answer. Does the standard require they do or do not share such a thing, or does it leave it up to the implementation?
What would happen?
You can't "attach" a file descriptor from a process to a file descriptor of a different process. What you do (if your operating system supports it) is to assign the two file descriptors to the ends of a "pipe". Pipes are not specified anywhere in the C/C++ standard (they are defined by POSIX), and you won't find any standard C/C++ library function which makes any reference to them at all.
As implemented by Unix (and Unix-like) systems, a pipe is little more than a buffer somewhere in the operating system. While the buffer is not full, a process can write data to the input end of the pipe; the data is simply added to the buffer. While the buffer is not empty, a process can read data from the output end of the buffer; the data is removed from the buffer and handed off to the reading process. If a process tries to write to a pipe whose buffer is full or read from a pipe whose buffer is empty, the process "blocks": that is, it is marked by the kernel scheduler as not runnable, and it stays in that state until the pipe can handle its request.
The scenario described in the question needs to involve two pipes. One pipe is used to allow the parent's stdout to send data to the child's stdin, and the other is used to allow the child's stdout to send data to the parent's stdin. These two pipes are wholly independent of each other.
Now, if the parent stops reading from its stdin, but the child continues writing to its stdout, then eventually the pipe buffer will become full. (It actually won't take very long. Pipe buffers are not very big, and they don't grow.) At that point, the child will block trying to write to the pipe. If the child is not multithreaded, then once it blocks, that's it. It stops running, so it won't read from its stdin any more. And if the child stops reading from its stdin, then the other pipe will soon become full and the parent will also block trying to write to its stdout.
So there's no requirement that resources be shared in order to achieve deadlock.
This is a very well-known bug in processes which spawn a child and try to feed data to the child while reading the child's response. If the reader does not keep up with the data produced, then deadlock is likely. You'll find lots of information about it by searching for, for example, "pipe buffer deadlock". Here are a few sample links, just at random:
Raymond Chen, on MSDN: http://blogs.msdn.com/b/oldnewthing/archive/2011/07/07/10183884.aspx
Right here on StackOverflow (with reference to Python but the issue is identical): Can someone explain pipe buffer deadlock?
David Glasser, from 2006: http://web.mit.edu/6.033/2006/wwwdocs/writing-samples/unix-DavidGlasser.html ("These limitations are not merely theoretical — they can be seen in practice by the fact that no major form of inter-process communication later developed in Unix is layered on top of pipe.")
Related
Background: I'm working in C (and very C-ish C++) on Linux. The parent process has an open file descriptor (edit: not file descriptor, actually a FILE pointer) that it writes data to in a "sectioned" format. The child process uses it for this same purpose. As long as the child process is running, it is guaranteed that the parent will not attempt to write more data to its copy of the FILE pointer. The child exits, the parent waits for it, and then it writes more data to the file.
It appears to be working correctly, but I'm still suspicious of it. Do I need to re-seek to the end in the parent? Are there any synchronization issues I need to handle?
The question changed from 'file descriptors' to 'file pointers' or 'file streams'. That complicates any answer.
File Descriptors
File descriptors come from Unix in general, and Linux in particular, and a lot of the behaviour is standardized by POSIX. You say that the parent and child processes share the same file descriptor. That's not possible; file descriptors are specific to one process.
Suppose the parent opens a file (assume the file descriptor is 3) and therefore there is also a new open file description; then the parent process forks. After the fork, each process has a separate file descriptor (but they're both using file descriptor 3), but they share the same open file description. Yes: 'open file descriptors' and 'open file descriptions' are different! Every open file descriptor has an open file description, but a single open file description can be associated with many open file descriptors, and those descriptors need not all be associated with the same process.
One of the critical bits of data in the open file description is the current position (for reading or writing — writing is what matters here).
Consequently, when the child writes, the current position moves for both the parent and the child. Therefore, whenever the parent does write, it writes after the location where the child finished writing.
What you are seeing is guaranteed. At least under the circumstances where the parent opens the file and forks.
Note that in the scenario discussed, the O_APPEND flag was not needed or relevant. However, if you are worried about it, you could open the file with the O_APPEND flag, and then each normal write (not written via pwrite()) will write the data at the current end of the file. This will work even if the two processes do not share the same open file description.
POSIX specification:
open()
fork()
write()
dup2()
pwrite()
File Streams
File streams come with buffering which makes their behaviour more complex than file descriptors (which have no buffering).
Suppose the scenario is like this pseudo-code (error handling omitted):
FILE *fp = fopen(filename, "w");
…code block 1…
pid_t pid = fork();
if (pid == 0)
{
…child writes to file…
…child exits…
}
else
{
…parent waits for child to exit…
…parent writes to file…
}
The de facto implementation of a file stream uses a file descriptor (and you can find the file descriptor using fileno(fp)). When the fork() occurs, it is important that there is no pending data in the file stream — use fflush(fp) before the fork() if any of the code in '…code block 1…' has written to the file stream.
After the fork(), the two processes share the same open file description, but they have independent file descriptors. Of necessity, they have identical copies of the file stream structure.
When the child writes to its copy of the file stream, the data is stored in its buffer. When the buffer fills, or when the child closes the stream (possibly by exiting in a coordinated manner, not using _exit() or its relatives or as a result of a signal), the child's file data is written to the file. That process will move the current position in the shared open file description.
When the parent is notified that the child has exited, then it can write to its file buffer. That information will be written to disk when the buffer fills or is flushed, or when the parent closes the file stream. Since it will be using the same open file description as the child was using, the write position will be where the child left it.
So, as before, what you're seeing is guaranteed as long as you are careful enough.
In particular, calling fflush(fp) before the fork() is crucial if the file stream has been used by the parent before the fork(). If you don't ensure that the stream is flushed, you will get unflushed data written twice, once by the child and once by the parent.
It is also crucial that the child exits cleanly — closing the file stream and hence flushing any unwritten data to the file. If the child does not exit cleanly, there may be buffered data that never gets written to the file. Similarly, if the parent doesn't exit cleanly, there may be buffered data from the parent that never gets written to the file.
If you are talking about POSIX file descriptors, then each write call to a file descriptor is atomic and affects the underlying kernel resource object indepently of what other processes might do with file descriptors that refer to the same object. If two processes do write at approximately the same time, the operations will get ordered by the kernel with one happening completely (though it might write less data than requested) and then the other happening.
In your case, it sounds like you are synchronizing such that you know all parent writes happen either before the child has started (before fork) or after it has completed (after wait), which guarentees the ordering of the write calls.
(no semaphores, or threading, just processes)
I want to read data from a file in parent and pass it to child through pipe.
Suppose data in file is
Is
This
Possible?
now after reading "Is" through pipe
How would child know that new data "This" has been passed and should be read
What would be the terminating condition after reading "Possible?" through pipe, so that child can terminate after reading all the data Parent wanted to pass
(Doing it without using semaphores or threads, just plain processes i-e forking)
Thanks in Advance
A parent writing to a file and the child reading from it would require the synchronization you're thinking of. That is, if parent has only written the 1st line and child has read it, but parent has not written line 2, child will get a premature EOF.
But, a pipe does not.
A pipe stays open until the parent/sender closes it [or child terminates]. So, the child can just read in a loop until it receives EOF.
The child will automatically block in the read if no data is available but will not get EOF prematurely. If you want, the child can do select(2) or poll(2) to check for data being available but I hardly think that's necessary.
The child will not get EOF until the parent has sent all the data and closed its end of the pipe.
So, no synchronization is needed.
On the other side, we may have a parent that sends lots of data quickly and the child is reading slowly (i.e.) falls behind a bit. Eventually, the [kernel] pipe buffer gets "full" and the parent write will block until the child has been able to "catch up" and drain some of the data. Thus, no data is "lost".
You can simply read a fixed amount of data using fairly ubiquitous read() or fread() API, and use the fact that these will either block until more data is available or signal end-of-file conditions. This is the most straightforward way to pipe data into child processes: the child simply reads from stdin like an ordinary file, until it encounters the 'end-of-file' condition.
Alternatively, for a more responsive/performant design (or when dealing with hardware that signals on file objects) you need:
Enable support for nonblocking I/O
Monitor your input file descriptor/handle (the read side of the pipe) using select() like API
Integrate with an event loop (or write a simple one yourself)
Be able to deal with unexpected amounts of data available.
Be able to deal with errors from read()/fread() and friends, in particular EAGAIN.
And be able to deal with end-of-file conditions.
Wiring this up means delving into OS specific APIs, but fortunately it's also a common enough task that plenty of toolkit libraries (e.g. libuv, Qt) exist and provide consistent abstraction/higher level API.
Is it normal, for a given file descriptor shared between a forked parent and child process, that the file position in the parent process remains the same after a child process reads from the same file descriptor?
This is happening for me. Here's the setup:
I am writing a C++ CGI program, so it reads http requests from stdin. When processing a multipart_form, I process stdin with an intermediary object (Multipart_Pull) that has a getc() method that detects the boundary strings and returns EOF at the end of a each field, so I can pretend a field's contents are a file. When the field is a file upload, I fork twice in order to pipe the results of Multipart_Pull::getc to the stdin of a child process that runs ssconvert to make a CSV file from an Excel file for further processing. I wrote the child process to leave the file pointer at the position where the parent could pick it up. The parent process uses wait() to ensure the child processes are done before continuing.
For testing while developing Multipart_Pull, I am faking stdin by opening a disk file that was copied from a real multipart_form request.
When faking stdin, and after the child process returns, the first character read in the parent process is the same first character that the child process read when it started. That is, the file pointer didn't move in the parent's copy of the file.
I have confirmed that the child process actually reads the data by running gdb and following the appropriate child process by using set follow-fork-mode child, and also confirmed the file position of the parent on return by comparing the characters read against the file from which the data is read.
When I am really reading from stdin, I don't expect that this will be a problem because (correct me if I'm wrong here), when you read a character from stdin, it's gone forever.
I realize that there are workarounds to solve this particular problem, the easiest being to just ignore any fields that follow a file upload on a multipart_form, i.e. the parent doesn't try to continue reading after the fork. However, I hate to cripple the production code or make unnecessary restrictions, and mainly because I really just want to understand what's happening.
Thanks in advance.
Is it normal, for a given file descriptor shared between a forked parent and child process, that the file position in the parent process remains the same after a child process reads from the same file descriptor?
Since you bring up fork(), I presume you are working with a POSIX-compliant system. Otherwise, the answer is subject to the specific details of your C++ implementation.
In POSIX terminology, file descriptors and streams are both types of "handles" on an underlying "open file description". There may be multiple distinct handles on the same open file description, potentially held by different processes. The fork() function is one way in which such a situation may arise.
In the event that multiple handles on the same open file description are manipulated, POSIX explicitly declares the results unspecified except under specific conditions. Your child processes satisfy their part of those requirements by closing their streams, either explicitly or as a consequence of normal process termination. According to POSIX, however, for the parent's subsequent use of its stream to have specified behavior, it "shall perform an lseek() or fseek() (as appropriate to the type of handle) to an appropriate location."
In other words, the parent process cannot rely on the child processes' manipulation of the file offset to automatically be visible to it, and in fact cannot rely on any particular offset at all after the children manipulate their copies of the stream.
Yes, I can't. It seems weird ostream has no close, since istream can detect end of file.
Here's my situation: I am capturing all the output on Posix fd2, in this process, and its children, by creating a pipe and dup2'ing the pipe output end onto fd2. A thread then reads the read end of the pipe using an associated C stream (and happens to write each line with a timestamp to the original fd2 via another associated C stream).
When all the children are dead, I write a closing message to cerr, then I need to close it so the thread echoing it to the original error file will close the pipe and terminate.
The thread is not detecting eof(), even though I am closing both stderr and fd2.
I have duplicated my main program using a simple one, and using C streams instead of C++ iostreams, and everything works just fine by fclosing stderr (there are no child processes in that simplified test though).
Edit: hmm .. do I need to close the original pipe fd after dup2'ing it onto channel 2? I didn't do that, so the underlying pipe still has an open fd attached. Aha .. that's the answer!
When you duplicate a file descriptor with dup2 the original descriptor remains a valid reference to the underlying file. The file won't be closed and the associated resources freed until all file descriptors associated with a particular file are closed (with close).
If you are using dup2 to copy a file descriptor to a well known number (such as 2 for stderr), you usually want to call close on the original file descriptor immediately after a successful dup2.
The streams used for the standard C++ streams are the same as those controlled by the corresponding stdio files. That is, if you fclose(stderr) you also close the stream used for std::cerr. ... and since you seem to play with the various dup() functions you can also close(2) to close this stream.
The best is to put a wrapper around your resource and then have the destructor close it when it goes out of scope. The the presention from Bjarne Stoustup
I am programming a shell in c++. It needs to be able to pipe the output from one thing to another. For example, in linux, you can pipe a textfile to more by doing cat textfile | more.
My function to pipe one thing to another is declared like this:
void pipeinput(string input, string output);
I send "cat textfile" as the input, and "more" as the output.
In c++ examples that show how to make pipes, fopen() is used. What do I send as my input to fopen()? I have seen c++ examples of pipeing using dup2 and without suing dup2. What's dup2 used for? How do you know if you need to use it or not?
Take a look at popen(3), which is a way to avoid execvp.
For a simple, two-command pipeline, the function interface you propose may be sufficient. For the general case of an N-stage pipeline, I don't think it is flexible enough.
The pipe() system call is used to create a pipe. In context, you will be creating one pipe before forking. One of the two processes will arrange for the write end of the pipe to become its standard output (probably using dup2()), and will then close both of the file descriptors originally returned by pipe(). It will then execute the command that writes to the pipe (cat textfile in your example). The other process will arrange for the read enc of the pipe to become its standard input (probably using dup2() again), and will then close both of the file descriptor originally returned by pipe(). It will then execute the command that reads from the pipe (more in your example).
Of course, there will be still a third process around - the parent shell process - which forked off a child to run the entire pipeline. You might decide you want to refine the mechanisms a bit if you want to track the statuses of each process in the pipeline; the process organization is then a bit different.
fopen() is not used to create pipes. It can be used to open the file descriptor, but it is not necessary to do so.
Pipes are created with the pipe(2) call, before forking off the process. The subprocess has a little bit of file descriptor management to do before execing the command. See the example in pipe's documentation.