Determine the size of a pipe without calling read() - c++

I need a function called SizeOfPipe() which should return the size of a pipe - I only want to know how much data is in the pipe and not actually read data off the pipe itself.
I thought the following code would work:
fseek (pPipe, 0 , SEEK_END);
*pBytes = ftell (pPipe);
rewind (pPipe);
but fseek() doesn't work on file descriptors. Another option would be to read the pipe then write the data back but would like to avoid this if possible. Any suggestions?

Depending on your unix implementation ioctl/FIONREAD might do the trick
err = ioctl(pipedesc, FIONREAD, &bytesAvailable);
Unless this returns the error code for "invalid argument" (or any other error) bytesAvailable contains the amount of data available for unblocking read operations at that time.

Some UNIX implementations return the number of bytes that can be read in the st_size field after calling fstat(), but this is not portable.

Unfortunately the system cannot always know the size of a pipe - for example if you are piping a long-running process into another command, the source process may not have finished running yet. In this case there is no possible way (even in theory) to know how much more data is going to come out of it.
If you want to know the amount of data currently available to read out of the pipe that might be possible, but it will depend on OS buffering and other factors which are hard to control. The most common approach here is just to keep reading until there's nothing left to come (if you don't get an EOF then the source process hasn't finished yet). However I don't think this is what you are looking for.
So I'm afraid there is no general solution.

It's not in general possible to know the amount of data you can read from a pipe just from the pipe handle alone. The data may be coming in across a network, or being dynamically generated by another process. If you need to know up front, you should arrange for the information to be sent to you - through the pipe, or out of band - by whatever process is at the other end of the pipe.

There is no generic, portable way to tell how much data is available in a pipe without reading it. At least not under POSIX specifications.
Pipes are not seekable, and neither is it possible to put the data back into the reading end of a pipe.
Platform-specific tricks might be possible, though. If your question is platform-specific, editing your question to say so might improve your chances to get a working answer.

It's almost never necessary to know how many bytes are in the pipe: perhaps you just want to do a non-blocking read() on the pipe, ie. to check if there are any bytes ready, and if so, read them, but never stop and wait for the pipe to be ready.
You can do that in two steps. First, use the select() system call to find out whether data is available or not. An example is here: http://www.developerweb.net/forum/showthread.php?t=2933
Second, if select tells you data is available, call read() once, and only once, with a large block size. It will read only as many bytes are available, or up to the size of your block, whichever is smaller. If select() returns true, read() will always return right away.

I don't think it is possible, isn't the point of a pipe to provide interprocess communication between the two ends (in one direction). If I'm correct in that assertion, the send may not yet have finished pushing data into the pipe -- so it'd be impossible to determine the length.
What platform are you using?

I do not think it's possible. Pipes present stream-oriented protocol rather than packet-oriented one. IOW, if you write to a pipe twice, once with,say, 250 bytes and once with, say, 520 bytes, there is no way to tell how many bytes you'll get from the other end in one read request. You could get 256, 256, and then the rest.
If you need to impose packets on a pipe, you need to do it yourself by writing pre-determined (or delimited) number of bytes as packet length, and then the rest of teh packet. Use select() to find out if there is data to read, use read() to get a reasonably-sized buffer. When you have your buffer, it's your responsibility to determine the packet boundary.

If you want to know the amount of data that it's expected to arrive, you could always write at the begining of every msg sent by the pipes the size of the msg.
So write for example 4 bytes at the start of every msg with the length of your data, and then only read the first 4 bytes.

There is no portable way to tell the amount of data coming from a pipe.
The only thing you could do is to read and process data as it comes.
For that you could use something like a circular buffer

You can wrap it in object with buffering that can be rewinded. This would be feasible only for small amounts of data.
One way to do this in C is to define stuct and wrap all functions operating on pipes for your struct.

As many have answered, you cannot portably tell how many bytes there is to read, OTOH what you can do is poll the pipe for data to be read. First be sure to open the pipe with O_RDWR|O_NONBLOCK - it's mandated by POSIX that a pipe be open for both read and write to be able poll it.
Whenever you want to know if there is data available, just select/poll for data to read. You can also know if the pipe is full by checking for write but see the note below, depending on the type or write it may be inaccurate.
You won't know how much data there is but keep in mind writes up to PIPE_BUF bytes are guaranteed to be atomic, so if you're concerned about having a full message on the pipe, just make sure they fit within that or split them up.
Note: When you select for write, even if poll/select says you can write to the pipe a write <= PIPE_BUF will return EAGAIN if there isn't enough room for the full write. I have no ideas how to tell if there is enough room to write... that is what I was looking for (I may end padding with \0's to PIPE_BUF size... in my case it's just for testing anyway).
I have an old example app Perl that can read one or more pipes in non-blocking mode, OCP_Daemon. The code is pretty close to what you would do in C using an event loop.

On Windows you can always use PeekNamedPipe, but I doubt that's what you want to do anyway.

Related

Thread-safe file updates

I need to learn how to update a file concurrently without blocking other threads. Let me explain how it should work, needs, and how I think it should be implemented, then I ask my questions:
Here is how the worker works:
Worker is multithreaded.
There is one very large file (6 Terabyte).
Each thread is updating part of this file.
Each write is equal to one or more disk blocks (4096 bytes).
No two worker write at same block (or same group of blocks) at the same time.
Needs:
Threads should not block other blocks (no lock on file, or minimum possible number of locks should be used)
In case of (any kind of) failure, There is no problem if updating block corrupts.
In case of (any kind of) failure, blocks that are not updating should not corrupts.
If file write was successful, we must be sure that it is not buffered and be sure that actually written on disk (fsync)
I can convert this large file to as many smaller files as needed (down to 4kb files), but I prefer not to do that. Handling that many files is difficult, and needs a lot of file handles open/close operations, which has negative impact on performance.
How I think it should be implemented:
I'm not much familiar with file manipulation and how it works at operating system level, but I think writing on a single block should not corrupt other blocks when errors happen. So I think this code should perfectly work as needed, without any change:
char write_value[] = "...4096 bytes of data...";
int write_block = 12345;
int block_size = 4096;
FILE *fp;
fp = fopen("file.txt","w+");
fseek(fp, write_block * block_size, SEEK_SET);
fputs(write_value, fp);
fsync(fp);
fclose(fp);
Questions:
Obviously, I'm trying to understand how it should be implemented. So any suggestions are welcome. Specially:
If writing to one block of a large file fails, what is the chance of corrupting other blocks of data?
In short, What things should be considered on perfecting code above, (according to the last question)?
Is it possible to replace one block of data with another file/block atomically? (like how rename() system call replaces one file with another atomically, but in block-level. Something like replacing next-block-address of previous block in file system or whatever else).
Any device/file system/operating system specific notes? (This code will run on CentOS/FreeBSD (not decided yet), but I can change the OS if there is better alternative for this problem. File is on one 8TB SSD).
Threads should not block other blocks (no lock on file, or minimum possible number of locks should be used)
Your code sample uses fseek followed by fwrite. Without locking in-between those two, you have a race condition because another thread could jump in-between. There are three reasonable solutions:
Use flockfile, followed by regular fseek and fwrite_unlocked then funlock. Those are POSIX-2001 standard
Use separate file handles per thread
Use pread and pwrite to do IO without having to worry about the seek position
Option 3 is the best for you.
You could also use the asynchronous IO from <aio.h> to handle the multithreading. It basically works with a thread-pool calling pwrite on most Unix implementations.
In case of (any kind of) failure, There is no problem if updating block corrupts
I understand this to mean that there should be no file corruption in any failure state. To the best of my knowledge, that is not possible when you overwrite data. When the system fails in the middle of a write command, there is no way to guarantee how many bytes were written, at least not in a file-system agnostic version.
What you can do instead is similar to a database transaction: You write the new content to a new location in the file. Then you do an fsync to ensure it is on disk. Then you overwrite a header to point to the new location. If you crash before the header is written, your crash recovery will see the old content. If the header gets written, you see the new content. However, I'm not an expert in this field. That final header update is a bit of a hand-wave.
In case of (any kind of) failure, blocks that are not updating should not corrupts.
Should be fine
If file write was successful, we must be sure that it is not buffered and be sure that actually written on disk (fsync)
Your sample code called fsync, but forgot fflush before that. Or you set the file buffer to unbuffered using setvbuf
I can convert this large file to as many smaller files as needed (down to 4kb files), but I prefer not to do that. Handling that many files is difficult, and needs a lot of file handles open/close operations, which has negative impact on performance.
Many calls to fsync will kill your performance anyway. Short of reimplementing database transactions, this seems to be your best bet to achieve maximum crash recovery. The pattern is well documented and understood:
Create a new temporary file on the same file system as the data you want to overwrite
Read-Copy-Update the old content to the new temporary file
Call fsync
Rename the new file to the old file
The renaming on a single file system is atomic. Therefore this procedure will ensure after a crash, you either get the old data or the new one.
If writing to one block of a large file fails, what is the chance of corrupting other blocks of data?
None.
Is it possible to replace one block of data with another file/block atomically? (like how rename() system call replaces one file with another atomically, but in block-level. Something like replacing next-block-address of previous block in file system or whatever else).
No.

Can I create a Handle without a file?

I want to create a dump in windows with the function MiniDumpWriteDump. The problem is that that function takes a Handle to a file to write the result to. I want the data in memory so that I can send it over the internet. Therefore, I was wondering if there is a way to create a handle without a file backing it and I can just get a pointer to the data?
You can use memory mapped files. See here: http://msdn.microsoft.com/en-us/library/windows/desktop/aa366537(v=vs.85).aspx
You need to pass hFile = INVALID_HANDLE_VALUE and specify maximal size of file. Please, check msdn for the details.
There are a couple of possibilities.
One would be to use CreateFile, but pass FILE_ATTRIBUTE_TEMPORARY. This will create a file, but tells Windows to attempt to keep as much of the file in the cache as possible. While this doesn't completely avoid creating a file, if you have enough memory it can often eliminate any (or much, anyway) I/O to/from the disk from happening.
Another possibility (though one I've never tested) would be to pass a handle to a named (or maybe even an anonymous) pipe. You can generally write to a pipe like you would a file, so as long as the crash dump writer just needs to be able to pass the handle to WriteFile, chances are pretty good this will work fine. From there, you could (for example) have another small program that would read the data from the pipe and write it to a socket. Obviously it would be nice to be able to avoid the extra processing to translate from pipe to socket, but such is life some times.
If you haven't tried it, you might want to test with just passing a socket handle to the crash dump writer. Although it's somewhat limited, Windows does support treating a socket handle like a normal file (or whatever) handle. There's certainly nothing close to a guarantee that it'll work, but it may be worth a shot anyway.
The crash dump is indeed process's memory. So, it doesn't make sense. Why don't you simply send the file and delete after successful send?
By the way, you can compress the file and send it, because crashdumps are usually big files.
The documentation says to pass a file handle, so if you do anything else you're breaking the contract and (if it works at all) the behaviour will not be reliable.
Pass a named pipe handle. Pipe the data back to yourself.

C++ reading from FIFO without memory operations

I'd like to use a FIFO buffer from a C++ code.
There are two processes, one of them always writes to the FIFO, the other always reads it. It's very simple.
I don't really need to read the contents of the buffer, I just want to know how much data is in it, and clear it.
What is the most sophisticated solution for this in C++?
This code works well, but I never need the buffer's contents:
int num;
char buffer[32];
num = read(FIFO, buffer, sizeof(buffer));
//num is the important variable
Thank you!
You could take a look at this question: Determing the number of bytes ready to be recv()'d
On linux, code for sockets should work with minimal effort on FIFOs too. Windows, though, I'm not sure.
The only way to clear a pipe is to read it so the question of how many bytes are present is moot - you'll know after you read them. The real issues ends up being the same as for any read:
(1) If you don't care about the data then presumably you don't want to block waiting for it so make the FIFO non-blocking.
(2) Since you presumably don't want to sit and waste time polling the FIFO to see if there is something to read you should put the FIFO fd in a select statement. When there is something to read then drain it and add to a counter.
As far as I am aware of, the only way to clear bytes from a linux FIFO (short of destroying the FIFO) is to read them out. You can clear them out faster by reading larger amounts of data at a time (32 is a very small read size, unless that is the size that is normally written to the FIFO). If you are in blocking mode, then you should query for the bytes as described in the link indicated by Robert Mason. If the descriptor is in non-blocking mode, you can read until EAGAIN is returned to know it was cleared. You may use poll to determine when more data has arrived on the FIFO.
Not sure if I've got you right, sophisticated - do you mean the most efficient, or the most obfuscated?
Anyway, if you don't need the buffer contents - you may just use a (shared) interlocked variable.

When and how read/write blocks (i.e. suspends your program) for different kind of files?

I think I am not clear about when and how read/write blocks for various kind of files.
(disk file, pipe, socket, FIFO)
Could any one explain for both read and write scenarios of each file type?
Thanks!!
For a disk-based file, read and write may block briefly while the requested read/write is performed. A read at the end of a file will always return a short result, and a write to a file on a full FS will fail -- barring various unusual circumstances, read/write to a plain file will never block indefinitely.
For pipes, sockets, and FIFOs, read will block if no data is available, and write will block if the pipe/socket/FIFO is "full" (e.g, you've written a bunch of data and the process on the other end hasn't read it yet). The exact amount of data required to fill the buffer is variable; for a pipe, for instance, it's typically between 4 and 64 kB.

Reading from a socket 1 byte a time vs reading in large chunk

What's the difference - performance-wise - between reading from a socket 1 byte a time vs reading in large chunk?
I have a C++ application that needs to pull pages from a web server and parse the received page line by line. Currently, I'm reading 1 byte at a time until I encounter a CRLF or the max of 1024 bytes is reached.
If reading in large chunk(e.g. 1024 bytes at a time) is a lot better performance-wise, any idea on how to achieve the same behavior I currently have (i.e. being able to store and process 1 html line at a time - until the CRLF without consuming the succeeding bytes yet)?
EDIT:
I can't afford too big buffers. I'm in a very tight code budget as the application is used in an embedded device. I prefer keeping only one fixed-size buffer, preferrably to hold one html line at a time. This makes my parsing and other processing easy as I am by anytime I try to access the buffer for parsing, I can assume that I'm processing one complete html line.
Thanks.
I can't comment on C++, but from other platforms - yes, this can make a big difference; particularly in the amount of switches the code needs to do, and the number of times it needs to worry about the async nature of streams etc.
But the real test is, of course, to profile it. Why not write a basic app that churns through an arbitrary file using both approaches, and test it for some typical files... the effect is usually startling, if the code is IO bound. If the files are small and most of your app runtime is spent processing the data once it is in memory, you aren't likely to notice any difference.
If you are reading directly from the socket, and not from an intermediate higher-level representation that can be buffered, then without any possible doubt, it is just better to read completely the 1024 bytes, put them in RAM in a buffer, and then parse the data from the RAM.
Why? Reading on a socket is a system call, and it causes a context switch on each read, which is expensive. Read more about it: IBM Tech Lib: Boost socket performances
First and simplest:
cin.getline(buffer,1024);
Second, usually all IO is buffered so you don't need to worry too much
Third, CGI process start usually costs much more then input processing (unless it is huge
file)... So you may just not think about it.
G'day,
One of the big performance hits by doing it one byte at a time is that your context is going from user time into system time over and over. And over. Not efficient at all.
Grabbing one big chunk, typically up to an MTU size, is measurably more efficient.
Why not scan the content into a vector and iterate over that looking out for \n's to separate your input into lines of web input?
HTH
cheers,
You are not reading one byte at a time from a socket, you are reading one byte at a atime from the C/C++ I/O system, which if you are using CGI will have alreadety buffered up all the input from the socket. The whole point of buffered I/O is to make the data available to the programmer in a way that is convenient for them to process, so if you want to process one byte at a time, go ahead.
Edit: On reflection, it is not clear from your question if you are implementing CGI or just using it. You could clarify this by posting a code snippet which indicates how you currently read read that single byte.
If you are reading the socket directly, then you should simply read the entire response to the GET into a buffer and then process it. This has numerous advantages, including performance and ease of coding.
If you are linitted to a small buffer, then use classic buffering algorithms like:
getbyte:
if buffer is empty
fill buffer
set buffer pointer to start of buffer
end
get byte at buffer pointer
increment pointer
You can open the socket file descritpor with the fdopen() function. Then you have buffered IO so you can call fgets() or similar on that descriptor.
There is no difference at the operating system level, data are buffered anyway. Your application, however, must execute more code to "read" bytes one at a time.