Multi-processing and file operations?

Multi-processing and file operations? - c++

In windows-based OS, assuming there are several different processes that may read and/or write a file freqently by using fopen/fopen_s/fwrite etc, in such case, do I need to consider data-races, or the OS can handle this automatically to ensure the file can only be opened/updated by a single process here at any given time whilst the rest fopen attemp will fail? And what about linux-based OS on this matter?

In Windows it depends on how you open the file.
see some possible values for uStyle parameter in case of OpenFile and dwShareMode in case of CreateFile.
Please note that OpenFile is kind of deprecated though so better use CreateFile.

You will have to take care to not open the same file from multiple threads simultaneously - as it's entirely possible to open the file multiple times, and the OS may or may not do what you expect, depending on the mode you are opening the file in - e.g. if you create a new file it will definitely create two different files (one of which will disappear when it gets closed, because it was deleted by the other thread, great, eh?). The rules are pretty complex, and the worst part is that if you don't take extra care, you'll get "mixed up output to the same file" - so lines or even parts of lines get mixed from the two threads.
Even if the OS stops you from opening the same file twice, you will still have to deal with the consequences of "FILE * came back as NULL". What do you do then? Go back and try again, or fail, or?
I'm not sure I can make a good suggestion as to HOW to solve this problem, since you haven't described very well what you are doing to these files. There are a few different things that come to mind:
Keep a "register" of file-names, and a mutex for each file that has to be held to be able to open the file.
Use a single "file-thread" to read/write data on files, and just queue "I want to write this stuff to file aa.txt", and let the worker write as it goes along.
Use lower level file system calls, and use "exclusive" access to the files, with some sort of "backoff" behaviour in case of collision.
I'm sure there are dozens of other ways to solve the problem - it really depends on what you are trying to do.

Maybe. If you're talking about different processes (and not
threads), the conventional data race conditions which apply to
threads don't apply. However (and there is no difference
between Unix and Windows here):
Any single write/WriteFile operation will be atomic. (I'm
not 100% sure concerning Windows, but I can't imagine it
otherwise.) However, if you're using iostream or the older
FILE* functions, you don't have direct control of when those
operations take place. Normally, they will only occur when the
stream's buffer is full. You'll want to ensure that the buffer
is big enough, and explicitly flush after each output. (If
you're outputting lines of a reasonable length, say 80
characters at the most, it's a safe bet that the buffer will
hold a complete line. In this case, just use std::endl to
terminate the lines in iostreams; for the C style functions,
you'll have to call setvbuf( stream, NULL, _IOLBF, 0 ) before
the first output.
Each open file in the process has its own idea of where to
write in the file, and its own idea of where the end of file is.
If you want all writes to go to the end of file, you'll need to
open it with std::ios_base::app in C++, or "a" in C. Just
std::ios_base::out/"w" is not enough. (Also, of course,
with just std::ios_base::out or "w", the file will be
truncated when it is opened. Having several different processes
truncating the file could result in loss of data.)
When reading a file that other processes are writing to: when
you reach end of file, the stream or FILE goes into an error
state, and will not try to read further, even if other processes
are appending data. In C, clearerr should (I think) undo
this, but it's not clear what happens next; in C++, clearing the
error in the stream doesn't necessarily mean that further reads
will not immediately encounter end of file either. In both
cases, the safest bet is to memorize where you were before each
read, and if the read fails, close the file, then later reopen
it, seek to where you were, and start reading from there.
Random access, writing other than at the end of file, will
also work, as long as all writes are atomic (see above); you
should always get a consistent state. If what you write depends
on what you have read, however, and other processes are doing
something similar, you'll need file locking, which isn't
available at the iostream/FILE* level.

Related

Why think in terms of buffers and not lines? And can fgets() read multiple lines at one time?

Recently I have been reading Unix Network Programming Vol.1. In section 3.9, the last two paragraphs above Figure 3.18 said, here I quote:
...But our advice is to think in terms of butters and not lines, Write your code to read butters of data, and if a line is expected, check the buffer to see if it contains that line.
And in the next paragraph, the authors gave a more specific example, here I quote:
...as we'll see in Section 6.3. System functions like select still won't know about readline's internal buffer, so a carelessly written program could easily find itself waiting in select for data already received and stored in readline's butters.
In section 6.5, the actual problem is "mixing of stdio and select()", which would make the program, here I quote the book, "error-prone". But how?
I know that the authors gave the answer later in the same section and according to my understanding to the book, it is because of the data being hidden from select() and thus select() could not know that the data that has been read is consumed or not.
The answer is literally there, but the first problem here is that I really have a hard time getting it, I cannot imagine what damage would it make to the program, maybe I need a demo program that suffers from the problem to help me understand it.
Still in section 6.5, the authors tried to explain the problem further by giving, here I quote:
... Consider the case when several lines of input are available from the standard input.
select will cause the code at line 20 to read the input using fgets and that, in turn, will read the available lines into a buffer used by stdio. But, fgets only returns a single line and leaves any remaining data sitting in the stdio buffer ...
The "line 20" mentioned above is:
if (Fgets(sendline, MAXLINE, fp) == NULL)
where sendline is an array of char and fp is a pointer to FILE. I looked up into the detailed implementation of Fgets, and it just wrapped fgets() up with some extra error-dealing logic and nothing more.
And here comes my second question, how does fgets manage to, here I quote again, read the available lines? I mean, I looked up the man-page of fgets, it says fgets normally stops on the first newline character. Doesn't this mean that only one line would be read by fgets? More specifically, if I type one line in the terminal and press the enter key, then fgets reads this exact line. I do this again, then the next new line is read by fgets, and the point is one line at a time.
Thanks for your patience in reading all the descriptions, and looking forward to your answers.

One of the main reasons to think about buffers rather than lines (when it comes to network programming) is because TCP is a streaming protocol, where data is just a stream of bytes beginning with a connection and ending with a disconnection.
There are no message boundaries, and there are no "lines", except what the application-level protocol on top of TCP have decided.
That makes it impossible to read a "line" from a TCP connection, there are no such primitive functions for it. You must read using buffers. And because of the streaming and the lack of any kind of boundaries, a single call to receive data may give your application less than you ask for, and it may be a partial application-level message. Or you might get more than a single message, including a partial message at the end.
Another note of importance is that sockets by default are blocking, so a socket that don't have any data ready to be received will cause any read call to block, and wait until there are data. The select call only tells if the read call won't block right now. If you do the read call multiple times in a loop it can (and ultimately will) block when the data to receive is exhausted.
All this makes it really hard to use high-level functions like fgets (after a fdopen call of course) to read data from TCP sockets, as it can block at any time if you use blocking socket. Or it can return with a failure if you use non-blocking sockets and the read call returns with the failure that it would block (yes that is returned as an error).
If you use your own buffering, you can use select in the same loop as read or recv, to make sure that the call won't block. Or of you use non-blocking sockets you can gather data (and append to your buffer) with single read calls, and add detection when you have a full message (either by knowing its length or by detecting the message terminator or separator, like a newline).
As for fgets reading "multiple lines", it can cause the underlying reads to fill the buffers with multiple lines, but the fgets function itself will only fill your supplied buffer with a single line.
fgets will never give you multiple lines.

select is a Linux kernel call. It will tell you if the Linux kernel has data that your process hasn't received yet.
fgets is a C library call. To reduce the number of Linux kernel calls (which are typically slower) the C library uses buffering. It will try to read a big chunk of data from the Linux kernel (typically something like 4096 bytes) and then return just the part you asked for. Next time you call it, it will see if it already read the part you asked for, and then it won't need to read it from the kernel. For example, if it's able to read 5 lines at once from the kernel, it will return the first line, and the other 4 will be stored in the C library and returned in the next 4 calls.
When fgets reads 5 lines, returns 1, and stores 4, the Linux kernel will see that all the data has been read. It doesn't know your program is using the C library to read the data. Therefore, select will say there is no data to read, and your program will get stuck waiting for the next line, even though there already is one.
So how do you resolve this? You basically have two options: don't do buffering at all, or do your own buffering so you get to control how it works.
Option 1 means you read 1 byte at a time until you get a \n and then you stop reading. The kernel knows exactly how much data you have read, and it will be able to accurately tell you whether there's more data. However, making a kernel call for every single byte is relatively slow (measure it) and also, the computer on the other end of the connection could cause your program to freeze simply by not sending a \n at all.
I want to point out that option 1 is completely viable if you are just making a prototype. It does work, it's just not very good. If you try to fix the problems with option 1, you will find the only way to fix them is to do option 2.
Option 2 means doing your own buffering. You keep an array of say 4096 bytes per connection. Whenever select says there is data, you try to fill up the array as much as possible, and you check whether there is a \n in the array. If so, you process that line, remove the line from the array*, and repeat. This means you minimize kernel calls, and you also won't freeze if the other computer doesn't send a \n since the unfinished line will just stay in the array. If all 4096 bytes are used, and there is still no \n, you can either choose to process it as a big line (if this makes sense, e.g. in a chat program) or you can disconnect the connection, since the other computer is breaking the rules. Of course you can choose to use a bigger number than 4096.
* Extra for experts: "removing the line from the array" can be fast if you implement a "circular buffer" data structure.

Open(), Close(), and Read() applied to Linux pipe file descriptiors

This is probably a simple question, but I want to confirm my understanding of these functions - and possibly clarified if I'm completely wrong about them.
Here's what's going on:
I have a multi-threaded program that is passing data via pipes, using the unix pipe() function. Basically, two threads can write to the pipe (they're synchronized of course), and only one can read from the pipe.
From my understanding, the read() command will attempt to read x number of bytes from the passed file descriptor parameter, and it will return 0 if EOF is reached.
The number of bytes I write to the pipe is variable, so this presents a minor difficulty when reading from the pipe. I believe I read somewhere that using close(my_pipe_file_descriptor) throws in EOF. If this is the case, read() will return once it hits EOF - which would be great.
If what I said above is correct - in reference to how close() and read() works - I have a question.
If I call close(my_pipe_file_descriptor), is the pipe destroyed, making any future calls to open(my_pipe_file_descriptor) invalid?
I hope this makes sense.

For the question about close, yes it destroys the pipe, you can no longer use it in the process where you closed it. If you want to use a new pipe you have to create one again. If you close the write-end of the pipe, then the read end is still valid, allowing the reader to read until all data has been received. That last bit means that the writer doesn't have to wait until it knows the reader has received all data (it generally can't anyway), but just write whatever data it wants and then just close the its end of the pipe.
As your understanding of the read function, it's basically correct. You ask it to read a certain number of bytes, and it will read up to that number of bytes. It may read less, you have to check the returned value to learn exactly how much it has read. That goes not only for pipes, but for sockets and files as well.
I recommend you read the official POSIX references:
pipe
read
write
And for completeness sake (even though it can't be used to open or create anonymous pipes):
open
There are also thousands of example on how to use pipes if you just search a little.

How read file functions recognize end of a text file in C++?

As far as you know, there are two standard to read a text file in C++ (in this case 2 numbers in every line) :
The two standard methods are:
Assume that every line consists of 2 numbers and read token by token:
#include <fstream>
std::ifstream infile("thefile.txt");
int a, b;
while (infile >> a >> b)
{
// process pair (a,b)
}
Line-based parsing, using string streams:
#include <sstream>
#include <string>
#include <fstream>
std::ifstream infile("thefile.txt");
std::string line;
while (std::getline(infile, line))
{
std::istringstream iss(line);
int a, b;
if (!(iss >> a >> b)) { break; } // error
// process pair (a,b)
}
And also I can use the below code to see if the files ends or not :
while (!infile.eof())
My question is :
Question1: how this functions understand that one line is the last
line? I mean "how eof() returns false\true?"
As far as I know, they reading a part of memory. what is the
difference between the part that belongs to the file and the parts
that not?
Question2: Is there anyway to cheat this function?! I mean, Is it
possible to add something in the middle of the text file (for example
by a Hex editor tools) and make the eof() wrongly returns True in
the middle of the text file?
Appreciate your time and consideration.

Question1: how this functions understand that one line is the last line? I mean "how eof() returns false\true?"
It doesn't. The functions know when you've tried to read past the very last character in the file. They don't necessarily know whether a line is the last line. "Files" aren't the only things that you can read with streams. Keyboard input, a special purpose device, internet sockets: All can be read with the right kind of I/O stream. When reading from standard input, the stream has no knowing of if the very next thing I type is control-Z.
With regard to files on a computer disk, most modern operating systems store metadata regarding the file separate from the file. These metadata include the length of the file (and oftentimes when the file was last modified and when it was last read). On these systems, the stream buffer than underlies the I/O stream knows the current read location within the file and knows how long the file is. The stream buffer signals EOF when the read location reaches the length of the file.
That's not universal, however. There are some not-so-common operating systems that don't use this concept of metadata stored elsewhere. End of file on a disk file is just as surprising on these systems as is end of file from user input on a keyboard.
As far as I know, they reading a part of memory. what is the difference between the part that belongs to the file and the parts that not?
Learn the difference between memory and disk files. There's a huge difference between the two. Unless you're working with an embedded computer, memory is much more limited than is disk space.
Question2: Is there anyway to cheat this function?! I mean, Is it possible to add something in the middle of the text file (for example by a Hex editor tools) and make the eof() wrongly returns True in the middle of the text file?
That depends very much on how the operating system implements files. On most modern operating systems, the answer is not just "no" but "No!". The concept of using some special signature that indicates end of file in a disk file is one of many computer science concepts that for the most part have been dumped into the pile of "that wasn't very smart" ideas. You asked your question on the internet. That most likely means you are using a Windows machine, a Linux machine, or a Mac. All of them store the length of a file as metadata separate from the contents of a file.
However, there is a need for the ability to clear the end of file indicator. One program might be writing to a file while at the same time another is reading from it. The reader might hit EOF while the writer is still active. The reader needs to clear the EOF indicator to continue reading what the writer has written. The C++ I/O streams provide the ability to do just that. Every I/O stream has a clear function. Whether it works, that's a different story. The clear will work temporarily, but the very next read might well reset the EOF bit. For example, when I type control-Z on my keyboard, that means I am done interacting with the program, period, My next action might well be to go out for lunch.

Read/Write at the same time

What I am doing is opening my file using fstream at the start of the main and closing it at the end. In between I am writing "Hello World" and after that reading what I wrote but the result is always weired charecters and not the "Hello World". I did do a cast to char but that didnt help. Any way I can do this?

You need to interpose an fseek call when you switch from reading to writing, or viceversa. (Of course, you also need to fopen for "r+" or the like, so that both reading and writing are allowed, but I imagine you are already aware of that -- the need for seeking in order to switch between reading and writing is a lesser known fact).
As this page puts it,
For the modes where both read and
writing (or appending) are allowed
(those which include a "+" sign), the
stream should be flushed (fflush) or
repositioned (fseek, fsetpos, rewind)
between either a reading operation
followed by a writing operation or a
writing operation followed by a
reading operation.

I'd be amused if this works, because I always had to open a file twice to do that: once for reading and once for writing. Even then, I had to write the whole file out and close it (which flushed the OS buffers) before I could be sure I could read the whole file and not get an early EOF.
Nowadays, since I use Unix-style operating systems, I would just use the pipe() function. Not sure if that works in Windows (because so much doesn't, like select() on files).

Make sure you are seeking to the beginning of the file before reading, like so:
fileFStream.seekg(0, ios_base::beg);
If that doesn't work, post your code.

Can we write an EOF character ourselves?

Most of the languages like C++ when writing into a file, put an EOF character even if we miss to write statements like :
filestream.close
However is there any way, we can put the EOF character according to our requirement, in C++, for an instance.
Or any other method we may use apart from using the functions provided in C++.
If you need to ask more of information then kindly do give a comment.
EDIT:
What if, we want to trick the OS and place an EOF character in a file and write some data after the EOF so that an application like notepad.exe is not able to read after our EOF character.
I have read answers to the question related to this topic and have come to know that nowdays OS generally don't see for an EOF character rather check the length of file to get the correct idea of knowing about the length of the file but, there must be a procedure in OS which would be checking the length of file and then updating the file records.
I am sorry if I am wrong at any point in my estimation but please do help me because it can lead to a lot of new ideas.

There is no EOF character. EOF by definition "is unequal to any valid character code". Often it is -1. It is not written into the file at any point.
There is a historical EOF character value (CTRL+Z) in DOS, but it is obsolete these days.
To answer the follow-up question of Apoorv: The OS never uses the file data to determine file length (files are not 'null terminated' in any way). So you cannot trick the OS. Perhaps old, stupid programs won't read after CTRL+Z character. I wouldn't assume that any Windows application (even Notepad) would do that. My guess is that it would be easier to trick them with a null (\0) character.

Well, EOF is just a value returned by the function defined in the C stdio.h header file. Its actually returned to all the reading functions by the OS, so its system dependent. When OS reaches the end of file, it sends it to the function, which in its return value than places most commonly (-1), but not always. So, to summarize, EOF is not character, but constant returned by the OS.
EDIT: Well, you need to know more about filesystem, look at this.
Hi, to your second question:
once again, you should look better into filesystems. FAT is a very nice example because you can find many articles about it, and its principles are very similar to NTFS. Anyway, once again, EOF is NOT a character. You cannot place it in file directly. If you could do so, imagine the consequences, even "dumb" image file could not be read by the system.
Why? Because OS works like very complex structure of layers. One of the layers is the filesystem driver. It makes sure that it transfers data from every filesystem known to the driver. It provides a bridge between applications and the actual system of storing files into HDD.
To be exact, FAT filesystem uses the so-called FAT table - it is a table located close to the start of the HDD (or partition) address space, and it contains map of all clusters (little storage cells). OK, so now, when you want to save some file to the HDD, OS (filesystem driver) looks into FAT table, and searches for the value "0x0". This "0x0" value says to the OS that cluster which address is described by the location of that value in FAT table is free to write.
So it writes into it the first part of the file. Then, it looks for another "0x0" value in FAT, and if found, it writes the second part of the file into cluster which it points to. Then, it changes the value of the first FAT table record where the file is located to the physical address of the next in our case second part of the file.
When your file is all stored on HDD, now there comes the final part, it writes desired EOF value, but into FAT table, not into the "data part" of the HDD. So when the file is read next time, it knows this is the end, don´t look any further.
So, now you see, if you would want to manually write EOF value into the place it doesn't belong to, you have to write your own driver which would be able to rewrite the FAT record, but this is practically impossible to do for beginners.

I came here while going through the Kernighan & Ritchie C exercises.
Ctrl+D sends the character that matches the EOF constant from stdio.h.
(Edit: this is on Mac OS X; thanks to #markmnl for pointing out that the Windows 10 equivalent is Ctrl+Z)

Actually in C++ there is no physical EOF character written to a file using either the fprintf() or ostream mechanisms. EOF is an I/O condition to indicate no more data to read.
Some early disk operating systems like CP/M actually did use a physical 0x1A (ASCII SUB character) to indicate EOF because the file system only maintained file size in blocks so you never knew exactly how long a file was in bytes. With the advent of storing actual length counts in the directory it is no longer typical to store an "EOF" character as part of the 'in-band' file data.

Under Windows, if you encounter an ASCII 26 (EOF) in stdin, it will stop reading the rest of the data. I believe writing this character will also terminate output sent to stdout, but I haven't confirmed this. You can switch the stream to binary mode as in this SO question:
#include <io.h>
#include <fcntl.h>
...
_setmode(0, _O_BINARY)
And not only will you stop 0x0A being converted to 0x0D 0x0A, but you'll also gain the ability to read/write 0x1A as well. Note you may have to switch both stdin (0) and stdout (1).

If by the EOF character you mean something like Control-Z, then modern operating systems don't need such a thing, and the C++ runtime will not write one for you. You can of course write one yourself:
filestream.put( 26 ); // write Ctrl-Z
but there is no good reason to do so. There is also no need to do:
filesystem.close();
as the file stream will be closed for you automatically when its destructor is called, but it is (I think) good practice to do so.

There is no such thing as the "EOF" character. The fact of closing the stream in itself is the "EOF" condition.
When you press Ctrl+D in a unix shell, that simply closes the standard input stream, which in turn is recognized by the shell as "EOF" and it exits.
So, to "send" an "EOF", just close the stream to which the "EOF" needs to be sent.

Nobody has yet mentioned the [f]truncate system calls, which are how you make a file shorter without recreating it from scratch.
The truncate() and ftruncate() functions cause the regular file named by path or referenced by fd to be truncated to a size of precisely length bytes.
If the file previously was larger than this size, the extra data is lost. If the file previously was shorter, it is extended, and the extended part reads as null bytes ('\0').
Understand that this is a distinct operation from writing any sort of data to the file. The file is a linear array of bytes, laid out on disk somehow, with metadata that says how long it is; truncate changes the metadata.

On modern filesystems EOF is not a character, so you don't have to issue it when finishing to write to a file. You just have to close the file or let the OS do it for you when your process terminates.

Yes, you can manually add EOF to a file.
1) in Mac terminan, create a new file. touch filename.txt
2) Open the file in VI
vi filename.txt
3) In Insert mode (hit i), type Control+V and then Control+D. Do not let go of the Control key on the Mac.
Alternatively, if I want other ^NewLetters, like ^N^M^O^P, etc, I could do Contorl+V and then Control+NewLetter. So for example, to do ^O, hold down control, and then type V and O, then let go of Control.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js