Related
This recently asked question has raised another interesting issue, as discussed in the comments to one of its answers.
To summarize: the OP there was having issues with code like that below, when subsequently attempting to read and write data from/to the two streams 'concurrently':
ifstream infile;
infile.open("accounts.txt");
ofstream outfile;
outfile.open("accounts.txt");
Although the issue, in itself, was successfully resolved, it has raised a question to which I cannot find an authoritative answer (and I've made some quite extensive searches of Stack Overflow and the wider web).
It is very clearly stated what happens when calling the open() method of a stream that is already associated with a file (cppreference), but what I cannot find an answer to is what happens when (as in this case) the file is already associated with a (different) stream.
If the stream is already associated with a file (i.e., it is already
open), calling this function fails.
I can see several possible scenarios here:
The second open call will fail and any attempted writes to it will also fail (but that is not the case in the cited question).
The second open call will 'override' the first, effectively closing it (this could explain the issues encountered in said code).
Both streams remain open but enter into a 'mutual clobbering' match regarding their internal file pointers and buffers.
We enter the realm of undefined (or implementation-defined) behaviour.
Note that, as the first open() call is made by an input stream, the operating system will not necessarily 'lock' the file, as it probably would for an output stream.
So, does anyone have a definitive answer to this? Or a citation from the Standard (cppreference will be 'acceptable' if nothing more authoritative can be found)?
basic_filebuf::open (and all things that depend on it, like fstream::open) has no statement about what will happen in this case. A filesystem may allow it or it may not.
What the standard says is that, if the file successfully opens, then you can play with it in accord with the interface. And if it doesn't successfully open, then there will be an error. That is, the standard allows a filesystem to permit it or forbid it, but it doesn't say which must happen. The implementation can even randomly forbid it. Or forbid you from opening any files in any way. All are (theoretically) valid.
To me, this falls even out of the 'implementation defined' field. The very same code will have different behaviour depending of the underlying filesystem or OS (some OSes forbid to open a file twice).
No.
Such a scenario is not discussed by the standard.
It's not even managed by the implementation (your compiler, standard library implementation etc).
The stream ultimately asks the operating system for access to that file in the desired mode, and it's up to the operating system to decide whether that access shall be granted at that time.
A simple analogy would be your program making some API call to a web application over a network. Perhaps the web application does not permit more than ten calls per minute, and returns some error code if you attempt more than that. But that doesn't mean your program has undefined behaviour in such a case.
C implementations exist for many different platforms, whose underlying file systems may handle such corner cases differently. For the Standard to mandate any particular corner-case behavior would have made the language practical only on platforms whose file systems behave in such fashion. Instead, the Standard regards such issues as being outside its jurisdiction (i.e. to use its own terminology, "Undefined Behavior"). That doesn't mean that implementations whose target OS offers useful guarantees shouldn't make such guarantees to programs when practical, but implementation designers are presumed to know more than the Committee about how best to serve their customers.
On the other hand, it may sometime be helpful for an implementation not to expose the underlying OS behavior. On an OS that doesn't have a distinct "append" mode, for example, but code needing an "open for append" could do an "open existing file for write" followed by "seek to end of file", an attempt to open two streams for appending to the same file could result in data corruption when one stream writes part of a file, and the other stream then rewrites that same portion. It may be helpful for an implementation that detects that condition to either inject its own logic to either ensure smooth merging of the data or block the second open request. Either course of action might be better, depending upon an application's purpose, but--as noted above--the choice is outside the Standard's jurisdiction.
I open the zip file as stream twice.The zip file contains some XML files.
std::ifstream("filename") file;
zipstream *p1 = new zipstream(file);
zipstream *p2 = new zipstream(file);
p1->getNextEntry();
auto p3 = p1.rdbuf();
autp p4 = p2.rdbuf();
Then see p3 address = p4 address, but the member variables are different between them. Such as _IGfirst.
The contents of one of the XML files are as follows:
<test>
<one value="0.00001"/>
</test>
When the contents of file are read in two thread at the same time.error happend.
string One = p1.getPropertyValue("one");
// one = "0001two"
Given a HANDLE to a file (e.g. C:\\FolderA\\file.txt), I want a function which will return a HANDLE to the containing directory (in the previous example, it would be a HANDLE to C:\\FolderA). For example:
HANDLE hFile = CreateFileA(
"C:\\FolderA\\file.txt",
GENERIC_READ,
FILE_SHARE_READ,
NULL,
OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL,
NULL);
HANDLE hDirectory = somefunc(hFile);
Possible implementation for someFunc:
HANDLE someFunc(HANDLE h)
{
char *path = getPath(h); // "C:\\FolderA\\file.txt"
char *parent = getParentPath(path); // "C:\\FolderA"
HANDLE hFile = CreateFileA(
parent,
GENERIC_READ,
FILE_SHARE_READ,
NULL,
OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL,
NULL);
free(parent);
free(path);
return hFile;
}
But is there a way to implement someFunc without getParentPath or without making it look at the string and removing everything after the last directory separator (because this is terrible from a performance point of view)?
I don't know what getParentPath is. I assume it's a function that searches for the trailing backslash in the string and uses that to strip off the file specification. You don't have to define such a function yourself; Windows already provides one for you—PathCchRemoveFileSpec. (Note that this assumes the specified path actually contains a file name to remove. If the path doesn't contain a file name, it will remove the trailing directory name. There are other functions you can use to verify whether a path contains a file specification.)
The older version of this function is PathRemoveFileSpec, which is what you would use on downlevel operating systems where the newer, safer function is not available.
Outside of the Windows API, there are other ways of doing the same thing. If you're targeting C++17, there is the filesystem::path class. Boost provides something similar. Or you could write it yourself with the find_last_of member function of the std::string class, if you absolutely have to. (But prefer not to re-invent the wheel. There are lots of edge cases when it comes to path manipulation that you probably won't think of, and that your testing probably won't reveal.)
You express concerns about the performance of this approach. This is nonsense. Stripping some characters from a string is not a slow operation. It wouldn't even be slow if you started searching from the beginning of the string and then, once you found the file specification, made a second copy of the string, again starting from the beginning of the string. It's a simple loop searching through the characters of a reasonable-length string, and then a simple memcpy. There is absolutely no way that this operation could be a performance bottleneck in code that does file I/O.
But, the implementation probably isn't even going to be that naïve. You can optimize it by starting the search from the end of the path string, reducing the number of characters that you have to iterate through, and you can avoid any type of memory copy altogether if you're allowed to manipulate the original string. With a C-style string, you just replace the trailing path separator (the one that demarcates the beginning of the path specification) with a NUL character (\0). With a C++-style string, you just call the erase member function.
In fact, if you really care about performance, this is virtually guaranteed to be faster than making a system call to retrieve the containing folder from a file object. System calls are a lot slower than some compiler-generated, inlinable code to iterate through a string and strip out a sub-string.
Once you have the path to the directory, you can obtain a HANDLE to it by calling the CreateFile function with the FILE_FLAG_BACKUP_SEMANTICS flag. (It is necessary to pass that flag if you want to retrieve a handle to a directory.
I have measured that this is slow and am looking for a faster way.
Your measurements are wrong. Either you've made the common mistake of benchmarking a debugging build, where the standard library functionality (e.g., std::string) is not optimized, and/or the real performance bottleneck is the file I/O. CreateFile is not a speedy function by any stretch of the imagination. I can almost guarantee that is going to be your hotspot.
Note that if you don't already have the path, it is straightforward to obtain the path from a HANDLE to a file. As was pointed out in the comments, on Windows Vista and later, you simply need to call the GetFinalPathNameByHandle function. More details are available in this article on MSDN, including sample code and an alternative for use on downlevel versions of Windows.
As was mentioned already in the comments to the question, you can optimize this further by allocating a buffer of length MAX_PATH (or perhaps even larger) on the stack. That compiles to a single instruction to adjust the stack pointer, so it won't be a performance bottleneck, either. (Okay, I lied: you actually will need two instructions—one to create space on the stack, and the other to free the allocated space on the stack. Still not a performance problem.) That way, you don't even have to do any dynamic memory allocation.
Note that for maximum robustness, especially on Windows 10, you want to handle the case that a path is longer than MAX_PATH. In such cases, your stack-allocated buffer will be too small, and the function you call to fill it will return an error. Handle that error, and allocate a larger buffer on the free store. That will be slower, but this is an edge case and probably not one that is worth optimizing. The 99% common case will use the stack-allocated buffer.
Furthermore, eryksun points out (in comments to this answer) that, although it is convenient, GetFinalPathNameByHandle requires multiple system calls to map the file object between the NT and DOS namespaces and to normalize the path. I haven't disassembled this function, so I can't confirm his claims, but I have no reason to doubt them. Under normal circumstances, you wouldn't worry about this sort of overhead or possible performance costs, but since this seems to be a big concern for your application, you can use eryksun's alternative suggestion of calling GetFileInformationByHandleEx and requesting the FileNameInfo class. GetFileInformationByHandleEx is a general, multi-purpose function that can retrieve all different sorts of information about a file, including the path. Its implementation is simpler, calling directly down to the native NtQueryInformationFile function. I would have thought GetFinalPathNameByHandle was just a user-mode wrapper providing exactly this service, but eryksun's research suggests it is doing extra work that you might want to avoid if this is truly a performance hot-spot. I have to qualify this slightly by noting that GetFileInformationByHandleEx, in order to retrieve the FileNameInfo, is going to have to create an I/O Request Packet (IRP) and call down to the underlying device driver. That's not a cheap operation, so I'm not sure that the additional overhead of normalizing the path is really going to matter. But in this case, there's no real harm in using the GetFileInformationByHandleEx approach, since it's a documented function.
If you've written the code as described but are still having measurable performance problems, then please post that code for someone to review and help you optimize. The Code Review Stack Exchange site is a great place to get help like that on working code. Feel free to leave me a link to such a question in a comment under this answer so that I don't miss it.
Whatever you do, please stop calling the ANSI versions of the Windows API functions (the ones that end with an A suffix). You want the wide-character (Unicode) versions. These end with a W suffix, and work with strings composed of WCHAR (== wchar_t) characters. Aside from the fact that the ANSI versions have been deprecated for decades now because they do not provide Unicode support (it is not optional for any application written after the year 2000 to support Unicode characters in paths), as much as you care about performance, you should be aware of the fact that all A-suffixed API functions are just stubs that convert the passed-in ANSI string to a Unicode string and then delegate to the W-suffixed version. If the function returns a string, a second conversion also must be done by the A-suffixed version, since all native APIs work with Unicode strings. Performance isn't the real reason why you should avoid calling ANSI functions, but perhaps it's one that you'll find more convincing.
There might be a way to do what you want (map a file object via a HANDLE to its containing directory), but it would require undocumented usage of the NT native API. I don't see anything at all in the documented functions that would allow you to obtain this information. It certainly isn't accessible via the GetFileInformationByHandleEx function. For better or worse, the user-mode file system API is almost entirely path-based. Presumably, it is tracked internally, but even the documented NT native API functions that take a root directory HANDLE (e.g., NtDeleteFile via the OBJECT_ATTRIBUTES structure) allow this field to be NULL, in which case the full path string is used.
As always, if you had provided more details on the bigger picture, we could probably provide a more appropriate solution. This is what the commenters were driving at when they mentioned an XY problem. Yes, people are questioning your motives because that's how we provide the most appropriate help.
I'm writing a C shared library for internal use (I'll be dlopen()'ing it to a c++ application, if that matters). The shared library loads (amongst other things) some java code through a JNI module, which means all manners of nightmare error modes can come out of the JVM that I need to handle intelligently in the application. Additionally, this library needs to be re-entrant. Is there in idiom for passing error strings back in this case, or am I stuck mapping errors to integers and using printfs to debug things?
Thanks!
My approach to the problem would be a little different from everyone else's. They're not wrong, it's just that I've had to wrestle with a different aspect of this problem.
A C API needs to provide numeric error codes, so that the code using the API can take sensible measures to recover from errors when appropriate, and pass them along when not. The errno.h codes demonstrate a good categorization of errors; in fact, if you can reuse those codes (or just pass them along, e.g. if all your errors come ultimately from system calls), do so.
Do not copy errno itself. If possible, return error codes directly from functions that can fail. If that is not possible, have a GetLastError() method on your state object. You have a state object, yes?
If you have to invent your own codes (the errno.h codes don't cut it), provide a function analogous to strerror, that converts these codes to human-readable strings.
It may or may not be appropriate to translate these strings. If they're meant to be read only by developers, don't bother. But if you need to show them to the end user, then yeah, you need to translate them.
The untranslated version of these strings should indeed be just string constants, so you have no allocation headaches. However, do not waste time and effort coding your own translation infrastructure. Use GNU gettext.
If your code is layered on top of another piece of code, it is vital that you provide direct access to all the error information and relevant context information that that code produces, and you make it easy for developers against your code to wrap up all that information in an error message for the end user.
For instance, if your library produces error codes of its own devising as a direct consequence of failing system calls, your state object needs methods that return the errno value observed immediately after the system call that failed, the name of the file involved (if any), and ideally also the name of the system call itself. People get this wrong waaay too often -- for instance, SQLite, otherwise a well designed API, does not expose the errno value or the name of the file, which makes it infuriatingly hard to distinguish "the file permissions on the database are wrong" from "you have a bug in your code".
EDIT: Addendum: common mistakes in this area include:
Contorting your API (e.g. with use of out-parameters) so that functions that would naturally return some other value can return an error code.
Not exposing enough detail for callers to be able to produce an error message that allows a knowledgeable human to fix the problem. (This knowledgeable human may not be the end user. It may be that your error messages wind up in server log files or crash reports for developers' eyes only.)
Exposing too many different fine distinctions among errors. If your callers will never plausibly do different things in response to two different error codes, they should be the same code.
Providing more than one success code. This is asking for subtle bugs.
Also, think very carefully about which APIs ought to be allowed to fail. Here are some things that should never fail:
Read-only data accessors, especially those that return scalar quantities, most especially those that return Booleans.
Destructors, in the most general sense. (This is a classic mistake in the UNIX kernel API: close and munmap should not be able to fail. Thankfully, at least _exit can't.)
There is a strong case that you should immediately call abort if malloc fails rather than trying to propagate it to your caller. (This is not true in C++ thanks to exceptions and RAII -- if you are so lucky as to be working on a C++ project that uses both of those properly.)
In closing: for an example of how to do just about everything wrong, look no further than XPCOM.
You return pointers to static const char [] objects. This is always the correct way to handle error strings. If you need them localized, you return pointers to read-only memory-mapped localization strings.
In C, if you don't have internationalization (I18N) or localization (L10N) to worry about, then pointers to constant data is a good way to supply error message strings. However, you often find that the error messages need some supporting information (such as the name of the file that could not be opened), which cannot really be handled by constant data.
With I18N/L10N to worry about, I'd recommend storing the fixed message strings for each language in an appropriately formatted file, and then using mmap() to 'read' the file into memory before you fork any threads. The area so mapped should then be treated as read-only (use PROT_READ in the call to mmap()).
This avoids complicated issues of memory management and avoids memory leaks.
Consider whether to provide a function that can be called to get the latest error. It can have a prototype such as:
int get_error(int errnum, char *buffer, size_t buflen);
I'm assuming that the error number is returned by some other function call; the library function then consults any threadsafe memory it has about the current thread and the last error condition returned to that thread, and formats an appropriate error message (possibly truncated) into the given buffer.
With C++, you can return (a reference to) a standard String from the error reporting mechanism; this means you can format the string to include the file name or other dynamic attributes. The code that collects the information will be responsible for releasing the string, which isn't (shouldn't be) a problem because of the destructors that C++ has. You might still want to use mmap() to load the format strings for the messags.
You do need to be careful about the files you load and, in particular, any strings used as format strings. (Also, if you are dealing with I18N/L10N, you need to worry about whether to use the 'n$ notation to allow for argument reordering; and you have to worry about different rules for different cultures/languages about the order in which the words of a sentence are presented.)
I guess you could use PWideChars, as Windows does. Its thread safe. What you need is that the calling app creates a PwideChar that the Dll will use to set an error. Then, the callling app needs to read that PWideChar and free its memory.
R. has a good answer (use static const char []), but if you are going to have various spoken languages, I like to use an Enum to define the error codes. That is better than some #define of a bunch of names to an int value.
return integers, don't set some global variable (like errno— even if it is potentially TLSed by an implementation); aking to Linux kernel's style of return -ENOENT;.
have a function similar to strerror that takes such an integer and returns a pointer to a const string. This function can transparently do I18N if needed, too, as gettext-returnable strings also remain constant over the lifetime of the translation database.
If you need to provide non-static error messages, then I recommend returning strings like this: error_code_t function(, char** err_msg). Then provide a function to free the error message: void free_error_message(char* err_msg). This way you hide how the error strings are allocated and freed. This is of course only worth implementing of your error strings are dynamic in nature, meaning that they convey more than just a translation of error codes.
Please havy oversight with mu formatting. I'm writing this on a cell phone...
I get this warning saying that tmpnam is dangerous, but I would prefer to use it, since it can be used as is in Windows as well as Linux. I was wondering why it would be considered dangerous (I'm guessing it's because of the potential for misuse rather than it actually not working properly).
From tmpnam manpage :
The tmpnam() function generates a different string each time it is called, up to TMP_MAX times. If it is called more than TMP_MAX times, the behavior is implementation defined.
Although tmpnam() generates names that are difficult to guess, it is nevertheless possible that between the time that tmpnam() returns a pathname, and the time that the program opens it, another program might create that pathname using open(2), or create it as a symbolic link. This can lead to security holes. To avoid such possibilities, use the open(2) O_EXCL flag to open the pathname. Or better yet, use mkstemp(3) or tmpfile(3).
Mktemp really create the file, so you are assured it works, whereas tmpnam returns a name, possibly already existing.
If you want to use the same symbol on multiple platforms, use a macro to define TMPNAM. As long as you pick more secure functions with the same interface, you'll be able to use it on both. You have conditional compilation somewhere in your code anyway, right?
if you speak about the compiler warning of MSVC:
These functions are deprecated because more secure versions are available;
see tmpnam_s, _wtmpnam_s.
(http://msdn.microsoft.com/de-de/library/hs3e7355(VS.80).aspx)
otherwise just read what the manpages say about the drawbacks of this function. it is mostly about a 2nd process creating exactly the same file name as your process just did.
From the tmpnam(3) manpage:
Although tmpnam() generates names that are difficult to guess, it is nevertheless possible that between the time
that tmpnam() returns a pathname, and the time that the program opens it, another program might create that path‐
name using open(2), or create it as a symbolic link. This can lead to security holes. To avoid such possibili‐
ties, use the open(2) O_EXCL flag to open the pathname. Or better yet, use mkstemp(3) or tmpfile(3).
The function is dangerous, because you are responsible for allocating a buffer that will be big enough to handle the string that tmpnam() is going to write into that buffer. If you allocate a buffer that is too small, tmpnam() has no way of knowing that, and will overrun the buffer (Causing havoc). tmpnam_s() (MS's secure version) requires you to pass the length of the buffer, so tmpnam_s know when to stop.
In the mold of a previous question I asked about the so-called safe library deprecations, I find myself similarly bemused as to why fopen() should be deprecated.
The function takes two C strings, and returns a FILE* ptr, or NULL on failure. Where are the thread-safety problems / string overrun problems? Or is it something else?
Thanks in advance
You can use fopen(). Seriously, don't take any notice of Microsoft here, they're doing programmers a real disservice by deviating from the ISO standards . They seem to think that people writing code are somehow brain-dead and don't know how to check parameters before calling library functions.
If someone isn't willing to learn the intricacies of C programming, they really have no business doing it. They should move on to a safer language.
This appears to be just another attempt at vendor lock-in by Microsoft of developers (although they're not the only ones who try it, so I'm not specifically berating them). I usually add:
#define _CRT_SECURE_NO_WARNINGS
(or the "-D" variant on the command line) to most of my projects to ensure I'm not bothered by the compiler when writing perfectly valid, legal C code.
Microsoft has provided extra functionality in the fopen_s() function (file encodings, for one) as well as changing how things are returned. This may make it better for Windows programmers but makes the code inherently unportable.
If you're only ever going to code for Windows, by all means use it. I myself prefer the ability to compile and run my code anywhere (with as little change as possible).
As of C11, these safe functions are now a part of the standard, though optional. Look into Annex K for full details.
There is an official ISO/IEC JTC1/SC22/WG14 (C Language) technical report TR24731-1 (bounds checking interfaces) and its rationale available at:
http://www.open-std.org/jtc1/sc22/wg14
There is also work towards TR24731-2 (dynamic allocation functions).
The stated rationale for fopen_s() is:
6.5.2 File access functions
When creating a file, the fopen_s and freopen_s functions improve security by protecting the file from unauthorized access by setting its file protection and opening the file with exclusive access.
The specification says:
6.5.2.1 The fopen_s function
Synopsis
#define __STDC_WANT_LIB_EXT1__ 1
#include <stdio.h>
errno_t fopen_s(FILE * restrict * restrict streamptr,
const char * restrict filename,
const char * restrict mode);
Runtime-constraints
None of streamptr, filename, or mode shall be a null pointer.
If there is a runtime-constraint violation, fopen_s does not attempt to open a file.
Furthermore, if streamptr is not a null pointer, fopen_s sets *streamptr to the
null pointer.
Description
The fopen_s function opens the file whose name is the string pointed to by
filename, and associates a stream with it.
The mode string shall be as described for fopen, with the addition that modes starting
with the character ’w’ or ’a’ may be preceded by the character ’u’, see below:
uw truncate to zero length or create text file for writing, default permissions
ua append; open or create text file for writing at end-of-file, default permissions
uwb truncate to zero length or create binary file for writing, default permissions
uab append; open or create binary file for writing at end-of-file, default
permissions
uw+ truncate to zero length or create text file for update, default permissions
ua+ append; open or create text file for update, writing at end-of-file, default
permissions
uw+b or uwb+ truncate to zero length or create binary file for update, default
permissions
ua+b or uab+ append; open or create binary file for update, writing at end-of-file,
default permissions
To the extent that the underlying system supports the concepts, files opened for writing
shall be opened with exclusive (also known as non-shared) access. If the file is being
created, and the first character of the mode string is not ’u’, to the extent that the
underlying system supports it, the file shall have a file permission that prevents other
users on the system from accessing the file. If the file is being created and first character
of the mode string is ’u’, then by the time the file has been closed, it shall have the
system default file access permissions10).
If the file was opened successfully, then the pointer to FILE pointed to by streamptr
will be set to the pointer to the object controlling the opened file. Otherwise, the pointer
to FILE pointed to by streamptr will be set to a null pointer.
Returns
The fopen_s function returns zero if it opened the file. If it did not open the file or if
there was a runtime-constraint violation, fopen_s returns a non-zero value.
10) These are the same permissions that the file would have been created with by fopen.
The fopen_s() function has been added by Microsoft to the C runtime with the following fundamental differences from fopen():
if the file is opened for writing ("w" or "a" specified in the mode) then the file is opened for exclusive (non-shared) access (if the platform supports it).
if the "u" specifier is used in the mode argument with the "w" or "a" specifiers, then by the time the file is closed, it will have system default permissions for others users to access the file (which may be no access if that's the system default).
if the "u" specified is not used in those cases, then when the file is closed (or before) the permissions for the file will be set such that other users will not have access to the file.
Essentially it means that files the application writes are protected from other users by default.
They did not do this to fopen() due to the likelyhood that existing code would break.
Microsoft has chosen to deprecate fopen() to encourage developers for Windows to make conscious decisions about whether the files their applications use will have loose permissions or not.
Jonathan Leffler's answer provides the proposed standardization language for fopen_s(). I added this answer hoping to make clear the rationale.
Or is it something else?
Some implementations of the FILE structure used by 'fopen' has the file descriptor defined as 'unsigned short'. This leaves you with a maximum of 255 simultaneously open files, minus stdin, stdout, and stderr.
While the value of being able to have 255 open files is debatable, of course, this implementation detail materializes on the Solaris 8 platform when you have more than 252 socket connections! What first appeared as a seemingly random failure to establish an SSL connection using libcurl in my application turned out to be caused by this, but it took deploying debug versions of libcurl and openssl and stepping the customer through debugger script to finally figure it out.
While it's not entirely the fault of 'fopen', one can see the virtues of throwing off the shackles of old interfaces; the choice to deprecate might be based on the pain of maintaining binary compatibility with an antiquated implementation.
The new versions do parameter validation whereas the old ones didn't.
See this SO thread for more information.
Thread safety. fopen() uses a global variable, errno, while the fopen_s() replacement returns an errno_t and takes a FILE** argument to store the file pointer to.