I have created a class, SettingsClass that contains static strings that hold db connection strings to be used by the MySQL C++ connector library (for e.g. hostname, dbname, username, password).
Whenever a function needs to connect to the database, it calls the .c_str() function on these static strings. For example:
Class SettingsClass
{
public:
static string hostname;
...
}SettingsClass;
string SettingsClass::hostname;
//A function that needs to connect to the DB uses:
driver = get_griver_instance();
driver->connect(SettingsClass.hostname.c_str(),...);
The static strings are populated once in the process lifetime. Its value is read from a configuration file.
My application is multithreaded. Am I using c_str() in a safe way?
The c_str() in itself should be threadsafe. However, if you have another thread that accesses (writes to) the string that you are taking c_str() of, then you're playing with matches sitting in a pool of petrol.
Typically c_str() is implemented by adding a zero value (the null character 0x00, not the character for zero which is 0x30) on the end of the existing string (if there isn't one there already) and passes back the address of where the string is stored.
Anyone interested can read the libstdc++ code here:
libstdc++, basic_string.h
The interesting lines are 1802 and 294 - 1802 is the c_str() function, and 294 is the function that c_str() calls.
In theoretically at least, the standard allows implementations
which wouldn't be thread safe. In practice, I don't think
you'll run any risk (although formally, there could be undefined
behavior), but if you want to be sure, just call .c_str() once
on all of the strings during initialization (before threads have
been started). The standard guarantees the validity of the
returned pointer until the next non-const function is called
(even if you don't keep a copy of it), which means that the
implementation cannot mutate any data.
Since the string type is constant (so, assume, it is immutable), it is never modified. Any function only reading it should be thread-safe.
So, I think .c_str() is thread-safe.
As long the std::string variables are initialized correctly before your threads are started and aren't changed afterwards using the const char* representation via c_str() should be thread safe.
Related
I want to use the getenv() function.
Now I got a remark from somebody that if multiple threads are calling this function, this will not be thread-safe. However if I look at the information page for this function, it states that:
Concurrently calling this function is safe, provided that the environment remains unchanged.
I understand the concept of a static block of data, and the function returns a pointer to it. I understand that the contents of the block can change over time, by making multiple calls to the function, as the reference pages state.
If one thread is calling
getenv("myEnvVar1")
and another one is calling
getenv("myEnvVar2")
will the same memory block be used where the returned pointers are pointing to? How should I interpret the fact that "Concurrently calling this function is safe"?
getenv returns a pointer to the ACTUAL environment content - so the process has an array of strings with the environment variables in them, and you get back, not a copy, but the ACTUAL pointer to that.
Note that char *p = getenv("foo"); ... setenv("foo", "new value"); ... use p is also undefined, as the string p points at may well have changed now [and not in a well-defined way]
It's not.
The function getenv is part of the C Standard Library, thus it's behaviour is specified by the set of C standards.
POSIX.1-2017 (Which defers to the ISO C Standard) states:
The returned string pointer might be invalidated or the string content might be overwritten by a subsequent call to getenv(),
setenv(), unsetenv(), or (if supported) putenv() but they shall not be
affected by a call to any other function in this volume of
POSIX.1-2017.
The returned string pointer might also be invalidated if the calling thread is terminated.
The getenv() function need not be thread-safe.
ISO C11 (Which POSIX defers to), says:
The set of environment names and the method for altering the
environment list are implementation-defined. The getenv function
need not avoid data races with other threads of execution that modify
the environment list.
You cannot be sure that getenv itself does not modify the environment when searching through it (I cannot think why it would, but doing so would not violate the standard). If you want to be sure that the version of getenv you're using is a thread-safe implementation, you must consult your implementation's documentation to confirm that it is.
I've been doing some reading on immutable strings in general and in c++, here, here, and I think I have a decent understanding of how things work. However I have built a few assumptions that I would just like to run by some people for verification. Some of the assumptions are more general than the title would suggest:
While a const string in c++ is the closest thing to an immutable string in STL, it is only locally immutable and therefore doesn't experience the benefit of being a smaller object. So it has all the trimmings of a standard string object but it can't access all of the member functions. This means that it doesn't create any optimization in the program over non-const? But rather just protects the object from modification? I understand that this is an important attribute but I'm simply looking to know what it means to use this
I'm assuming that an object's member functions exist only once in read-only memory, and how is probably implementation specific, so does a const object have a separate location in memory? Or are the member functions limited in another way? If there are only 'const string' objects and no non-const strings in a code base, does the compiler leave out the inaccessible functions?
I recall hearing that each string literal is stored only once in read-only memory in c++, however I don't find anything on this here. In other words, if I use some string literal multiple times in the same program, each instance references the same location in memory. I'm going to assume no, but would two string objects initialized by the same string literal point to the same string until one is modified?
I apologize if I have included too many disjunct thoughts in the same post, they are all related to me as string representation and just learning how to code better.
As far as I know, std::string cannot assume that the input string is a read-only constant string from your data segment. Therefore, point (3) does not apply. It will most likely allocate a buffer and copy the string in the buffer.
Note that C++ (like C) has a const qualifier for compilation time, it is a good idea to use it for two reasons: (a) it will help you find bugs, a statement such as a = 5; if a is declared const fails to compile; (b) the compile may be able to optimize the code more easily (it may otherwise not be able to figure out that the object is constant.)
However, C++ has a special cast to remove the const-ness of a variable. So our a variable can be cast and assigned a value as in const_cast<int&>(a) = 5;. An std::string can also get its const-ness removed. (Note that C does not have a special cast, but it offers the exact same behavior: * (int *) &a = 5)
Are all class members defined in the final binary?
No. std::string as most of the STL uses templates. Templates are compiled once per unit (your .o object files) and the link will reduce duplicates automatically. So if you look at the size of all the .o files and add them up, the final output will be a lot small.
That also means only the functions that are used in a unit are compiled and saved in the object file. Any other function "disappear". That being said, often function A calls function B, so B will be defined, even if you did not explicitly call it.
On the other hand, because these are templates, very often the functions get inlined. But that is a choice by the compiler, not the language or the STL (although you can use the inline keyword for fun; the compiler has the right to ignore it anyway).
Smaller objects... No, in C++ an object has a very specific size that cannot change. Otherwise the sizeof(something) would vary from place to place and C/C++ would go berserk!
Static strings that are saved in read-only data sections, however, can be optimized. If the linker/compiler are good enough, they will be able to merge the same string in a single location. These are just plan char * or wchar_t *, of course. The Microsoft compiler has been able to do that one for a while now.
Yet, the const on a string does not always force your string to be put in a read-only data section. That will generally depend on your command line option. C++ may have corrected that, but I think C still put everything in a read/write section unless you use the correct command line option. That's something you need to test to make sure (your compiler is likely to do it, but without testing you won't know.)
Finally, although std::string may not use it, C++ offers a quite interesting keyword called mutable. If you heard about it, you would know that a variable member can be marked as mutable and that means even const functions can modify that variable member. There are two main reason for using that keyword: (1) you are writing a multi-thread program and that class has to be multi-thread safe, in that case you mark the mutex as mutable, very practical; (2) you want to have a buffer used to cache a computed value which is costly, that buffer is only initialized when that value is requested to not waste time otherwise, that buffer is made mutable too.
Therefore the "immutable" concept is only really something that you can count on at a higher level. In practice, reality is often quite different. For example, an std::string c_str() function may reallocate the buffer to add the necessary '\0' terminator, yet that function is marked as being a const:
const CharT* c_str() const;
Actually, an implementation is free to allocate a completely different buffer, copy its existing data to that buffer and return that bare pointer. That means internally the std::string could be allocate many buffers to store large strings (instead of using realloc() which can be costly.)
Once thing, though... when you copy string A into string B (B = A;) the string data does not get copied. Instead A and B will share the same data buffer. Once you modify A or B, and only then, the data gets copied. This means calling a function which accepts a string by copy does not waste that much time:
int func(std::string a)
{
...
if(some_test)
{
// deep copy only happens here
a += "?";
}
}
std::string b;
func(b);
The characters of string b do not get copied at the time func() gets called. And if func() never modifies 'a', the string data remains the same all along. This is often referenced as a shallow copy or copy on write.
In our C++ code, we have our own string class (for legacy reasons). It supports a method c_str() much like std::string. What I noticed is that many developers are using it incorrectly. I have reduced the problem to the following line:
const char* x = std::string("abc").c_str();
This seemingly innocent code is quite dangerous in the sense that the destructor on std::string gets invoked immediately after the call to c_str(). As a result, you are holding a pointer to a de-allocated memory location.
Here is another example:
std::string x("abc");
const char* y = x.substr(0,1).c_str();
Here too, we are using a pointer to de-allocated location.
These problems are not easy to find during testing as the memory location still contains valid data (although the memory location itself is invalid).
I am wondering if you have any suggestions on how I can modify class/method definition such that developers can never make such a mistake.
The modern part of the code should not deal with raw pointers like that.
Call c_str only when providing an argument to a legacy function that takes const char*. Like:
legacy_print(x.substr(0,1).c_str())
Why would you want to create a local variable of type const char*? Even if you write a copying version c_str_copy() you will just get more headache because now the client code is responsible for deleting the resulting pointer.
And if you need to keep the data around for a longer time (e.g. because you want to pass the data to multiple legacy functions) then just keep the data wrapped in a string instance the whole time.
For the basic case, you can add a ref qualifier on the "this" object, to make sure that .c_str() is never immediately called on a temporary. Of course, this can't stop them from storing in a variable that leaves scope before the pointer does.
const char *c_str() & { return ...; }
But the bigger-picture solution is to replace all functions from taking a "const char *" in your codebase with functions that take one of your string classes (at the very least, you need two: an owning string and a borrowed slice) - and make sure that none of your string class does cannot be implicitly constructed from a "const char *".
The simplest solution would be to change your destructor to write a null at the beginning of the string at destruction time. (Alternatively, fill the entire string with an error message or 0's; you can have a flag to disable this for release code.)
While it doesn't directly prevent programmers from making the mistake of using invalid pointers, it will definitely draw attention to the problem when the code doesn't do what it should do. This should help you flush out the problem in your code.
(As you mentioned, at the moment the errors go unnoticed because for the most part the code will happily run with the invalid memory.)
Consider using Valgrind or Electric Fence to test your code. Either of these tools should trivially and immediately find these errors.
I am not sure that there is much you can do about people using your library incorrectly if you warn them about it. Consider the actual stl string library. If i do this:
const char * lala = std::string("lala").c_str();
std::cout << lala << std::endl;
const char * lala2 = std::string("lalb").c_str();
std::cout << lala << std::endl;
std::cout << lala2 << std::endl;
I am basically creating undefined behavior. In the case where i run it on ideone.com i get the following output:
lala
lalb
lalb
So clearly the memory of the original lala has been overwritten. I would just make it very clear to the user in the documentation that this sort of coding is bad practice.
You could remove the c_str() function and instead provide a function that accepts a reference to an already created empty smart pointer that resets the value of the smart pointer to a new copy of the string. This would force the user to create a non temporary object which they could then use to get the raw c string and it would be destructed and free the memory when exiting the method scope.
This assumes though that your library and its users would be sharing the same heap.
EDIT
Even better, create your own smart pointer class for this purpose whose destructor calls a library function in your library to free the memory so it can be used across DLL boundaries.
Is it valid (defined behavior) to access a string literal simultaneously with multiple threads? Given a function like this:
const char* give()
{
return "Hello, World!";
}
Would it be save to call the function and dereference the pointer simultaneously?
Edit: Many answers. Will accept the first one who can show me the section out of the standard.
According to the standard:
C++11 1.10/3: The value of an object visible to a thread T at a particular point is the initial value of the object, a value assigned to the object by T, or a value assigned to the object by another thread, according to the rules below.
A string literal, like any other constant object, cannot legally be assigned to; it has static storage duration, and so is initialised before the program starts; therefore, all threads will see its initial value at all times.
Older standards had nothing to say about threads; so if your compiler doesn't support the C++11 threading model then you'll have to consult its documentation for any thread-safety guarantees. However, it's hard to imagine any implementation under which access to immutable objects were not thread-safe.
Yes, it's safe. Why wouldn't it be? It would be unsafe if you'd try to modify the string, but that's illegal anyway.
It is always safe to access immutable data from multiple threads. String literals are an example of immutable data (since it's illegal to modify them at run-time), so it is safe to access them from multiple threads.
As long as you only read data, you can access it from as many threads as you want. When data needs to be changed, that's when it gets complicated.
This depends on the implementation of the C Compiler. But I do not know of an implementation where concurrent read accesses might be unsafe, so in practice this is safe.
String literals are (conceptually) stored in read only memory and initialised on loading (rather than at runtime). It's therefore safe to access them from multiple threads at any time.
Note that more complex structures might not be initialised at load time, and so multiple thread access might have the possibility of issues immediately after the creation of the object.
But string literals are completely safe.
while I was reading nVidia CUDA source code, I stumbled upon these two lines:
std::string stdDevString;
stdDevString = std::string(device_string);
Note that device_string is a char[1024]. The question is: Why construct an empty std::string, then construct it again with a C string as an argument? Why didn't they call std::string stdDevString = std::string(device_string); in just one line?
Is there a hidden string initialization behavior that this code tries to evade/use? Is to ensure that the C string inside stdDevString remains null terminated no matter what? Because as far as I know, initializing an std::string to a C string that's not null terminated will still exhibit problems.
Why didn't they call std::string stdDevString = std::string(device_string); in just one line?
No good reason for what they did. Given the std::string::string(const char*) constructor, you can simply use any of:
std::string stdDevString = device_string;
std::string stdDevString(device_string);
std::string stdDevString{device_string}; // C++11 { } syntax
The two-step default construction then assignment is just (bad) programmer style or oversight. Sans optimisation, it does do a little unnecessary construction, but that's still pretty cheap. It's likely removed by optimisation. Not a biggie - I doubt if I'd bother to mention it in a code review unless it was in an extremely performance sensitive area, but it's definitely best to defer declaring variables until a useful initial value is available to construct them with, localising it all in one place: not only is it less error prone and cross-referenceable, but it minimises the scope of the variable simplifying the reasoning about its use.
Is to ensure that the C string inside stdDevString remains null terminated no matter what?
No - it made no difference to that. Since C++11 the internal buffer in stdDevString would be kept NUL terminated regardless of which constructor is used, while for C++03 isn't not necessarily terminated - see dedicated heading for C++03 details below - but there's no guarantees regardless of how construction / assignment is done.
Because as far as I know, initializing an std::string to a C string that's not null terminated will still exhibit problems.
You're right - any of the construction options you've listed will only copy ASCIIZ text into the std::string - considering the first NUL ('\0') the terminator. If the char array isn't NUL-terminated there will be problems.
(That's a separate issue to whether the buffer inside the std::string is kept NUL terminated - discussed above).
Note that there's a separate string(const char*, size_type) constructor that can create strings with embedded NULs, and won't try to read further than told (Constructor (4) here)
C++03 std::strings were not guaranteed NUL-terminated internally
Whichever way the std::string is constructed and initialised, before C++11 the Standard did not require it to be NUL-terminated within the string's buffer. std::string was best imagined as containing a bunch of potentially non-printable (loosely speaking, binary in the ftp/file I/O sense) characters starting at address data() and extending for size() characters. So, if you had:
std::string x("help");
x[4]; // undefined behaviour: only [0]..[3] are safe
x.at(4); // will throw rather than return '\0'
x.data()[4]; // undefined behaviour, equivalent to x[4] above
x.c_str()[4]; // safely returns '\0', (perhaps because a NUL was always
// at x[4], one was just added, or a new NUL-terminated
// buffer was just prepared - in which case data() may
// or may not start returning it too)
Note that the std::string API requires c_str() to return a pointer to a NUL-terminated value. To do so, it can either:
proactively keep an extra NUL on the end of the string buffer at all times (in which case data[5] would happen to be safe on that implementation, but the code could break if the implementation changed or the code was ported to another Standard library implementation etc.)
reactively wait until c_str() is called, then:
if it has enough capacity at the current address (i.e. data()), append a NUL and return the same pointer value that data() would return
otherwise, allocate a new, larger buffer, copy the data over, NUL terminate it, and return a pointer to it (typically but optionally this buffer would replace the old buffer which would be deleted, such that calling data() immediately afterwards would return the same pointer returned by c_str())
I would say that it's equivalent of writing:
std::string stdDevString = std::string(device_string);
Or, even simpler:
std::string stdDevString = device_string;
Once the std::string has been created, it contains a private copy of the data in the C string.
I think it is ignorant to dismiss this as poor coding. If we assume that this string was allocated at file scope or as a static variable, it could be good coding.
When programming C++ for embedded systems with non-volatile memory present, there are many reasons why you wish to avoid static initialization: the main reason is that it adds lots of overhead code in the beginning of the program, where all such variables much be initialized. If they are instances of classes, constructors will be called.
This will lead to a delay peak at the beginning of the program execution. You don't want this workload peak there, because there are much more important tasks to do when starting up the program, like setting up various hardware.
To avoid this, you typically enable an option in the compiler which removes such static initialization, and then write your code in such a manner that no static/global variables are initialized, but instead set them in runtime.
On such a system, the code posted by the OP is the correct way to do it.
Looks like an artefact to me. Perhaps there was some other code in between, then it got removed, and someone was too lazy to join those two remaining lines into a single one.