Strange Way to Initialize std::string with C string

Strange Way to Initialize std::string with C string - c++

while I was reading nVidia CUDA source code, I stumbled upon these two lines:
std::string stdDevString;
stdDevString = std::string(device_string);
Note that device_string is a char[1024]. The question is: Why construct an empty std::string, then construct it again with a C string as an argument? Why didn't they call std::string stdDevString = std::string(device_string); in just one line?
Is there a hidden string initialization behavior that this code tries to evade/use? Is to ensure that the C string inside stdDevString remains null terminated no matter what? Because as far as I know, initializing an std::string to a C string that's not null terminated will still exhibit problems.

Why didn't they call std::string stdDevString = std::string(device_string); in just one line?
No good reason for what they did. Given the std::string::string(const char*) constructor, you can simply use any of:
std::string stdDevString = device_string;
std::string stdDevString(device_string);
std::string stdDevString{device_string}; // C++11 { } syntax
The two-step default construction then assignment is just (bad) programmer style or oversight. Sans optimisation, it does do a little unnecessary construction, but that's still pretty cheap. It's likely removed by optimisation. Not a biggie - I doubt if I'd bother to mention it in a code review unless it was in an extremely performance sensitive area, but it's definitely best to defer declaring variables until a useful initial value is available to construct them with, localising it all in one place: not only is it less error prone and cross-referenceable, but it minimises the scope of the variable simplifying the reasoning about its use.
Is to ensure that the C string inside stdDevString remains null terminated no matter what?
No - it made no difference to that. Since C++11 the internal buffer in stdDevString would be kept NUL terminated regardless of which constructor is used, while for C++03 isn't not necessarily terminated - see dedicated heading for C++03 details below - but there's no guarantees regardless of how construction / assignment is done.
Because as far as I know, initializing an std::string to a C string that's not null terminated will still exhibit problems.
You're right - any of the construction options you've listed will only copy ASCIIZ text into the std::string - considering the first NUL ('\0') the terminator. If the char array isn't NUL-terminated there will be problems.
(That's a separate issue to whether the buffer inside the std::string is kept NUL terminated - discussed above).
Note that there's a separate string(const char*, size_type) constructor that can create strings with embedded NULs, and won't try to read further than told (Constructor (4) here)
C++03 std::strings were not guaranteed NUL-terminated internally
Whichever way the std::string is constructed and initialised, before C++11 the Standard did not require it to be NUL-terminated within the string's buffer. std::string was best imagined as containing a bunch of potentially non-printable (loosely speaking, binary in the ftp/file I/O sense) characters starting at address data() and extending for size() characters. So, if you had:
std::string x("help");
x[4]; // undefined behaviour: only [0]..[3] are safe
x.at(4); // will throw rather than return '\0'
x.data()[4]; // undefined behaviour, equivalent to x[4] above
x.c_str()[4]; // safely returns '\0', (perhaps because a NUL was always
// at x[4], one was just added, or a new NUL-terminated
// buffer was just prepared - in which case data() may
// or may not start returning it too)
Note that the std::string API requires c_str() to return a pointer to a NUL-terminated value. To do so, it can either:
proactively keep an extra NUL on the end of the string buffer at all times (in which case data[5] would happen to be safe on that implementation, but the code could break if the implementation changed or the code was ported to another Standard library implementation etc.)
reactively wait until c_str() is called, then:
if it has enough capacity at the current address (i.e. data()), append a NUL and return the same pointer value that data() would return
otherwise, allocate a new, larger buffer, copy the data over, NUL terminate it, and return a pointer to it (typically but optionally this buffer would replace the old buffer which would be deleted, such that calling data() immediately afterwards would return the same pointer returned by c_str())

I would say that it's equivalent of writing:
std::string stdDevString = std::string(device_string);
Or, even simpler:
std::string stdDevString = device_string;
Once the std::string has been created, it contains a private copy of the data in the C string.

I think it is ignorant to dismiss this as poor coding. If we assume that this string was allocated at file scope or as a static variable, it could be good coding.
When programming C++ for embedded systems with non-volatile memory present, there are many reasons why you wish to avoid static initialization: the main reason is that it adds lots of overhead code in the beginning of the program, where all such variables much be initialized. If they are instances of classes, constructors will be called.
This will lead to a delay peak at the beginning of the program execution. You don't want this workload peak there, because there are much more important tasks to do when starting up the program, like setting up various hardware.
To avoid this, you typically enable an option in the compiler which removes such static initialization, and then write your code in such a manner that no static/global variables are initialized, but instead set them in runtime.
On such a system, the code posted by the OP is the correct way to do it.

Looks like an artefact to me. Perhaps there was some other code in between, then it got removed, and someone was too lazy to join those two remaining lines into a single one.

Related

Remove last char in stringstream?

For example, the stringstream contains "abc\n", I want to remove the last char '\n'.
I know it can be done by using 'str' first.
But could it be done without stringstream::str()?

No, there isn't, at least not in a guaranteed manner. Although internally, it maintains a string buffer, you currently do not have access to it without a copy being made. There is a proposal to change this:
Streams have been the oldest part of the C++ standard library and their specification doesn’t take into account many things introduced since C++11. One of the oversights is that there is no non-copying access to the internal buffer of a basic_stringbuf which makes at least the obtaining
of the output results from an ostringstream inefficient, because a copy is always made. I personally speculate that this was also the reason why basic_strbuf took so long to get deprecated with its char * access.
With move semantics and basic_string_view there is no longer a reason to keep this pessimissation alive on basic_stringbuf.
Internally, there is no reason why there should be this limited, as I believe (I may be wrong) that basic_stringbuf requires a basic_string buffer, and Clang certainly implements basic_stringbuf in such a manner.
Right now, you can stringstream like any other stream, or access a copy of it's underlying buffer, however, you cannot modify the buffer directly. This means that any attempts to modify the end of the stream require copying the underlying buffer or reading bytes until the end.

stringstream ss;
ss<<"abc\n";
ss.seekp(-1, std::ios_base::end);
ss << '\0';

Questions and Verifications on immutable [string] objects c++

I've been doing some reading on immutable strings in general and in c++, here, here, and I think I have a decent understanding of how things work. However I have built a few assumptions that I would just like to run by some people for verification. Some of the assumptions are more general than the title would suggest:
While a const string in c++ is the closest thing to an immutable string in STL, it is only locally immutable and therefore doesn't experience the benefit of being a smaller object. So it has all the trimmings of a standard string object but it can't access all of the member functions. This means that it doesn't create any optimization in the program over non-const? But rather just protects the object from modification? I understand that this is an important attribute but I'm simply looking to know what it means to use this
I'm assuming that an object's member functions exist only once in read-only memory, and how is probably implementation specific, so does a const object have a separate location in memory? Or are the member functions limited in another way? If there are only 'const string' objects and no non-const strings in a code base, does the compiler leave out the inaccessible functions?
I recall hearing that each string literal is stored only once in read-only memory in c++, however I don't find anything on this here. In other words, if I use some string literal multiple times in the same program, each instance references the same location in memory. I'm going to assume no, but would two string objects initialized by the same string literal point to the same string until one is modified?
I apologize if I have included too many disjunct thoughts in the same post, they are all related to me as string representation and just learning how to code better.

As far as I know, std::string cannot assume that the input string is a read-only constant string from your data segment. Therefore, point (3) does not apply. It will most likely allocate a buffer and copy the string in the buffer.
Note that C++ (like C) has a const qualifier for compilation time, it is a good idea to use it for two reasons: (a) it will help you find bugs, a statement such as a = 5; if a is declared const fails to compile; (b) the compile may be able to optimize the code more easily (it may otherwise not be able to figure out that the object is constant.)
However, C++ has a special cast to remove the const-ness of a variable. So our a variable can be cast and assigned a value as in const_cast<int&>(a) = 5;. An std::string can also get its const-ness removed. (Note that C does not have a special cast, but it offers the exact same behavior: * (int *) &a = 5)
Are all class members defined in the final binary?
No. std::string as most of the STL uses templates. Templates are compiled once per unit (your .o object files) and the link will reduce duplicates automatically. So if you look at the size of all the .o files and add them up, the final output will be a lot small.
That also means only the functions that are used in a unit are compiled and saved in the object file. Any other function "disappear". That being said, often function A calls function B, so B will be defined, even if you did not explicitly call it.
On the other hand, because these are templates, very often the functions get inlined. But that is a choice by the compiler, not the language or the STL (although you can use the inline keyword for fun; the compiler has the right to ignore it anyway).
Smaller objects... No, in C++ an object has a very specific size that cannot change. Otherwise the sizeof(something) would vary from place to place and C/C++ would go berserk!
Static strings that are saved in read-only data sections, however, can be optimized. If the linker/compiler are good enough, they will be able to merge the same string in a single location. These are just plan char * or wchar_t *, of course. The Microsoft compiler has been able to do that one for a while now.
Yet, the const on a string does not always force your string to be put in a read-only data section. That will generally depend on your command line option. C++ may have corrected that, but I think C still put everything in a read/write section unless you use the correct command line option. That's something you need to test to make sure (your compiler is likely to do it, but without testing you won't know.)
Finally, although std::string may not use it, C++ offers a quite interesting keyword called mutable. If you heard about it, you would know that a variable member can be marked as mutable and that means even const functions can modify that variable member. There are two main reason for using that keyword: (1) you are writing a multi-thread program and that class has to be multi-thread safe, in that case you mark the mutex as mutable, very practical; (2) you want to have a buffer used to cache a computed value which is costly, that buffer is only initialized when that value is requested to not waste time otherwise, that buffer is made mutable too.
Therefore the "immutable" concept is only really something that you can count on at a higher level. In practice, reality is often quite different. For example, an std::string c_str() function may reallocate the buffer to add the necessary '\0' terminator, yet that function is marked as being a const:
const CharT* c_str() const;
Actually, an implementation is free to allocate a completely different buffer, copy its existing data to that buffer and return that bare pointer. That means internally the std::string could be allocate many buffers to store large strings (instead of using realloc() which can be costly.)
Once thing, though... when you copy string A into string B (B = A;) the string data does not get copied. Instead A and B will share the same data buffer. Once you modify A or B, and only then, the data gets copied. This means calling a function which accepts a string by copy does not waste that much time:
int func(std::string a)
{
...
if(some_test)
{
// deep copy only happens here
a += "?";
}
}
std::string b;
func(b);
The characters of string b do not get copied at the time func() gets called. And if func() never modifies 'a', the string data remains the same all along. This is often referenced as a shallow copy or copy on write.

C++ strings - How to avoid obtaining invalid pointer?

In our C++ code, we have our own string class (for legacy reasons). It supports a method c_str() much like std::string. What I noticed is that many developers are using it incorrectly. I have reduced the problem to the following line:
const char* x = std::string("abc").c_str();
This seemingly innocent code is quite dangerous in the sense that the destructor on std::string gets invoked immediately after the call to c_str(). As a result, you are holding a pointer to a de-allocated memory location.
Here is another example:
std::string x("abc");
const char* y = x.substr(0,1).c_str();
Here too, we are using a pointer to de-allocated location.
These problems are not easy to find during testing as the memory location still contains valid data (although the memory location itself is invalid).
I am wondering if you have any suggestions on how I can modify class/method definition such that developers can never make such a mistake.

The modern part of the code should not deal with raw pointers like that.
Call c_str only when providing an argument to a legacy function that takes const char*. Like:
legacy_print(x.substr(0,1).c_str())
Why would you want to create a local variable of type const char*? Even if you write a copying version c_str_copy() you will just get more headache because now the client code is responsible for deleting the resulting pointer.
And if you need to keep the data around for a longer time (e.g. because you want to pass the data to multiple legacy functions) then just keep the data wrapped in a string instance the whole time.

For the basic case, you can add a ref qualifier on the "this" object, to make sure that .c_str() is never immediately called on a temporary. Of course, this can't stop them from storing in a variable that leaves scope before the pointer does.
const char *c_str() & { return ...; }
But the bigger-picture solution is to replace all functions from taking a "const char *" in your codebase with functions that take one of your string classes (at the very least, you need two: an owning string and a borrowed slice) - and make sure that none of your string class does cannot be implicitly constructed from a "const char *".

The simplest solution would be to change your destructor to write a null at the beginning of the string at destruction time. (Alternatively, fill the entire string with an error message or 0's; you can have a flag to disable this for release code.)
While it doesn't directly prevent programmers from making the mistake of using invalid pointers, it will definitely draw attention to the problem when the code doesn't do what it should do. This should help you flush out the problem in your code.
(As you mentioned, at the moment the errors go unnoticed because for the most part the code will happily run with the invalid memory.)

Consider using Valgrind or Electric Fence to test your code. Either of these tools should trivially and immediately find these errors.

I am not sure that there is much you can do about people using your library incorrectly if you warn them about it. Consider the actual stl string library. If i do this:
const char * lala = std::string("lala").c_str();
std::cout << lala << std::endl;
const char * lala2 = std::string("lalb").c_str();
std::cout << lala << std::endl;
std::cout << lala2 << std::endl;
I am basically creating undefined behavior. In the case where i run it on ideone.com i get the following output:
lala
lalb
lalb
So clearly the memory of the original lala has been overwritten. I would just make it very clear to the user in the documentation that this sort of coding is bad practice.

You could remove the c_str() function and instead provide a function that accepts a reference to an already created empty smart pointer that resets the value of the smart pointer to a new copy of the string. This would force the user to create a non temporary object which they could then use to get the raw c string and it would be destructed and free the memory when exiting the method scope.
This assumes though that your library and its users would be sharing the same heap.
EDIT
Even better, create your own smart pointer class for this purpose whose destructor calls a library function in your library to free the memory so it can be used across DLL boundaries.

Can I do a zero-copy std::string allocation in C++ from a const char * array?

Profiling of my application reveals that it is spending nearly 5% of CPU time in string allocation. In many, many places I am making C++ std::string objects from a 64MB char buffer. The thing is, the buffer never changes during the running of the program. My analysis of std::string(const char *buf,size_t buflen) calls is that that the string is being copied because the buffer might change after the string is made. That isn't the problem here. Is there a way around this problem?
EDIT: I am working with binary data, so I can't just pass around char *s. Besides, then I would have a substantial overhead from always scanning for the NULL, which the std::string avoids.

If the string isn't going to change and if its lifetime is guaranteed to be longer than you are going to use the string, then don't use std::string.
Instead, consider a simple C string wrapper, like the proposed string_ref<T>.

Binary data? Stop using std::string and use std::vector<char>. But that won't fix your issue of it being copied. From your description, if this huge 64MB buffer will never change, you truly shouldn't be using std::string or std::vector<char>, either one isn't a good idea. You really ought to be passing around a const char* pointer (const uint8_t* would be more descriptive of binary data but under the covers it's the same thing, neglecting sign issues). Pass around both the pointer and a size_t length of it, or pass the pointer with another 'end' pointer. If you don't like passing around separate discrete variables (a pointer and the buffer’s length), make a struct to describe the buffer & have everyone use those instead:
struct binbuf_desc {
uint8_t* addr;
size_t len;
binbuf_desc(addr,len) : addr(addr), len(len) {}
}
You can always refer to your 64MB buffer (or any other buffer of any size) by using binbuf_desc objects. Note that binbuf_desc objects don’t own the buffer (or a copy of it), they’re just a descriptor of it, so you can just pass those around everywhere without having to worry about binbuf_desc’s making unnecessary copies of the buffer.

There is no portable solution. If you tell us what toolchain you're using, someone might know a trick specific to your library implementation. But for the most part, the std::string destructor (and assignment operator) is going to free the string content, and you can't free a string literal. (It's not impossible to have exceptions to this, and in fact the small string optimization is a common case that skips deallocation, but these are implementation details.)
A better approach is to not use std::string when you don't need/want dynamic allocation. const char* still works just fine in modern C++.

Since C++17, std::string_view may be your way. It can be initialized both from a bare C string (with or without a length), or a std::string
There is no constraint that the data() method returns a zero-terminated string though.
If you need this "zero-terminated on request" behaviour, there are alternatives such as str_view from Adam Sawicki that looks satisfying (https://github.com/sawickiap/str_view)

Seems that using const char * instead of std::string is the best way to go for you. But you should also consider how you are using strings. It may be possible that there could be going on implicit conversion from char pointers to std::string objects. This could happen during function calls, for example.

Would this optimization in the implementation of std::string be allowed?

I was just thinking about the implementation of std::string::substr. It returns a new std::string object, which seems a bit wasteful to me. Why not return an object that refers to the contents of the original string and can be implicitly assigned to a std::string? A kind of lazy evaluation of the actual copying. Such a class could look something like this:
template <class Ch, class Tr, class A>
class string_ref {
public:
// not important yet, but *looks* like basic_string's for the most part
private:
const basic_string<Ch, Tr, A> &s_;
const size_type pos_;
const size_type len_;
};
The public interface of this class would mimic all of the read-only operations of a real std::string, so the usage would be seamless. std::string could then have a new constructor which takes a string_ref so the user would never be the wiser. The moment you try to "store" the result, you end up creating a copy, so no real issues with the reference pointing to data and then having it modified behind its back.
The idea being that code like this:
std::string s1 = "hello world";
std::string s2 = "world";
if(s1.substr(6) == s2) {
std::cout << "match!" << std::endl;
}
would have no more than 2 std::string objects constructed in total. This seems like a useful optimization for code which that performs a lot of string manipulations. Of course, this doesn't just apply to std::string, but to any type which can return a subset of its contents.
As far as I know, no implementations do this.
I suppose the core of the question is:
Given a class that can be implicitly converted to a std::string as needed, would it be conforming to the standard for a library writer to change the prototype of a member's to return type? Or more generally, do the library writers have the leeway to return "proxy objects" instead of regular objects in these types of cases as an optimization?
My gut is that this is not allowed and that the prototypes must match exactly. Given that you cannot overload on return type alone, that would leave no room for library writers to take advantage of these types of situations. Like I said, I think the answer is no, but I figured I'd ask :-).

This idea is copy-on-write, but instead of COW'ing the entire buffer, you keep track of which subset of the buffer is the "real" string. (COW, in its normal form, was (is?) used in some library implementations.)
So you don't need a proxy object or change of interface at all because these details can be made completely internal. Conceptually, you need to keep track of four things: a source buffer, a reference count for the buffer, and the start and end of the string within this buffer.
Anytime an operation modifies the buffer at all, it creates its own copy (from the start and end delimiters), decreases the old buffer's reference count by one, and sets the new buffer's reference count to one. The rest of the reference counting rules are the same: copy and increase count by one, destruct a string and decrease count by one, reach zero and delete, etc.
substr just makes a new string instance, except with the start and end delimiters explicitly specified.

This is a quite well-known optimization that is relatively widely used, called copy-on-write or COW. The basic thing is not even to do with substrings, but with something as simple as
s1 = s2;
Now, the problem with this optimization is that for C++ libraries that are supposed to be used on targets supporting multiple threads, the reference count for the string has to be accessed using atomic operations (or worse, protected with a mutex in case the target platform doesn't supply atomic operations). This is expensive enough that in most cases the simple non-COW string implementation is faster.
See GOTW #43-45:
http://www.gotw.ca/gotw/043.htm
http://www.gotw.ca/gotw/044.htm
http://www.gotw.ca/gotw/045.htm
To make matters worse, libraries that have used COW, such as the GNU C++ library, cannot simply revert to the simple implementation since that would break the ABI. (Although, C++0x to the rescue, as that will require an ABI bump anyway! :) )

Since substr returns std::string, there is no way to return a proxy object, and they can't just change the return type or overload on it (for the reasons you mentioned).
They could do this by making string itself capable of being a sub of another string. This would mean a memory penalty for all usages (to hold an extra string and two size_types). Also, every operation would need to check to see if it has the characters or is a proxy. Perhaps this could be done with an implementation pointer -- the problem is, now we're making a general purpose class slower for a possible edge case.
If you need this, the best way is to create another class, substring, that constructs from a string, pos, and length, and coverts to string. You can't use it as s1.substr(6), but you can do
substring sub(s1, 6);
You would also need to create common operations that take a substring and string to avoid the conversion (since that's the whole point).

Regarding your specific example, this worked for me:
if (&s1[6] == s2) {
std::cout << "match!" << std::endl;
}
That may not answer your question for a general-purpose solution. For that, you'd need sub-string CoW, as #GMan suggests.

What you are talking about is (or was) one of the core features of Java's java.lang.String class (http://fishbowl.pastiche.org/2005/04/27/the_string_memory_gotcha/). In many ways, the designs of Java's String class and C++'s basic_string template are similar, so I would imagine that writing an implementation of the basic_string template utilizing this "substring optimization" is possible.
One thing that you will need to consider is how to write the implementation of the c_str() const member. Depending on the location of a string as a substring of another, it may have to create a new copy. It definitely would have to create a new copy of the internal array if the string for which the c_str was requested is not a trailing substring. I think that this necessitates using the mutable keyword on most, if not all, of the data members of the basic_string implementation, greatly complicating the implementation of other const methods because the compiler is no longer able to assist the programmer with const correctness.
EDIT: Actually, to accommodate c_str() const and data() const, you could use a single mutable field of type const charT*. Initially set to NULL, it could be per-instance, initialized to a pointer to a new charT array whenever c_str() const or data() const are called, and deleted in the basic_string destructor if non-NULL.

If and only if you really need more performance than std::string provides then go ahead and write something that works the way you need it to. I have worked with variants of strings before.
My own preference is to use non-mutable strings rather than copy-on-write, and to use boost::shared_ptr or equivalent but only when the string is actually beyond 16 in length, so the string class also has a private buffer for short strings.
This does mean that the string class might carry a bit of weight.
I also have in my collection list a "slice" class that can look at a "subset" of a class that lives elsewhere as long as the lifetime of the original object is intact. So in your case I could slice the string to see a substring. Of course it would not be null-terminated, nor is there any way of making it such without copying it. And it is not a string class.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Strange Way to Initialize std::string with C string - c++

I would say that it's equivalent of writing: std::string stdDevString = std::string(device_string); Or, even simpler: std::string stdDevString = device_string; Once the std::string has been created, it contains a private copy of the data in the C string.

Looks like an artefact to me. Perhaps there was some other code in between, then it got removed, and someone was too lazy to join those two remaining lines into a single one.

Related

Remove last char in stringstream?

Questions and Verifications on immutable [string] objects c++

C++ strings - How to avoid obtaining invalid pointer?

Can I do a zero-copy std::string allocation in C++ from a const char * array?

Would this optimization in the implementation of std::string be allowed?

Categories

Resources