Like many people I'm in the habit of writing new string functions as functions of const std::string &. The advantages are efficiency (you can pass existing std::string objects without incurring overhead for copying/moving) and flexibility/readability (if all you have is a const char * you can just pass it and have the construction done implicitly, without cluttering up your code with an explicit std::string construction):
#include <string>
#include <iostream>
unsigned int LengthOfStringlikeObject(const std::string & s)
{
return s.length();
}
int main(int argc, const char * argv[])
{
unsigned int n = LengthOfStringlikeObject(argv[0]);
std::cout << "'" << argv[0] << "' has " << n << " characters\n";
}
My aim is to write efficient cross-platform code that can handle long strings efficiently. My question is, what happens during the implicit construction? Are there any guarantees that the string will not be copied? It strikes me that, because everything is const, copying is not necessary—a thin STL wrapper around the existing pointer is all that's needed—but I'm not sure how compiler- and platform-dependent I should expect that behavior to be. Would it be safer to always explicitly write two versions of the function, one for const std::string & and one for const char *?
If you pass a const char* to something that takes a std::string, reference or not, a string will be constructed. A compiler might even complain if you send it to a reference with a warning that there is an implicit temporary object.
Now this may be optimized by the compiler and also some implementations will not allocate memory for small strings. The compiler might also internally optimize it to use a C++17 string_view. It essentially depends on what you will do to the string in your code. If you only use constant member functions, a clever compiler might optimize out.
But that is up to the implementation and outside your control. You can use explicitly std::string_view if you want to take over.
If you don't want copying, then string_view is what you want.
However, with this benefit comes problems. Specifically, you have to ensure that the storage that you pass lasts "long enough".
For string literals, that's no problem. For argv[0], that's almost certainly not a problem. For arbitrary sequences of characters, then you'll need to think about them.
but you can write:
unsigned int LengthOfStringlikeObject(std::string_view sv)
{
return sv.length();
}
and call it with a string, or a const char *, and it will be fine.
It strikes me that, because everything is const, copying is not
necessary—a thin STL wrapper around the existing pointer is all that's
needed
I don't think this assumption is correct. Just because you have a pointer to const, it does not imply that the underlying value cannot change. It only implies that the value cannot be changed through that pointer. The pointer could be pointing to non-const storage which can change at any time.
Because of this, the library must make its own copy (to provide the "correct" string observable behavior). A quick review of libstdc++ shows that it always makes a copy. The construction from char* is not inline, so it cannot be optimized away without static linking and LTO.
While extremely trivial statically linked programs might have the copy optimized away with LTO (I wasn't able to reproduce this), I think in general it would be unlikely this optimization could be performed (especially considering the aliasing rules for char*). g++ doesn't even perform this optimization for a string literal.
Related
I made a mistake in a socket interface I wrote a while back and I just noticed the problem while looking through the code for a different issue. The socket receives a string of characters and passes it to jsoncpp to complete the json parsing. I can almost understand what is happening here but I can't get my head around it. I would like to grasp what is actually happening under the hood. Here is the minimum example:
#include <iostream>
#include <cstring>
void doSomethingWithAString(const std::string &val) {
std::cout << val.size() << std::endl;
std::cout << val << std::endl;
}
int main()
{
char responseBufferForSocket[10000];
memset(responseBufferForSocket, 0, 10000);
//Lets simulate a response from a socket connection
responseBufferForSocket[0] = 'H';
responseBufferForSocket[1] = 'i';
responseBufferForSocket[2] = '?';
// Now lets pass a .... the address of the first char in the array...
// wait a minute..that's not a const std::string& ... but hey, it's ok it *works*!
doSomethingWithAString(responseBufferForSocket);
return 0;
}
The code above is not causing any obvious issues but I would like to correct it if there is a problem lurking. Obviously the character array is being transformed to a string, but by what mechanism? I guess I have four questions:
Is this string converted on the stack and passed by reference or is it passed by value?
Is it using the operator= overload? A "from c-string" constructor? Some other mechanism?
Based on 2 is this less efficient in than converting to a string explicitly using a constructor?
Is this dangerous. :)
compiled with g++ (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
std::string has a non explicit constructor (i.e. not marked with the explicit keyword) that takes a const char* parameter and copies characters until the first '\0' (the behaviour is undefined if no such character exists in the string). In other words, it performs a copy of the source data. It's overload #5 on this page.
const char[] implicitly decays to const char*, and you can pass a temporary to a function taking a const reference parameter. This only works if the reference is const, by the way; if you can't use const, pass it by value.
And so, when you pass a const char[] to that function, a temporary object of type std::string is constructed using that constructor, and bound to the parameter. The temporary will remain alive for the duration of the function call, and will be destroyed when it returns.
With all that in mind, let's address your questions:
It's passed by reference, but the reference is to a temporary object.
A constructor, since we're constructing an object. std::string also has an operator= taking a const char* parameter, but that's never used for implicit conversions: you'll need to be explicitly assigning something.
The performance is the same since the same code runs, but you do incur some overhead because the data is copied instead of referenced. If that is an issue, use std::string_view instead.
It's safe as long as you don't try to keep a reference or pointer to the parameter for longer than the function call, because the object might not be alive afterwards (but then you should always keep that in mind with reference parameters). You also need to make sure that the C string you're passing is properly null terminated.
Is this string converted on the stack
The language doesn't specify the storage of temporary objects, but in this case it is probably stored on the stack, yes.
or is it passed by value?
The argument is a reference. Therefore you are "passing by reference".
Is it using the operator= overload?
No. You aren't using operator= there, so why would it?
A "from c-string" constructor?
Yes.
Based on 2 is this less efficient in than converting to a string explicitly using a constructor?
No. Whether object is created implicitly or explicitly is irrelevant to efficiency.
Creating a std::string is however potentially less efficient than not creating it which you could achieve by not accepting a reference to a string as the argument. You could use a string view instead.
Is this dangerous.
Not particularly. In some cases implicit conversions can cause a bit of problems when the programmers doesn't notice them, but typically they simplify the language by reducing verbosity.
Does it matter anymore if I use boost::string_ref over std::string& ? I mean, is it really more efficient to use boost::string_ref over the std version when you are processing strings ? I don't really get the explanation offered here: http://www.boost.org/doc/libs/1_61_0/libs/utility/doc/html/string_ref.html . What really confuses me is the fact that std::string is also a handle class that only points to the allocated memory, and since c++11, with move semantics the copy operations noted in the article above are not going to happen. So, which one is more efficient ?
The use case for string_ref (or string_view in recent Boost and C++17) is for substring references.
The case where
the source string happens to be std::string
and the full length of a source string is referenced
is a (a-typical) special case, where it does indeed resemble std::string const&.
Note also that operations on string_ref (like sref.substring(...)) automatically return more string_ref objects, instead of allocating a new std::string.
I have never used it be it seems to me that its purpose is to provide an interface similar to std::string but without having to allocate a string for manipulation. Take the example given extract_part(): it is given a hard-coded C array "ABCDEFG", but because the initial function takes a std::string an allocation takes place (std::string will have its own version of "ABCDEFG"). Using string_ref, no allocation occurs, it uses the reference to the initial "ABCDEFG". The constraint is that the string is read-only.
This answer uses the new name string_view to mean the same as string_ref.
What really confuses me is the fact that std::string is also a handle class that only points to the allocated memory
A string allocates, owns, and manages its own memory. A string_view is a handle to some memory that was already allocated. The memory is managed by some other mechanism, unrelated to the string_view.
If you already have some text data, for example in a char array, then the additional memory allocation involved in constructing a string might be redundant. A string_view could be more efficient because it would allow you to operate directly on the original data in the char array. However, it would not permit the data to be modified; string_view allows no non-const access, because it doesn't own the data it refers to.
and since c++11, with move semantics the copy operations noted in the article above are not going to happen.
You can only move from an object that is ready to be discarded. Copying still serves a purpose and is necessary in many cases.
The example in the article constructs two new strings (not copies) and also constructs two copies of existing strings. In C++98 the copies could already be elided by RVO without move semantics, so they're not a big deal. By using string_view it avoids constructing the two new strings. Move semantics are irrelevant here.
In the call to extract_part("ABCDEFG") a string_view is constructed which refers to the char array represented by the string literal. Constructing a string here would have involved a memory allocation and a copy of the char array.
In the call to bar.substr(2,3) a string_view is constructed which refers to parts of the data already referred to by the first string_view. Using a string here would have involved another memory allocation and copy of part of the data.
So, which one is more efficient?
This is a bit like asking if a hammer is more efficient than a screwdriver. They serve different purposes, so it depends what it is you're trying to accomplish.
You need to be careful when using string_view that the memory it refers to remains valid throughout its lifetime.
If you stick to std::string it does not matter, but boost::string_ref also supports const char*. That is, do you intend to call your string processing function foo with std::string only?
void foo(const std::string&);
foo("won't work"); // no support for `const char*`
Since boost::string_ref is constructable from const char*, it is more flexible since it works with both const char* and std::string.
The proposal N3442 might be helpful.
In short: The main benefit of std::string_view over const std::string& is that you can pass both const char* and std::string objects without doing a copy. As others have said, it also allows you to pass substrings without copying, although (in my experience) this is somewhat less often important.
Consider the following (silly) function (yes I know you could just call s.at(2)):
char getThird(std::string s)
{
if (s.size() < 3) throw std::runtime_error("String too short");
return s[2];
}
This function works, but the string is passed by value. This means the whole length of the string is copied even though we don't look at all of it, and it also (often) incurs a dynamic memory allocation. Doing this in a tight loop can be very expensive. One solution to this is to pass the string by const reference instead:
char getThird(const std::string& s);
This works a lot better if you have a std::string variable and you pass it as a parameter to getThird. But now there's a problem: what if you have a null-terminated const char* string? When you call this function, a temporary std::string will get constructed, so you still get still get the copy and dynamic memory allocation.
Here's another attempt:
char getThird(const char* s)
{
if (std::strlen(s) < 3) throw std::runtime_error("String too short");
return s[2];
}
This will obviously now work fine for const char* variables. It will also work for std::string variables, but calling it is a little awkward: getThird(myStr.c_str()). What's more, std::string supports embedded null characters, and getThird will misinterpret the string as ended at the first of these. At worst this could cause a security vulnerability - imagine if the function were called checkStringForBadHacks!
Another problem is simply that it's annoying to write a function in terms of old null-terminated strings instead of std::string objects with their handy methods. Did you notice, for example, that this function looks at the whole length of the string even though only the first few characters are important? It's hidden in std::strlen, which iterates over all characters looking for the null terminator. We could replace that with a manual check that the first three characters aren't null, but you can see this is a lot less convenient than the other versions.
Step in std::string_view (or boost::string_view, previously known as boost::string_ref):
char getThird(std::string_view s)
{
if (s.size() < 3) throw std::runtime_error("String too short");
return s[2];
}
This gives you the nice methods you expect from a proper string class, like .size(), and it works in both the situations discussed above, plus another:
It works with std::string objects, which can be implicitly be converted to std::string_view objects.
It works with const char* null-terminated strings, which can also be implicitly be converted to std::string_view objects.
This does have the potential disadvantage that constructing the std::string_view requires iterating over the whole string to find the length, even if the function that uses it never needs it (as is the case here). However, if a caller is using a const char* as a parameter to several functions (or one function in a loop) that take std::string_view objects it could always manually construct that object beforehand. This could even give a performance increase, because if that function(s) do need the length then it is precomputed once and reused.
As other answers have mentioned, it also avoids a copy when you only want to pass a substring. For example, this is very useful in parsing. But std::string_view is justified even without this feature.
It's worth noting that there is a case where the original function signature, taking a std::string by value, may actually be better than a std::string_view. That's where you were going to make a copy of the string anyway, for example to store in some other variable or to return from the function. Imagine this function:
std::string changeThird(std::string s, char c)
{
if (s.size() < 3) throw std::runtime_error("String too short");
s[2] = c;
return s;
}
// vs.
std::string changeThird(std::string_view s, char c)
{
if (s.size() < 3) throw std::runtime_error("String too short");
std::string result = s;
result[2] = c;
return result;
}
Note that both of these involve exactly one copy: In the first case this is done implicitly when the parameter s is constructed from whatever is passed in (including if it is another std::string). In the second case we do it explicitly when we create result. But the return statement does not do a copy, because uses move semantics (as if we had done std::move(result)), or more likely uses the return value optimisation.
The reason the first version can be better is that it is actually possible for it to perform zero copies, if the caller moves the argument:
std::string something = getMyString();
std::string other = changeThird(std::move(something), "x");
In this case, the first changeThird does not involve any copy at all, whereas the second one does.
It maybe seems to be a silly question but i really need to clarify this:
Will this bring any danger to my program?
Is the const_cast even needed?
If i change the input pointers values in place will it work safely with std::string or will it create undefined behaviour?
So far the only concern is that this could affect the string "some_text" whenever I modify the input pointer and makes it unusable.
std::string some_text = "Text with some input";
char * input = const_cast<char*>(some_text.c_str());
Thanks for giving me some hints, i would like to avoid the shoot in my own foot
As an example of evil behavior: the interaction with gcc's Copy On Write implementation.
#include <string>
#include <iostream>
int main() {
std::string const original = "Hello, World!";
std::string copy = original;
char* c = const_cast<char*>(copy.c_str());
c[0] = 'J';
std::cout << original << "\n";
}
In action at ideone.
Jello, World!
The issue ? As the name implies, gcc's implementation of std::string uses a ref-counted shared buffer under the cover. When a string is modified, the implementation will neatly check if the buffer is shared at the moment, and if it is, copy it before modifying it, ensuring that other strings sharing this buffer are not affected by the new write (thus the name, copy on write).
Now, with your evil program, you access the shared buffer via a const-method (promising not to modify anything), but you do modify it!
Note that with MSVC's implementation, which does not use Copy On Write, the behavior would be different ("Hello, World!" would be correctly printed).
This is exactly the essence of Undefined Behavior.
To modify an inherently const object by casting away its constness using const_cast is an Undefined Behavior.
string::c_str() returns a const char *, i.e: a pointer to a constant c-style string. Technically, modifying this will result in Undefined Behavior.
Note, that the use of const_cast is when you have a const pointer to a non const data and you wish to modify the non-constant data.
Simply casting will not bring forth an undefined behavior. Modifying the data pointed at, however, will. (Also see ISO 14882:98 5.2.7-7).
If you want a pointer to modifiable data, you can have a
std::vector<char> wtf(str.begin(), str.end());
char* lol= &wtf[0];
The std::string manages it's own memory internally, which is why it returns a pointer to that memory directly as it does with the c_str() function. It makes sure it's constant so that your compiler will warn you if you try to do modifiy it.
Using const_cast in that way literally casts away such safety and is only an arguably acceptable practice if you are absolutely sure that memory will not be modified.
If you can't guarantee this then you must copy the string and use the copy.; it's certainly a lot safer to do this in any event (you can use strcpy).
See the C++ reference website:
const char* c_str ( ) const;
"Generates a null-terminated sequence of characters (c-string) with the same content as the string object and returns it as a pointer to an array of characters.
A terminating null character is automatically appended.
The returned array points to an internal location with the required storage space for this sequence of characters plus its terminating null-character, but the values in this array should not be modified in the program and are only guaranteed to remain unchanged until the next call to a non-constant member function of the string object."
Yes, it will bring danger, because
input points to whatever c_str happens to be right now, but if some_text ever changes or goes away, you'll be left with a pointer that points to garbage. The value of c_str is guaranteed to be valid only as long as the string doesn't change. And even, formally, only if you don't call c_str() on other strings too.
Why do you need to cast away the const? You're not planning on writing to *input, are you? That is a no-no!
This is a very bad thing to do. Check out what std::string::c_str() does and agree with me.
Second, consider why you want a non-const access to the internals of the std::string. Apparently you want to modify the contents, because otherwise you would use a const char pointer. Also you are concerned that you don't want to change the original string. Why not write
std::string input( some_text );
Then you have a std::string that you can mess with without affecting the original, and you have std::string functionality instead of having to work with a raw C++ pointer...
Another spin on this is that it makes code extremely difficult to maintain. Case in point: a few years ago I had to refactor some code containing long functions. The author had written the function signatures to accept const parameters but then was const_casting them within the function to remove the constness. This broke the implied guarantee given by the function and made it very difficult to know whether the parameter has changed or not within the rest of the body of the code.
In short, if you have control over the string and you think you'll need to change it, make it non-const in the first place. If you don't then you'll have to take a copy and work with that.
it is UB.
For example, you can do something like this this:
size_t const size = (sizeof(int) == 4 ? 1024 : 2048);
int arr[size];
without any cast and the comiler will not report an error. But this code is illegal.
The morale is that you need consider action each time.
If all a function needs to do with a parameter is see its value, shouldn't you always pass that parameter by constant reference?
A colleague of mine stated that it doesn't matter for small types, but I disagree.
So is there any advantage to do this:
void function(char const& ch){ //<- const ref
if (ch == 'a'){
DoSomething(ch);
}
return;
}
over this:
void function(char ch){ //<- value
if (ch == 'a'){
DoSomething(ch);
}
return;
}
They appear to be the same size to me:
#include <iostream>
#include <cstdlib>
int main(){
char ch;
char& chref = ch;
std::cout << sizeof(ch) << std::endl; //1
std::cout << sizeof(chref) << std::endl; //1
return EXIT_SUCCESS;
}
But I do not know if this is always the case.
I believe I'm right, because it does not produce any additional overhead and it is self documenting.
However, I want to ask the community if my reasoning and assumptions are correct?
Your colleague is correct. For small types (char, int) it makes no sense to pass by reference, when the variable is not to be modified. Passing by value would be better, as size of pointer (used in case of passing by reference) is about the size of small types.
And moreover, passing by value, is lesser typing, as well as slightly more readable.
Even though the sizeof(chref) is the same as sizeof(ch), passing character by reference does take more bytes on most systems: although the standard does not say anything specific about the implementation of references, an address (i.e. a pointer) is regularly passed behind the scenes. With optimization on, it probably would not matter. When you code template functions, items of unknown type that will not be modified should always be passed by const reference.
As far as small types go, you can pass them by value with a const qualifier to emphasize the point that you aren't going to touch the argument through the signature of your function:
void function(const char ch){ //<- value
if (ch == 'a'){
DoSomething(ch);
}
return;
}
For small values, the cost of creating a reference and dereferencing it is likely to be greater than the cost of copying it (if there is a difference at all). This is especially true when you consider that reference parameters are pretty much always implemented as a pointer. Both document equally well if you just declare your value as const (I'm using this value for input only and it will not be modified). I generally just make all of the standard built-in types by const value and all user-defined / STL types as const &.
Your sizeof example is flawed because chref is just an alias for ch. You'd get equal results for sizeof(T) for any type T.
The sizes are not the same as passed. The result depends on the ABIs calling convention, but the sizeof(referenceVariable) produces the sizeof(value).
If all a function needs to do with a parameter is see its value, shouldn't you always pass that parameter by constant reference?
That's what I do. I know people disagree with me, and argue for passing small builtins by value, or prefer to omit the const. Passing by reference can add instructions and/or consume more space. I pass this way for consistency, and because always measuring the best way to pass for any given platform is a lot of hassle to maintain.
There isn't an advantage beyond readability (if that's your preference). Performance could suffer very slightly, but it will not be a consideration in most cases.
Passing these small builtins by value is more common. If passing by value, you can const qualify the definition (independent of the declaration).
My recommendation is that the vast majority of teams should simply choose one way to pass and stick with it, and performance should not influence that unless every instruction counts. The const never hurts.
In my opinion, your general approach of passing by const reference is a good practice (but see below for some caveats on your example). On the other hand, your friend is correct that for built-in types, passing by reference should not result in any significant performance gains, and could even result in marginal performance losses. I come from a C background, so I tend to think of references in terms of pointers (even though there are some subtle differences), and a "char*" will be bigger than a "char" on any platform with which I'm familiar.
[EDIT: removed incorrect information.]
The bottom line, in my opinion, is that when you're passing larger user-defined types, and the called function only needs to read values without modifying them, passing by "type const&" is a good practice. As you say, it's self-documenting, and helps clarify the roles of the various pieces of your internal API.
i have a specific problem with pointers and references, with std::vector and std::string. Some question are in the following code snipped, some below.
I have basically this code
//first of all: is this function declaration good?
//or shoudl I already pass the std::vector in another way?
//I'm only reading from it
void func(const std::vector<std::string>& vec)
{
//I need to call otherFunc(void*) with a single element from vec:
//otherFunc(void* arg);
//with arg being a void* to vec[0]
}
My IDE tells me that only &*vec[0] works as parameter for otherFunc, but this doesn't compile...
How is the best way to do these kind of parameter passing?
That is a good declaration, as long as the function is not intended to modify the vector. It's more efficient than passing by value, since it avoids copying the vector - an expensive operation requiring a memory allocation.
However, the other function requires a non-const pointer. How to handle this depends on whether it might modify the data.
If it won't (as you imply when you say "I'm only reading from it") then the options are:
Change it to otherFunc(void const * arg) to give a stronger guarantee that it won't, or
Remove the const qualification with a const_cast<void*> when calling it
Note that &*vec[0] won't compile; you want vec[0].c_str() to get a C-compatible pointer to the first string's data, assuming that's what you need.
If it might modify the vector, then you'll have to do something else, since there's no legal way to modify a std::string through a pointer to its data. Probably the best option is to use std::vector<char> rather than std::string, but that depends on exactly what the function does.
First of all, since you only read from the vector, passing it as a const reference is a good idea and should work well.
To your second question: it is difficult to say what otherFunc will do to your object, since it expects a void pointer. This is C style and should not be used in C++, there are better and type safe ways to do so.
Firstly, &*vec[0] is meaningless; you probably mean &vec[0] (i.e. the address of the first string in your vector).
But even this won't compile because &vec[0] is of type const std::string *. Most importantly, it's const. You could do this:
otherFunc(const_cast<std::string *>(&vec[0]));
BUT!!! Trying to use a const std::string * in the context of a void * sounds like a very bad idea. How could that possibly ever do anything useful?