Why doesn't std::string take a null pointer? - c++

I recently passed a null pointer to a std::string constructor and got undefined behavior. I'm certain this is something that thousands or tens of thousands of programmers have done before me, and this same bug has no doubt crashed untold numbers of programs. It comes up a lot when converting from code using char* to code using std::string, and it's the kind of thing that is not catchable at compile time and can easily be missed in run time unit tests.
What I'm confused about is the reason for specifying std::string this way.
Why not just define std::string(NULL)==""?
The efficiency loss would be negligible, I doubt it's even measurable in a real program.
Does anyone know what the possible reason for making std::string(NULL) undefined is?

No good reason as far as I know.
Someone just proposed a change to this a month ago. I encourage you to support it.
std::string is not the best example of well done standardization. The version initially standardized was impossible to implement; the requirements placed on it where not consistent with each other.
At some point that inconsistency was fixed.
In c++11 the rules where changed that prevent COW (copy on write) implementations, which broke the ABI of existing reasonably compliant std::strings. This change may have been the point where the inconsistency was fixed, I do not recall.
Its API is different than the rest of std's container because it didn't come from the same pre-std STL.
Treating this legacy behavior of std::string as some kind of reasoned decision that takes into account performance costs is not realistic. If any such testing was done, it was 20+ years ago on a non-standard compliant std::string (because none could exist, the standard was inconsistent).
It continues to be UB on passing (char const*)0 and nullptr due to inertia, and will continue to do so until someone makes a proposal and demonstrates that the cost is tiny while the benefit is not.
Constructing a std::string from a literal char const[N] is already a low performance solution; you already have the size of the string at compile time and you drop it on the ground and then at runtime walk the buffer to find the '\0' character (unless optimized around; and if so, the null check is equally optimizable). The high performance solution involves knowing the length and telling std::string about it instead of copying from a '\0' terminated buffer.

The sole reason is: Runtime performance.
It would indeed be easy to define that std::string(NULL) results in the empty string. But it would cost an extra check at the construction of every std::string from a const char *, which can add up.
On the balance between absolute maximum performance and convenience, C++ always goes for absolute maximum performance, even if this would mean to compromise the robustness of programs.
The most famous example is to not initialize POD member variables in classes by default: Even though in 99% of all cases programmers want all POD member variables to be initialized, C++ decides not to do so to allow the 1% of all classes to achieve slightly higher runtime performance. This pattern repeats itself all over the place in C++. Performance over everything else.
There is no "the performance impact would be negligible" in C++. :-)
(Note that I personally do not like this behavior in C++ either. I would have made it so that the default behavior is safe, and that the unchecked and uninitialized behavior has to be requested explicitly, for example with an extra keyword. Uninitialized variables are still a major problem in a lot of programs in 2018.)

Related

Integer constant 0 causes std::string to throw [duplicate]

I recently passed a null pointer to a std::string constructor and got undefined behavior. I'm certain this is something that thousands or tens of thousands of programmers have done before me, and this same bug has no doubt crashed untold numbers of programs. It comes up a lot when converting from code using char* to code using std::string, and it's the kind of thing that is not catchable at compile time and can easily be missed in run time unit tests.
What I'm confused about is the reason for specifying std::string this way.
Why not just define std::string(NULL)==""?
The efficiency loss would be negligible, I doubt it's even measurable in a real program.
Does anyone know what the possible reason for making std::string(NULL) undefined is?
No good reason as far as I know.
Someone just proposed a change to this a month ago. I encourage you to support it.
std::string is not the best example of well done standardization. The version initially standardized was impossible to implement; the requirements placed on it where not consistent with each other.
At some point that inconsistency was fixed.
In c++11 the rules where changed that prevent COW (copy on write) implementations, which broke the ABI of existing reasonably compliant std::strings. This change may have been the point where the inconsistency was fixed, I do not recall.
Its API is different than the rest of std's container because it didn't come from the same pre-std STL.
Treating this legacy behavior of std::string as some kind of reasoned decision that takes into account performance costs is not realistic. If any such testing was done, it was 20+ years ago on a non-standard compliant std::string (because none could exist, the standard was inconsistent).
It continues to be UB on passing (char const*)0 and nullptr due to inertia, and will continue to do so until someone makes a proposal and demonstrates that the cost is tiny while the benefit is not.
Constructing a std::string from a literal char const[N] is already a low performance solution; you already have the size of the string at compile time and you drop it on the ground and then at runtime walk the buffer to find the '\0' character (unless optimized around; and if so, the null check is equally optimizable). The high performance solution involves knowing the length and telling std::string about it instead of copying from a '\0' terminated buffer.
The sole reason is: Runtime performance.
It would indeed be easy to define that std::string(NULL) results in the empty string. But it would cost an extra check at the construction of every std::string from a const char *, which can add up.
On the balance between absolute maximum performance and convenience, C++ always goes for absolute maximum performance, even if this would mean to compromise the robustness of programs.
The most famous example is to not initialize POD member variables in classes by default: Even though in 99% of all cases programmers want all POD member variables to be initialized, C++ decides not to do so to allow the 1% of all classes to achieve slightly higher runtime performance. This pattern repeats itself all over the place in C++. Performance over everything else.
There is no "the performance impact would be negligible" in C++. :-)
(Note that I personally do not like this behavior in C++ either. I would have made it so that the default behavior is safe, and that the unchecked and uninitialized behavior has to be requested explicitly, for example with an extra keyword. Uninitialized variables are still a major problem in a lot of programs in 2018.)

Why did C++11 make std::string::data() add a null terminating character?

Previously that was std::string::c_str()'s job, but as of C++11, data() also provides it, why was c_str()'s null-terminating-character added to std::string::data()? To me it seems like a waste of CPU cycles, in cases where the null-terminating-character is not relevant at all and only data() is used, a C++03 compiler doesn't have to care about the terminator, and don't have to write 0 to the terminator every time the string is resized, but a C++11 compiler, because of the data()-null-guarantee, has to waste cycles writing 0 every time the string is resized, so since it potentially makes code slower, I guess they had some reason to add that guarantee, what was it?
There are two points to discuss here:
Space for the null-terminator
In theory a C++03 implementation could have avoided allocating space for the terminator and/or may have needed to perform copies (e.g. unsharing).
However, all sane implementations allocated room for the null-terminator in order to support c_str() to begin with, because otherwise it would be virtually unusable if that was not a trivial call.
The null-terminator itself
It is true that some very (1999), very old implementations (2001) wrote the \0 every c_str() call.
However, major implementations changed (2004) or were already like that (2010) to avoid such a thing way before C++11 was released, so when the new standard came, for many users nothing changed.
Now, whether a C++03 implementation should have done it or not:
To me it seems like a waste of CPU cycles
Not really. If you are calling c_str() more than once, you are already wasting cycles by writing it several times. Not only that, you are messing with the cache hierarchy, which is important to consider in multithreaded systems. Recall that multi-core/SMT CPUs started to appear between 2001 and 2006, which explains the switch to modern, non-CoW implementations (even if there were multi-CPU systems a couple of decades before that).
The only situation where you would save anything is if you never called c_str(). However, note that when you are re-sizing the string, you are anyway re-writing everything. An additional byte is going to be hardly measurable.
In other words, by not writing the terminator on re-size, you are exposing yourself to worse performance/latency. By writing it once at the same time you have to perform a copy of the string, the performance behavior is way more predictable and you avoid performance pitfalls if you end up using c_str(), specially on multithreaded systems.
Advantages of the change:
When data also guarantees the null terminator, the programmer doesn't need to know obscure details of differences between c_str and data and consequently would avoid undefined behaviour from passing strings without guarantee of null termination into functions that require null termination. Such functions are ubiquitous in C interfaces, and C interfaces are used in C++ a lot.
The subscript operator was also changed to allow read access to str[str.size()]. Not allowing access to str.data() + str.size() would be inconsistent.
While not initialising the null terminator upon resize etc. may make that operation faster, it forces the initialisation in c_str which makes that function slower¹. The optimisation case that was removed was not universally the better choice. Given the change mentioned in point 2. that slowness would have affected the subscript operator as well, which would certainly not have been acceptable for performance. As such, the null terminator was going to be there anyway, and therefore there would not be a downside in guaranteeing that it is.
Curious detail: str.at(str.size()) still throws an exception.
P.S. There was another change, that is to guarantee that strings have contiguous storage (which is why data is provided in the first place). Prior to C++11, implementations could have used roped strings, and reallocate upon call to c_str. No major implementation had chosen to exploit this freedom (to my knowledge).
P.P.S Old versions of GCC's libstdc++ for example apparently did set the null terminator only in c_str until version 3.4. See the related commit for details.
¹ A factor to this is concurrency that was introduced to the language standard in C++11. Concurrent non-atomic modification is data-race undefined behaviour, which is why C++ compilers are allowed to optimize aggressively and keep things in registers. So a library implementation written in ordinary C++ would have UB for concurrent calls to .c_str()
In practice (see comments) having multiple threads writing the same thing wouldn't cause a correctness problem because asm for real CPUs doesn't have UB. And C++ UB rules mean that multiple threads actually modifying a std::string object (other than calling c_str()) without synchronization is something the compiler + library can assume doesn't happen.
But it would dirty cache and prevent other threads from reading it, so is still a poor choice, especially for strings that potentially have concurrent readers. Also it would stop .c_str() from basically optimizing away because of the store side-effect.
The premise of the question is problematic.
a string class has to do a lot of expansive things, like allocating dynamic memory, copying bytes from one buffer to another, freeing the underlying memory and so on.
what upsets you is one lousy mov assembly instruction? believe me, this doesn't effect your performance even by 0.5%.
When writing a programing language runtime, you can't be obsessive about every small assembly instruction. you have to choose your optimization battles wisely, and optimizing an un-noticable null termination is not one of them.
In this specific case, being compatible with C is way more important than null termination.
Actually, it's the other way around.
Before C++11, c_str() may in theory have cost "additional cycles" as well as a copy, so as to ensure the presence of a null terminator at the end of the buffer.
This was unfortunate, particularly as it can be fixed very simply, with effectively no additional runtime cost, by simply incorporating a null byte at the end of every buffer to begin with. Only one additional byte to allocate (and a teensie little write), with no runtime cost at point of use, in exchange for thread-safety and a boatload of sanity.
Once you've done that, c_str() is literally the same as data() by definition. So, the "change" to data() actually came for free. Nobody's adding an extra byte to the result of data(); it's already there.
Helping matters is the fact that most implementations already did this under C++03 anyway, to avoid the hypothetical runtime cost ascribed to c_str().
So, in short, this has almost certainly cost you literally nothing.

How to efficiently implement infinity and minus infinity supporting arithmetic in C++?

The trivial solution would be:
class Number
{
public:
bool isFinite();
bool isPositive();
double value();
...
private:
double value_;
bool isFinite_;
bool isPositive_;
...
};
The thing that worries me is efficiency:
From Effective C++: 55 Specific Ways to Improve Your Programs and Designs (3rd Edition) by Scott Meyers:
Even when small objects have inexpensive copy constructors, there can
be performance issues. Some compilers treat built-in and user-defined
types differently, even if they have the same underlying
representation. For example, some compilers refuse to put objects
consisting of only a double into a register, even though they happily
place naked doubles there on a regular basis. When that kind of thing
happens, you can be better off passing such objects by reference,
because compilers will certainly put pointers (the implementation of
references) into registers.
Is there a way to bypass the efficiency problem? For example a library that uses some assembly language magic?
There is very little reason to implement a Number class for a double. The double format already implement Infinity, NaN, and signage as part of the raw basic double type.
Second, you should write your code first aiming for correctness, and then try to optimize later, at which time you can look at factor out specific data structure and rewrite code and algos.
Modern day compilers are typically very efficient in writing good code, and typically does a way better job than most human programmers does.
For your specific example, I would just use doubles as they are rather than in a class. They're well adapted and defined for handling infinities.
In a more general sense, you should use the trivial solution and only worry about performance when (or, more likely, if) it becomes a problem.
That means code it up and test it in many of the scenarios you're going to use it for.
If it still performs within the bounds of your performance requirements, don't worry about trying to optimise it. And you should have some performance requirements slightly more specific that "I want it to run a s fast as possible" :-)
Remember that efficiency doesn't always mean "as fast as possible, no matter the cost". It means achieving your goals without necessarily sacrificing other things (like readability or maintainability).
If you take an complete operation that makes the user wait 0.1 seconds and optimise it to the point where it's ten times faster, the user won't notice that at all (I say "complete" because obviously, the user would notice a difference if it was done ten thousand times without some sort of interim result).
And remember, measure, don't guess!

Is using strings this way inefficient?

I am a newbee to c++ and am running into problems with my teacher using strings in my code. Though it is clear to me that I have to stop doing that in her class, I am curious as to why it is wrong. In this program the five strings I assigned were going to be reused no less than 4 to 5 times, therefore I put the text into strings. I was told to stop doing it as it is inefficient. Why? In c++ are textual strings supposed to be typed out as opposed to being stored into strings, and if so why? Below is some of the program, please tell me why it is bad.
string Bry = "berries";
string Veg = "vegetables";
string Flr = "flowers";
string AllStr;
float Tmp1, Precip;
int Tmp, FlrW, VegW, BryW, x, Selct;
bool Cont = true;
AllStr = Flr + ", " + Bry + ", " + "and " + Veg;
Answering whether using strings is inefficient is really something that very much depends on how you're using them.
First off, I would argue that you should be using C++ strings as a default - only going to raw C strings if you actually measure and find C++ strings to be too slow. The advantages (primarily for security) are just too great - it's all too easy to screw up buffer management with raw C strings. So I would disagree with your teacher that this is overly inefficient.
That said, it's important to understand the performance implications of using C++ strings. Since they are always dynamically allocated, you may end up spending a lot of time copying and reallocating buffers. This is usually not a problem; usually there are other things which take up much more time. However, if you're doing this right in the middle of a loop that's critical to your program's performance, you may need to find another method.
In short, premature optimization is usually a bad idea. Write code that is obviously correct, even if it takes ever-so-slightly longer to run. But be aware of the costs and trade-offs you're making at the same time; that way, if it turns out that C++ strings are actually slowing your program down a lot, you'll know what to change to fix that.
Yes, it's fairly inefficient, for following reasons:
When you construct a std::string object, it has to allocate a storage space for the string content (which may or may not be a separate dynamic memory allocation, depending on whether small-string optimization is in effect) and copy the literal string that is parameter of the constructor. For example, when you say: string Bry = "berries" it allocates a separate memory block (potentially from the dynamic memory), then copies "berries" to that block.
So you potentially have an extra dynamic memory allocation (costing time),
have to perform the copy (costing more time),
and end-up with 2 copies of the same string (costing space).
Using std::string::operator+ produces a new string that is the result of concatenation. So when you write several + operators in a row, you have several temporary concatenation results and a lot of unnecessary copying.
For your example, I recommend:
Using string literals unless you actually need the functionality only available in std::string.
Using std::stringstream to concatenate several strings together.
Normally, code readability is preferred over micro-optimizations of this sort, but luckily you can have both performance and readability in this case.
Your teacher is both right and wrong. S/he's right that building up strings from substrings at runtime is less CPU-efficient than simply providing the fully pre-built string in the code to start with -- but s/he's wrong in thinking that efficiency is necessarily an important factor to worry about in this case.
In a lot of cases, efficiency simply doesn't matter. At all. For example, if your code above is only going to be executed rarely (e.g. no more than once per second), then it's going to be literally impossible to measure any difference between the "most efficient version" and your not-so-efficient version. Given that, it's quite justifiable to decide that other factors (such as code readability and maintainability) are more important than maximizing efficiency.
Of course, if your program is going to be reconstructing these strings thousands or millions of times per second, then making sure your code is maximally efficient, even at the expense of readability/maintainability, is a good tradeoff to make. But I doubt that is the case here.
Your approach is almost perfect - try and declare everything only once. But if it is not used more than once - dont wast you fingers typing it :-) ie a 10 line program
The only change I would suggest is to make the strings const to help the compiler optimize you program.
If you instructor still disagrees - get a new instructor.
it is inefficient. doing that last line right would be 4-5 times faster.
at the very least you should use +=
+= means that you would avoid creating new strings with the + operator.
The instructor knows that when you do a string = string + string C++ creates a new string that is immediately destroyed.
Efficiency is probably not is good argument to not use string in school assignments but yes, if I am a teacher and the topic is not about some very high level applications, I don't want my students using string.
The real reason is string hides the low level memory management. A student coming out of college should have the basic memory management skill. Though nowadays in working environment, programmers don't deal with the memory management in most of the time but there are always situations where you need to understand what's happening under the hood to be able to reason the problem you are encountering.
With the context given, it looks like you should just be able to declare AllString as a const or string literal without all the substrings and addition. Assuming there's more to it, declaring them as literal string objects allocates memory at runtime. (And, not that there is any practical impact here, but you should be aware that stl container objects sometimes allocate a default minimum of space that is larger than the number of things initially in it, as part of its optimizations in anticipation of later modifying operations. I'm not sure if std::string does so on an declare/assign or not.) If you are only ever going to use them as literals, declaring them as a const char* or in a #define is easier on both memory and runtime performance, and you can still use them as r-values in string operations. If you are using them other ways in code you are not showing us, then its up to whether they need to ever be changed or manipulated as to whether they need to be strings or not.
If you are trying to learn coding, inefficiencies that don't matter in practice are still things you should be aware of and avoid if unnecessary. In production code, there are sometimes reasons to do something like this, but if it is not for any good reason, it's just sloppy. She's right to point it out, but what she should be doing is using that as a starting point for a conversation about the various tradeoffs involved - memory, speed, readability, maintainability, etc. If she's a teacher, she should be looking for "teaching moments" like this rather than just an opportunity to scold.
you can use string.append() ;
its better than + or +=

NULL terminated string and its length

I have a legacy code that receives some proprietary, parses it and creates a bunch of static char arrays (embedded in class representing the message), to represent NULL strings. Afterwards pointers to the string are passed all around and finally serialized to some buffer.
Profiling shows that str*() methods take a lot of time.
Therefore I would like to use memcpy() whether it's possible. To achive it I need a way to associate length with pointer to NULL terminating string. I though about:
Using std::string looks less efficient, since it requires memory allocation and thread synchronization.
I can use std::pair<pointer to string, length>. But in this case I need to maintain length "manually".
What do you think?
use std::string
Profiling shows that str*() methods
take a lot of time
Sure they do ... operating on any array takes a lot of time.
Therefore I would like to use memcpy()
whether it's possible. To achive it I
need a way to associate length with
pointer to NULL terminating string. I
though about:
memcpy is not really any slower than strcpy. In fact if you perform a strlen to identify how much you are going to memcpy then strcpy is almost certainly faster.
Using std::string looks less
efficient, since it requires memory
allocation and thread synchronization
It may look less efficient but there are a lot of better minds than yours or mine that have worked on it
I can use std::pair. But in this case I need to
maintain length "manually".
thats one way to save yourself time on the length calculation. Obviously you need to maintain the length manually. This is how windows BSTRs work, effectively (though the length is stored immediately prior, in memory, to the actual string data). std::string. for example, already does this ...
What do you think?
I think your question is asked terribly. There is no real question asked which makes answering next to impossible. I advise you actually ask specific questions in the future.
Use std::string. It's an advice already given, but let me explain why:
One, it uses a custom memory allocation scheme. Your char* strings are probably malloc'ed. That means they are worst-case aligned, which really isn't needed for a char[]. std::string doesn't suffer from needless alignment. Furthermore, common implementatios of std::string use the "Small String Optimization" which eliminates a heap allocation altogether, and improves locality of reference. The string size will be on the same cache line as the char[] itself.
Two, it keeps the string length, which is indeed a speed optimization. Most str* functions are slower because they don't have this information up front.
A second option would be a rope class, e.g. from SGI. This be more efficient by eliminating some string copies.
Your post doesn't explain where the str*() function calls are coming from; passing around char * certainly doesn't invoke them. Identify the sites that actually do the string manipulation and then try to find out if they're doing so inefficiently. One common pitfall is that strcat first needs to scan the destination string for the terminating 0 character. If you call strcat several times in a row, you can end up with a O(N^2) algorithm, so be careful about this.
Replacing strcpy by memcpy doesn't make any significant difference; strcpy doesn't do an extra pass to find the length of the string, it's simply (conceptually!) a character-by-character copy that stops when it encounters the terminating 0. This is not much more expensive than memcpy, and always cheaper than strlen followed by memcpy.
The way to gain performance on string operations is to avoid copies where possible; don't worry about making the copying faster, instead try to copy less! And this holds for all string (and array) implementations, whether it be char *, std::string, std::vector<char>, or some custom string / array class.
What do I think? I think that you should do what everyone else obsessed with pre-optimization does. You should find the most obscure, unmaintainable, yet intuitively (to you anyway) high-performance way you can and do it that way. Sounds like you're onto something with your pair<char*,len> with malloc/memcpy idea there.
Whatever you do, do NOT use pre-existing, optimized wheels that make maintenence easier. Being maintainable is simply the least important thing imaginable when you're obsessed with intuitively measured performance gains. Further, as you well know, you're quite a bit smarter than those who wrote your compiler and its standard library implementation. So much so that you'd be seriously silly to trust their judgment on anything; you should really consider rewriting the entire thing yourself because it would perform better.
And ... the very LAST thing you'll want to do is use a profiler to test your intuition. That would be too scientific and methodical, and we all know that science is a bunch of bunk that's never gotten us anything; we also know that personal intuition and revelation is never, ever wrong. Why waste the time measuring with an objective tool when you've already intuitively grasped the situation's seemingliness?
Keep in mind that I'm being 100% honest in my opinion here. I don't have a sarcastic bone in my body.