How is std::string implemented? - c++

I am curious to know how std::string is implemented and how does it differ from c string?If the standard does not specify any implementation then any implementation with explanation would be great with how it satisfies the string requirement given by standard?

Virtually every compiler I've used provides source code for the runtime - so whether you're using GCC or MSVC or whatever, you have the capability to look at the implementation. However, a large part or all of std::string will be implemented as template code, which can make for very difficult reading.
Scott Meyer's book, Effective STL, has a chapter on std::string implementations that's a decent overview of the common variations: "Item 15: Be aware of variations in string implementations".
He talks about 4 variations:
several variations on a ref-counted implementation (commonly known as copy on write) - when a string object is copied unchanged, the refcount is incremented but the actual string data is not. Both object point to the same refcounted data until one of the objects modifies it, causing a 'copy on write' of the data. The variations are in where things like the refcount, locks etc are stored.
a "short string optimization" (SSO) implementation. In this variant, the object contains the usual pointer to data, length, size of the dynamically allocated buffer, etc. But if the string is short enough, it will use that area to hold the string instead of dynamically allocating a buffer
Also, Herb Sutter's "More Exceptional C++" has an appendix (Appendix A: "Optimizations that aren't (in a Multithreaded World)") that discusses why copy on write refcounted implementations often have performance problems in multithreaded applications due to synchronization issues. That article is also available online (but I'm not sure if it's exactly the same as what's in the book):
http://www.gotw.ca/publications/optimizations.htm
Both those chapters would be worthwhile reading.

std::string is a class that wraps around some kind of internal buffer and provides methods for manipulating that buffer.
A string in C is just an array of characters
Explaining all the nuances of how std::string works here would take too long. Maybe have a look at the gcc source code http://gcc.gnu.org to see exactly how they do it.

There's an example implementation in an answer on this page.
In addition, you can look at gcc's implementation, assuming you have gcc installed. If not, you can access their source code via SVN. Most of std::string is implemented by basic_string, so start there.
Another possible source of info is Watcom's compiler

The c++ solution for strings are quite different from the c-version. The first and most important difference is while the c using the ASCIIZ solution, the std::string and std::wstring are using two iterators (pointers) to store the actual string. The basic usage of the string classes provides a dynamic allocated solution, so in the cost of CPU overhead with the dynamic memory handling it makes the string handling more comfortable.
As you probably already know, the C doesn't contain any built-in generic string type, only provides couple of string operations through the standard library. One of the major difference between C and C++ that the C++ provides a wrapped functionality, so it can be considered as a faked generic type.
In C you need to walk through the string if you would like to know the length of it, the std::string::size() member function is only one instruction (end - begin) basically. You can safely append strings one to an other as long as you have memory, so there is no need to worry about the buffer overflow bugs (and therefore the exploits), because the appending creates a bigger buffer if it is needed.
As somebody told here before, the string is derivated from the vector functionality, in a templated way, so it makes easier to deal with the multibyte-character systems. You can define your own string type using the typedef std::basic_string specific_str_t; expression with any arbitary data type in the template parameter.
I think there are enough pros and contras both side:
C++ string Pros:
- Faster iteration in certain cases (using the size definitely, and it doesn't need the data from the memory to check if you are at the end of the string, comparing two pointers. that could make a difference with the caching)
- The buffer operation are packed with the string functionality, so less worries about the buffer problems.
C++ string Cons:
- due to the dynamic memory allocation stuff, the basic usage could cause impact on the performance. (fortunately you can tell to the string object what should be the original buffer size, so unless you are exceed it, it won't allocate dynamic blocks from the memory)
- often weird and inconsistent names compared to other languages. this is the bad thing about any stl stuff, but you can use to it, and it makes a bit specific C++ish feeling.
- the heavy usage of the templating forces the standard library to use header based solutions so it is a big impact on the compiling time.

That depends on the standard library you use.
STLPort for example is a C++ Standard Library implementation which implements strings among other things.

Related

Why is string not a (subclass of) vector?

C strings are char arrays.
Vectors are the new arrays in C++.
Why aren't strings vectors (of chars) then?
Most of the methods of vectors and strings seem duplicated too. Is there a reason for making strings a different thing in C++?
It's pretty much just historical. Strings and vectors were developed in parallel with little thought going to how they could be considered one and the same, for T==char.
That's also why standard containers are nice and generic, whereas std::basic_string is a total monolith of member function after member function.
Edge case optimisation opportunities since made it difficult or impossible to transform std::basic_string<T, Alloc> into std::vector<T, Alloc> in any sort of standard way. Take the small string optimisation, for example. Although, now that GCC's copy-on-write mechanism is officially dead, we're a little closer.
The ability to legally dereference std::string::end() (and obtain '\0' for your trouble) is still problematic, though. A bunch of fairly stringent iterator invalidation rules for .c_str() basically prevent us from using std::vector<char> for this right from the start.
tl;dr: this is what happens when you create a camel
The various answers in Vector vs string show several differences in interfaces between vector and string, and since the typical pattern in the standard is to use static polymorphism rather than dynamic, they were created as two different classes.
Since strings do have different characteristics from vectors it doesn't seem that you would want to use public inheritance but I don't think there's anything in the standard that would prohibit protected or private inheritance, or composition to provide the underlying space management.
Additionally I suspect that string may have been developed earlier and orthogonally to vector which likely explains why there are member methods that more likely would have been made free algorithms if developed in parallel to vector.
Yes because vector is a container which can contain T and std::string is a wrapper around char* and can't contain int or other datatypes. Wouldn't make any sense to have it any other way.

What is string_view?

string_view was a proposed feature within the C++ Library Fundamentals TS(N3921) added to C++17
As far as i understand it is a type that represent some kind of string "concept" that is a view of any type of container that could store something viewable as a string.
Is this right ?
Should the canonical
const std::string& parameter type become string_view ?
Is there another important point about string_view to take into consideration ?
The purpose of any and all kinds of "string reference" and "array reference" proposals is to avoid copying data which is already owned somewhere else and of which only a non-mutating view is required. The string_view in question is one such proposal; there were earlier ones called string_ref and array_ref, too.
The idea is always to store a pair of pointer-to-first-element and size of some existing data array or string.
Such a view-handle class could be passed around cheaply by value and would offer cheap substringing operations (which can be implemented as simple pointer increments and size adjustments).
Many uses of strings don't require actual owning of the strings, and the string in question will often already be owned by someone else. So there is a genuine potential for increasing the efficiency by avoiding unneeded copies (think of all the allocations and exceptions you can save).
The original C strings were suffering from the problem that the null terminator was part of the string APIs, and so you couldn't easily create substrings without mutating the underlying string (a la strtok). In C++, this is easily solved by storing the length separately and wrapping the pointer and the size into one class.
The one major obstacle and divergence from the C++ standard library philosophy that I can think of is that such "referential view" classes have completely different ownership semantics from the rest of the standard library. Basically, everything else in the standard library is unconditionally safe and correct (if it compiles, it's correct). With reference classes like this, that's no longer true. The correctness of your program depends on the ambient code that uses these classes. So that's harder to check and to teach.
(Educating myself in 2021)
From Microsoft's <string_view>:
The string_view family of template specializations provides an efficient way to pass a read-only, exception-safe, non-owning handle to the character data of any string-like objects with the first element of the sequence at position zero. (...)
From Microsoft's C++ Team Blog std::string_view: The Duct Tape of String Types from August 21st, 2018 (retrieved 2021 Apr 01):
string_view solves the “every platform and library has its own string type” problem for parameters. It can bind to any sequence of characters, so you can just write your function as accepting a string view:
void f(wstring_view); // string_view that uses wchar_t's
and call it without caring what stringlike type the calling code is using (and > for (char*, length) argument pairs just add {} around them)
(...)
(...)
Today, the most common “lowest common denominator” used to pass string data around is the null-terminated string (or as the standard calls it, the Null-Terminated Character Type Sequence). This has been with us since long before C++, and provides clean “flat C” interoperability. However, char* and its support library are associated with exploitable code, because length information is an in-band property of the data and susceptible to tampering. Moreover, the null used to delimit the length prohibits embedded nulls and causes one of the most common string operations, asking for the length, to be linear in the length of the string.
(...)
Each programming domain makes up their own new string type, lifetime semantics, and interface, but a lot of text processing code out there doesn’t care about that. Allocating entire copies of the data to process just to make differing string types happy is suboptimal for performance and reliability.

std::string with no free store memory allocation

I have a question very similar to
How do I allocate a std::string on the stack using glibc's string implementation?
but I think it's worth asking again.
I want an std::string with local storage that overflows into the free store. std::basic_string provides an allocator as a template parameter, so it seems like the thing to do is to write an allocator with local storage and use it to parameterize the basic_string, like so:
std::basic_string<
char,
std::char_traits<char>,
inline_allocator<char, 10>
>
x("test");
I tried to write the inline_allocator class that would work the way you'd expect: it reserves 10 bytes for storage, and if the basic_string needs more than 10 bytes, then it calls ::operator new(). I couldn't get it to work. In the course of executing the above line of code, my GCC 4.5 standard string library calls the copy constructor for inline_allocator 4 times. It's not clear to me that there's a sensible way to write the copy constructor for inline_allocator.
In the other StackOverflow thread, Eric Melski provided this link to a class in Chromium:
http://src.chromium.org/svn/trunk/src/base/stack_container.h
which is interesting, but it's not a drop-in replacement for std::string, because it wraps the std::basic_string in a container so that you have to call an overloaded operator->() to get at the std::basic_string.
I can't find any other solutions to this problem. Could it be that there is no good solution? And if that's true, then are the std::basic_string and std::allocator concepts badly flawed? I mean, it seems like this should be a very basic and simple use case for std::basic_string and std::allocator. I suppose the std::allocator concept is designed primarily for pools, but I think it ought to cover this as well.
It seems like the rvalue-reference move semantics in C++0x might make it possible to write inline_allocator, if the string library is re-written so that basic_string uses the move constructor of its allocator instead of the copy constructor. Does anyone know what the prospect is for that outcome?
My application needs to construct a million tiny ASCII strings per second, so I ended up writing my own fixed-length string class based on Boost.Array, which works fine, but this is still bothering me.
Andrei Alexandrescu, C++ programmer extraordinaire who wrote "Modern C++ Design" once wrote a great article about building different string implementations with customizable storage systems. His article (linked here) describes how you can do what you've described above as a special case of a much more general system that can handle all sorts of clever memory allocation requirements. This doesn't talk so much about std::string and focuses more on a completely customized string class, but you might want to look into it as there are some real gems in the implementation.
C++2011 is really going to help you here :)
The fact is that the allocator concept in C++03 was crippled. One of the requirement was that an allocator of type A should be able to deallocate memory from any other allocator from type A... Unfortunately this requirement is also at odds with stateful allocators each hooked to its own pool.
Howard Hinnant (who manages the STL subgroup of the C++ commitee and is implementing a new STL from scratch for C++0x) has explored stack-based allocators on his website, which you could get inspiration from.
This is generally unnecessary. It's called the "short string optimization", and most implementations of std::string already include it. It may be hard to find, but it's usually there anyway.
Just for example, here's the relevant piece of sso_string_base.h that's part of MinGW:
enum { _S_local_capacity = 15 };
union
{
_CharT _M_local_data[_S_local_capacity + 1];
size_type _M_allocated_capacity;
};
The _M_local_data member is the relevant one -- space for it to store (up to) 15 characters (plus a NUL terminator) without allocating any space on the heap.
If memory serves, the Dinkumware library included with VC++ allocates space for 20 characters, though it's been a while since I looked, so I can't swear to that (and tracking down much of anything in their headers tends to be a pain, so I prefer to avoid looking if I can).
In any case, I'd give good odds that you've been engaged in that all-too-popular pass-time known as premature optimization.
I believe the code from Chromium just wraps things into a nice shell. But you can get the same effect without using the Chromium wrapper container.
Because the allocator object gets copied so often, it needs to hold a reference or pointer to the memory. So what you'd need to do is create the storage buffer, create the allocator object, then call the std::string constructor with the allocator.
It will be a lot wordier than using the wrapper class but should get the same effect.
You can see an example of the verbose method (still using the chromium stuff) in my question about stack vectors.

Does the imlementation of std::string, I use, implement ref-counting or not?

I develop for iOS and use XCode 3.2.5, GCC 4.2.
UPD
This code works:
string s = "aaaa";
string s1 = s;
assert(s.data() == s1.data());
Does it mean ref-counting is used? Or '==' is overloaded for const char* somehow to compare contents, not addresses?
UPD
Okay, it does.
There are different ways of finding out, the first of which is plainly looking at the code. std::string is a typedef to an instantiation of the basic_string template, and being a template, all the code is available to you in the headers. Note that reading standard library headers can be both enlightening and hard. And yet, you don't even need to understand the code, you might get some good hints from a cursory look (as by the fact that basic_string contains a member _M_p with a _M_refcount sub member)
If you don't want to read the code, you can approach the problem from a practical point of view and measure the effects that a copy-on-write implementation would have. You can, for example create a long string [*], then copy it to a different string and compare the addresses of the data() that stores the actual contents.
[*] The reason for the long string is to avoid getting confused with some other implementations, as small object implementation that could be used by the compiler and by which a string could contain a small buffer to avoid dynamic memory allocations for very small uses.
An easy way to find out would be to copy-construct or assign a string, and compare the results of their data() method - if their data area is at the same location in memory, they must be using some form of reference counting.
One obvious answer is: it's unspecified. As far as I know, it's
not only unspecified in the standard, but in every
implementation. But for what it's worth, g++ uses a reference
counted implementation, at least through the latest version I've
looked at (4.4.2).
ref counting is really useful, copy on write...etc. a lot of code relies on it to be efficient. it's probably a bad idea to abandon it. better to have a function that explicitly obtains a copy of a string the way MS does it (lock buffer, etc.) if you're going to tinker with internals in an unsafe manner.

immutable strings vs std::string

I've recent been reading about immutable strings Why can't strings be mutable in Java and .NET? and Why .NET String is immutable? as well some stuff about why D chose immutable strings. There seem to be many advantages.
trivially thread safe
more secure
more memory efficient in most use cases.
cheap substrings (tokenizing and slicing)
Not to mention most new languages have immutable strings, D2.0, Java, C#, Python, etc.
Would C++ benefit from immutable strings?
Is it possible to implement an immutable string class in c++ (or c++0x) that would have all of these advantages?
update:
There are two attempts at immutable strings const_string and fix_str. Neither have been updated in half a decade. Are they even used? Why didn't const_string ever make it into boost?
I found most people in this thread do not really understand what immutable_string is. It is not only about the constness. The really power of immutable_string is the performance (even in single thread program) and the memory usage.
Imagine that, if all strings are immutable, and all string are implemented like
class string {
char* _head ;
size_t _len ;
} ;
How can we implement a sub-str operation? We don't need to copy any char. All we have to do is assign the _head and the _len. Then the sub-string shares the same memory segment with the source string.
Of course we can not really implement a immutable_string only with the two data members. The real implementation might need a reference-counted(or fly-weighted) memory block. Like this
class immutable_string {
boost::fly_weight<std::string> _s ;
char* _head ;
size_t _len ;
} ;
Both the memory and the performance would be better than the traditional string in most cases, especially when you know what you are doing.
Of course C++ can benefit from immutable string, and it is nice to have one. I have checked the boost::const_string and the fix_str mentioned by Cubbi. Those should be what I am talking about.
As an opinion:
Yes, I'd quite like an immutable string library for C++.
No, I would not like std::string to be immutable.
Is it really worth doing (as a standard library feature)? I would say not. The use of const gives you locally immutable strings, and the basic nature of systems programming languages means that you really do need mutable strings.
My conclusion is that C++ does not require the immutable pattern because it has const semantics.
In Java, if you have a Person class and you return the String name of the person with the getName() method, your only protection is the immutable pattern. If it would not be there you would have to clone() your strings all night and day (as you have to do with data members that are not typical value-objects, but still needs to be protected).
In C++ you have const std::string& getName() const. So you can write SomeFunction(person.getName()) where it is like void SomeFunction(const std::string& subject).
No copy happened
If anyone wants to copy he is free to do so
Technique applies to all data types, not just strings
You're certainly not the only person who though that. In fact, there is const_string library by Maxim Yegorushkin, which seems to have been written with inclusion into boost in mind. And here's a little newer library, fix_str by Roland Pibinger. I'm not sure how tricky would full string interning at run-time be, but most of the advantages are achievable when necessary.
I don't think there's a definitive answer here. It's subjective—if not because personal taste then at least because of the type of code one most often deals with. (Still, a valuable question.)
Immutable strings are great when memory is cheap—this wasn't true when C++ was developed, and it isn't the case on all platforms targeted by C++. (OTOH on more limited platforms C seems much more common than C++, so that argument is weak.)
You can create an immutable string class in C++, and you can make it largely compatible with std::string—but you will still lose when comparing to a built-in string class with dedicated optimizations and language features.
std::string is the best standard string we get, so I wouldn't like to see any messing with it. I use it very rarely, though; std::string has too many drawbacks from my point of view.
const std::string
There you go. A string literal is also immutable, unless you want to get into undefined behavior.
Edit: Of course that's only half the story. A const string variable isn't useful because you can't make it reference a new string. A reference to a const string would do it, except that C++ won't allow you to reassign a reference as in other languages like Python. The closest thing would be a smart pointer to a dynamically allocated string.
Immutable strings are great if, whenever it's necessary to create a new a string, the memory manager will always be able to determine determine the whereabouts of every string reference. On most platforms, language support for such ability could be provided at relatively modest cost, but on platforms without such language support built in it's much harder.
If, for example, one wanted to design a Pascal implementation on x86 that supported immutable strings, it would be necessary for the string allocator to be able to walk the stack to find all string references; the only execution-time cost of that would be requiring a consistent function-call approach [e.g. not using tail calls, and having every non-leaf function maintain a frame pointer]. Each memory area allocated with new would need to have a bit to indicate whether it contained any strings and those that do contain strings would need to have an index to a memory-layout descriptor, but those costs would be pretty slight.
If a GC wasn't table to walk the stack, then it would be necessary to have code use handles rather than pointers, and have code create string handles when local variables come into scope, and destroy the handles when they go out of scope. Much greater overhead.
Qt also uses immutable strings with copy-on-write.
There is some debate about how much performance it really buys you with decent compilers.
constant strings make little sense with value semantics, and sharing isn't one of C++'s greatest strengths...
Strings are mutable in Ruby.
$ irb
>> foo="hello"
=> "hello"
>> bar=foo
=> "hello"
>> foo << "world"
=> "helloworld"
>> print bar
helloworld=> nil
trivially thread safe
I would tend to forget safety arguments. If you want to be thread-safe, lock it, or don't touch it. C++ is not a convenient language, have your own conventions.
more secure
No. As soon as you have pointer arithmetics and unprotected access to the address space, forget about being secure. Safer against innocently bad coding, yes.
more memory efficient in most use cases.
Unless you implement CPU-intensive mechanisms, I don't see how.
cheap substrings (tokenizing and slicing)
That would be one very good point. Could be done by referring to a string with backreferences, where modifications to a string would cause a copy. Tokenizing and slicing become free, mutations become expensive.
C++ strings are thread safe, all immutable objects are guaranteed to be thread safe but Java's StringBuffer is mutable like C++ string is and the both of them are thread safe. Why worry about speed, define your method or function parameters with the const keyword to tell the compiler the string will be immutable in that scope. Also if string object is immutable on demand, waiting when you absolutely need to use the string, in other words, when you append other strings to the main string, you have a list of strings until you actually need the whole string then they are joined together at that point.
immutable and mutable object operate at the same speed to my knowledge , except their methods which is a matter of pro and cons. constant primitives and variable primitives move at different speeds because at the machine level, variables are assigned to a register or a memory space which require a few binary operations, while constants are labels that don't require any of those and are thus faster (or less work is done). works only for primitives and not for object.