When we have to work with string manipulation, is there any significants performance difference between std::string and std::stringbuf, and if yes why.
More generally when it is good to use std::stringbuf over std::string ?
A std::stringbuf uses a string internally to buffer data, so it is probably a bit slower. I don't think the difference would be significant though, because it basically just delegation. To be sure you'd have to run some performance-tests though.
std::stringbuf is useful when you want an IO-stream to use a string as buffer (like std::stringstream, which uses a std::stringbuf by default).
First of all, a std::stringbuf does not necessarily (or even ordinarily) use an std::string for its internal storage. For example, the standard describes initialization from an std::string as follows:
Constructs an object of class basic_stringbuf ... Then copies the content of
str into the basic_stringbuf underlying character sequence [...]
Note the wording: "character sequence" -- at least to me, this seems to be quite careful to avoid saying (or even implying) that the content should be stored in an actual string.
Past that, I think efficiency is probably a red herring. Both of them are fairly thin wrappers for managing dynamically allocated buffers of some sort of character-like sequence. There's a big difference in capabilities (e.g., string has lots of searching and insertion/deletion in the middle of a string that are entirely absent from stringbuf). Given its purpose, it might make sense to implement stringbuf on top of something like a std::deque, to optimize the (usual) path of insertion/deletion at the ends, but this is likely to be insubstantial for most uses.
If I were doing it, I'd probably be most worried by the fact that stringbuf is probably only tested along with stringstream, so if I used it differently than stringstream did, I might encounter problems, even if I'm following what the standard says it should support.
std::stringbuf extends std::string container, for reading and writing to std::string.
Generally, there is no significant performance difference between std::string and std::stringbuf.
Because std::streambuf <- std::stringbuf, both because:
typedef basic_stringbuf<char> stringbuf;
typedef basic_string<char> string;
Read this for more details.
Related
This is something the professor showed us in his scripts. I have not used this method in any code I have written.
Basically, we take a class, or struct, and reinterpret_cast it and save off the entire struct like so:
struct Account
{
Account()
{ }
Account(std::string one, std::string two)
: login_(one), pass_(two)
{ }
private:
std::string login_;
std::string pass_;
};
int main()
{
Account *acc = new Account("Christian", "abc123");
std::ofstream out("File.txt", std::ios::binary);
out.write(reinterpret_cast<char*>(acc), sizeof(Account));
out.close();
This produces the output (in the file)
ÍÍÍÍChristian ÍÍÍÍÍÍ ÍÍÍÍabc123 ÍÍÍÍÍÍÍÍÍ
I'm confused. Does this method actually work, or does it cause UB because magical things happen within classes and structs that are at the whims of individual compilers?
It doesn't actually work, but it also does not cause undefined behavior.
In C++ it is legal to reinterpret any object as an array of char, so there is no undefined behavior here.
The results, however, are usually only usable if the class is POD (effectively, if the class is a simple C-style struct) and self-contained (that is, the struct doesn't have pointer data members).
Here, Account is not POD because it has std::string members. The internals of std::string are implementation-defined, but it is not POD and it usually has pointers that refer to some heap-allocated block where the actual string is stored (in your specific example, the implementation is using a small-string optimization where the value of the string is stored in the std::string object itself).
There are a few issues:
You aren't always going to get the results you expect. If you had a longer string, the std::string would use a buffer allocated on the heap to store the string and so you will end up just serializing the pointer, not the pointed-to string.
You can't actually use the data you've serialized here. You can't just reinterpret the data as an Account and expect it to work, because the std::string constructors would not get called.
In short, you cannot use this approach for serializing complex data structures.
It's not undefined. Rather, it's platform dependent or implementation defined behavior. This is, in general bad code, because differing versions of the same compiler, or even different switches on the same compiler, can break your save file format.
This could work depending on the contents of the struct, and the platform on which the data is read back. This is a risky, non-portable hack which your teacher should not be propagating.
Do you have pointers or ints in the struct? Pointers will be invalid in the new process when read back, and int format is not the same on all machines (to name but two show-stopping problems with this approach). Anything that's pointed to as part of an object graph will not be handled. Structure packing could be different on the target machine (32-bit vs 64-bit) or even due to compiler options changing on the same hardware, making sizeof(Account) unreliable as a read back data size.
For a better solution, look at a serialization library which handles those issues for you. Boost.Serialization is a good example.
Here, we use the term "serialization"
to mean the reversible deconstruction
of an arbitrary set of C++ data
structures to a sequence of bytes.
Such a system can be used to
reconstitute an equivalent structure
in another program context. Depending
on the context, this might used
implement object persistence, remote
parameter passing or other facility.
Google Protocol Buffers also works well for simple object hierarchies.
It's no substitute for proper serialization. Consider the case of any complex type that contains pointers - if you save the pointers to a file, when you load them up later, they're not going to point to anything meaningful.
Additionally, it's likely to break if the code changes, or even if it's recompiled with different compiler options.
So it's really only useful for short-term storage of simple types - and in doing so, it takes up way more space than necessary for that task.
This method, if it works at all, is far from robust. It is much better to decide on some "serialized" form, whether it is binary, text, XML, etc., and write that out.
The key here: You need a function/code to reliably convert your class or struct to/from a series of bytes. reinterpret_cast does not do this, as the exact bytes in memory used to represent the class or struct can change for things like padding, order of members, etc.
No.
In order for it to work, the structure must be a POD (plain old data: only simple data members and POD data members, no virtual functions... probably some other restrictions which I can't remember).
So if you wanted to do that, you'd need a struct like this:
struct Account {
char login[20];
char password[20];
};
Note that std::string's not a POD, so you'd need plain arrays.
Still, not a good approach for you. Keyword: "serialization" :).
Some version of string don;t actually use dynamic memory for the string when the string is small. Thus store the string internally in the string object.
Think of this:
struct SimpleString
{
char* begin; // beginning of string
char* end; // end of string
char* allocEnd; // end of allocated buffer end <= allocEnd
int* shareCount; // String are usually copy on write
// as a result you need to track the number of people
// using this buffer
};
Now on a 64 bit system. Each pointer is 8 bytes. Thus a string of less than 32 bytes could fit into the same structure without allocating a buffer.
struct CompressedString
{
char buffer[sizeof(SimpleString)];
};
stuct OptString
{
int type; // Normal /Compressed
union
{
SimpleString simple;
CompressedString compressed;
}
};
So this is what I believe is happening above.
A very efficient string implementation is being used thus allowing you to dump the object to file without worrying about pointers (as the std::string are not using pointers).
Obviously this is not portable as it depends on the implementation details of std::string.
So interesting trick, but not portable (and liable to break easily without some compile time checks).
I'm working with a C API and a lot of the functions take arguments that are character arrays. I've heard that using char arrays is now frowned upon. But on the other hand, using c_str() to convert the string to a char array over-and-over seems wasteful.
Are there any reasons to do it one way vs the other?
The c_str() call is quite likely to be inlined—it's very small in terms of the required code. I would use std::string if that's the only thing holding you back.
Of course, if you're very worried, this standard advice applies:
Profile it
Read the assembly
Also be aware that this is a micro-optimization; you're quite likely to be wasting development time worrying about something completely different than from what you should be worrying about.
It depends on what you're doing, and what the interface functions are
doing. At one extreme, I don't think anyone would recommend converting
a string literal to an std::string, just so you can call c_str on
it. At the other: any code building up strings dynamically should use
std::string. Functions like strcpy and strcat are invitations to
buffer overflow. Between two, it depends. I'd say that the criteria
should be ease and safety: anytime something is easier or safer to do
using std::string, use std::string. As long as what you're doing
doesn't require dynamic allocation of char[], and things like
operator+ on strings wouldn't be used, you can use char[].
As mentioned, c_str() is going to be inlined. But what hasn't been mentioned but what I think is one of the most important aspects of your question is that std::string follows the principles of RAII. When using std::string, you won't need to remember to free the string or need to worry about exception safety. Just make sure each instance of std::string is not destructed until the C code is done with the string. That could especially be an issue if the std::string is a temporary made by the compiler.
If your C function writes back a string, you could use a vector<char> and set the size to your desired buffer size. That way you'll still follow C++ RAII principles.
Most implementations of std::string probably store the actual string as a C string anyway, so the c_str function is just an inline function that returns a pointer. So generally, I would say the proper way to go is with std::string.
Of course, if the string is intended to be modified by the function you call, then you can't use the std::string approach. Instead, you'll have to make a copy to your own buffer before calling the function, in which case using arrays may be the way to go.
There is a very simple reason to use string: it works.
Working with C-strings is a pain:
manual memory allocation and deallocation is error-prone
the interface itself is error-prone (lack of null character termination, off-by-one errors and buffer overflows are common)
the operations are inefficient (strlen, strcpy and strcat) because the length need be recomputed at each time
I really see no good reason to ever work with C-strings.
It's such a pain that many platforms have provided their own specific operations and that a number of "better strings" have been proposed (oh, the joy of having multiple standards).
This is something the professor showed us in his scripts. I have not used this method in any code I have written.
Basically, we take a class, or struct, and reinterpret_cast it and save off the entire struct like so:
struct Account
{
Account()
{ }
Account(std::string one, std::string two)
: login_(one), pass_(two)
{ }
private:
std::string login_;
std::string pass_;
};
int main()
{
Account *acc = new Account("Christian", "abc123");
std::ofstream out("File.txt", std::ios::binary);
out.write(reinterpret_cast<char*>(acc), sizeof(Account));
out.close();
This produces the output (in the file)
ÍÍÍÍChristian ÍÍÍÍÍÍ ÍÍÍÍabc123 ÍÍÍÍÍÍÍÍÍ
I'm confused. Does this method actually work, or does it cause UB because magical things happen within classes and structs that are at the whims of individual compilers?
It doesn't actually work, but it also does not cause undefined behavior.
In C++ it is legal to reinterpret any object as an array of char, so there is no undefined behavior here.
The results, however, are usually only usable if the class is POD (effectively, if the class is a simple C-style struct) and self-contained (that is, the struct doesn't have pointer data members).
Here, Account is not POD because it has std::string members. The internals of std::string are implementation-defined, but it is not POD and it usually has pointers that refer to some heap-allocated block where the actual string is stored (in your specific example, the implementation is using a small-string optimization where the value of the string is stored in the std::string object itself).
There are a few issues:
You aren't always going to get the results you expect. If you had a longer string, the std::string would use a buffer allocated on the heap to store the string and so you will end up just serializing the pointer, not the pointed-to string.
You can't actually use the data you've serialized here. You can't just reinterpret the data as an Account and expect it to work, because the std::string constructors would not get called.
In short, you cannot use this approach for serializing complex data structures.
It's not undefined. Rather, it's platform dependent or implementation defined behavior. This is, in general bad code, because differing versions of the same compiler, or even different switches on the same compiler, can break your save file format.
This could work depending on the contents of the struct, and the platform on which the data is read back. This is a risky, non-portable hack which your teacher should not be propagating.
Do you have pointers or ints in the struct? Pointers will be invalid in the new process when read back, and int format is not the same on all machines (to name but two show-stopping problems with this approach). Anything that's pointed to as part of an object graph will not be handled. Structure packing could be different on the target machine (32-bit vs 64-bit) or even due to compiler options changing on the same hardware, making sizeof(Account) unreliable as a read back data size.
For a better solution, look at a serialization library which handles those issues for you. Boost.Serialization is a good example.
Here, we use the term "serialization"
to mean the reversible deconstruction
of an arbitrary set of C++ data
structures to a sequence of bytes.
Such a system can be used to
reconstitute an equivalent structure
in another program context. Depending
on the context, this might used
implement object persistence, remote
parameter passing or other facility.
Google Protocol Buffers also works well for simple object hierarchies.
It's no substitute for proper serialization. Consider the case of any complex type that contains pointers - if you save the pointers to a file, when you load them up later, they're not going to point to anything meaningful.
Additionally, it's likely to break if the code changes, or even if it's recompiled with different compiler options.
So it's really only useful for short-term storage of simple types - and in doing so, it takes up way more space than necessary for that task.
This method, if it works at all, is far from robust. It is much better to decide on some "serialized" form, whether it is binary, text, XML, etc., and write that out.
The key here: You need a function/code to reliably convert your class or struct to/from a series of bytes. reinterpret_cast does not do this, as the exact bytes in memory used to represent the class or struct can change for things like padding, order of members, etc.
No.
In order for it to work, the structure must be a POD (plain old data: only simple data members and POD data members, no virtual functions... probably some other restrictions which I can't remember).
So if you wanted to do that, you'd need a struct like this:
struct Account {
char login[20];
char password[20];
};
Note that std::string's not a POD, so you'd need plain arrays.
Still, not a good approach for you. Keyword: "serialization" :).
Some version of string don;t actually use dynamic memory for the string when the string is small. Thus store the string internally in the string object.
Think of this:
struct SimpleString
{
char* begin; // beginning of string
char* end; // end of string
char* allocEnd; // end of allocated buffer end <= allocEnd
int* shareCount; // String are usually copy on write
// as a result you need to track the number of people
// using this buffer
};
Now on a 64 bit system. Each pointer is 8 bytes. Thus a string of less than 32 bytes could fit into the same structure without allocating a buffer.
struct CompressedString
{
char buffer[sizeof(SimpleString)];
};
stuct OptString
{
int type; // Normal /Compressed
union
{
SimpleString simple;
CompressedString compressed;
}
};
So this is what I believe is happening above.
A very efficient string implementation is being used thus allowing you to dump the object to file without worrying about pointers (as the std::string are not using pointers).
Obviously this is not portable as it depends on the implementation details of std::string.
So interesting trick, but not portable (and liable to break easily without some compile time checks).
I've recent been reading about immutable strings Why can't strings be mutable in Java and .NET? and Why .NET String is immutable? as well some stuff about why D chose immutable strings. There seem to be many advantages.
trivially thread safe
more secure
more memory efficient in most use cases.
cheap substrings (tokenizing and slicing)
Not to mention most new languages have immutable strings, D2.0, Java, C#, Python, etc.
Would C++ benefit from immutable strings?
Is it possible to implement an immutable string class in c++ (or c++0x) that would have all of these advantages?
update:
There are two attempts at immutable strings const_string and fix_str. Neither have been updated in half a decade. Are they even used? Why didn't const_string ever make it into boost?
I found most people in this thread do not really understand what immutable_string is. It is not only about the constness. The really power of immutable_string is the performance (even in single thread program) and the memory usage.
Imagine that, if all strings are immutable, and all string are implemented like
class string {
char* _head ;
size_t _len ;
} ;
How can we implement a sub-str operation? We don't need to copy any char. All we have to do is assign the _head and the _len. Then the sub-string shares the same memory segment with the source string.
Of course we can not really implement a immutable_string only with the two data members. The real implementation might need a reference-counted(or fly-weighted) memory block. Like this
class immutable_string {
boost::fly_weight<std::string> _s ;
char* _head ;
size_t _len ;
} ;
Both the memory and the performance would be better than the traditional string in most cases, especially when you know what you are doing.
Of course C++ can benefit from immutable string, and it is nice to have one. I have checked the boost::const_string and the fix_str mentioned by Cubbi. Those should be what I am talking about.
As an opinion:
Yes, I'd quite like an immutable string library for C++.
No, I would not like std::string to be immutable.
Is it really worth doing (as a standard library feature)? I would say not. The use of const gives you locally immutable strings, and the basic nature of systems programming languages means that you really do need mutable strings.
My conclusion is that C++ does not require the immutable pattern because it has const semantics.
In Java, if you have a Person class and you return the String name of the person with the getName() method, your only protection is the immutable pattern. If it would not be there you would have to clone() your strings all night and day (as you have to do with data members that are not typical value-objects, but still needs to be protected).
In C++ you have const std::string& getName() const. So you can write SomeFunction(person.getName()) where it is like void SomeFunction(const std::string& subject).
No copy happened
If anyone wants to copy he is free to do so
Technique applies to all data types, not just strings
You're certainly not the only person who though that. In fact, there is const_string library by Maxim Yegorushkin, which seems to have been written with inclusion into boost in mind. And here's a little newer library, fix_str by Roland Pibinger. I'm not sure how tricky would full string interning at run-time be, but most of the advantages are achievable when necessary.
I don't think there's a definitive answer here. It's subjective—if not because personal taste then at least because of the type of code one most often deals with. (Still, a valuable question.)
Immutable strings are great when memory is cheap—this wasn't true when C++ was developed, and it isn't the case on all platforms targeted by C++. (OTOH on more limited platforms C seems much more common than C++, so that argument is weak.)
You can create an immutable string class in C++, and you can make it largely compatible with std::string—but you will still lose when comparing to a built-in string class with dedicated optimizations and language features.
std::string is the best standard string we get, so I wouldn't like to see any messing with it. I use it very rarely, though; std::string has too many drawbacks from my point of view.
const std::string
There you go. A string literal is also immutable, unless you want to get into undefined behavior.
Edit: Of course that's only half the story. A const string variable isn't useful because you can't make it reference a new string. A reference to a const string would do it, except that C++ won't allow you to reassign a reference as in other languages like Python. The closest thing would be a smart pointer to a dynamically allocated string.
Immutable strings are great if, whenever it's necessary to create a new a string, the memory manager will always be able to determine determine the whereabouts of every string reference. On most platforms, language support for such ability could be provided at relatively modest cost, but on platforms without such language support built in it's much harder.
If, for example, one wanted to design a Pascal implementation on x86 that supported immutable strings, it would be necessary for the string allocator to be able to walk the stack to find all string references; the only execution-time cost of that would be requiring a consistent function-call approach [e.g. not using tail calls, and having every non-leaf function maintain a frame pointer]. Each memory area allocated with new would need to have a bit to indicate whether it contained any strings and those that do contain strings would need to have an index to a memory-layout descriptor, but those costs would be pretty slight.
If a GC wasn't table to walk the stack, then it would be necessary to have code use handles rather than pointers, and have code create string handles when local variables come into scope, and destroy the handles when they go out of scope. Much greater overhead.
Qt also uses immutable strings with copy-on-write.
There is some debate about how much performance it really buys you with decent compilers.
constant strings make little sense with value semantics, and sharing isn't one of C++'s greatest strengths...
Strings are mutable in Ruby.
$ irb
>> foo="hello"
=> "hello"
>> bar=foo
=> "hello"
>> foo << "world"
=> "helloworld"
>> print bar
helloworld=> nil
trivially thread safe
I would tend to forget safety arguments. If you want to be thread-safe, lock it, or don't touch it. C++ is not a convenient language, have your own conventions.
more secure
No. As soon as you have pointer arithmetics and unprotected access to the address space, forget about being secure. Safer against innocently bad coding, yes.
more memory efficient in most use cases.
Unless you implement CPU-intensive mechanisms, I don't see how.
cheap substrings (tokenizing and slicing)
That would be one very good point. Could be done by referring to a string with backreferences, where modifications to a string would cause a copy. Tokenizing and slicing become free, mutations become expensive.
C++ strings are thread safe, all immutable objects are guaranteed to be thread safe but Java's StringBuffer is mutable like C++ string is and the both of them are thread safe. Why worry about speed, define your method or function parameters with the const keyword to tell the compiler the string will be immutable in that scope. Also if string object is immutable on demand, waiting when you absolutely need to use the string, in other words, when you append other strings to the main string, you have a list of strings until you actually need the whole string then they are joined together at that point.
immutable and mutable object operate at the same speed to my knowledge , except their methods which is a matter of pro and cons. constant primitives and variable primitives move at different speeds because at the machine level, variables are assigned to a register or a memory space which require a few binary operations, while constants are labels that don't require any of those and are thus faster (or less work is done). works only for primitives and not for object.
I am curious to know how std::string is implemented and how does it differ from c string?If the standard does not specify any implementation then any implementation with explanation would be great with how it satisfies the string requirement given by standard?
Virtually every compiler I've used provides source code for the runtime - so whether you're using GCC or MSVC or whatever, you have the capability to look at the implementation. However, a large part or all of std::string will be implemented as template code, which can make for very difficult reading.
Scott Meyer's book, Effective STL, has a chapter on std::string implementations that's a decent overview of the common variations: "Item 15: Be aware of variations in string implementations".
He talks about 4 variations:
several variations on a ref-counted implementation (commonly known as copy on write) - when a string object is copied unchanged, the refcount is incremented but the actual string data is not. Both object point to the same refcounted data until one of the objects modifies it, causing a 'copy on write' of the data. The variations are in where things like the refcount, locks etc are stored.
a "short string optimization" (SSO) implementation. In this variant, the object contains the usual pointer to data, length, size of the dynamically allocated buffer, etc. But if the string is short enough, it will use that area to hold the string instead of dynamically allocating a buffer
Also, Herb Sutter's "More Exceptional C++" has an appendix (Appendix A: "Optimizations that aren't (in a Multithreaded World)") that discusses why copy on write refcounted implementations often have performance problems in multithreaded applications due to synchronization issues. That article is also available online (but I'm not sure if it's exactly the same as what's in the book):
http://www.gotw.ca/publications/optimizations.htm
Both those chapters would be worthwhile reading.
std::string is a class that wraps around some kind of internal buffer and provides methods for manipulating that buffer.
A string in C is just an array of characters
Explaining all the nuances of how std::string works here would take too long. Maybe have a look at the gcc source code http://gcc.gnu.org to see exactly how they do it.
There's an example implementation in an answer on this page.
In addition, you can look at gcc's implementation, assuming you have gcc installed. If not, you can access their source code via SVN. Most of std::string is implemented by basic_string, so start there.
Another possible source of info is Watcom's compiler
The c++ solution for strings are quite different from the c-version. The first and most important difference is while the c using the ASCIIZ solution, the std::string and std::wstring are using two iterators (pointers) to store the actual string. The basic usage of the string classes provides a dynamic allocated solution, so in the cost of CPU overhead with the dynamic memory handling it makes the string handling more comfortable.
As you probably already know, the C doesn't contain any built-in generic string type, only provides couple of string operations through the standard library. One of the major difference between C and C++ that the C++ provides a wrapped functionality, so it can be considered as a faked generic type.
In C you need to walk through the string if you would like to know the length of it, the std::string::size() member function is only one instruction (end - begin) basically. You can safely append strings one to an other as long as you have memory, so there is no need to worry about the buffer overflow bugs (and therefore the exploits), because the appending creates a bigger buffer if it is needed.
As somebody told here before, the string is derivated from the vector functionality, in a templated way, so it makes easier to deal with the multibyte-character systems. You can define your own string type using the typedef std::basic_string specific_str_t; expression with any arbitary data type in the template parameter.
I think there are enough pros and contras both side:
C++ string Pros:
- Faster iteration in certain cases (using the size definitely, and it doesn't need the data from the memory to check if you are at the end of the string, comparing two pointers. that could make a difference with the caching)
- The buffer operation are packed with the string functionality, so less worries about the buffer problems.
C++ string Cons:
- due to the dynamic memory allocation stuff, the basic usage could cause impact on the performance. (fortunately you can tell to the string object what should be the original buffer size, so unless you are exceed it, it won't allocate dynamic blocks from the memory)
- often weird and inconsistent names compared to other languages. this is the bad thing about any stl stuff, but you can use to it, and it makes a bit specific C++ish feeling.
- the heavy usage of the templating forces the standard library to use header based solutions so it is a big impact on the compiling time.
That depends on the standard library you use.
STLPort for example is a C++ Standard Library implementation which implements strings among other things.