Does std::string use string interning? - c++

I'm especially interested of windows, mingw.
Thanks.
Update:
First, I thought everyone is familiar with string interning.
http://en.wikipedia.org/wiki/String_interning
Second, my problem is in detail:
I knocked up a string class for practice. Nothing fancy you know, i just store the size and a char * in a class.
I use memcpy for the assignment.
When i do this to measure the assignment speed of std::string and my string class:
string test1 = " 65 kb text ", test2;
for(int i=0; i<1000000; i++)
{
test2 = test1;
}
mystring test3 = "65 kb text", test4;
for (int i=0; i<1000000; i++)
{
test4 = test3
}
The std::string is a winner by a large margin. I do not do anything in the assignment operator (in my class) but copy with memcpy. I do not even create a new array with the "new" operator, cause i check for size equality, and only request new if needed. How come?
For small strings, there is no problem. I cant see how can std::string assign values faster than memcpy, i bet it uses it too in the background, or something similar, so that's why i asked interning.
Update2:
by modifying the loops with a single character assignment like this: test2[15] = 78, I avoided the effect of copy-on-write of std::string. Now both codes takes exactly the same time (okay, there is an 1-2% difference, but that is negligible). So if I am not entirely mistaken, the mingw std::string must use COW.
Thank you all for your help.

Simply put, no. String interning is not feasible with mutable strings, such as all std::string-objects.

String interning may be done by the compiler only for string literals appearing in the code. If you initialise std:strings with string literals, and some of the literals occur multiple times, the compiler may store only one copy of this string in your binary. There is no string interning at run time. mingw supports compile time string interning as explained before.

Not so much, since std::string is modifiable.
Implementations have been known to attempt the use of copy-on-write, but that causes such problems in multi-threaded code that I think it's out of fashion. It's also very hard to implement correctly - perhaps impossible? If someone takes a pointer to a character in the string, and then modifies another character, I'm not sure that this is permitted to invalidate the first pointer. If it's not allowed, then COW is out of the question too, I think, but I can't remember how it works out.

No, there is no string interning in the STL. It doesn't fit the C++ design philosophy to have such a feature.

Two ideas:
Is myclass a template class? The std::string class is a typedef of the template class basic_string. This means that the complete source of basic_string instead of just the header is accessible to the compiler when your test function is compiled. This additional information enables more optimisations in exchange for higher compilation time.
Most c++ standard library implementations are highly optimised (and sadly almost unreadable).

Related

When to use std::string vs char*? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
C++ char* vs std::string
I'm new to C++ coming from C# but I really do like C++ much better.
I have an abstract class that defines two constant strings (not static). And I wondered if a const char* would be a better choice. I'm still getting the hang of the C++ standards, but I just figured that there really isn't any reason why I would need to use std::string in this particular case (no appending or editing the string, just writing to the console via printf).
Should I stick to std::string in every case?
Should I stick to std::string in every case?
Yes.
Except perhaps for a few edge cases where you writing a high performance multi-threaded logging lib and you really need to know when a memory allocation is going to take place, or perhaps in fiddling with individual bits in a packet header in some low level protocol/driver.
The problem is that you start with a simple char and then you need to print it, so you use printf(), then perhaps a sprintf() to parse it because std::stream would be a pain for just one int to string. And you end up with an unsafe and unmaintainable mix oc c/c++
I would stick to using std::string instead of const char*, simply because most of the built-in C++ libraries work with strings and not character arrays. std::string has a lot of built-in methods and facilities that give the programmer a lot of power when manipulating strings.
Should I stick to std::string in every case?
There are cases where std::string isn't needed and just a plain char const* will do. However you do get other functionality besides manipulation, you also get to do comparison with other strings and char arrays, and all the standard algorithms to operate on them.
I would say go with std::string by default (for members and variables), and then only change if you happen to see that is the cause of a performance drop (which it won't).
Use std::string when you need to store a value.
Use const char * when you want maximum flexibility, as almost everything can be easily converted to or from one.
This like comparing Apples to Oranges. std::string is a container class while char* is just a pointer to a character sequence.
It really all depends on what you want to do with the string.
Std::string on the other hand can give you a quick access for simple string calculation and manipulation function. Most of those are simple string manipulation functions, nothing fancy really.
So it basically depends on your needs and how your functions are declared. The only advantage for std::string over a char pointer is that it doesnt require a specific lenghth decleration.

Strange error in variable values C++

I have used this code. Here a string is present from location starting from 4 and length of string is 14. All these calculations are done prior to this code. I am pasting a small snippet of the error containing code.
void *data = malloc(4096);
int len = 14;
int fileptr = 4;
string str;
cout<<len<<endl;
cout<<fileptr<<endl;
memcpy(&str, (char *)data+fileptr, len);
cout<<len<<endl;
cout<<fileptr<<endl;
Output i get is:
14
4
4012176
2009288233
Here i am reading a string "System Catalog" from memory. Its displaying the string correctly. But the values of fileptr and len are abruptly changing after using memcpy() function.
string is not the same as a char*. string is an object. So you can't just memcpy() data to it. So the behavior of this code is undefined.
In your case, you are copying 14 bytes of junk data into str and corrupting the stack.
The result is that you are overwriting both len and fileptr with junk from the malloc().
I'm not sure exactly what you're trying to do, but if you want to create a string, you should do it like this:
string str = "System Catalog";
A string is an object and is not just a sequence of bytes. You cannot just memcpy over it from raw memory.
My guess is that in your code the str variable is allocated before other variables in stack memory and memcpy-ing over it you are overwriting them.
Note that your phrase "It's displaying the string correctly" has the seed of a common misconception about C++ in it.
When you do bad things in C++ (e.g. writing bytes over an object) you should expect the worst possible behavior. The worst possible behavior however is NOT an ugly result, a crash or a runtime error... but something that seems to work but that has bad consequences in the future.
You want to assign this many characters from that char pointer into a std::string, so you should look at what facilities a string object provides for doing that rather than hitting it over the head with memcpy(). As others have noted, memcpy() is for use in low-level C-style code, not for interacting with C++ objects.
In particular, you should study the assignment methods provided by std::string, one of which does exactly what you want -- which isn't a coincidence.
string is an object - please look up the semantics for it. Why are you doing this and what are you trying to achieve?
If for some reason you actually MUST use memcpy you can get the Internal address of the string to copy to (provided the string is big enough to contain the information you want to copy)
static_cast < char * >(&(str[0]));
But this is VERY VERY BAD. If you use it, I'm quite sure there are more crazy things going on in your code :-)

C++: Is it safe to move a jpeg around in memory using a std::string?

I have an external_jpeg_func() that takes jpeg data in a char array to do stuff with it. I am unable to modify this function. In order to provide it the char array, I do something like the following:
//what the funcs take as inputs
std::string my_get_jpeg();
void external_jpeg_func(const char* buf, unsigned int size);
int main ()
{
std::string myString = my_get_jpeg();
external_jpeg_func(myString.data(), myString.length() );
}
My question is: Is it safe to use a string to transport the char array around? Does jpeg (or perhaps any binary file format) be at risk of running into characters like '\0' and cause data loss?
My recommendation would be to use std::vector<char>, instead of std::string, in this case; the danger with std::string is that it provides a c_str() function and most developers assume that the contents of a std::string are NUL-terminated, even though std::string provides a size() function that can return a different value than what you would get by stopping at NUL. That said, as long as you are careful to always use the constructor that takes a size parameter, and you are careful not to pass the .c_str() to anything, then there is no problem with using a string here.
While there is no technical advantage to using a std::vector<char> over a std::string, I feel that it does a better job of communicating to other developers that the content is to be interpreted as an arbitrary byte sequence rather than NUL-terminated textual content. Therefore, I would choose the former for this added readability. That said, I have worked with plenty of code that uses std::string for storing arbitrary bytes. In fact, the C++ proto compiler generates such code (though, I should add, that I don't think this was a good choice for the readability reasons that I mentioned).
std::string does not treat null characters specially, unless you don't give it an explicit string length. So your code will work fine.
Although, in C++03, strings are technically not required to be stored in contiguous memory. Just about every std::string implementation you will find will in fact store them that way, but it is not technically required. C++11 rectifies this.
So, I would suggest you use a std::vector<char> in this case. std::string doesn't buy you anything over a std::vector<char>, and it's more explicit that this is an array of characters and not a possibly printable string.
I think it is better to use char array char[] or std::vector<char>. This is standard way to keep images. Of course, binary file may contain 0 characters.

Why isn't ("Maya" == "Maya") true in C++?

Any idea why I get "Maya is not Maya" as a result of this code?
if ("Maya" == "Maya")
printf("Maya is Maya \n");
else
printf("Maya is not Maya \n");
Because you are actually comparing two pointers - use e.g. one of the following instead:
if (std::string("Maya") == "Maya") { /* ... */ }
if (std::strcmp("Maya", "Maya") == 0) { /* ... */ }
This is because C++03, §2.13.4 says:
An ordinary string literal has type “array of n const char”
... and in your case a conversion to pointer applies.
See also this question on why you can't provide an overload for == for this case.
You are not comparing strings, you are comparing pointer address equality.
To be more explicit -
"foo baz bar" implicitly defines an anonymous const char[m]. It is implementation-defined as to whether identical anonymous const char[m] will point to the same location in memory(a concept referred to as interning).
The function you want - in C - is strmp(char*, char*), which returns 0 on equality.
Or, in C++, what you might do is
#include <string>
std::string s1 = "foo"
std::string s2 = "bar"
and then compare s1 vs. s2 with the == operator, which is defined in an intuitive fashion for strings.
The output of your program is implementation-defined.
A string literal has the type const char[N] (that is, it's an array). Whether or not each string literal in your program is represented by a unique array is implementation-defined. (§2.13.4/2)
When you do the comparison, the arrays decay into pointers (to the first element), and you do a pointer comparison. If the compiler decides to store both string literals as the same array, the pointers compare true; if they each have their own storage, they compare false.
To compare string's, use std::strcmp(), like this:
if (std::strcmp("Maya", "Maya") == 0) // same
Typically you'd use the standard string class, std::string. It defines operator==. You'd need to make one of your literals a std::string to use that operator:
if (std::string("Maya") == "Maya") // same
What you are doing is comparing the address of one string with the address of another. Depending on the compiler and its settings, sometimes the identical literal strings will have the same address, and sometimes they won't (as apparently you found).
Any idea why i get "Maya is not Maya" as a result
Because in C, and thus in C++, string literals are of type const char[], which is implicitly converted to const char*, a pointer to the first character, when you try to compare them. And pointer comparison is address comparison.
Whether the two string literals compare equal or not depends whether your compiler (using your current settings) pools string literals. It is allowed to do that, but it doesn't need to. .
To compare the strings in C, use strcmp() from the <string.h> header. (It's std::strcmp() from <cstring>in C++.)
To do so in C++, the easiest is to turn one of them into a std::string (from the <string> header), which comes with all comparison operators, including ==:
#include <string>
// ...
if (std::string("Maya") == "Maya")
std::cout << "Maya is Maya\n";
else
std::cout << "Maya is not Maya\n";
C and C++ do this comparison via pointer comparison; looks like your compiler is creating separate resource instances for the strings "Maya" and "Maya" (probably due to having an optimization turned off).
My compiler says they are the same ;-)
even worse, my compiler is certainly broken. This very basic equation:
printf("23 - 523 = %d\n","23"-"523");
produces:
23 - 523 = 1
Indeed, "because your compiler, in this instance, isn't using string pooling," is the technically correct, yet not particularly helpful answer :)
This is one of the many reasons the std::string class in the Standard Template Library now exists to replace this earlier kind of string when you want to do anything useful with strings in C++, and is a problem pretty much everyone who's ever learned C or C++ stumbles over fairly early on in their studies.
Let me explain.
Basically, back in the days of C, all strings worked like this. A string is just a bunch of characters in memory. A string you embed in your C source code gets translated into a bunch of bytes representing that string in the running machine code when your program executes.
The crucial part here is that a good old-fashioned C-style "string" is an array of characters in memory. That block of memory is often referred to by means of a pointer -- the address of the start of the block of memory. Generally, when you're referring to a "string" in C, you're referring to that block of memory, or a pointer to it. C doesn't have a string type per se; strings are just a bunch of chars in a row.
When you write this in your code:
"wibble"
Then the compiler provides a block of memory that contains the bytes representing the characters 'w', 'i', 'b', 'b', 'l', 'e', and '\0' in that order (the compiler adds a zero byte at the end, a "null terminator". In C a standard string is a null-terminated string: a block of characters starting at a given memory address and continuing until the next zero byte.)
And when you start comparing expressions like that, what happens is this:
if ("Maya" == "Maya")
At the point of this comparison, the compiler -- in your case, specifically; see my explanation of string pooling at the end -- has created two separate blocks of memory, to hold two different sets of characters that are both set to 'M', 'a', 'y', 'a', '\0'.
When the compiler sees a string in quotes like this, "under the hood" it builds an array of characters, and the string itself, "Maya", acts as the name of the array of characters. Because the names of arrays are effectively pointers, pointing at the first character of the array, the type of the expression "Maya" is pointer to char.
When you compare these two expressions using "==", what you're actually comparing is the pointers, the memory addresses of the beginning of these two different blocks of memory. Which is why the comparison is false, in your particular case, with your particular compiler.
If you want to compare two good old-fashioned C strings, you should use the strcmp() function. This will examine the contents of the memory pointed two by both "strings" (which, as I've explained, are just pointers to a block of memory) and go through the bytes, comparing them one-by-one, and tell you whether they're really the same.
Now, as I've said, this is the kind of slightly surprising result that's been biting C beginners on the arse since the days of yore. And that's one of the reasons the language evolved over time. Now, in C++, there is a std::string class, that will hold strings, and will work as you expect. The "==" operator for std::string will actually compare the contents of two std::strings.
By default, though, C++ is designed to be backwards-compatible with C, i.e. a C program will generally compile and work under a C++ compiler the same way it does in a C compiler, and that means that old-fashioned strings, "things like this in your code", will still end up as pointers to bits of memory that will give non-obvious results to the beginner when you start comparing them.
Oh, and that "string pooling" I mentioned at the beginning? That's where some more complexity might creep in. A smart compiler, to be efficient with its memory, may well spot that in your case, the strings are the same and can't be changed, and therefore only allocate one block of memory, with both of your names, "Maya", pointing at it. At which point, comparing the "strings" -- the pointers -- will tell you that they are, in fact, equal. But more by luck than design!
This "string pooling" behaviour will change from compiler to compiler, and often will differ between debug and release modes of the same compiler, as the release mode often includes optimisations like this, which will make the output code more compact (it only has to have one block of memory with "Maya" in, not two, so it's saved five -- remember that null terminator! -- bytes in the object code.) And that's the kind of behaviour that can drive a person insane if they don't know what's going on :)
If nothing else, this answer might give you a lot of search terms for the thousands of articles that are out there on the web already, trying to explain this. It's a bit painful, and everyone goes through it. If you can get your head around pointers, you'll be a much better C or C++ programmer in the long run, whether you choose to use std::string instead or not!

Runtime dependency for std::string concatenation

std::string sAttr("");
sAttr = sAttr+VAL_TAG_OPEN+sVal->c_str()+VAL_TAG_CLOSE;
else where in the code I have defined
const char VAL_TAG_OPEN[] = "<value>";
sVal is a variable that is retrieved off of a array of string pointers. This works fine in most of the system, windows and linux. However at a customer site, where to my belief has a version of linux on which we had done some extensive testing, produce a result as if I have never used the VAL_TAG_OPEN and VAL_TAG_CLOSE. The results I recieve is for
sAttr = sAttr+sVal->c_str();
Whats going on ?. Does std::string concatenation varies across runtime ?
Why the ->c_str()? If sVal is a std::string, try removing this call. Remember that the order of evaluation is undefined, so you may end up adding pointers instead of concatenating strings, because VAL_TAG_OPEN, sVal->c_str() and VAL_TAG_CLOSE are all plain C strings. I suggest you use the addition assignment operator +=, e.g. :
sAttr += VAL_TAG_OPEN;
sAttr += *sVal; /* sVal->c_str() ? */
sAttr += VAL_TAG_CLOSE;
(which should be faster anyway).
No, std::string concatenation should definitely not depend upon runtime, but somehow VAL_TAG_OPEN and VAL_TAG_CLOSE seem to be empty strings.
I'd guess you've some kind of buffer overflow or invalid pointer arithmetic somewhere, so that your program overwrites the memory containing those "constant" values. Wherever your memory ends up being is indeed runtime (and therefore OS version) specific. I've been trapped by similar things in the past by switching compilers or optimizer options.
As you mention keeping raw pointers to std::string instances in raw arrays, such errors are indeed not all to improbable, but may be difficult to detect, as using a DEBUG build won't give you any iterator checks with all these all to RAW things... Good luck.
I don't thing its the order of evaluation that is causing the issue. Its because of the constant char arrays at the beginning and end
const char VAL_TAG_OPEN[] = "<value>";
const char VAL_TAG_CLOSE[] = "</value>"
The concatenation operator thought VAL_TAG_OPN and VAL_TAG_CLOSE as not a null terminator string. Hence the optimizer just ignored them thinking it as garbage.
sAttr += std::string(VAL_TAG_OPEN);
sAttr += *sVal;
sAttr += std::string(VAL_TAG_CLOSE);
This does solve it.
sAttr = sAttr+VAL_TAG_OPEN+sVal->c_str()+VAL_TAG_CLOSE;
Like fbonnet said, it's an order of evaluation issue.
If that line gets evaluated strictly left to right, then the result of each addition is a std::string object, which has an operator overload for addition and things work as you expect.
If it doesn't get evaluated left to right, then you wind up adding pointers together and who knows what that will get you.
Avoid this construct and just use the += operator on std::string.